Skip to content

Testing and Validation

Generated Output Tests

TestGenerator creates pytest code from:

  • task description
  • success criteria
  • testable assertions
  • output artifacts

Then executes tests in isolated container:

  • image: validtr-test-runner:latest
  • network: disabled (network_mode=none)
  • mounts: tests and output as read-only

Score Dimensions

Current CodeScorer weights:

  • Test passing: 40
  • Execution: 25
  • Syntax validity: 15
  • Completeness (LLM judge): 20

Composite score determines pass/fail by threshold.

Completeness Judge

Uses provider JSON response (score 0-100). If judge fails, completeness defaults to 50% of its weight.

Task-Type Coverage

Only code-generation has dedicated scoring today. Other task types use code scorer fallback.

MIT Licensed