Testing and Validation

Generated Output Tests

TestGenerator creates pytest code from:

task description
success criteria
testable assertions
output artifacts

Then executes tests in isolated container:

image: validtr-test-runner:latest
network: disabled (network_mode=none)
mounts: tests and output as read-only

Score Dimensions

Current CodeScorer weights:

Test passing: 40
Execution: 25
Syntax validity: 15
Completeness (LLM judge): 20

Composite score determines pass/fail by threshold.

Completeness Judge

Uses provider JSON response (score 0-100). If judge fails, completeness defaults to 50% of its weight.

Task-Type Coverage

Only code-generation has dedicated scoring today. Other task types use code scorer fallback.