Iterative Evaluation Framework
SForge uses an iterative evaluate-and-improve loop: the agent edits code, submits for evaluation, reads feedback, and keeps improving until timeout.
The mechanisms fall into four groups:
| Mechanism | Purpose |
|---|---|
sforge-submit | Agent-initiated submission with immediate feedback |
| Auto-eval | Periodically evaluate current code to add sampling points |
| Stop hook / Auto-resume | Keep the agent working until timeout |
| Submission limits | Control the number and frequency of agent-initiated submissions |
Workflow
- The agent works in the Work container
- The agent calls
sforge-submitto submit current code - The Judge Server starts a temporary Judge container and runs
eval_cmd - SForge returns
pass_rate,score, failed items, and summary - The agent uses the feedback to improve and submit again
- Auto-eval periodically submits current code in the background, creating extra sampling points
- The final result is the best result across all submissions
Agent-Initiated Submission: sforge-submit
sforge-submit is installed in the Work container at /usr/local/bin/sforge-submit.
It:
- Archives code from
submit_paths, excludingsubmit_exclude - Submits it to the Judge Server
- Polls for evaluation results
- Displays pass rate, score, failed items, and summary
- Updates
/tmp/sforge_state.json
Common commands:
| Command | Description |
|---|---|
sforge-submit | Submit and wait for result |
sforge-submit --details / -d | Show per-item details |
sforge-submit --list / -l | List previous submissions |
Auto-Eval: Adding Sampling Points
Auto-eval is a background evaluation daemon. Its purpose is not to inject extra feedback into the agent, but to periodically record the current code state as additional evaluation samples.
This is useful for long runs: even if the agent does not call sforge-submit at the right time, SForge still records intermediate versions instead of only the final state.
Properties:
- Default interval:
300seconds - Adjust with
--eval-interval - Disable with
--disable-auto-eval - Uses separate rounds:
auto-1,auto-2, ... - Results are not injected into the agent context
- Auto-eval submissions are not limited by
--max-submissionsor--submission-cooldown
# Sample every 600 seconds
sforge run --task ad_placement_optimization --agent claude-code --eval-interval 600
# Disable automatic sampling
sforge run --task ad_placement_optimization --agent claude-code --disable-auto-evalKeeping the Agent Working
Stop hook
The stop hook prevents the agent from deciding it is done and exiting early. When the agent tries to exit, the hook tells it to keep working until the run timeout.
- Prevents premature natural exits
- Enabled by default for agents that support stop hooks
- Can be disabled for debugging or short runs
sforge run --task ad_placement_optimization --agent claude-code --disable-stop-hookAuto-resume
Auto-resume handles abnormal exits such as API disconnects, transient errors, or unexpected agent process termination. These exits can bypass the stop hook.
- After an abnormal exit, SForge resumes the session using the agent's native resume mechanism
- The remaining timeout budget is passed to the resumed session
- Enabled by default for agents with a
resume_cmd - Safety guards stop retrying after very fast exits or too many resume attempts
sforge run --task ad_placement_optimization --agent claude-code --disable-auto-resumeSubmission Count and Cooldown
Some tasks need to limit agent-initiated evaluations or prevent very frequent submissions. Use:
| Flag | Description |
|---|---|
--max-submissions N | Limit the total number of agent-initiated submissions |
--submission-cooldown S | Require at least S seconds between two agent-initiated submissions |
sforge run \
--task ad_placement_optimization \
--agent claude-code \
--max-submissions 10 \
--submission-cooldown 120These limits apply only to agent-initiated sforge-submit calls. Auto-eval samples do not count toward the limit and do not observe the cooldown.
If a limit is hit, sforge-submit tells the agent how many submissions remain or how long to wait. The agent should continue local checks and improvements, then submit later.
Scoring and Final Result
Each submission records pass rate, score, summary, and details.
| Metric | Description |
|---|---|
pass_rate | Fraction of tests passed |
score | Continuous numeric score, often used by optimization tasks |
score_direction | maximize or minimize |
The final result is the best result across all submissions, including:
- Agent-initiated submissions:
agent-1,agent-2, ... - Auto-eval samples:
auto-1,auto-2, ...
The best-submission policy is controlled by judge.selection in the task JSON:
pass_rate_first: prioritize pass rate, then score after 100% or equal pass ratescore_first: compare score directlyvalid_then_score: filter invalid submissions, then compare score
State File
/tmp/sforge_state.json inside the Work container tracks submission history:
{
"best_pass_rate": 0.95,
"best_score": 250,
"best_round": "agent-3",
"submissions": []
}