Skip to content

CLI Commands

SForge exposes one CLI entry point: sforge <command>.

For most evaluation users, the core workflow only needs four commands:

bash
sforge fetch-tasks edgebench
sforge pull --task ad_placement_optimization --registry seededge
sforge serve
sforge run --task ad_placement_optimization --agent claude-code

Developer-oriented commands such as build, push, and hub are grouped at the end of this page. They are mainly useful when authoring tasks, maintaining a benchmark, or operating shared infrastructure.

Command Groups

Evaluation workflow

SubcommandPurposeKey Flags
fetch-tasksDownload benchmark task definitionsbenchmark, --repo, --revision
listList available task definitionsnone
pullPull pre-built task images from a registry--task, --registry
serveStart the Judge HTTP server--host, --port
runRun an agent on tasks--task, --experiment, --agent, --model, --timeout, --judge-url
evalEvaluate a solution archive directly--task, --archive, --run-id, --timeout, --json
proxyStart a host-side LLM API reverse proxy--target, --host, --port
visualizerStart the local results viewer--runs-dir, --tasks-dir, --host, --port

Developer / benchmark maintenance

SubcommandPurposeKey Flags
buildBuild base, work, and judge images locally--task, --force-rebuild, --force-rebuild-with-base
pushPush locally-built images to a registry--task, --registry

Global Options

These flags are available for all subcommands:

FlagDefaultDescription
--log-dirlogs/Override log output directory
--tasks-dirtasks/Override task definitions directory
--silentfalseSuppress verbose output (auto-enabled for multi-task runs)

sforge fetch-tasks

Download benchmark task definitions from HuggingFace Hub. For EdgeBench:

bash
sforge fetch-tasks edgebench

Advanced usage:

bash
sforge fetch-tasks --repo ByteDance-Seed/EdgeBench --revision main
Flag / ArgumentDescription
benchmarkBenchmark name, for example edgebench
--repoHuggingFace dataset repo ID. Overrides benchmark lookup
--revisionGit revision, branch, tag, or commit hash to download

sforge list

List available tasks.

bash
sforge list

Output columns: ID, Name, Base Image, Parser.

sforge pull

Pull pre-built images from a remote container registry. In the standard evaluation workflow, run this after sforge fetch-tasks edgebench and before sforge serve / sforge run.

bash
sforge pull --task ad_placement_optimization --registry seededge
sforge pull --task ad_placement_optimization gitlet --registry seededge
FlagDescription
--taskOne or more task IDs (required)
--registryRemote container registry or configured registry alias (overrides SFORGE_REGISTRY env var)

Each image is tagged by a content hash, so only images matching the current task definition are pulled.

sforge serve

Start the Judge HTTP server. This must be running before sforge run is invoked --- the agent container connects to it to submit code for evaluation.

bash
sforge serve
sforge serve --port 8080
sforge serve --host 0.0.0.0 --port 9090
FlagDefaultDescription
--host0.0.0.0Bind address
--port8080Listen port

The server exposes the Judge HTTP API and handles both standard test-driven submissions and interactive game sessions.

sforge run

Run an agent on one or more tasks. This is the primary command for evaluation.

Basic usage

bash
# Standard agent mode
sforge run --task ad_placement_optimization --agent claude-code

# Multiple tasks in parallel
sforge run --task ad_placement_optimization gitlet rookiedb --agent claude-code

# Experiment config mode
sforge run --experiment experiment.yaml

Full options

FlagDefaultDescription
--taskrequired*One or more task IDs (space-separated). Multiple tasks run fully in parallel.
--agentrequired*Agent name (e.g., claude-code, codex). Required unless --experiment is specified.
--experiment---Path to experiment YAML config file. If --task is omitted, all YAML tasks run; if --task is provided, only that subset runs with the experiment settings.
--model---Model override (e.g., claude-opus-4-8)
--timeout3600Agent timeout in seconds
--eval-interval300Auto-eval daemon interval in seconds
--run-idrandomRun identifier for tracking and log organization
--judge-urlhttp://host.docker.internal:8080Judge server URL as seen from inside the container
--backenddockerContainer backend (docker or k8s)
--stagger---Spread task launches evenly over N seconds (e.g., --stagger 300)
--max-submissions---Maximum number of agent submissions per run
--submission-cooldown---Minimum seconds between agent submissions
--work-cpu-limit---Number of CPUs for work containers
--work-mem-limit---Memory limit for work containers (e.g., '8g')
--judge-cpu-limit---Number of CPUs for judge containers
--judge-mem-limit---Memory limit for judge containers (e.g., '4g')
--disable-stop-hookfalseDisable the stop hook (allow agent to exit normally)
--disable-auto-evalfalseDisable the background auto-evaluation daemon
--disable-auto-resumefalseDisable auto-resume on abnormal agent exit
--disable-internetfalseForce network isolation (only judge server + API allowed). Requires sudo for iptables. Mutually exclusive with --enable-internet.
--enable-internetfalseForce full internet access (overrides per-task internet: false setting). Mutually exclusive with --disable-internet.

Requirement

--task or --experiment is required. --agent is required unless --experiment is specified.

Output files

After completion, results are written to logs/runs/<run_id>/<task_id>/:

logs/runs/<run_id>/<task_id>/
├── run_config.json      # Effective configuration for this run
├── run_agent.log        # Harness-level log
├── install_output.txt   # Agent installation output
├── agent_prompt.md      # Enhanced prompt sent to agent
├── agent_output.txt     # Full agent conversation log
├── final_archive.tar.gz # Final code snapshot
├── final_result.json    # Summary with best_pass_rate, total_rounds, etc.
└── submissions/         # Per-round evaluation details

sforge eval

Evaluate an archive directly against a task's test suite, without running an agent. Useful for testing solutions manually.

bash
sforge eval --task ad_placement_optimization --archive solution.tar.gz
sforge eval --task ad_placement_optimization --archive solution.tar.gz --json
sforge eval --task ad_placement_optimization --archive - < solution.tar.gz
FlagDescription
--taskTask ID (required)
--archivePath to .tar.gz archive, or - to read from stdin
--run-idCustom run ID for log organization
--timeoutEvaluation timeout in seconds
--jsonAlso output the full JSON report
--backendContainer backend (docker or k8s)
--judge-cpu-limitNumber of CPUs for judge container
--judge-mem-limitMemory limit for judge container (e.g., '4g')

sforge proxy

Start a local API reverse proxy. This is designed for use with --disable-internet.

bash
sforge proxy --target https://api.anthropic.com --port 9090
FlagDefaultDescription
--targetrequiredUpstream API URL to forward to
--host0.0.0.0Bind address
--port9090Listen port

Requires SFORGE_HTTPS_PROXY (or HTTPS_PROXY) to be configured.

sforge visualizer

Start a web-based results viewer for browsing run outputs and comparing scores across tasks and runs.

bash
sforge visualizer
sforge visualizer --runs-dir logs/runs --port 8000
FlagDefaultDescription
--runs-dirlogs/runsDirectory containing run folders
--tasks-dirtasks/Directory of task JSON definitions (used for score direction)
--host127.0.0.1Bind address
--port8000Listen port

Open http://127.0.0.1:8000/ in your browser after starting.

Developer Commands

The following commands are mainly for users who develop tasks, maintain benchmark JSON definitions, or operate shared infrastructure.

sforge build

Build Docker images (base + work + judge) for one or more tasks. Evaluation users normally use sforge pull instead.

bash
sforge build --task ad_placement_optimization
sforge build --task ad_placement_optimization gitlet rookiedb
sforge build --task ad_placement_optimization --force-rebuild
FlagDescription
--taskOne or more task IDs (required, space-separated)
--force-rebuildForce rebuild work + judge images (skip base)
--force-rebuild-with-baseForce rebuild ALL images including base

sforge push

Push locally-built images to a remote container registry.

bash
sforge push --task ad_placement_optimization --registry registry.example.com/sforge
sforge push --task ad_placement_optimization gitlet --registry registry.example.com/sforge
FlagDescription
--taskOne or more task IDs (required)
--registryRemote container registry URL (overrides SFORGE_REGISTRY env var)