Skip to content

Experiment Configuration

Overview

Experiment configuration files provide a YAML-based way to run multiple tasks with shared model settings and per-task overrides. This decouples "how to run" (model, timeouts, agent settings) from "what to test" (task JSON definitions).

YAML Format

yaml
model:
  api_key: ${OPUS_KEY}
  api_base_url: https://api.anthropic.com
  model: claude-opus-4-6

stagger: 300  # Spread task launches evenly over 300 seconds

defaults:
  agent: claude-code
  timeout: 7200
  eval_interval: 300

tasks:
  ad_placement_optimization:
    eval_interval: 1800
  tinykv:
    extra_env:
      GOPROXY: "https://goproxy.cn"
  gitlet: {}

Fields

model

Configures the LLM used by the agent. All string values support ${ENV_VAR} expansion --- the referenced environment variable must be set at runtime.

FieldTypeDescription
api_keystringAPI key (supports ${ENV_VAR} expansion)
api_base_urlstringAPI base URL
modelstringModel name (e.g., claude-opus-4-6, claude-sonnet-4-20250514)

defaults

Default settings applied to all tasks unless overridden per-task.

FieldTypeDefaultDescription
agentstring---Agent name (claude-code, codex)
modelstring---Model override (overrides model.model for specific tasks if set per-task)
timeoutint---Agent timeout in seconds
eval_intervalint---Auto-eval interval in seconds
disable_stop_hookboolfalseDisable the agent stop hook
disable_auto_evalboolfalseDisable background auto-evaluation
disable_auto_resumeboolfalseDisable auto-resume on abnormal agent exit
internetbool---Enable/disable internet access
extra_envdict---Extra environment variables injected into the agent container
backendstring---Container backend (docker or k8s)
judge_urlstring---Judge server URL override
max_submissionsint---Maximum number of agent submissions per run
submission_cooldownint---Minimum seconds between agent submissions
work_cpu_limitint---CPU limit for work containers
work_mem_limitstring---Memory limit for work containers
judge_cpu_limitint---CPU limit for judge containers
judge_mem_limitstring---Memory limit for judge containers

tasks

Per-task overrides. Each key is a task ID that must correspond to a JSON file in the tasks directory. The value accepts the same fields as defaults. Tasks listed here define which tasks are run --- if --task is not specified on the CLI, the experiment's task list is used.

An empty object ({}) means "use all defaults, no overrides."

stagger

A top-level integer field that spreads task launches evenly over the specified number of seconds when running multiple tasks. This prevents a thundering-herd of simultaneous container starts and API calls.

yaml
stagger: 300  # Launch tasks at even intervals over 5 minutes

For example, with 6 tasks and stagger: 300, each task starts 50 seconds apart.

Priority Chain

Configuration values are resolved in the following order (highest priority first):

CLI flags  >  per-task overrides  >  experiment defaults  >  env vars / task JSON

For example:

  • --timeout 3600 on the CLI always wins
  • A per-task timeout: 1800 overrides the experiment-level defaults.timeout
  • The experiment defaults.timeout overrides SFORGE_AGENT_TIMEOUT

For extra_env, dictionaries are merged (not replaced). Per-task keys override default keys on conflict:

yaml
defaults:
  extra_env:
    FOO: "from-defaults"
    BAR: "shared"

tasks:
  ad_placement_optimization:
    extra_env:
      FOO: "from-task"   # overrides defaults
      BAZ: "task-only"   # added
    # Result: FOO=from-task, BAR=shared, BAZ=task-only

Usage

Run all tasks in an experiment

bash
sforge run --experiment experiments/my_config.yaml --run-id batch-001

When --task is not specified, all tasks listed in the experiment file are run in parallel.

Run a subset of tasks

bash
sforge run --experiment experiments/my_config.yaml --task ad_placement_optimization tinykv

This uses the experiment's model and default settings but only runs the specified tasks.

CLI overrides

bash
sforge run \
  --experiment experiments/my_config.yaml \
  --timeout 3600 \
  --disable-stop-hook \
  --run-id experiment-v2

CLI flags take highest priority and override both per-task and default settings.

Environment Variable Expansion

String values in the model section support ${VAR_NAME} syntax:

yaml
model:
  api_key: ${MY_API_KEY}
  api_base_url: ${API_ENDPOINT}

If a referenced variable is not set, SForge exits with an error. This lets you keep secrets out of config files while still having reproducible experiment definitions.

Run Artifacts

When running with an experiment config, SForge saves:

  • logs/runs/<run-id>/experiment.yaml --- a copy of the experiment file
  • logs/runs/<run-id>/run_config.json --- the fully resolved configuration for all tasks
  • logs/runs/<run-id>/<task-id>/run_config.json --- per-task resolved configuration (with API keys redacted)
  • logs/runs/<run-id>/summary.json --- final results summary (multi-task runs)

This ensures every run is fully reproducible from its log directory.