All Tasks (Kubernetes)
Run the full EdgeBench suite (~50 tasks) across a Kubernetes cluster.
Prerequisites
| Requirement | Check |
|---|---|
Linux host with kubectl configured | kubectl cluster-info |
| NetworkPolicy support (for network isolation) | CNI enforces NetworkPolicy; kubeconfig can create/delete NetworkPolicy in the target namespace |
| Docker Engine (for pushing images) | docker run hello-world |
| Python >= 3.10 | python --version |
| A container registry reachable from K8s nodes | e.g. <registry-ip>:5000 |
Step-by-step
1. Install SForge
pip install sforge2. Fetch task definitions
sforge fetch-tasks edgebench
sforge list # verify tasks are visible3. Push images to the cluster registry
K8s pods pull images at runtime. Pulling from Docker Hub or other public registries is too slow and will severely impact agent performance — image pulls can take minutes per task and eat into the evaluation budget. You should push all images to a private registry within the same VPC as your K8s cluster beforehand.
# Pull pre-built images from the public registry
sforge pull --all --registry seededge
# Push to your private registry
sforge push --all --registry <registry-ip>:5000Verify an image is available in your registry:
curl -s http://<registry-ip>:5000/v2/_catalog | headSetting up a private Docker registry
If you don't already have a private registry, you can start one with a single command:
docker run -d -p 5000:5000 --restart=always --name registry registry:2Configuring insecure (HTTP) registries: By default Docker only trusts HTTPS registries. For a plain HTTP registry on your LAN, you need to configure every machine that pushes or pulls (including K8s nodes) to trust it:
Edit
/etc/docker/daemon.json(create if it doesn't exist):json{ "insecure-registries": ["<registry-ip>:5000"] }Restart Docker:
bashsudo systemctl restart dockerFor K8s nodes using containerd, add the registry to
/etc/containerd/config.tomlon every node in the cluster (pods can be scheduled to any node):toml[plugins."io.containerd.grpc.v1.cri".registry.configs."<registry-ip>:5000".tls] insecure_skip_verify = trueThen restart containerd:
sudo systemctl restart containerd
4. Set up the judge server
Start the judge server and set --judge-url to the host's IP address that K8s pods can reach (not localhost or host.docker.internal):
sforge serve --port 80805. Configure LLM-graded tasks
The college_english_exam_bank task uses an LLM to grade agent submissions. Its judge container calls out to a model API, so it needs credentials passed via SFORGE_JUDGE_EXTRA_ENV.
Set this before starting the judge server:
export SFORGE_JUDGE_EXTRA_ENV="SFORGE_JUDGE_API_KEY=your-key,SFORGE_JUDGE_API_BASE_URL=https://api.openai.com/v1,SFORGE_JUDGE_MODEL=gpt-5.5"
sforge serve --port 8080| Variable (inside judge container) | Purpose |
|---|---|
SFORGE_JUDGE_API_KEY | API key the grading script uses to call the LLM |
SFORGE_JUDGE_API_BASE_URL | Base URL of the LLM endpoint for grading |
SFORGE_JUDGE_MODEL | Model ID used by the grading script |
Non-LLM-graded tasks ignore these variables, so it is safe to set them unconditionally for all tasks.
6. Run the experiment
The experiment.yaml defines the full EdgeBench suite — all tasks, per-task overrides, model config, and resource limits. See Experiment YAML Walkthrough below for details.
sforge run --experiment experiment.yaml \
--judge-url http://<judge-host-ip>:8080 \
--run-id edgebench-001This launches all tasks staggered over 600 seconds total (the delay is evenly divided among tasks). Monitor progress with:
sforge visualizer
# Open http://127.0.0.1:8000Or watch a specific task's log:
tail -f logs/runs/*/ad_placement_optimization/agent_output.txtExperiment YAML Walkthrough
The experiment.yaml in this directory is annotated with comments. Key sections:
env:
Environment variables injected into the host process before parsing the rest of the config. Most important for K8s:
env:
SFORGE_K8S_IMAGE_REGISTRY: "<registry-ip>:5000" # REQUIRED for k8sstagger:
Seconds between launching consecutive tasks. With 50+ tasks, launching them all at once causes API rate-limit storms and K8s scheduling pressure. 600 seconds (10 minutes) is a safe default. Set to 0 to launch all simultaneously.
model:
model:
api_key: "sk-xxxx"
model: claude-opus-4-8
# Only needed for third-party / self-hosted endpoints.
# Omit when using the Anthropic API directly.
# api_base_url: "https://api.deepseek.com/anthropic"defaults:
Applied to every task unless overridden per-task.
Per-task overrides
tasks:
smt_solver:
work_cpu_limit: 16 # needs more CPU than the default 4
work_mem_limit: "16g"
anchorhead_text_adventure:
submission_cooldown: 0 # game-mode task, no cooldown
carleson_formalization: *lean_task # YAML anchor for Lean tasksYAML anchors
Shared override blocks can be defined with x- prefix and referenced with *:
x-lean-task: &lean_task
work_cpu_limit: 8
work_mem_limit: "16g"
judge_cpu_limit: 8
judge_mem_limit: "16g"
tasks:
carleson_formalization: *lean_task
pfr_formalization: *lean_taskOther Experiment Configs
The examples directory includes additional experiment configs for other agents and third-party models:
experiment-codex.yaml— Codex with GPT-5.5experiment-deepseek.yaml— Claude Code with DeepSeek V4 Pro (1M context)experiment-glm.yaml— Claude Code with GLM 5.1 (200K context)
sforge run --experiment experiment-codex.yaml \
--judge-url http://<judge-host-ip>:8080 \
--run-id edgebench-codex-001For a detailed explanation of the Claude Code third-party model settings (cache optimization, model routing variables, context window configuration), see Single Task (Docker).
K8s-Specific Environment Variables
| Variable | Required | Purpose |
|---|---|---|
SFORGE_K8S_IMAGE_REGISTRY | Yes | Registry that K8s pods pull images from. Backend init fails without it. |
SFORGE_K8S_NAMESPACE | No (default: default) | Kubernetes namespace for pods |
SFORGE_K8S_KUBECONFIG | No | Path to kubeconfig file (uses default context if omitted) |
SFORGE_K8S_NODE_SELECTOR | No | Node selector for pods, format: "key1=val1,key2=val2" |
For all other environment variables, see Environment Variables.