OpenBench Benchmark Report
Run ID: latest
Run ID: latest
practical, swe-bench
4
20
n/a
native
2026-04-12T09:20:56+00:00
| cpu | x86_64 |
|---|---|
| memory_gb | 62.72 |
| os | Linux-6.8.0-106-generic-x86_64-with-glibc2.35 |
| python | 3.11.15 |
| Agent | Startup | Memory | Binary size |
|---|---|---|---|
| omc | 293.26 ms · score 76.64 | 192.21 MB · score 71.27 | 0.00005 MB · score 100.00 |
| omx | 399.67 ms · score 69.91 | 62.23 MB · score 95.76 | 0.00004 MB · score 100.00 |
omc
omx
| Agent | Successful tasks | Failed tasks | Total tasks |
|---|---|---|---|
| claude | 5 | 0 | 5 |
| codex | 5 | 0 | 5 |
| omc | 5 | 0 | 5 |
| omx | 5 | 0 | 5 |
claude · 5/5 passed
codex · 5/5 passed
omc · 5/5 passed
omx · 5/5 passed
Raw values remain visible alongside normalized scores. Lower values are better for startup, memory, and binary-size metrics. Binary-size currently reflects the resolved command path footprint, which may be a launcher wrapper rather than the full installation size.
Measures how quickly the agent CLI starts up and becomes responsive. Uses hyperfine to run `<command> --version` with warmup, reporting the mean execution time. Lower is better.
Measures peak memory consumption (Maximum Resident Set Size) during a single CLI invocation. Uses GNU time to track memory allocation. Lower is better.
Measures the on-disk footprint of the agent CLI entry point. Uses du to calculate the resolved command path size. Lower is better.
Each agent receives the same prompt and works in an identical sandboxed environment. Three independent axes are measured: correctness (did the task pass?), duration (time to completion), and token usage (cost efficiency). Lower duration and fewer tokens are better, given equal correctness.
| Agent | Pass | Time | Tokens (in) | Tokens (out) |
|---|---|---|---|---|
| claude | 5/5 | 1m 33s | 834k | 2k |
| codex | 5/5 | 3m 12s | 767k | 8k |
| omc | 5/5 | 1m 56s | 880k | 3k |
| omx | 5/5 | 4m 39s | 1.9M | 10k |
| Metric | claude | codex | omc | omx |
|---|---|---|---|---|
| Pass@1 | 100.0% | 100.0% | 100.0% | 100.0% |
| Tokens/success | 492 | 154,876 | 575 | 374,276 |
| Duration/success | 20.2s | 40.6s | 22.3s | 54.3s |
| Agent | Result | Duration | Total in | Cached | Output |
|---|---|---|---|---|---|
| claude | PASS | 12.5s | 131k | 131k | 241 |
| codex | PASS | 27.8s | 122k | 101k | 1k |
| omc | PASS | 14.2s | 131k | 131k | 246 |
| omx | PASS | 37.0s | 257k | 230k | 1k |
| Agent | Result | Duration | Total in | Cached | Output |
|---|---|---|---|---|---|
| claude | PASS | 15.4s | 176k | 176k | 467 |
| codex | PASS | 34.7s | 149k | 132k | 2k |
| omc | PASS | 21.3s | 176k | 176k | 471 |
| omx | PASS | 1m 24s | 572k | 543k | 3k |
| Agent | Result | Duration | Total in | Cached | Output |
|---|---|---|---|---|---|
| claude | PASS | 20.2s | 176k | 176k | 467 |
| codex | PASS | 40.6s | 174k | 152k | 2k |
| omc | PASS | 33.0s | 221k | 221k | 742 |
| omx | PASS | 54.3s | 359k | 315k | 2k |
| Agent | Result | Duration | Total in | Cached | Output |
|---|---|---|---|---|---|
| claude | PASS | 20.3s | 176k | 176k | 542 |
| codex | PASS | 44.5s | 122k | 82k | 1k |
| omc | PASS | 22.3s | 176k | 176k | 560 |
| omx | PASS | 43.4s | 315k | 296k | 2k |
| Agent | Result | Duration | Total in | Cached | Output |
|---|---|---|---|---|---|
| claude | PASS | 24.4s | 176k | 176k | 692 |
| codex | PASS | 44.4s | 200k | 178k | 2k |
| omc | PASS | 25.2s | 176k | 176k | 803 |
| omx | PASS | 1m 0s | 359k | 345k | 3k |
Each task is a self-contained Python fixture with a deliberate bug or missing feature. Agents receive only a task description and the list of editable files. Tests are hidden and applied only during evaluation.
Easy Bug Fix
| Bug | The add_numbers() function uses subtraction (-) instead of addition (+). |
|---|---|
| Goal | Fix the operator so the calculator returns correct sums. |
| Files | calculator.py |
| Why this test | Tests the most basic capability: reading code, spotting a single-character bug, and making a minimal fix. |
Easy Bug Fix
| Bug | The slugify() function replaces spaces with underscores (_) but tests expect hyphens (-). |
|---|---|
| Goal | Change the replacement character from underscore to hyphen. |
| Files | text_utils.py |
| Why this test | Tests whether the agent can infer the expected behavior from context without seeing the test file. |
Easy Feature
| Bug | load_timeout() only reads timeout_ms but the new schema uses timeout_seconds. |
|---|---|
| Goal | Support both timeout_seconds (new) and timeout_ms (legacy) fields with backward compatibility. |
| Files | config_loader.py |
| Why this test | Tests the agent's ability to add a feature while preserving existing behavior — a common real-world pattern. |
Medium Debug
| Bug | Two files import from helpers.math_ops, but the module was moved to utils.math_ops. |
|---|---|
| Goal | Update imports in both app.py and report.py to use the new path. |
| Files | app.py, report.py |
| Why this test | Tests multi-file awareness: the agent must find and fix the same broken import in two separate files. |
Easy Feature
| Bug | create_user() accepts any email including blank strings without validation. |
|---|---|
| Goal | Add email trimming and raise ValueError for blank emails. |
| Files | user_service.py |
| Why this test | Tests the agent's ability to add input validation logic based on a natural-language description. |
SWE-bench evaluates agents on real-world GitHub issues from open-source Python projects. Each agent must resolve the issue by editing the codebase so that the hidden test suite passes. Three independent axes are measured: correctness (did the patch resolve the issue?), duration (time to completion), and token usage (cost efficiency).
| Agent | Pass | Time | Tokens (in) | Tokens (out) |
|---|---|---|---|---|
| claude | 1/1 | 9m 8s | 4.9M | 32k |
| omc | 3/3 | 11m 45s | 3.0M | 12k |
| Metric | claude | omc |
|---|---|---|
| Pass@1 | 100.0% | 100.0% |
| Tokens/success | 31,899 | 4,061 |
| Duration/success | 9m 8s | 4m 19s |
| Agent | Result | Duration | Total in | Cached | Output |
|---|---|---|---|---|---|
| claude | PASS | 9m 8s | 4.9M | 4.9M | 32k |
| omc | — | — | — | — | — |
| Agent | Result | Duration | Total in | Cached | Output |
|---|---|---|---|---|---|
| claude | — | — | — | — | — |
| omc | PASS | 4m 19s | 1.2M | 1.2M | 5k |
| Agent | Result | Duration | Total in | Cached | Output |
|---|---|---|---|---|---|
| claude | — | — | — | — | — |
| omc | PASS | 6m 13s | 1.2M | 1.2M | 4k |
| Agent | Result | Duration | Total in | Cached | Output |
|---|---|---|---|---|---|
| claude | — | — | — | — | — |
| omc | PASS | 1m 12s | 514k | 514k | 3k |