OpenBench Benchmark Report

Run ID: latest

Suites

practical, swe-bench

Agents

4

Practical tasks

20

Runtime mode

n/a

Practical mode

native

Timestamp

2026-04-12T09:20:56+00:00

Environment
cpux86_64
memory_gb62.72
osLinux-6.8.0-106-generic-x86_64-with-glibc2.35
python3.11.15

Runtime comparison

Agent Startup Memory Binary size
omc 293.26 ms · score 76.64 192.21 MB · score 71.27 0.00005 MB · score 100.00
omx 399.67 ms · score 69.91 62.23 MB · score 95.76 0.00004 MB · score 100.00
Runtime agent cards

oh-my-claudecode

omc

  • Startup OK
    Raw: 293.26 ms
    Normalized score: 76.64
  • Memory OK
    Raw: 192.21 MB
    Normalized score: 71.27
  • Binary size OK
    Raw: 0.00005 MB
    Normalized score: 100.00

oh-my-codex

omx

  • Startup OK
    Raw: 399.67 ms
    Normalized score: 69.91
  • Memory OK
    Raw: 62.23 MB
    Normalized score: 95.76
  • Binary size OK
    Raw: 0.00004 MB
    Normalized score: 100.00

Practical task summary

mode: native
Agent Successful tasks Failed tasks Total tasks
claude 5 0 5
codex 5 0 5
omc 5 0 5
omx 5 0 5
Practical task cards

Claude Code (native)

claude · 5/5 passed

  • single-file-bug-fix SUCCESS
    Fix the single-file arithmetic bug so the calculator tests pass.
    Changed files: calculator.py
    Duration: 12.5s
    Tokens: 131,058 in / 131,050 cached / 241 out
  • failing-unit-test-repair SUCCESS
    Repair the slugify implementation so the failing unit tests pass.
    Changed files: text_utils.py
    Duration: 15.4s
    Tokens: 175,674 in / 175,663 cached / 467 out
  • config-schema-migration SUCCESS
    Migrate timeout loading to support the new schema while preserving backward compatibility.
    Changed files: config_loader.py
    Duration: 20.2s
    Tokens: 175,569 in / 175,558 cached / 467 out
  • multi-file-import-repair SUCCESS
    Repair the broken imports after the math_ops module was moved.
    Changed files: app.py, report.py
    Duration: 20.3s
    Tokens: 175,877 in / 175,866 cached / 542 out
  • validation-error-handling-patch SUCCESS
    Add the missing email validation and trimming behavior.
    Changed files: user_service.py
    Duration: 24.4s
    Tokens: 175,893 in / 175,882 cached / 692 out

Codex CLI (native)

codex · 5/5 passed

  • single-file-bug-fix SUCCESS
    Fix the single-file arithmetic bug so the calculator tests pass.
    Changed files: calculator.py
    Duration: 27.8s
    Tokens: 121,980 in / 100,864 cached / 1,000 out
  • failing-unit-test-repair SUCCESS
    Repair the slugify implementation so the failing unit tests pass.
    Changed files: text_utils.py
    Duration: 34.7s
    Tokens: 148,964 in / 131,584 cached / 1,756 out
  • config-schema-migration SUCCESS
    Migrate timeout loading to support the new schema while preserving backward compatibility.
    Changed files: config_loader.py
    Duration: 40.6s
    Tokens: 174,048 in / 152,320 cached / 1,602 out
  • multi-file-import-repair SUCCESS
    Repair the broken imports after the math_ops module was moved.
    Changed files: app.py, report.py
    Duration: 44.5s
    Tokens: 121,834 in / 82,048 cached / 1,360 out
  • validation-error-handling-patch SUCCESS
    Add the missing email validation and trimming behavior.
    Changed files: user_service.py
    Duration: 44.4s
    Tokens: 199,933 in / 178,048 cached / 1,905 out

oh-my-claudecode

omc · 5/5 passed

  • single-file-bug-fix SUCCESS
    Fix the single-file arithmetic bug so the calculator tests pass.
    Changed files: calculator.py
    Duration: 14.2s
    Tokens: 131,061 in / 131,053 cached / 246 out
  • failing-unit-test-repair SUCCESS
    Repair the slugify implementation so the failing unit tests pass.
    Changed files: text_utils.py
    Duration: 21.3s
    Tokens: 175,699 in / 175,688 cached / 471 out
  • config-schema-migration SUCCESS
    Migrate timeout loading to support the new schema while preserving backward compatibility.
    Changed files: config_loader.py
    Duration: 33.0s
    Tokens: 221,296 in / 221,282 cached / 742 out
  • multi-file-import-repair SUCCESS
    Repair the broken imports after the math_ops module was moved.
    Changed files: app.py, report.py
    Duration: 22.3s
    Tokens: 176,123 in / 176,112 cached / 560 out
  • validation-error-handling-patch SUCCESS
    Add the missing email validation and trimming behavior.
    Changed files: user_service.py
    Duration: 25.2s
    Tokens: 175,797 in / 175,786 cached / 803 out

oh-my-codex

omx · 5/5 passed

  • single-file-bug-fix SUCCESS
    Fix the single-file arithmetic bug so the calculator tests pass.
    Changed files: calculator.py
    Duration: 37.0s
    Tokens: 256,535 in / 229,504 cached / 1,291 out
  • failing-unit-test-repair SUCCESS
    Repair the slugify implementation so the failing unit tests pass.
    Changed files: text_utils.py
    Duration: 1m 24s
    Tokens: 571,645 in / 543,104 cached / 2,870 out
  • config-schema-migration SUCCESS
    Migrate timeout loading to support the new schema while preserving backward compatibility.
    Changed files: config_loader.py
    Duration: 54.3s
    Tokens: 359,063 in / 315,136 cached / 2,184 out
  • multi-file-import-repair SUCCESS
    Repair the broken imports after the math_ops module was moved.
    Changed files: app.py, report.py
    Duration: 43.4s
    Tokens: 314,753 in / 295,680 cached / 1,500 out
  • validation-error-handling-patch SUCCESS
    Add the missing email validation and trimming behavior.
    Changed files: user_service.py
    Duration: 1m 0s
    Tokens: 358,975 in / 344,960 cached / 2,568 out

Notes

Raw values remain visible alongside normalized scores. Lower values are better for startup, memory, and binary-size metrics. Binary-size currently reflects the resolved command path footprint, which may be a launcher wrapper rather than the full installation size.

Measures how quickly the agent CLI starts up and becomes responsive. Uses hyperfine to run `<command> --version` with warmup, reporting the mean execution time. Lower is better.

Startup

omc 293.26 ms · score 76.64
omx 399.67 ms · score 69.91

oh-my-claudecode OK

Raw value 293.26 ms
Normalized score 76.64

oh-my-codex OK

Raw value 399.67 ms
Normalized score 69.91

Measures peak memory consumption (Maximum Resident Set Size) during a single CLI invocation. Uses GNU time to track memory allocation. Lower is better.

Memory

omc 192.21 MB · score 71.27
omx 62.23 MB · score 95.76

oh-my-claudecode OK

Raw value 192.21 MB
Normalized score 71.27

oh-my-codex OK

Raw value 62.23 MB
Normalized score 95.76

Measures the on-disk footprint of the agent CLI entry point. Uses du to calculate the resolved command path size. Lower is better.

Binary size

omc 0.00005 MB · score 100.00
omx 0.00004 MB · score 100.00

oh-my-claudecode OK

Raw value 0.00005 MB
Normalized score 100.00

oh-my-codex OK

Raw value 0.00004 MB
Normalized score 100.00

Each agent receives the same prompt and works in an identical sandboxed environment. Three independent axes are measured: correctness (did the task pass?), duration (time to completion), and token usage (cost efficiency). Lower duration and fewer tokens are better, given equal correctness.

Agent leaderboard

Agent Pass Time Tokens (in) Tokens (out)
claude 5/5 1m 33s 834k 2k
codex 5/5 3m 12s 767k 8k
omc 5/5 1m 56s 880k 3k
omx 5/5 4m 39s 1.9M 10k

Results by difficulty

Easy

Metricclaudecodexomcomx
Pass@1100.0%100.0%100.0%100.0%
Tokens/success492154,876575374,276
Duration/success20.2s40.6s22.3s54.3s

Per-task breakdown

single-file-bug-fix — all pass
Agent Result Duration Total in Cached Output
claude PASS 12.5s 131k 131k 241
codex PASS 27.8s 122k 101k 1k
omc PASS 14.2s 131k 131k 246
omx PASS 37.0s 257k 230k 1k
failing-unit-test-repair — all pass
Agent Result Duration Total in Cached Output
claude PASS 15.4s 176k 176k 467
codex PASS 34.7s 149k 132k 2k
omc PASS 21.3s 176k 176k 471
omx PASS 1m 24s 572k 543k 3k
config-schema-migration — all pass
Agent Result Duration Total in Cached Output
claude PASS 20.2s 176k 176k 467
codex PASS 40.6s 174k 152k 2k
omc PASS 33.0s 221k 221k 742
omx PASS 54.3s 359k 315k 2k
multi-file-import-repair — all pass
Agent Result Duration Total in Cached Output
claude PASS 20.3s 176k 176k 542
codex PASS 44.5s 122k 82k 1k
omc PASS 22.3s 176k 176k 560
omx PASS 43.4s 315k 296k 2k
validation-error-handling-patch — all pass
Agent Result Duration Total in Cached Output
claude PASS 24.4s 176k 176k 692
codex PASS 44.4s 200k 178k 2k
omc PASS 25.2s 176k 176k 803
omx PASS 1m 0s 359k 345k 3k

Each task is a self-contained Python fixture with a deliberate bug or missing feature. Agents receive only a task description and the list of editable files. Tests are hidden and applied only during evaluation.

Single-file bug fix

Easy Bug Fix

BugThe add_numbers() function uses subtraction (-) instead of addition (+).
GoalFix the operator so the calculator returns correct sums.
Filescalculator.py
Why this testTests the most basic capability: reading code, spotting a single-character bug, and making a minimal fix.

Failing unit test repair

Easy Bug Fix

BugThe slugify() function replaces spaces with underscores (_) but tests expect hyphens (-).
GoalChange the replacement character from underscore to hyphen.
Filestext_utils.py
Why this testTests whether the agent can infer the expected behavior from context without seeing the test file.

Config schema migration

Easy Feature

Bugload_timeout() only reads timeout_ms but the new schema uses timeout_seconds.
GoalSupport both timeout_seconds (new) and timeout_ms (legacy) fields with backward compatibility.
Filesconfig_loader.py
Why this testTests the agent's ability to add a feature while preserving existing behavior — a common real-world pattern.

Multi-file import repair

Medium Debug

BugTwo files import from helpers.math_ops, but the module was moved to utils.math_ops.
GoalUpdate imports in both app.py and report.py to use the new path.
Filesapp.py, report.py
Why this testTests multi-file awareness: the agent must find and fix the same broken import in two separate files.

Validation and error handling patch

Easy Feature

Bugcreate_user() accepts any email including blank strings without validation.
GoalAdd email trimming and raise ValueError for blank emails.
Filesuser_service.py
Why this testTests the agent's ability to add input validation logic based on a natural-language description.

SWE-bench evaluates agents on real-world GitHub issues from open-source Python projects. Each agent must resolve the issue by editing the codebase so that the hidden test suite passes. Three independent axes are measured: correctness (did the patch resolve the issue?), duration (time to completion), and token usage (cost efficiency).

Agent leaderboard

Agent Pass Time Tokens (in) Tokens (out)
claude 1/1 9m 8s 4.9M 32k
omc 3/3 11m 45s 3.0M 12k

Results by difficulty

Hard

Metricclaudeomc
Pass@1100.0%100.0%
Tokens/success31,8994,061
Duration/success9m 8s4m 19s

Per-task breakdown

django__django-16560 — all pass
Agent Result Duration Total in Cached Output
claude PASS 9m 8s 4.9M 4.9M 32k
omc
django__django-17087 — all pass
Agent Result Duration Total in Cached Output
claude
omc PASS 4m 19s 1.2M 1.2M 5k
django__django-16493 — all pass
Agent Result Duration Total in Cached Output
claude
omc PASS 6m 13s 1.2M 1.2M 4k
sympy__sympy-24213 — all pass
Agent Result Duration Total in Cached Output
claude
omc PASS 1m 12s 514k 514k 3k