OpenBench Benchmark Report

Run ID: latest

Suites

practical, swe-bench

Agents

Practical tasks

Runtime mode

n/a

Practical mode

native

Timestamp

2026-04-12T09:20:56+00:00

Environment

cpu	x86_64
memory_gb	62.72
os	Linux-6.8.0-106-generic-x86_64-with-glibc2.35
python	3.11.15

Runtime comparison

Agent	Startup	Memory	Binary size
omc	293.26 ms · score 76.64	192.21 MB · score 71.27	0.00005 MB · score 100.00
omx	399.67 ms · score 69.91	62.23 MB · score 95.76	0.00004 MB · score 100.00

Runtime agent cards

oh-my-claudecode

omc

Startup OK
Raw: 293.26 ms

Normalized score: 76.64
Memory OK
Raw: 192.21 MB

Normalized score: 71.27
Binary size OK
Raw: 0.00005 MB

Normalized score: 100.00

oh-my-codex

omx

Startup OK
Raw: 399.67 ms

Normalized score: 69.91
Memory OK
Raw: 62.23 MB

Normalized score: 95.76
Binary size OK
Raw: 0.00004 MB

Normalized score: 100.00

Practical task summary

mode: native

Agent	Successful tasks	Total tasks
claude	5	5
codex	5	5
omc	5	5
omx	5	5

Practical task cards

Claude Code (native)

claude · 5/5 passed

single-file-bug-fix SUCCESS
Fix the single-file arithmetic bug so the calculator tests pass.

Changed files: calculator.py

Duration: 12.5s

Tokens: 131,058 in / 131,050 cached / 241 out
failing-unit-test-repair SUCCESS
Repair the slugify implementation so the failing unit tests pass.

Changed files: text_utils.py

Duration: 15.4s

Tokens: 175,674 in / 175,663 cached / 467 out
config-schema-migration SUCCESS
Migrate timeout loading to support the new schema while preserving backward compatibility.

Changed files: config_loader.py

Duration: 20.2s

Tokens: 175,569 in / 175,558 cached / 467 out
multi-file-import-repair SUCCESS
Repair the broken imports after the math_ops module was moved.

Changed files: app.py, report.py

Duration: 20.3s

Tokens: 175,877 in / 175,866 cached / 542 out
validation-error-handling-patch SUCCESS
Add the missing email validation and trimming behavior.

Changed files: user_service.py

Duration: 24.4s

Tokens: 175,893 in / 175,882 cached / 692 out

Codex CLI (native)

codex · 5/5 passed

single-file-bug-fix SUCCESS
Fix the single-file arithmetic bug so the calculator tests pass.

Changed files: calculator.py

Duration: 27.8s

Tokens: 121,980 in / 100,864 cached / 1,000 out
failing-unit-test-repair SUCCESS
Repair the slugify implementation so the failing unit tests pass.

Changed files: text_utils.py

Duration: 34.7s

Tokens: 148,964 in / 131,584 cached / 1,756 out
config-schema-migration SUCCESS
Migrate timeout loading to support the new schema while preserving backward compatibility.

Changed files: config_loader.py

Duration: 40.6s

Tokens: 174,048 in / 152,320 cached / 1,602 out
multi-file-import-repair SUCCESS
Repair the broken imports after the math_ops module was moved.

Changed files: app.py, report.py

Duration: 44.5s

Tokens: 121,834 in / 82,048 cached / 1,360 out
validation-error-handling-patch SUCCESS
Add the missing email validation and trimming behavior.

Changed files: user_service.py

Duration: 44.4s

Tokens: 199,933 in / 178,048 cached / 1,905 out

oh-my-claudecode

omc · 5/5 passed

single-file-bug-fix SUCCESS
Fix the single-file arithmetic bug so the calculator tests pass.

Changed files: calculator.py

Duration: 14.2s

Tokens: 131,061 in / 131,053 cached / 246 out
failing-unit-test-repair SUCCESS
Repair the slugify implementation so the failing unit tests pass.

Changed files: text_utils.py

Duration: 21.3s

Tokens: 175,699 in / 175,688 cached / 471 out
config-schema-migration SUCCESS
Migrate timeout loading to support the new schema while preserving backward compatibility.

Changed files: config_loader.py

Duration: 33.0s

Tokens: 221,296 in / 221,282 cached / 742 out
multi-file-import-repair SUCCESS
Repair the broken imports after the math_ops module was moved.

Changed files: app.py, report.py

Duration: 22.3s

Tokens: 176,123 in / 176,112 cached / 560 out
validation-error-handling-patch SUCCESS
Add the missing email validation and trimming behavior.

Changed files: user_service.py

Duration: 25.2s

Tokens: 175,797 in / 175,786 cached / 803 out

oh-my-codex

omx · 5/5 passed

single-file-bug-fix SUCCESS
Fix the single-file arithmetic bug so the calculator tests pass.

Changed files: calculator.py

Duration: 37.0s

Tokens: 256,535 in / 229,504 cached / 1,291 out
failing-unit-test-repair SUCCESS
Repair the slugify implementation so the failing unit tests pass.

Changed files: text_utils.py

Duration: 1m 24s

Tokens: 571,645 in / 543,104 cached / 2,870 out
config-schema-migration SUCCESS
Migrate timeout loading to support the new schema while preserving backward compatibility.

Changed files: config_loader.py

Duration: 54.3s

Tokens: 359,063 in / 315,136 cached / 2,184 out
multi-file-import-repair SUCCESS
Repair the broken imports after the math_ops module was moved.

Changed files: app.py, report.py

Duration: 43.4s

Tokens: 314,753 in / 295,680 cached / 1,500 out
validation-error-handling-patch SUCCESS
Add the missing email validation and trimming behavior.

Changed files: user_service.py

Duration: 1m 0s

Tokens: 358,975 in / 344,960 cached / 2,568 out

Notes

Raw values remain visible alongside normalized scores. Lower values are better for startup, memory, and binary-size metrics. Binary-size currently reflects the resolved command path footprint, which may be a launcher wrapper rather than the full installation size.

Measures how quickly the agent CLI starts up and becomes responsive. Uses hyperfine to run `<command> --version` with warmup, reporting the mean execution time. Lower is better.

Startup

omc 293.26 ms · score 76.64

omx 399.67 ms · score 69.91

oh-my-claudecode OK

Raw value 293.26 ms

Normalized score 76.64

oh-my-codex OK

Raw value 399.67 ms

Normalized score 69.91

Measures peak memory consumption (Maximum Resident Set Size) during a single CLI invocation. Uses GNU time to track memory allocation. Lower is better.

Memory

omc 192.21 MB · score 71.27

omx 62.23 MB · score 95.76

oh-my-claudecode OK

Raw value 192.21 MB

Normalized score 71.27

oh-my-codex OK

Raw value 62.23 MB

Normalized score 95.76

Measures the on-disk footprint of the agent CLI entry point. Uses du to calculate the resolved command path size. Lower is better.

Binary size

omc 0.00005 MB · score 100.00

omx 0.00004 MB · score 100.00

oh-my-claudecode OK

Raw value 0.00005 MB

Normalized score 100.00

oh-my-codex OK

Raw value 0.00004 MB

Normalized score 100.00

Each agent receives the same prompt and works in an identical sandboxed environment. Three independent axes are measured: correctness (did the task pass?), duration (time to completion), and token usage (cost efficiency). Lower duration and fewer tokens are better, given equal correctness.

Agent leaderboard

Agent	Pass	Time	Tokens (in)	Tokens (out)
claude	5/5	1m 33s	834k	2k
codex	5/5	3m 12s	767k	8k
omc	5/5	1m 56s	880k	3k
omx	5/5	4m 39s	1.9M	10k

Results by difficulty

Easy

Metric	claude	codex	omc	omx
Pass@1	100.0%	100.0%	100.0%	100.0%
Tokens/success	492	154,876	575	374,276
Duration/success	20.2s	40.6s	22.3s	54.3s

Per-task breakdown

single-file-bug-fix — all pass

Agent	Result	Duration	Total in	Cached	Output
claude	PASS	12.5s	131k	131k	241
codex	PASS	27.8s	122k	101k	1k
omc	PASS	14.2s	131k	131k	246
omx	PASS	37.0s	257k	230k	1k

failing-unit-test-repair — all pass

Agent	Result	Duration	Total in	Cached	Output
claude	PASS	15.4s	176k	176k	467
codex	PASS	34.7s	149k	132k	2k
omc	PASS	21.3s	176k	176k	471
omx	PASS	1m 24s	572k	543k	3k

config-schema-migration — all pass

Agent	Result	Duration	Total in	Cached	Output
claude	PASS	20.2s	176k	176k	467
codex	PASS	40.6s	174k	152k	2k
omc	PASS	33.0s	221k	221k	742
omx	PASS	54.3s	359k	315k	2k

multi-file-import-repair — all pass

Agent	Result	Duration	Total in	Cached	Output
claude	PASS	20.3s	176k	176k	542
codex	PASS	44.5s	122k	82k	1k
omc	PASS	22.3s	176k	176k	560
omx	PASS	43.4s	315k	296k	2k

validation-error-handling-patch — all pass

Agent	Result	Duration	Total in	Cached	Output
claude	PASS	24.4s	176k	176k	692
codex	PASS	44.4s	200k	178k	2k
omc	PASS	25.2s	176k	176k	803
omx	PASS	1m 0s	359k	345k	3k

Each task is a self-contained Python fixture with a deliberate bug or missing feature. Agents receive only a task description and the list of editable files. Tests are hidden and applied only during evaluation.

Single-file bug fix

Easy Bug Fix

Bug	The add_numbers() function uses subtraction (-) instead of addition (+).
Goal	Fix the operator so the calculator returns correct sums.
Files	`calculator.py`
Why this test	Tests the most basic capability: reading code, spotting a single-character bug, and making a minimal fix.

Failing unit test repair

Easy Bug Fix

Bug	The slugify() function replaces spaces with underscores (_) but tests expect hyphens (-).
Goal	Change the replacement character from underscore to hyphen.
Files	`text_utils.py`
Why this test	Tests whether the agent can infer the expected behavior from context without seeing the test file.

Config schema migration

Easy Feature

Bug	load_timeout() only reads timeout_ms but the new schema uses timeout_seconds.
Goal	Support both timeout_seconds (new) and timeout_ms (legacy) fields with backward compatibility.
Files	`config_loader.py`
Why this test	Tests the agent's ability to add a feature while preserving existing behavior — a common real-world pattern.

Multi-file import repair

Medium Debug

Bug	Two files import from helpers.math_ops, but the module was moved to utils.math_ops.
Goal	Update imports in both app.py and report.py to use the new path.
Files	`app.py, report.py`
Why this test	Tests multi-file awareness: the agent must find and fix the same broken import in two separate files.

Validation and error handling patch

Easy Feature

Bug	create_user() accepts any email including blank strings without validation.
Goal	Add email trimming and raise ValueError for blank emails.
Files	`user_service.py`
Why this test	Tests the agent's ability to add input validation logic based on a natural-language description.

SWE-bench evaluates agents on real-world GitHub issues from open-source Python projects. Each agent must resolve the issue by editing the codebase so that the hidden test suite passes. Three independent axes are measured: correctness (did the patch resolve the issue?), duration (time to completion), and token usage (cost efficiency).

Agent leaderboard

Agent	Pass	Time	Tokens (in)	Tokens (out)
claude	1/1	9m 8s	4.9M	32k
omc	3/3	11m 45s	3.0M	12k

Results by difficulty

Hard

Metric	claude	omc
Pass@1	100.0%	100.0%
Tokens/success	31,899	4,061
Duration/success	9m 8s	4m 19s

Per-task breakdown

django__django-16560 — all pass

Agent	Result	Duration	Total in	Cached	Output
claude	PASS	9m 8s	4.9M	4.9M	32k
omc	—	—	—	—	—

django__django-17087 — all pass

Agent	Result	Duration	Total in	Cached	Output
claude	—	—	—	—	—
omc	PASS	4m 19s	1.2M	1.2M	5k

django__django-16493 — all pass

Agent	Result	Duration	Total in	Cached	Output
claude	—	—	—	—	—
omc	PASS	6m 13s	1.2M	1.2M	4k

sympy__sympy-24213 — all pass

Agent	Result	Duration	Total in	Cached	Output
claude	—	—	—	—	—
omc	PASS	1m 12s	514k	514k	3k