A Free Open-Source Coding Model From DeepReinforce Just Outscored Claude Opus 4.7 on Two Benchmarks

DeepReinforce's new Ornith-1.0 family of open-source coding models is built for autonomous agents rather than chat, and it beats Claude Opus 4.7 on two coding benchmarks, though it is aimed strictly at developers already running agent infrastructure.

An AI research lab called DeepReinforce, the team behind the earlier CUDA-L1 project and the IterX code-agent optimization loop, quietly shipped Ornith-1.0 late last week. It is not a single model but a whole family of open-source coding models, now live on Hugging Face in four different sizes measured by parameter count: a 9 billion version, a 31 billion version, a 35 billion mixture-of-experts version, and a 397 billion mixture-of-experts flagship. Every one of them ships under an MIT license with no regional restrictions attached.

Parameters are essentially the dials and configurations a model can juggle while it learns. As a rule, the more parameters a model carries, the more capable it tends to be. A 9 billion parameter model counts as small. It is light enough to run on a decent smartphone, yet it cannot be trusted with genuinely heavy reasoning. The 397 billion flagship is far more powerful, but it demands serious computing muscle, the sort of hardware you will not find in a consumer laptop.

 

What "agentic" actually means here
The lab calls Ornith "a self-improving family of open-source models specially for agentic coding tasks." That single word, agentic, carries most of the weight. The launch note put it plainly: Ornith-1.0 covers the full span of sizes, from 9B Dense and 31B Dense to 35B MoE and the 397B MoE, and claims state-of-the-art results among open-source models of similar size.

Most of the AI people deal with day to day is conversational. You type something, it answers, and the exchange is over. Agentic AI works differently. It receives a task and then takes its own actions to finish it, without a person steering every step. In a coding setting, that looks like an AI that opens files, runs the tests, works out what broke, rewrites the code, and goes around the loop again until the job is actually done.

In other words, nobody has to sit at the keyboard for most of the process, and that is the entire point. It is also where the most commercially meaningful progress is landing in 2026. A model that can grind unsupervised through a 20-step development workflow is simply worth more than one that writes a tidy function when you ask.

Letting the model build its own playbook
The catch is that most large language models are still designed around human feedback. Most AI coding agents come bolted to a human-designed harness, a fixed rulebook that dictates how the agent should organize its work: when to reach for a tool, how to react to an error, how to break a multi-step problem into pieces. Ornith takes a different route. It "treats the scaffold as a learnable object that co-evolves with the policy." Put simply, instead of borrowing someone else's playbook, it writes its own.

This happens during reinforcement learning, where every training step splits into two stages. First the model reads the task and drafts a sharpened strategy for tackling it. Then it follows that strategy to produce an actual solution. Crucially, the reward from the final outcome feeds back into both stages, so the model learns to write better strategies, not just better code. Repeat that loop thousands and then millions of times, and task-specific approaches start to surface on their own, with no engineer hand-crafting them.

Guarding against reward hacking
DeepReinforce treats reward hacking as a real danger. If a model is allowed to write its own training scaffold, it could in theory build one that cheats the verifier, say by touching a file so a task looks finished when no real work happened. Three layers stand in the way. The environment and the test suite are locked and kept out of the model's reach. A deterministic monitor raises a flag the moment anything tries to reach restricted paths or tamper with the verification scripts. And a frozen judge model sits above the automated verifier with veto power.

The benchmark numbers
The 397 billion parameter flagship scores 82.4 on SWE-bench Verified. That test hands an AI a real bug pulled from an open-source GitHub repository and asks it to fix the problem without ever seeing the test suite, then scores it on the share of issues it actually resolves. That 82.4 edges past Claude Opus 4.7 at 80.8 and DeepSeek-V4-Pro at 80.6 on the very same test. On Terminal Bench 2.1, which runs 89 tasks inside containerized terminal environments spanning everything from debugging async code to closing security vulnerabilities and grades on completion rate, Ornith posts 77.5 against Claude Opus 4.7's 70.3.

There is a contamination worry hanging over SWE-bench. Earlier this year OpenAI argued that some models were padding their scores by memorizing benchmark answers they had seen during training. To address that, Ornith also publishes results on SWE-bench Pro, a tougher variant built on more varied, less-leaked codebases and scored the same way. There the 397 billion model lands at 62.2. That is noticeably lower, but still competitive with the field and still ahead of DeepSeek V4 Pro.

The 9 billion model may be the more striking result. It puts up 69.4 on SWE-bench Verified, beating Gemma 4-31B's 52 and running close to Qwen 3.5-35B's 70, even though it is three to four times smaller than those rivals.

Who this is actually for
Ornith-1.0 is deliberately not a general-purpose AI, and the model's own documentation admits it may stumble on anything outside agentic coding. If you want help summarizing a document, writing a doctoral thesis, or drafting an email, this is the wrong tool. It is tuned for a narrow job: developer pipelines where an AI agent takes a task description, works inside a code repository or terminal session, and finishes multi-step work on its own. It was built for people already running agent infrastructure, not for someone still deciding whether AI is worth the trouble.

The "beats Claude" angle is genuine, but it needs framing. Every lab is now racing to win on agentic coding evals, because that is where the useful performance gaps actually show up. Ornith-1.0-397B does clear Claude Opus 4.7 on both coding benchmarks, yet Anthropic's current flagship, Claude Opus 4.8, scores higher. The comparison that really holds up is within the open-source category, at comparable parameter counts, on coding-specific agent tasks. For developers building self-hosted coding pipelines, agentic infrastructure, or similar work, the small and medium models running on edge hardware could prove genuinely useful. For the average user, though, the answer probably lies somewhere else.

What this means for you
• For developers: If you already run agentic coding pipelines, the free MIT-licensed 9B and 31B models can run on edge hardware and may genuinely speed up self-hosted development work.
• For everyday AI users: Ornith is useless for writing emails, summaries or essays, so for general tasks you are better off sticking with a conversational assistant.

Questions & Answers

1. What is Ornith-1.0?
It is a family of open-source coding models built by DeepReinforce, made specifically for agentic coding tasks and available on Hugging Face.

2. How many models are in the family and what sizes?
There are four sizes: 9 billion, 31 billion, 35 billion MoE and a 397 billion MoE flagship. All ship under an MIT license with no regional restrictions.

3. Is it better than Claude?
The 397 billion flagship scores 82.4 on SWE-bench Verified and 77.5 on Terminal Bench 2.1, beating Claude Opus 4.7, but Anthropic's current flagship Claude Opus 4.8 scores higher.

4. How good is the 9 billion model?
It scores 69.4 on SWE-bench Verified, higher than Gemma 4-31B's 52 and close to Qwen 3.5-35B's 70, despite being three to four times smaller.

5. Can I use it for writing emails or summarizing documents?
No, the model's own documentation says it may underperform on tasks outside agentic coding, so it is the wrong pick for those jobs.

6. Who is this model built for?
It is built for developers running self-hosted coding pipelines and agentic infrastructure, not for the average user.

https://trendkia.com/en/ai/deepreinforce-ke-nae-muphta-modala-ornith-1-0-ne-do-benchamarka-para-claude-opus-4-7-ko-pichhe-chhora-3655
TrendKia — Har trend, sabse pehle.