A new AI coding challenge just published its first results – and they aren’t pretty

A new AI coding challenge has revealed its first winner — and set a new bar for AI-powered software program engineers.

On Wednesday at 5pm PST, the nonprofit Laude Institute introduced the first winner of the Okay Prize, a multi-round AI coding challenge launched by Databricks and Perplexity co-founder Andy Konwinski. The winner was a Brazilian immediate engineer named Eduardo Rocha de Andrade, who will obtain $50,000 for the prize. However extra shocking than the win was his closing rating: he received with appropriate solutions to just 7.5% of the questions on the check.

“We’re glad we constructed a benchmark that’s truly laborious,” mentioned Konwinski. “Benchmarks must be laborious if they’re going to matter,” he continued, including: “Scores can be completely different if the large labs had entered with their largest fashions. However that’s type of the purpose. Okay Prize runs offline with restricted compute, so it favors smaller and open fashions. I like that. It ranges the enjoying area.”

Konwinski has pledged $1 million to the first open-source mannequin that may rating greater than 90% on the check.

Much like the well-known SWE-Bench system, the Okay Prize assessments fashions towards flagged points from GitHub as a check of how effectively fashions can take care of real-world programming issues. However whereas SWE-Bench relies on a set set of issues that fashions can prepare towards, the Okay Prize is designed as a “contamination-free model of SWE-Bench,” utilizing a timed entry system to protect towards any benchmark-specific coaching. For spherical one, fashions have been due by March twelfth. The Okay Prize organizers then constructed the check utilizing solely GitHub points flagged after that date.

The 7.5% prime rating stands in marked distinction to SWE-Bench itself, which presently exhibits a 75% prime rating on its simpler ‘Verified’ check and 34% on its tougher ‘Full’ check. Konwinski nonetheless isn’t positive whether or not the disparity is because of contamination on SWE-Bench or just the challenge of amassing new points from GitHub, however he expects the Okay Prize challenge to reply the query quickly.

“As we get extra runs of the factor, we’ll have a greater sense,” he informed TechCrunch, “as a result of we count on folks to adapt to the dynamics of competing on this each few months.”

Techcrunch occasion

San Francisco
|
October 27-29, 2025

It would seem to be an odd place to fall quick, given the wide selection of AI coding instruments already publicly obtainable – however with benchmarks changing into too straightforward, many critics see initiatives just like the Okay Prize as a crucial step towards fixing AI’s growing evaluation problem.

“I’m fairly bullish about constructing new assessments for current benchmarks,” says Princeton researcher Sayash Kapoor, who put ahead an analogous thought in a recent paper. “With out such experiments, we are able to’t truly inform if the problem is contamination, and even just focusing on the SWE-Bench leaderboard with a human within the loop.”

For Konwinski, it’s not just a greater benchmark, however an open challenge to the remainder of the business. “If you happen to take heed to the hype, it’s like we must be seeing AI docs and AI attorneys and AI software program engineers, and that’s just not true,” he says. “If we are able to’t even get greater than 10% on a contamination free SWE-Bench, that’s the fact test for me.”

admin

View More Posts

Previous Article Why I love my little round Dell USB-C mobile adapter

Next Article Beloved Former SNL Alum, 74, Randomly Deadlifts at Least 140 Pounds at Baseball Game

Trtblog

A new AI coding challenge just published its first results – and they aren’t pretty | TechCrunch

admin

Meta held talks to buy Thinking Machines, Perplexity, and Safe Superintelligence

Pocket Scion is a synth you play with plants

Jennifer Aniston Reveals The Career Move She Still Has Left On Her ‘Bucket List,’ And I Really Hope She Gets To Do It

Doctor Who’s Ruby Sunday Actress Confirmed A Villain Return For The Season 2 Finale, And I’m Looking Forward To This Rematch