What AI coding benchmarks still miss about software quality

Most AI coding benchmarks still ask the question: did the agent produce code that passes the current tests?

This is a useful question, but it is too narrow. Software development is iterative. Requirements change and edge cases appear. Old design decisions become constraints on new work. Code that passes today can still make the next change slower and more expensive, while also increasing risk.

The gap matters more as AI raises the volume of code change. When generation gets cheap, the real question shifts from ‘can the agent produce a working patch?’ to ‘what kind of codebase does repeated agent use create over time?’

Green tests can hide a worse codebase

The paper tracks two quality signals alongside correctness. Verbosity measures redundant or duplicated code. Structural erosion measures how much of a codebase’s complexity gets trapped inside functions that are already too complex.

files need to be touched for every feature. The software still works, but becomes more difficult to change.

The code-search example in the test is a good example of this issue. At first, the system only needs to find Python code using exact text or regular expressions. Later on, it needs to handle more languages, understand the code structure (AST matching), and even automatically fix problems.

If the initial design is too strict and makes early assumptions, it might pass the first tests but won’t be able to handle the complex, later requirements easily.

The results are clear. None of the evaluated agents solved any problem end to end. The best strict solve rate was 17.2 percent, and by the final checkpoint strict solve rates fell to 0.5 percent. Across trajectories, verbosity rose in 89.8 percent of runs and structural erosion in 80 percent.

The comparison with human-maintained code is even more useful. Against 48 maintained Python repositories, agent-generated code was 2.2 times more verbose and more structurally eroded.

When the authors tracked 20 of those repositories over time, the human code was comparatively flat while the agent code kept worsening with each iteration.

A passing suite tells you the latest version satisfied known checks. It does not tell you whether the code is becoming more fragile or more expensive to extend.

AI tools to write and maintain tests, especially functional UI automation in tools like Playwright. That work follows the same pattern as the paper: the product changes, the test has to change, the next feature adds another branch, another selector, another exception, another helper.

The paper is about coding broadly, not automation test suites specifically, but the mechanism carries over. A test suite can also become verbose and structurally weak under repeated AI-assisted edits.

A degraded test suite is harder to notice than degraded product code. The pipeline can still be green and the suite can still look larger on paper. Coverage can appear to improve.

Meanwhile, the core asset might be degrading. This could include bad selectors, weak checks, copied test steps, overly large helper functions, and UI tests that are hard to fix and easy to doubt. While test flakiness is obvious, problems like tests that don’t do much or tests that run very slowly might not be noticed right away.

For QA leaders, that shifts the job. Quality assurance cannot stop at validating the latest output against today’s requirements. It also has to watch whether repeated change is damaging both the product and the test system that is supposed to protect it.

The role of QA leadership is changing; quality assurance must now go beyond simply verifying the latest product output against current requirements. QA leaders must also monitor whether continuous change is negatively impacting both the product’s quality and the integrity of the testing system designed to safeguard it.

Prompting will not solve this by itself

The paper also tested whether better prompts could control the drift. They helped at the start, but not for long. Quality-aware prompts lowered initial verbosity and erosion. One anti-slop prompt cut initial verbosity by about a third on GPT-5.4.

The change was minimal. Cleaner starting points still degraded at roughly the same rate, and the better-looking code did not reliably improve pass rates. In some cases, the prompts increased cost.

Many organizations treat prompting as a governance layer. While this helps, it is not enough. If the workflow keeps asking an agent to extend its own code under changing requirements, the organization still needs controls outside the prompt.

ID, access rights, money, or rules.

The same rule applies to tests. Review how AI-generated test code changes after several product iterations. Watch for suites that grow faster than their signal and UI tests that absorb behavior better covered at lower levels.

Also be aware of ‘self-healing’ maintenance that subtly lowers assertion strength. A larger suite doesn’t automatically mean better control.

Quality needs to move upstream. By the time a feature reaches final validation, some of the damage may already be baked into the path the system took to get there.

QA needs a voice earlier in the loop: in design constraints, review standards, regression strategy, and the definition of acceptable change quality for both product code and test code.

Ultimately, passing tests still matters, but as AI increases the volume of code change, the more useful question is whether each successful change leaves the codebase safer to extend or more dangerous to touch.

We’ve featured the best AI website builder.

This article was produced as part of TechRadar Pro Perspectives, our channel to feature the best and brightest minds in the technology industry today.

The views expressed here are those of the author and are not necessarily those of TechRadarPro or Future plc. If you are interested in contributing find out more here: https://www.techradar.com/pro/perspectives-how-to-submit

https://cdn.mos.cms.futurecdn.net/PAztEScphfxGJfYno5NjrL-2560-80.jpg

Source link

Stuart Fails to Save the Universe episode 2 has my favorite surprise Big Bang Theory cameo yet, and it’s not the one you’re thinking...

Desky CoilFlex Ergonomic office chair review: a comfortable cushy seat that can accommodate taller users

Desky CoilFlex Ergonomic office chair review: a comfortable cushy seat that can accommodate taller users

This startup just raised nearly $500 million to build nuclear reactors for the US military

Ex-Israeli Intelligence Official: Shockwaves of Trump’s “Take Over Gaza” Heard, Felt Across Region

What UK political parties are promising in the 2019 general election

Otto Warmbier’s parents want North Korea to suffer for their son’s death

Could a ‘youthquake’ cause Boris Johnson to lose the general election?

Pokémon Fans Think a Long-Lost Game Has Finally Been Found After 20 Years — Though Someone Wants $9,000 For It on eBay

‘Stuart Fails To Save The Universe’ EPs Talk Ep 2, Kaley Cuoco return

Brandy Melville’s TikTok Staffer Allegra Pinkowitz Fired

‘Stuart Fails to Save the Universe’: How Kaley Cuoco Cameo Happened

Gold on track to end four-month slump as investors weigh Fed signals

Form 13D/A Life Time Group Holdings For: 30 July

Global Market Today: Asian stocks rise on Korean rally, yen holds gains

Tesla weighs sale of China business to pave way for potential SpaceX merger, WSJ reports

The YouTuber who has become one of Gen Z’s most beloved celebrities

26 last-minute holiday gifts that are still thoughtful and unique

Practicing gratitude regularly can make you less stressed and sleep better

8 things millennials wish you would just stop getting them for the holidays

What AI coding benchmarks still miss about software quality

Pokémon Fans Think a Long-Lost Game Has Finally Been Found After 20 Years — Though Someone Wants $9,000 For It on eBay

Gold on track to end four-month slump as investors weigh Fed signals

Stuart Fails to Save the Universe episode 2 has my favorite surprise Big Bang Theory cameo yet, and it’s not the one you’re thinking...

‘Stuart Fails To Save The Universe’ EPs Talk Ep 2, Kaley Cuoco return