Most AI coding benchmarks still ask the question: did the agent produce code that passes the current tests?
This is a useful question, but it is too narrow. Software development is iterative. Requirements change and edge cases appear. Old design decisions become constraints on new work. Code that passes today can still make the next change slower and more expensive, while also increasing risk.
The gap matters more as AI raises the volume of code change. When generation gets cheap, the real question shifts from ‘can the agent produce a working patch?’ to ‘what kind of codebase does repeated agent use create over time?’
A recent paper, SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks (Orlanski et al.), gets closer to that question than most benchmark work. Instead of scoring one-shot solutions, it makes agents extend their own prior code across 20 problems and 93 checkpoints.
Each checkpoint changes the specification. The agent does not start fresh and is not given an internal design to follow. It has to live with earlier choices.
This setup is closer to real development than most benchmark suites, because real teams inherit yesterday’s shortcuts.
Green tests can hide a worse codebase
The paper tracks two quality signals alongside correctness. Verbosity measures redundant or duplicated code. Structural erosion measures how much of a codebase’s complexity gets trapped inside functions that are already too complex.
Those are failure modes familiar for every engineering manager. A system can keep passing tests while more logic gets pushed into the same large functions and more special cases get bolted on. More files need to be touched for every feature. The software still works, but becomes more difficult to change.
The code-search example in the test is a good example of this issue. At first, the system only needs to find Python code using exact text or regular expressions. Later on, it needs to handle more languages, understand the code structure (AST matching), and even automatically fix problems.
If the initial design is too strict and makes early assumptions, it might pass the first tests but won’t be able to handle the complex, later requirements easily.
The results are clear. None of the evaluated agents solved any problem end to end. The best strict solve rate was 17.2 percent, and by the final checkpoint strict solve rates fell to 0.5 percent. Across trajectories, verbosity rose in 89.8 percent of runs and structural erosion in 80 percent.
The comparison with human-maintained code is even more useful. Against 48 maintained Python repositories, agent-generated code was 2.2 times more verbose and more structurally eroded.
When the authors tracked 20 of those repositories over time, the human code was comparatively flat while the agent code kept worsening with each iteration.
A passing suite tells you the latest version satisfied known checks. It does not tell you whether the code is becoming more fragile or more expensive to extend.
Why this matters for QA
For QA leaders, there are two key takeaways. The first is obvious: AI-built product code can degrade under repeated change even while current tests stay green. Teams may read continued output as proof that the system is healthy. In reality, they may be accumulating future regression cost at higher speed.
The second is closer to home. QA teams are now using AI tools to write and maintain tests, especially functional UI automation in tools like Playwright. That work follows the same pattern as the paper: the product changes, the test has to change, the next feature adds another branch, another selector, another exception, another helper.
The paper is about coding broadly, not automation test suites specifically, but the mechanism carries over. A test suite can also become verbose and structurally weak under repeated AI-assisted edits.
A degraded test suite is harder to notice than degraded product code. The pipeline can still be green and the suite can still look larger on paper. Coverage can appear to improve.
Meanwhile, the core asset might be degrading. This could include bad selectors, weak checks, copied test steps, overly large helper functions, and UI tests that are hard to fix and easy to doubt. While test flakiness is obvious, problems like tests that don’t do much or tests that run very slowly might not be noticed right away.
For QA leaders, that shifts the job. Quality assurance cannot stop at validating the latest output against today’s requirements. It also has to watch whether repeated change is damaging both the product and the test system that is supposed to protect it.
The role of QA leadership is changing; quality assurance must now go beyond simply verifying the latest product output against current requirements. QA leaders must also monitor whether continuous change is negatively impacting both the product’s quality and the integrity of the testing system designed to safeguard it.
Prompting will not solve this by itself
The paper also tested whether better prompts could control the drift. They helped at the start, but not for long. Quality-aware prompts lowered initial verbosity and erosion. One anti-slop prompt cut initial verbosity by about a third on GPT-5.4.
The change was minimal. Cleaner starting points still degraded at roughly the same rate, and the better-looking code did not reliably improve pass rates. In some cases, the prompts increased cost.
Many organizations treat prompting as a governance layer. While this helps, it is not enough. If the workflow keeps asking an agent to extend its own code under changing requirements, the organization still needs controls outside the prompt.
A better way to evaluate AI-assisted development
To manage AI-assisted development well, you need to look past quick wins. Check the code changes after a few adjustments, not just the first fix. Watch out for complex or repeated parts in the code.
Don’t confuse success on the current feature with confidence in long-term stability. Consider how easy the code is to maintain as a release risk, especially for systems dealing with things like cost, user ID, access rights, money, or rules.
The same rule applies to tests. Review how AI-generated test code changes after several product iterations. Watch for suites that grow faster than their signal and UI tests that absorb behavior better covered at lower levels.
Also be aware of ‘self-healing’ maintenance that subtly lowers assertion strength. A larger suite doesn’t automatically mean better control.
Quality needs to move upstream. By the time a feature reaches final validation, some of the damage may already be baked into the path the system took to get there.
QA needs a voice earlier in the loop: in design constraints, review standards, regression strategy, and the definition of acceptable change quality for both product code and test code.
Ultimately, passing tests still matters, but as AI increases the volume of code change, the more useful question is whether each successful change leaves the codebase safer to extend or more dangerous to touch.
We’ve featured the best AI website builder.
This article was produced as part of TechRadar Pro Perspectives, our channel to feature the best and brightest minds in the technology industry today.
The views expressed here are those of the author and are not necessarily those of TechRadarPro or Future plc. If you are interested in contributing find out more here: https://www.techradar.com/pro/perspectives-how-to-submit
https://cdn.mos.cms.futurecdn.net/PAztEScphfxGJfYno5NjrL-2560-80.jpg
Source link




