AI-Generated Code Needs Refactoring, Say 76% of Developers

While nearly seven out of 10 developers surveyed use AI tools to write or tweak code each week, many still report hallucinations and other challenges.

May 8th, 2025 3:00pm by Lawrence E Hecht

Featued image for: AI-Generated Code Needs Refactoring, Say 76% of Developers

Featured image by Mohamed Nohassi from Unsplash.

Claude, Supermaven and Cursor users are much more likely to say they had a positive experience than users of other AI-based developer tools. Users of Google Gemini, JetBrains AI and Meta’s Llama are much less likely to report having a positive experience with those technologies.

These are just some of the many findings we uncovered in “The 2025 State of Web Dev AI” report. The study also indicates that hallucinations, inaccuracies, lack of context and generally poor code quality are particularly challenging when using AI coding assistants and other AI-focused developer tools.

The report is based on more than 4,000 responses collected in February and March. Respondents were mostly web developers who had previously taken a survey about JavaScript, React and other topics conducted by Sacha Greif and the Devographics team. In addition to insights about AI adoption and challenges, these are some of the most relevant findings:

AI is an integral part of developers’ workflow. Fifty-nine percent agreed with that statement, compared to 26% who disagreed.
AI tools increase productivity. Fifty-nine percent agreed with that statement, compared to 20% who disagreed.
AI tools are regularly used. Sixty-nine percent use AI tools to generate or refactor code at least a few times a week.
Vibe coding is still in its infancy. Sixty-nine percent of respondents said that less than 25% of the code they produce is AI-generated. When asked about what type of code they generate with AI, 80% cited helper functions, 57% frontend components, 51% creating documentation and comments and 50% cited tests.
AI-generated code requires refactoring. Seventy-six percent of developers have to rewrite or refactor at least half of the outputted code before it’s ready to be used. Poor readability, variable renaming, excessive repetition and code that doesn’t work were reasons cited about why refactoring is necessary.

Models, Coding Assistants and IDEs

In the survey’s questions, each specific AI technology or tool was only included in one category, even though there is overlap in functionality. Unsurprisingly, 91% of respondents have used ChatGPT, with GitHub Copilot (71%), Claude (57%) and Google Gemini (55%) also seeing strong adoption.

Claude users are more likely to be satisfied than users of the other major players, with 65% feeling positive about the offering. Note that the positive/negative sentiment question was not required; if it were, that statistic would be even more impressive.

In comparison, only 37% of Google Gemini users are positive about the technology and 39% are positive about Llama. Since Gemini has seen several significant advances over the last year, perhaps sentiment will improve over the next 12 months.

After GitHub Copilot, Tabnine (17%), JetBrains AI (13%), Supermaven (10%) and Qodo (8%) are the most used coding assistants. Among these, Supermaven rates the best, with 66% of its users feeling positively included about it. JetBrains AI ranks the lowest, with 28% having a particularly favorable impression.

For AI-enabled integrated development environments (IDEs) and editors, Cursor is both the most used (33%) and most likely to have users volunteer that they feel positive about the product (55%). Zed is used by about half as many developers (17%), but only 36% of its users were positive about it.

Vercel’s v0 and Bolt are two other offerings that help with pair programming, used by 27% and 13%, respectively.

What Are the Pain Points?

Participants in the survey were asked a series of open-ended questions about the pain points associated with using different types of AI for development. Different types of issues were identified using keywords via both automation and manual data processing. Hallucinations and inaccuracies were the most common challenges mentioned when using AI models and coding assistants. Here are two relevant quotes:

“Hallucination — even the best models require a bit of ‘babysitting,’ as they can be supremely confident they’re right.”
“I used Copilot+ when it was still paid, I didn’t continue my free trial. Copilot was constantly forcing its way into my flow, with 95% of the time being wrong suggestions.”

Challenges associated with context and memory limitations were mentioned most often by survey participants using AI-focused IDEs or editors. The issue ranked No. 2 for both models and coding assistants. Complaints were also lodged about intrusive suggestions when using both coding assistants and IDEs.

Here are a few relevant quotes:

“Context adjustment. After the first or second prompt, every AI model has been very hard to adjust as you refine or set more correctional parameters. It starts being more and more wrong on every answer, evolving the whole discussion to pure noise that at some point you just can’t fix anymore.”
“No deep integration into classic refactoring features. Renaming a file should cause checking all import statements. Same for moving files. Yet coding agents are not even aware of such context updates if I do that manually.”
‘The way that these assistants ‘help’ you with tasks is often far too aggressive, and as someone with a learning disability, I find their constant insertion of different approaches and ideas as I’m working to be incredibly distracting.”
“Context loss is the biggest issue. It rarely happens in Cursor now, but when it does it’s super annoying. ChatGPT in the browser will just lose entire chunks of conversations, often in the middle of an existing conversation. Warp terminal is the worst for this, one accidental backspace or Ctrl-C and you’ve lost 30 minutes of back-and-forth context.”

Poor quality is the most common challenge when using these tools to generate code. For example, one respondent wrote:

“All AI code generation I’ve tried generates a big pile of unmaintainable and non-testable code that might work in the moment it is given for the specific interpretation that the AI had at that moment. I’ve never seen AI code that was even remotely close to something that could be shipped to production.”

Lawrence has generated actionable insights and reports about enterprise IT B2B markets and technology policy issues for over 25 years. He regularly works with clients to develop and analyze studies about open source ecosystems. In addition to his consulting work,...