One of the aspects in which great technological ones praise the capacities of their artificial intelligence tools is When creating code. It is one of the strengths that usually highlight with each language model that chatbots incorporate such as Chatgpt, Gemini either Claude. However, a new study comes to throw enough shadows about the usefulness of these tools In the daily reality of programmers. The most curious thing is that the investigation was raised to discover how long it made them save the AI and He has ended up verifying what makes them lose in their place.
American laboratory researchers Metr (Acronym for Model Evaluation and Threat Research or Model Evaluation and Threat Research) formed a team of 16 developers With years of experience working on open source repositories and, on average, with a million lines of code written by each one.
They had to do 246 tasks related to the maintenance of these repositories, such as ‘correct errors, add functions or make refactors that are part of their usual work’. Half of them with the tools of AI they wanted, being the most used Pro cursor (Specialized in code) and Claude 3.5/3.7 Sonnetthe most advanced models of Anthropic At the time of research.
Before getting into a slaughter, developers expected the tools to reduce a 24 % The time needed to complete your tasks. After finishing them, they kept believing that they had been a 20 % faster thanks to AI. But the real data showed otherwise: The tasks with the help of the late 19 % more to complete.
The analysis of the work done revealed that the AI Yes it reduced the time the developers spent in some tasks How to write code, test or ‘read/search information’. However, that savings remained eclipsed by ‘The time dedicated to writing Prompts, review the AI responses and wait for code’ to generate ‘.
In total, The developers had to make changes in 56% of the code generated by the AI. A 9 % of the time of the tasks assisted by AI dedicated to reviews of the code it generated.
At first glance, Metr’s results contradict other tests that do show productivity improvements when using AI. But many of them, the researchers point out, are based on questionable metrics -As the total code lines, number of completed tasks and others that can be poor indicators- and in Synthetic exercises Created specifically for the test, not in work with pre -existing code of the real world.
The developers who participated in Metr’s study indicated that The complexity of the repositories (With an average of 10 years of age and more than one million lines of code) limited the usefulness of AI. This could not use ‘important tacit knowledge or context’ on the base code, while the ‘great familiarity of developers with (the) repositories’ It was key to efficiency.
Therefore, researchers conclude that current tools of AI can be Little suitable for ‘configurations with very high quality standards or with many implicit requirements (for example, related to documentation, the coverage of evidence or formatting) that humans have a considerable time to learn.
Even so, the authors of the study are optimistic: they believe that more advanced and refined versions of these tools can provide future improvements. They point, for example, to the reduction of latency, the improvement of the relevance of the responses or the use of techniques such as the Prompt Scaffolding and the fine-tuning. But for now, The use of AI to program still has important limitations.