Internet data supply for training artificial intelligence linguistic models could dry up

The systems artificial intelligence as ChatGPT They could soon run out of the element that makes them increasingly intelligent: the billions of words that people have written and shared on Internet.

A new study released Thursday by research group Epoch AI predicts that tech companies will exhaust the supply of publicly available training data for AI language models sometime between 2026 and 2032.

Tamay Besiroglu, one of the study's authors, compares the phenomenon to a “gold rush” that depletes finite natural resources, and says the field of AI could struggle to maintain its current pace of progress once reserves are depleted. of human-generated writing.

In the short term, technology companies such as OpenAIthe developer of ChatGPT, and Google, they strive to source, and sometimes pay for, high-quality data sources to train their large AI language models. To do this, for example, they have signed agreements to take advantage of the constant flow of phrases from Reddit forums and the media.

In the long term, there will not be enough new blogs, news articles and social media comments to maintain the current trajectory of AI development, forcing companies to resort to sensitive data that is now considered private, such as emails or text messages, or to rely on “synthetic data” generated by the chatbots themselves, which are less reliable.

“There is a serious bottleneck here,” says Besiroglu. “If you start to run into those limitations on the amount of data you have, you can no longer scale your models efficiently. And expanding the models has probably been the most important way to increase their capabilities and improve the quality of their results.”

Researchers made their first predictions two years ago, shortly before ChatGPT debuted, in a working paper in which they predicted that high-quality text data would run out by 2026. Many things have changed since then, such as new techniques that allow AI researchers to make better use of the data they already have and sometimes “overtrain” multiple times with the same sources.

But there are limits, and following new research, Epoch now predicts that public text data will run out sometime in the next two to eight years.

The team's latest study has been peer-reviewed and will be presented at the upcoming International Conference on Machine Learning in Vienna, Austria. Epoch is a nonprofit institute sponsored by San Francisco-based Rethink Priorities and funded by supporters of effective altruism, a philanthropic movement that has poured money into mitigating the most serious risks of AI.

Besiroglu says AI researchers realized more than a decade ago that aggressively expanding two key ingredients—computing power and the Internet's vast data banks—could significantly improve the performance of AI systems.

According to the Epoch study, the amount of text data fed into AI language models has increased by 2.5 times per year, while computing power has quadrupled per year. Facebook parent company Meta Platforms recently claimed that the largest version of its Llama 3 model — which has not yet come to market — has been trained with up to 15 trillion tokens, each of which can represent a fragment of a word.

But how much data bottlenecking is worth worrying about is debatable.

“I think it's important to keep in mind that we don't necessarily have to train larger and larger models,” says Nicolas Papernot, associate professor of computer engineering at the University of Toronto and researcher at the non-profit Vector Institute for Artificial Intelligence.

Papernot, who was not involved in the Epoch study, says more skillful AI systems can also be created by training more specialized models on specific tasks. However, he is concerned that generative AI systems will be trained on the same results they produce, leading to a performance degradation known as “model crash.”

Training with data generated by AI is “like what happens when you photocopy a sheet of paper and then photocopy the photocopy. Some information is lost.”says Papernot. Not only that, but Papernot's research has also found that it can increase the integration of errors, biases and injustice that are already embedded in the information ecosystem.

If actual human-composed sentences remain a critical data source for AI, administrators of the most coveted treasures—websites like Reddit and Wikipedia, as well as news and book publishers—have been forced to reflect on their use. .

“It's an interesting problem that we're having natural resource conversations about human-created data. “I shouldn't laugh at it, but I think it's amazing.” commented Selena Deckelmann, director of products and technology at the Wikimedia Foundation, which runs Wikipedia.

Although some entities have tried to prevent their data from being used to train AI—often after it has already been used without compensation—Wikipedia has placed few restrictions on how AI companies use articles written by volunteers. Still, Deckelmann says he hopes there will continue to be incentives for people to continue contributing, especially when an avalanche of cheap, automatically generated “junk content” begins to pollute the Internet.

AI companies must “be interested in ensuring that human-generated content continues to exist and be accessible,” he says.

From the perspective of AI developers, the Epoch study states that it is “unlikely” that paying millions of humans to generate the text that AI models will need is a cost-effective way to drive better technical performance.

As OpenAI begins work on training the next generation of its large GPT language models, CEO Sam Altman told attendees at a United Nations event last month that the company has already experimented with “the generation of a lot of synthetic data” for training.

“I think what is needed is high-quality data. There is low quality synthetic data. There is low quality human data”Altman said. But he also expressed reservations about relying too much on synthetic data over other technical methods to improve AI models.

“It would be very strange if the best way to train a model was to generate, for example, 1,000 trillion tokens of synthetic data and feed them back,” says Altman. “Somehow, that seems inefficient.”