The efficiency of artificial intelligence depends on large amounts of data. Thanks to them, machine learning algorithms learn to They find interdependencies and patterns between data sets and apply those learnings to any new data presented to them. They draw conclusionsestablish relationships and apply them to different equations and questions. This makes the data you work with essential.
The information for Training an AI can be obtained internally, for example, customer data held by organizations, or externally, from third-party sources. The former are used for very specific AI training or for specialized projects, such as in medicine or in the music or movie suggestions that certain applications make for us.
The second option is the internet: data from suppliers who obtain and sell large quantities of it. Reddit, for example, began charging users for access to its API in April 2023, likely in response to the success of ChatGPT and the scope to generate a new revenue stream through sales of its data for marketing purposes. AI training.
Other external data sources include open data sets provided by, for example, governments, universities or scientific centres. The problem is that, although The amount of information is enormous and seems infinite, it is not.. And artificial intelligence is very close to exhausting all the data available on the Internet for its training.
And not everyone says this. The first one to report this was Ilya Sutskever, head of OpenAI, who warned about the problem a few weeks ago pointing out, in a conference that “we have reached the peak of data and there will be no more.”
And now it’s up to another heavyweight of the internet and artificial intelligence: Elon Musk. Owner of the artificial intelligence company xAI (and Twitter, SpaceX and Tesla among others), Musk echoed Sutskever and stated that “Basically, we have exhausted the accumulated sum of human knowledge… regarding artificial intelligence training.”. “That happened basically last year.”
“We’ve now exhausted basically the cumulative sum of human knowledge…. in AI training,” Musk said during a livestreamed conversation with Stagwell chairman Mark Penn streamed on X late Wednesday. “That basically happened last year.”
Read more here: https://t.co/3Erb0hfurrpic.twitter.com/BGTg248YAi
— TechCrunch (@TechCrunch) January 9, 2025
For Musk, the only way to overcome this wall is to use synthetic data, where the AI “creates its own training data. With the data synthetics, the AI will rate itself and go through this self-learning process”Musk added.
Companies such as Microsoft, Meta, OpenAI and Anthropic are already using synthetic data to train AI models. The advantage is that this information It is much cheaper, 80% according to reports from the Gartner firm. The problem is that it is a closed system: when using information created by the AI itself, limitations increase and loops occur that increase errors.