AI systems could collapse into nonsense as more of the internet gets filled with content made by artificial intelligence, researchers have warned.
Recent years have seen increased excitement about text-generating systems such as OpenAI’s ChatGPT. That excitement has led many to publish blog posts and other content created by those systems, and ever more of the internet has been produced by AI.
Many of the companies producing those systems use text taken from the internet to train them, however. That may lead to a loop in which the same AI systems being used to produce that text are then being trained on it.
That could quickly lead those AI tools to fall into gibberish and nonsense, researchers have warned in a new paper. Their warnings come amid a more general worry about the “dead internet theory”, which suggests that more and more of the web is becoming automated in what could be a vicious cycle.
It takes only a few cycles of both generating and then being trained on that content for those systems to produce nonsense, according to the research.
They found that one system tested with text about medieval architecture only needed nine generations before the output was just a repetitive list of jackrabbits, for instance.
The concept of AI being trained on datasets that was also created by AI and then polluting their output has been referred to as “model collapse”. Researchers warn that it could become increasingly prevalent as AI systems are used more across the internet.
It happens because as those systems produce data and are then trained on it, the less common parts of the data tends to left out. Researcher Emily Wenger, who did not work on the study, used the example of a system trained on pictures of different dog breeds: if there are more golden retrievers in the original data, then it will pick those out, and as the process goes round those other dogs will eventually be left out entirely – before the system falls apart and just generates nonsense.
The same effect happens with large language models like those that power ChatGPT and Google’s Gemini, the researchers found.
That could be a problem not only because the systems eventually become useless, but also because they will gradually become less diverse in their outputs. As the data is produced and recycled, the systems may fail to reflect all of the variety of the world, and smaller groups or outlooks might be erased entirely.
The problem “must be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web”, the researchers write in their paper. It might also mean that those companies that have already scraped data to train their systems could be in a beneficial position, since data taken earlier will have more genuine human output in it.
The problem could be fixed with a range of possible solutions including watermarking output so that it can be spotted by automated systems and then filtered out of those training sets. But it is easy to remove those watermarks and AI companies have been resistant to working together to use it, among other issues.
The study, ‘AI models collapse when trained on recursively generated data’, is published in Nature.