Beware of the risk of AI 'model collapse'

2024-09-25

From customer service to content creation, artificial intelligence (AI) has influenced progress in many fields. However, an increasingly serious problem known as "model collapse" may render all of AI's achievements in vain. The problem of "model collapse" was pointed out in a research paper published in the British journal Nature in July this year. It refers to using AI generated datasets to train future generations of machine learning models, which may seriously 'pollute' their outputs. Multiple foreign media outlets have reported that this is not only a technical issue that data scientists need to worry about, but if left unchecked, "model collapse" could have far-reaching impacts on businesses, technology, and the entire digital ecosystem. Professor Xiong Deyi, the head of the Natural Language Processing Laboratory at Tianjin University, interpreted "model collapse" from a professional perspective in an interview with Science and Technology Daily. What's the matter with "model crash"? Most AI models, such as GPT-4, are trained through a large amount of data, most of which come from the Internet. Initially, these data were generated by humans, reflecting the diversity and complexity of human language, behavior, and culture. AI learns from this data and uses it to generate new content. However, when AI searches for new data on the network to train next-generation models, it is likely to absorb some of its own generated content, forming a feedback loop where the output of one AI becomes the input of another AI. When generative AI is trained with its own content, its output will also deviate from reality. This is like copying a file multiple times, with each version losing some original details, resulting in a blurry and less accurate result. The New York Times reported that when AI is disconnected from human input content, the quality and diversity of its output will decrease. Xiong Deyi explained that "the distribution of real human language data usually conforms to Zipf's law, which states that word frequency is inversely proportional to word order. Zipf's law reveals the existence of a long tail phenomenon in human language data, that is, a large amount of low-frequency and diverse content." Xiong Deyi further explained that due to errors such as approximate sampling, the long tail phenomenon of the true distribution gradually disappears in the data generated by the model, and the distribution of the model generated data gradually converges to a distribution that is inconsistent with the true distribution, reducing diversity and leading to "model collapse". Is AI self cannibalization a bad thing? Regarding "model collapse," a recent article in The Week magazine in the United States commented that this means AI is self cannibalizing. Xiong Deyi believes that with the emergence of this phenomenon, the higher the proportion of data generated by the model in subsequent model iteration training, the more information the model will lose from real data, making model training more difficult. At first glance, 'model collapse' may seem like a niche issue that only AI researchers need to worry about in the laboratory, but its impact will be profound and long-lasting. According to an article in The Atlantic, in order to develop more advanced AI products, tech giants may have to provide synthetic data, which is simulated data generated by AI systems, to programs. However, due to some generative AI outputs being filled with bias, false information, and absurd content, these will be passed on to the next version of the AI model. Forbes magazine reported that "model collapse" may also exacerbate bias and inequality issues in AI. However, this does not mean that all synthesized data is bad. The New York Times has stated that in some cases, synthetic data can help AI learn. For example, when using the output of a large AI model to train a smaller model, or when the correct answer can be verified, such as the solution to a mathematical problem or the optimal strategy for games like chess and Go. Is AI occupying the Internet? The problem of training new AI models may highlight a greater challenge. Scientific American magazine said that AI content is occupying the Internet, and texts generated by large language models are flooding hundreds of websites. Compared to manually created content, AI content has a faster creation speed and a larger quantity. Sam Altman, CEO of OpenAI, said in February this year that the company generates about 100 billion words every day, equivalent to the text of 1 million novels, most of which will flow into the Internet. A large amount of AI content on the Internet, including tweets, absurd pictures and false comments released by robots, has triggered a more negative concept. Forbes magazine said that the "death Internet theory" believed that most of the traffic, posts and users on the Internet had been replaced by the content generated by robots and AI, and human beings could no longer determine the direction of the Internet. This concept was initially only circulated on online forums, but recently it has gained more attention. Fortunately, experts said that the "death internet theory" has not yet become a reality. Forbes magazine points out that the vast majority of widely circulated posts, including profound viewpoints, sharp language, insightful observations, and definitions of new things in a new context, are not generated by AI. However, Xiong Deyi still stressed: "With the wide application of the big model, AI synthetic data may account for more and more of the Internet data. A large number of low-quality AI synthetic data will not only lead to a certain degree of 'model collapse' of the subsequent models trained with Internet data, but also have a negative impact on the society, such as the generated error information misleading some people. Therefore, AI generated content is not only a technical issue, but also a social issue that requires effective response from both security governance and AI technology perspectives (New Society)

Edit:Lubaikang Responsible editor:Chenze

Source:digitalpaper.stdaily.com

Special statement: if the pictures and texts reproduced or quoted on this site infringe your legitimate rights and interests, please contact this site, and this site will correct and delete them in time. For copyright issues and website cooperation, please contact through outlook new era email：lwxsd@liaowanghn.com

Return to list