The development of large-scale models urgently needs high-quality "textbooks" to accompany them

2024-01-15

On January 5th, American artificial intelligence company OpenAI announced that it is in talks with dozens of publishers to reach article licensing agreements to obtain content to train its artificial intelligence models. On December 27, 2023, The New York Times sued OpenAI and Microsoft, accusing them of using their millions of articles to train artificial intelligence models without permission. As early as March 2023, there were reports that some of the training data for the Google Bard model came from ChatGPT. These events point to the same issue - the shortage of high-quality corpus for large models. "For models trained from scratch, the shortage of corpus will greatly limit the development of large models.", Professor Shao Rui from the School of Computer Science and Technology at Harbin Institute of Technology (Shenzhen) said in an interview with Science and Technology Daily, "The marginal benefits of increasing corpus for improving the ability of large models are weakening, and the lack of high-quality corpus is increasingly becoming a bottleneck limiting the development of large models." The shortage of corpus for training large models is a serious problem, and the New Generation Artificial Intelligence Development Research Center of the Ministry of Science and Technology released a report in 2023 The Research Report on China's Artificial Intelligence Large Model Map shows that in terms of the number of large models released globally, China and the United States are significantly ahead, accounting for over 80% of the global total. Although the development of large models is booming, the shortage of high-quality corpus for large models has become a common global problem. Public information shows that big models have extremely high requirements for data supply. For example, training GPT-4 and Gemini Ultra requires approximately 4 trillion to 8 trillion words. Researchers from universities such as MIT predict that by 2026, machine learning datasets may deplete all available high-quality corpus data. Research firm EpochAI has also publicly stated that as early as 2024, humans may fall into a training data drought, and high-quality training data worldwide will face depletion. OpenAI has also publicly expressed concerns about data shortages. It is worth noting that the current large model dataset is mainly in English. The shortage of Chinese language materials is even more severe. Gao Wen, an academician of the CAE Member and director of Pengcheng Laboratory, said publicly that Chinese language materials accounted for only 1.3% of the global 5 billion model data training set. Zhang Jian, Deputy General Manager of the Market Development Department of the Shanghai Data Exchange, previously publicly stated that there is a shortage of corpus supply in the current large model industry, especially in the vertical segmentation field. Although the quantity of shared and free downloaded corpus is large, the quality is not high. "While pursuing an increase in the number of corpora, we also need to value quality," said Zhang Jian. High quality corpus should have seven characteristics. So, what is high-quality corpus? When interviewed by reporters, professionals from companies and universities such as Tencent, Shangtang Technology, and Harbin Institute of Technology (Shenzhen) all gave a unanimous answer: high-quality language materials should have seven characteristics: diversity, scale, legality, authenticity, coherence, impartiality, and harmlessness. Shao Rui stated that high-quality language materials should have the characteristics of high diversity and smooth sentence structures. Kang Zhanhui, the algorithm manager of Tencent's machine learning platform, believes that the diversity of corpora is essential to ensure their effectiveness

Edit:Hou Wenzhe Responsible editor:WeiZe

Source:Science and Technology Daily

Special statement: if the pictures and texts reproduced or quoted on this site infringe your legitimate rights and interests, please contact this site, and this site will correct and delete them in time. For copyright issues and website cooperation, please contact through outlook new era email：lwxsd@liaowanghn.com

Return to list