It is the assessment mechanism that needs to be upgraded to solve the problem of "high score and low ability" of AI

2022-06-15

At present, some artificial intelligence are addicted to brushing the list. They pass the benchmark test with high scores and perform well. However, they still make some very basic mistakes in practical application. Recently, some media reported that at present, some AI addicts brush the list, pass the benchmark test with high scores and perform well, but they still make some very basic mistakes in practical application. This kind of behavior of indulging in brushing the list and ignoring the practical nature has caused the phenomenon of "high score but low energy" in some AI models. So, is benchmarking necessary for AI development? In practical application, what problems need to be improved in benchmarking? Which AI model is better? Benchmark test is the way to speak How should AI models measure their performance? "At present, the level of AI model capability depends on the data, because the essence of AI is to learn data and output algorithm models. In order to balance the AI capability, many institutions, enterprises and even scientists will collect and design different data sets, some of which are fed to AI training to obtain AI models, and the other part of the data is used to assess the AI model capability, which is the benchmark." Recently, wujiaji, a professor of the school of electronic engineering at Xi'an University of Electronic Science and technology, said in an interview with the science and technology daily. Wujiaji said that machine learning is increasingly used in various practical application scenarios, such as image and speech recognition, autonomous vehicle, medical diagnosis, etc. Therefore, it is very important to understand its behavior and performance in practice. High quality estimation of its robustness and uncertainty is very important for many functions, especially in the field of deep learning. In order to master the behavior of the model, researchers should measure its performance according to the baseline of the target task. In 2010, the launch of computer vision competition based on Imagenet data set inspired a revolution in algorithm and data in the field of deep learning. Since then, benchmarking has become an important means to measure the performance of AI models. Marcelo Ribeiro, a computer scientist at Microsoft, said that benchmarking should be a tool in the practitioners' toolbox. People use benchmarking to replace their understanding of models and test "model behavior" through benchmark data sets. For example, in the field of naturallanguageprocessing, glue researchers let the AI model train on a dataset containing thousands of sentences and test it on nine tasks to judge whether a sentence conforms to grammar, analyze emotion, or judge whether two sentences are logical implication, which once baffled the AI model. Subsequently, the researchers increased the difficulty of the benchmark test. Some tasks require that the AI model can not only process sentences, but also answer reading comprehension questions after paragraphs from Wikipedia or news websites. After only one year of development, the performance of the AI model has easily reached 90 points from less than 70 points, surpassing that of human beings. Wujiaji said: "Scientific research requires scientific problems, methods, calculations, test comparisons and other elements. Therefore, in scientific research, including artificial intelligence research, there must also be calculation and test comparisons, that is, the ability of AI algorithms should be measurable, in order to verify the feasibility and effectiveness of research methods. Therefore, benchmarking is necessary, so as to fairly verify the ability of AI algorithms and avoid saying different things "Wang Po sells melons and boasts." The algorithm is the final service practice, not the list Some people say that high scores are stimulants for AI models. As a result, some artificial intelligence frequently brush the list in order to achieve good results. The report released by Microsoft in 2020 pointed out that various SOTA models, including Microsoft, Google and Amazon, contain many hidden errors. For example, changing the "what's" in the sentence to "what is", the output results of the models will be quite different. Before that, no one had realized that these business models that had been evaluated well could be so bad in application. Obviously, the AI model trained in this way is like a student who can only test and achieve excellent results. He can successfully pass various benchmarks set by scientists, but he doesn't understand why. "In order to achieve good results, researchers may use special software and hardware settings to adjust and process the model, so that AI can perform well in the test, but these performances cannot be performed in the real world." Shangkun, a researcher at Xi'an University of Electronic Science and technology, pointed out. In the field of smart phones, when we talk about the use experience of mobile phones, we can not help but involve the performance of mobile phones, which is usually expressed by the score. However, we often encounter a mobile phone whose running score is at the leading level in the rankings, but in the actual use process, there are phenomena such as animation frame dropping, page sliding and blocking, application fake death, etc. A report from AnandTech, the world's top evaluation website, once questioned this phenomenon, pointing out that a certain brand of mobile phones started the "performance mode" when running, but in normal use, the "performance mode" is rarely invoked. Although this processing method can obtain high scores, it can not simulate the user's real use scenario, which makes the benchmark have no reference significance. Shang Kun believes that in view of the above problems, there are mainly two ways to improve the benchmark: one is to add more data sets to make the benchmark more difficult. Test with unseen data, so as to judge whether the AI model can avoid over fitting. Researchers can create a dynamic data collection and benchmarking platform to submit data that they think the artificial intelligence model will misclassify through crowdsourcing for each task, and samples that successfully deceive the model are added to the benchmarking. If we collect data dynamically and add annotations, and use iterative training models instead of traditional static methods, AI models should be able to achieve more substantial evolution. Shang Kun said that the other is to narrow the gap between the data in the laboratory and the real scene. No matter how high the score of the baseline test is, it still needs to be tested with the data in the actual scene. Therefore, the benchmark test is closer to the real scene by enhancing and expanding the data set closer to the real scene. For example, imagenet-c data sets can be expanded according to 16 different actual damage levels, which can better simulate the actual data processing scenarios. Widely used, national standards should be established as soon as possible According to the research of cleanlab Laboratory of Massachusetts Institute of technology, more than 3% of the 10 commonly used data sets as benchmarks are labeled incorrectly, and the results based on these benchmarks have no reference significance. "If the benchmark test can be called the 'imperial examination system' in the field of artificial intelligence, it is impossible to train a really good model if the 'score only theory' wins or loses. To break this phenomenon, on the one hand, we need to adopt a more comprehensive evaluation method, and on the other hand, we can consider dividing and treating problems. For example, we can use multiple AI models to solve complex problems and turn complex problems into simple and determined problems. A simple and optimized baseline model Often better than more complex methods. Google researchers have introduced an uncertainty baseline library for common AI tasks to better evaluate the robustness of AI applications and the ability to deal with complex uncertainties. " Tanmingzhou, director of the artificial intelligence division of Yuanwang think tank and Chief Strategic Officer of Turing robot, pointed out. Although the industry is changing its attitude towards benchmarking, benchmarking research is still a minority study. In a study, Google interviewed 53 AI practitioners in industry and academia. Many of them pointed out that improving data sets is not as rewarding as designing models. Tanmingzhou said that AI application benchmark research is an inherent need to build a unified domestic market. At present, AI has been widely used in various fields of the national economy and the people's livelihood. It is even more necessary to set standards to comprehensively and effectively evaluate AI models. The one-sided pursuit and adoption of high score AI models may lead to "mental retardation" behavior in complex extreme scenarios, and may cause adverse social impact due to the low efficiency of training and reasoning performance Economic loss and environmental damage. Tanmingzhou emphasized that AI application benchmark research is related to national strategy. For important fields, it is urgent to establish our own AI benchmark standards, AI data sets, AI model evaluation standards, etc. It is understood that the dvclab of Xi'an University of Electronic Science and technology has also conducted forward-looking research in the field of AI benchmarking, especially for the two key issues of the overall quality and dynamic expansion of data sets in AI application benchmarking, is developing an online collaborative data annotation and AI model R & D hosting project, plans to open source this year, and is actively exploring the construction of a national AI benchmarking evaluation standard system. (Xinhua News Agency)

Edit:Li Jialang    Responsible editor:Mu Mu

Source:xinhuanet

Special statement: if the pictures and texts reproduced or quoted on this site infringe your legitimate rights and interests, please contact this site, and this site will correct and delete them in time. For copyright issues and website cooperation, please contact through outlook new era email:lwxsd@liaowanghn.com

Return to list

Recommended Reading Change it

Links

Submission mailbox:lwxsd@liaowanghn.com Tel:020-817896455

粤ICP备19140089号 Copyright © 2019 by www.lwxsd.com.all rights reserved

>