Why is AI seriously biased towards certain subjects when taking the college entrance examination

2024-07-04

How many points can a large model take in the college entrance examination? Recently, the technology innovation exchange platform Geek Park released a report on the evaluation of the new college entrance examination (Gaokao) Volume I large model. Among the participating large models, GPT-4o ranked first in humanities with a score of 562. Among the 8 domestic big models that participated in the evaluation, the score of Doubao under ByteDance was 542.5 points, followed by 537.5 points of ERNIE Bot 4.0 of Baidu and 521 points of Baichuan Intelligent "Baixiaoying". The evaluation of this large-scale model in the college entrance examination is exactly the same as the exam paper in Henan Province, and all three domestically produced large-scale models exceed the Henan Humanities First Class admission line by 521 points. A GPT-4o score of 562 can rank 8811 among Henan humanities candidates, equivalent to the top 2.45%; Doubao is in the top 4.27%, close to the level of top large models. In the comprehensive evaluation of literature, GPT-4o scored 237 points, which is better than most human candidates. Among the domestic large-scale models, the comprehensive score of Doubao is the highest, with a score of 224.5 points, with a historical score of 82.5 points, ranking first among all 9 large-scale models. The geography exam has a large number of picture questions, and the GPT-4o with strong image comprehension ability scored the highest, but only 68 points. In the evaluation of Chinese and English, multiple large models scored full marks on objective questions. But writing essays is a weakness. Mr. Xia, a Beijing city level backbone teacher who has participated in the national college entrance examination Chinese language grading multiple times and a leading figure in the Chinese language discipline in Huairou District, is the essay grading person for this evaluation. She believes that AI essays have a clear and complete structure, logical structure, and fluent language, but lack emotion and infectiousness. Similarly, in the 40 point English writing exam, the highest score for the big model is only 29 points, mainly due to the lack of vague expression and details. It is worth noting that the large model college entrance examination shows a serious phenomenon of subject bias: failing all mathematical and physical subjects such as mathematics, physics, and chemistry, with a total score of less than 480. And the score for Henan Science First Class is 511 points. The top-notch models cannot enter the top 30% of science candidates. In the math evaluation, only GPT-4o, ERNIE Bot 4.0 and Doubao scored more than 60 points (out of 150 points). A large model can accurately apply the differentiation formula and trigonometric function theorem, but it is difficult to score when faced with more complex derivation and proof problems. Physics has a multiple-choice question that offers scores. Human candidates can easily choose the right answer based on the principle that "time will not reverse", while the big model is completely defeated. The current big language model is essentially a text chain, based on massive data, predicting the next most likely word and sentence. Generate coherent and complete text through continuous prediction. When dealing with liberal arts exams, the use of incorrect words or synonyms in large models does not significantly affect the score. But science exams test reasoning and calculation, for example, if a question has five steps of reasoning and the big model deviates one step, the answer will be completely wrong. Moreover, in the training data of the large model, humanities corpus is much larger than science corpus A domestic expert in the development of large models told a reporter from Science and Technology Daily. Recently, some domestic and foreign large-scale models have achieved good results in the evaluation of Olympiad math questions (non Olympiad math live competitions). Regarding this, the expert explained that using publicly trained datasets for evaluation, the accuracy of large models is very high; But testing with newer datasets results in a significant decrease in accuracy. The latest college entrance examination questions have not been trained by any big model, testing the generalization ability of mathematical reasoning and calculation, which exposes the shortcomings of big models. Professor Sui Zhifang from the Institute of Computational Linguistics at Peking University recently stated that the performance of the Big Model in standardized exams such as the Chinese National College Entrance Examination, the Civil Service Examination, and the US SAT exam is both good and bad. Some large models perform well in SAT math tests, but their performance is not outstanding enough in complex reasoning or specific knowledge domains. "In the absence of a clear exploration of the intrinsic mechanisms in the large model, our current evaluation path can only rely on inferring intrinsic abilities from external manifestations." Suizhi Fang said that in the future, more systematic evaluation outlines, more challenging evaluation tasks, and more scientific evaluation methods should be developed. Is AI more suitable for exams than humans? It is not yet conclusive. (Lai Xin She)

Edit:Xiong Dafei Responsible editor:Li Xiang

Source:XinHuaNet

Special statement: if the pictures and texts reproduced or quoted on this site infringe your legitimate rights and interests, please contact this site, and this site will correct and delete them in time. For copyright issues and website cooperation, please contact through outlook new era email：lwxsd@liaowanghn.com

Return to list