OpenAI's latest model o3 demonstrates powerful reasoning ability
2024-12-27
On December 20th, the Open Artificial Intelligence Research Institute (OpenAI) in the United States introduced its latest artificial intelligence (AI) inference model - O3 and its lightweight version O3 mini. The company claims that O3 has more advanced and human like reasoning abilities, surpassing its "predecessor" O1 in areas such as code writing, math competitions, and mastering scientific knowledge at the level of a human PhD. However, according to a report on December 22nd by the UK's New Scientist website, although O3 has achieved a remarkable performance leap, it has not yet reached the industry's long-awaited level of General Artificial Intelligence (AGI). OpenAI company revealed that when solving more complex multi-step problems, the O3 model will spend more time calculating the answer and then provide a response. The improvement of this reasoning ability has enabled O3 to perform well in multiple tests. Large language models are enthusiastic about "brushing scores" on various mathematical benchmarks, and O3 is no exception. In the 2024 American Mathematical Invitational, the accuracy of the O3 model reached 96.7%, with only one question answered incorrectly. In Frontier Math, which OpenAI researchers consider to be one of the strictest benchmark tests, o3 also solved 25.2% of the problems. Although this score may seem low, other large language models have previously "collectively crashed" here, with accuracy rates not exceeding 2%. The Frontier Math test is extremely difficult and has been evaluated by Chinese mathematician and Fields Medal winner Tao Zhexuan as "potentially challenging AI for several years". However, O3 only needs to think for a few minutes to solve one of the problems, while human mathematicians spend hours to days. In terms of mastering scientific knowledge, O3's performance also exceeds the average doctoral level. In the GPQA Diamond benchmark test, which measures the performance of a model on doctoral level scientific problems covering expertise in chemistry, physics, and biology, the accuracy of o3 reached 87.7%, surpassing the 70% of human PhDs and nearly 10% higher than the previous o1 performance. In addition, the encoding capability of O3 is also superior to the previous O1 series. On the SWE bench Verified benchmark, which measures the ability of AI models to solve real-world software problems, the accuracy of o3 is about 71.7%, which is more than 20% higher than o1. In the Codeforces coding competition platform, o3 scored 2727, equivalent to the level of the 175th human programmer on the list, while o1 scored only 1891. After showcasing the impressive achievements of O3, OpenAI CEO Altman emphasized that the emergence of O3 marks the next stage of AI development, as these models can handle complex tasks that require a lot of reasoning. The New Scientist website also reported that in the Abstract and Reasoning Corpus AGI (ARC-AGI) competition, which is considered an important measure of AGI, the o3 model also set a new record: under low computing power configuration, it ranked high on the public ranking list with a score of 75.7%. Due to the stricter computing power limit of the test used to determine the winner of this award, the O3 challenge ended in failure. However, despite exceeding the official computing power limit by 172 times, O3 achieved a score of 87.5% using "brute force", reaching the 85% threshold representing human level. Regarding the performance of O3, former Google engineer and main creator of ARC-AGI, Fran ç ois Shole, wrote in his blog that this is an astonishing and significant leap in AI capabilities. But O3 has not yet achieved AGI because it still cannot solve some very simple problems in ARC-AGI competitions, indicating a fundamental difference between it and human intelligence. AGI is a hypothetical future system that can mimic human thinking, decision-making, possess self-awareness, and act autonomously. However, AGI is currently mainly active in science fiction works and has not yet entered reality. Upgrading and iterating is not an easy task. O3 is not only OpenAI's latest masterpiece, but also a vivid portrayal of AI giants competing for large language models. Two years ago, OpenAI released ChatGPT, which marked the beginning of the AI arms race. From GPT-3.5 to the more accurate and creative GPT-4, then to o1, and finally to o3, OpenAI is constantly improving its own products. Other top AI developers are also utilizing increasingly advanced technology to drive iterative upgrades of their own products. Not long ago, Google launched a new version of its flagship model, Gemini, which reportedly has twice the speed of the previous generation and can "think, remember, plan, and even replace users in taking action". The metaverse platform company plans to launch Llama 4 next year. However, the path of iteration is not smooth. Several leading companies, including OpenAI and Google, are facing the dilemma of huge costs but decreasing returns in developing new models. The development of OpenAI's GPT-5 model is progressing slowly. It is reported that with just 6 months of training, the computational cost alone is about $500 million, and the performance is only slightly better than the company's existing products. (New Society)
Edit:He Chuanning Responsible editor:Su Suiyue
Source:Sci-Tech Daily
Special statement: if the pictures and texts reproduced or quoted on this site infringe your legitimate rights and interests, please contact this site, and this site will correct and delete them in time. For copyright issues and website cooperation, please contact through outlook new era email:lwxsd@liaowanghn.com