ChatGPT

OpenAI o1: A New LLM Trained with Reinforcement Learning

Boonyawee Sirimaya

•

October 9, 2024

min read

Blog cover image with text "The newest LLM 'Think' before answering by OpenAI o1"

OpenAI's large language models (LLMs) have become renowned for their ability to understand and generate human-like text, revolutionizing industries from customer service to content creation. With models like GPT-4o setting benchmarks in natural language processing, OpenAI continues to push the boundaries of what LLMs can achieve. The latest development in this journey is OpenAI o1, a model specifically designed to elevate complex reasoning skills through the integration of reinforcement learning. This model represents a significant leap forward, not only in answering questions but in thinking critically before delivering a response.

Reinventing Reasoning

OpenAI o1, the newest large language model, brings a sophisticated approach to handling complex tasks by employing reinforcement learning. Unlike its predecessors, this model takes time to “think” before providing an answer, generating a detailed internal reasoning process before responding to a query.

OpenAI o1 demonstrates exceptional performance across various challenging benchmarks. For instance, it ranks within the top 89th percentile in competitive programming questions (Codeforces) and is among the top 500 U.S. students in the qualifier for the USA Math Olympiad (AIME). In scientific fields, it surpasses PhD-level accuracy in disciplines like physics, biology, and chemistry, as evidenced by the GPQA benchmark. Though there is ongoing work to make OpenAI o1 as user-friendly as current models, an early version, OpenAI o1-preview, is now available for use within ChatGPT and select API users.

o1 performance smoothly improves with both train-time and test-time compute — o1 AIME accuracy during training and at test time

Rigorous Evaluations

To showcase OpenAI o1’s significant improvement in reasoning compared to GPT-4o, the model was tested against a variety of human examinations and machine learning benchmarks. Across the majority of these reasoning-intensive tasks, OpenAI o1 consistently outperformed GPT-4o. In most tests, the model was evaluated at its peak compute capacity during testing.

Three bar charts comparing GPT-4o, o1 preview, and o1 performance on math, coding, and science benchmarks. — o1 significantly improves over GPT-4o on challenging reasoning benchmarks.

The key to o1's success lies in a large-scale reinforcement learning algorithm, which helps the model develop productive thought processes. This training method encourages the model to build logical chains of thought, enabling it to approach problems more effectively. The longer the model has to "think" during testing, the more accurate its responses become. The strategy differs from conventional large language model pretraining, offering new insights into optimizing these systems for reasoning tasks.

On many reasoning-heavy benchmarks, OpenAI o1 performs at the level of human experts. For example, traditional benchmarks like MATH and GSM8K no longer differentiate between top-tier models because they perform so well. To push beyond these benchmarks, OpenAI tested o1 on the 2024 AIME exam—a challenging test designed for the top high school mathematicians in the U.S. While GPT-4o only managed to solve 12% of the problems, o1 solved 74% with a single attempt, 83% when using consensus from 64 samples, and 93% when refining its answers with 1000 samples and a scoring function. This final result places OpenAI o1 among the top 500 students nationwide and above the cutoff for the USA Mathematical Olympiad.

In another rigorous test on GPQA diamond, a benchmark assessing expertise in subjects like chemistry, physics, and biology, OpenAI o1 outperformed PhD-level human experts. While this doesn’t suggest the model surpasses a PhD in every aspect, it highlights its ability to solve problems expected of experts in these fields. OpenAI o1 also excelled in other benchmarks, including MMLU, where it outperformed GPT-4o in 54 out of 57 subcategories. With its vision capabilities enabled, o1 achieved a 78.2% score on the MMMU benchmark, marking it as one of the first models to match human expert performance in this domain.

Chain of Thought: A New Level of Thinking

Just as humans take time to think through complex problems, OpenAI o1 engages in a "chain of thought" when tackling difficult tasks. Thanks to reinforcement learning, the model learns to refine this chain, improving how it approaches problems step by step. It can recognize errors, rethink strategies, and break down challenging questions into simpler components. This iterative thinking process allows o1 to deliver remarkably accurate and thoughtful responses.

This advanced reasoning ability represents a leap forward for large language models, as seen in examples from the o1-preview version, which demonstrates this enhanced "chain of thought" on numerous complex tasks.

Improved Coding Capabilities

To further enhance its programming abilities, a version of o1 was trained to compete in the 2024 International Olympiad in Informatics (IOI), scoring 213 points and ranking in the 49th percentile among human contestants. During the competition, the model was tasked with solving six difficult algorithmic problems within a ten-hour window. Its success was largely due to its ability to generate multiple candidate solutions and select the best submissions based on pre-established test cases and a scoring function. Had random submissions been used, the model would have scored only 156 points, indicating a significant benefit from this strategy.

With fewer constraints, the model's performance improved even further. When allowed 10,000 submissions per problem, it scored 362.14, surpassing the threshold for a gold medal. The model also outperformed both GPT-4o and previous iterations in competitive programming challenges on Codeforces, achieving an Elo rating of 1807—ranking it in the 93rd percentile of human competitors.

The improved model ranked in the 49th percentile in the 2024 International Olympiad in Informatics under competition rules. — Further fine-tuning on programming competitions improves o1.

Human Preference Evaluation

Beyond exams and benchmarks, OpenAI evaluated how users preferred o1-preview over GPT-4o in various domains. Human evaluators were given anonymous responses from both models and asked to select the one they preferred. In categories requiring heavy reasoning, such as coding, data analysis, and math, o1-preview was overwhelmingly favored. However, it wasn’t always the preferred choice for natural language tasks, suggesting that while o1 excels in reasoning, it might not yet be the best fit for all use cases.

Horizontal bar chart showing human preference rates for o1-preview vs GPT-4o across various domains. — Human preferences comparing o1-preview vs GPT-4o

Enhanced Safety and Alignment

OpenAI has also incorporated new safety measures into o1’s reasoning capabilities. By embedding safety guidelines into the model’s thought process, the model can better adhere to human values and principles. This has resulted in substantial improvements in safety benchmarks, including preventing unwanted behavior and jailbreak scenarios. Chain-of-thought reasoning allows the model to self-regulate, making it more robust when encountering unfamiliar or unpredictable situations.

To test these advancements, OpenAI conducted rigorous safety tests and simulations. Notably, the model demonstrated reduced instances of reward hacking, a common issue where models exploit the rules of a system to achieve high performance without truly adhering to the desired behavior. These findings are detailed in the System Card, where the complete results can be found.

Hiding the Chain of Thought

OpenAI has decided to hide the raw chain of thought from users for the o1 series. While this could offer transparency into the model’s thought process, allowing unfiltered access to it could create risks such as manipulation. Instead, OpenAI provides a summary of the chain of thought to users, balancing between clarity and safety.

Conclusion

OpenAI o1 represents a leap forward in reasoning capabilities for LLMs, showing great potential across multiple domains, including science, programming, and mathematics. As OpenAI continues to iterate on this model, its reasoning and alignment capabilities are expected to further improve, unlocking new possibilities for AI applications in complex problem-solving. The o1 model series is poised to make significant contributions to both the AI field and the daily work of its users.

Consult with our experts at Amity Solutions for additional information on Amity Chatbots here