Microsoft’s new Orca-Math AI outperforms models 10x larger

Join leaders in Boston on March 27 for an exclusive night of networking, insights, and conversation. Request an invite here.

Students and STEM researchers of the world, rejoice! Particularly if you struggled with math (as I did as a youngster, and still do compared to many of the people I write about) or are just looking to supercharge your abilities, Microsoft has your back.

Yesterday afternoon, Arindam Mitra, a senior researcher at Microsoft Research and leader of its Orca AI efforts, posted on X a thread announcing Orca-Math, a new variant of French startup Mistral’s Mistral 7B model (itself a variant of Meta’s Llama 2), that excels “in math word problems” while retaining a small size to train and run as an inference. It’s part of the Microsoft Orca team’s larger quest to supercharge the capabilities of smaller-sized LLMs.

Orca Math: doing a lot with a little

In this case, they seem to have reached a new level of performance in a small size: besting the performance of models with 10 times more parameters (the “weights” and “biases,” or numerical settings that tell an AI model how to form its “artificial neuron” connections between words, concepts, numbers and, in this case, mathematical operations, during its training phase).

Mitra noted and posted a chart showing that Orca Math bests most other 7-70 billion parameter-sized AI large language models (LLMs) and variants — with the exceptions of Google’s Gemini Ultra and OpenAI’s GPT-4 — at the GSM8K benchmark, a series of 8,500 different mathematics word problems and questions originally released by OpenAI that take between 2-8 steps each to solve, and that are designed by human writers to be solvable by a “bright” human middle-school aged child (up to grade 8).

VB Event

The AI Impact Tour – Boston

Weâre excited for the next stop on the AI Impact Tour in Boston on March 27th. This exclusive, invite-only event, in partnership with Microsoft, will feature discussions on best practices for data infrastructure and integration, data validation methods, anomaly detection for security applications, and more. Space is limited, so request an invite today.

Request an invite

Introducing Orca-Math, our Mistral-7B offshoot excelling in math word problems! ??
– Impressive 86.81% score on GSM8k
– Surpasses models 10x larger or with 10x more training data
– No code, verifiers, or ensembling tricks needed pic.twitter.com/ncV1VUEAK5
— arindam mitra (@Arindam1408)
March 4, 2024

This is especially impressive given that Orca-Math is only a 7-billion parameter model and is competitive with, and nearly matches, the performance of what are assumed to be much larger parameter models from OpenAI and Google. It definitively bests larger parameter models such as MetaMath (70B) and Llemma (34B).

How the Orca Math model was made

How did the Orca team do it? Mitra revealed that they generated a new list of 200,000 word problems crafted by “specialized agents working together,” including student AI agents and teacher AI agents that corrected them on their answers. This synthetic data set follows on the heels of Anthropic using synthetic data to train its new GPT-4 matching or -beating Claude 3 model Opus, released yesterday, and adds to the growing pile of examples showing that machine-generated data is useful for increasing the intelligence and capabilities of LLMs, easing fears of a “model collapse.”

The team used the “Kahneman-Tversky Optimization” or KTO method created and open sourced by startup Contextual AI late last year. As Contextual’s Kawin Ethayarajh, Winnie Xu and Douwe Kiela described KTO in a blog post:

“By studying the work of economists Kahneman & Tversky on human decision-making, we’ve designed an alignment method that does not require preferences like ‘Output A trumps output B for input X’. Instead, for an input X, we simply need to know whether an output Y is desirable or undesirable. This kind of singleton feedback is abundant: every company has customer interaction data that can be marked as desirable (e.g., sale made) or undesirable (e.g., no sale made).“

Intriguingly, according to Mitra, in this case, KTO was used in conjunction with the more traditional technique of supervised fine-tuning, to improve the accuracy of the Orca-Math model’s answers to the math questions.

Crafting perfect dataset: formed (question, good, bad answer) by sampling 4 responses from the student and evaluating them with teacher. ?✅❌ Inclusion of student-generated good answers significantly enhances learning. A different take than SPIN (student models for bad answers) pic.twitter.com/DnxhhomBrJ
— arindam mitra (@Arindam1408)
March 4, 2024

A new, 200,000-word open source math problem set

Furthermore, the Orca team at Microsoft has posted their synthetic, AI-generated set of 200,000 word math problems on Hugging Face under a permissive MIT license, allowing “everyone to explore, build, and innovate” with it, even for commercial usage. So, startups and companies can make use of it. Hooray!

The original Orca 13B was released by Microsoft in June 2023, which used GPT-4 as its AI teacher, while a second version, Orca 2 at 13B and 7B versions, was released in November 2023, both based on Meta’s open source Llama 2 LLM. It seems that every few months, the Orca family continues to grow — and get smarter — with newer, smaller (or similarly sized) members.

VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.