Grok-3 Sets New Benchmark in AI Performance, Outshining Competitors

Elon Musk’s AI venture, xAI, has announced that its latest large language model (LLM), Grok-3, has outperformed top AI models, including ChatGPT, Gemini, and DeepSeek, in a blind evaluation test. According to xAI’s internal analysis, Grok-3 has set a new record score on LMArena, a community-driven AI evaluation platform.

Grok-3 Achieves Record Scores in AI Evaluation

During a livestream on X (formerly Twitter) on Feb. 18, Musk and the xAI team introduced Grok-3, revealing that an early version of the model, codenamed “chocolate,” had been tested on LMArena. The platform, which ranks AI models through blind tests, recorded over a million votes from users comparing chatbot responses.

This is it: The world’s smartest AI, Grok 3, now available for free (until our servers melt).

Try Grok 3 now: https://t.co/Tj0afLoxEz

X Premium+ and SuperGrok users will have increased access to Grok 3, in addition to early access to advanced features like Voice Mode pic.twitter.com/YgKavSCiWr
— xAI (@xai) February 20, 2025

Grok-3 reportedly outperformed OpenAI’s GPT models (o3mini, o1), Deepseek-R1, and Google’s Gemini-2 Flash Thinking by at least 10 points in key areas, including math, science, and coding. In addition, it led across multiple performance categories, such as:

Style control
Complex prompts and multi-turn responses
Creative writing and instruction following
Coding and mathematical problem-solving

The model reached a milestone score of 1400, with Musk stating that it continues to improve.

Skepticism Surrounding Grok-3’s Ranking

While xAI is celebrating its new AI model’s dominance, LMArena has not independently verified whether Grok-3’s ranking represents a true breakthrough over its competitors. Questions remain about possible external influences, such as audience demographics or biases in the voting process.

Additionally, controversy emerged within xAI when an engineer, Benjamin DeKraker, resigned on Feb. 12 after refusing to delete an X post in which he had ranked Grok-3 lower than ChatGPT.

The ranking currently (my opinion), for code:

ChatGPT o1-pro
o1
o3-mini
(all kind of tied)

Grok 3 (expected, tbd)

Claude 3.5 Sonnet

DeepSeek

GPT-4o

Grok 2

Gemini 2.0 Pro Series (might be higher, will probably move up)
— Benjamin De Kraker (@BenjaminDEKR) February 8, 2025

DeKraker explained that he was given an ultimatum to retract his opinion or face termination, ultimately choosing to leave the company.

Part of me will forever be inside Grok

Way, wayyyy up inside
— Benjamin De Kraker (@BenjaminDEKR) February 16, 2025

Beyond AI benchmarks, Musk revealed xAI’s ambitious plans to integrate Grok into Tesla’s Optimus humanoid robots, aiming to send them on SpaceX’s upcoming Mars mission by the end of 2026. He highlighted that the next optimal Earth-Mars transit window falls in November 2026, presenting a critical opportunity for advancing robotic exploration.

“If all goes well, SpaceX will send Starship rockets to Mars with Optimus robots and Grok,” Musk stated, underscoring his long-term vision for AI-powered automation in space.

Disclaimer: All materials on this site are for informational purposes only. None of the material should be interpreted as investment advice. Please note that despite the nature of much of the material created and hosted on this website, HODL FM is not a financial reference resource and the opinions of authors and other contributors are their own and should not be taken as financial advice. If you require advice of this sort, HODL FM strongly recommends contacting a qualified industry professional.

Grok-3 Sets New Benchmark in AI Performance, Outshining Competitors

Grok-3 Achieves Record Scores in AI Evaluation

Skepticism Surrounding Grok-3’s Ranking

Sign up for Newsletter

More News

Google DeepMind’s Gemini Powers Atlas Robots for Hyundai Factories

Anthropic to Raise $10B at $350B Valuation amid AI Funding Surge

AI and Crypto IPOs Underperform S&P 500 amid Cautious Investors