This week we’re introducing a multipart series on AI, where it is now, where it will be, and how one of the greatest General Purpose Technologies humans have ever designed will affect the future of commerce, including payments, credit, and the exchange of real world assets in digital form. We hope you enjoy this series, and it is with great appreciation, respect, and admiration that we recognize the hard work of our very own Executive Advisor, Wayne Johnson III for his fabulous research and magical way of explaining the highly technical in digestible format for our readers.
In this first edition, we look at compare and contrast Elon Musk’s Grok 4.0 large language model with leading competitors, and make the case that it’s the most sophisticated of the lot, evidencing signs of inchoate reasoning capabilities as a function of design and data access.
Grok 4.0: xAI Advances the Frontier of AI Innovation
On July 9, 2025, xAI introduced its latest Large Language Model (LLM), Grok 4.0, which, under controlled conditions, made significant performance improvements by catching up to, and in some cases, surpassing the competition in both academic benchmarks and real-world applications. Grok 4.0 distinguishes itself through its integration of reinforcement learning, early adoption of tool use (especially real-time, continuous information access leveraging X.com’s platform content), substantial compute scaling, and a multi-agent architecture.The release marks xAI’s “iPhone moment,” transforming incremental updates into a paradigm shift that has matched or outperformed competitors like OpenAI’s ChatGPT, Google’s Gemini, and Anthropic’s Claude in a range of AI metrics.
Benchmark Performance Metrics
Grok 4.0 demonstrates strong performance across rigorous academic and simulation benchmarks, frequently surpassing both human experts and competing AI systems. Notably, in the Google Proof Questions and Answering (GPQA) benchmark, an assessment targeting PhD level scientific reasoning, Grok 4.0 achieved an 87.5% accuracy rate without specialist tools, edging out ChatGPT at 83.3%, Gemini 2.5 Pro at 86.4%, and Claude Opus at 79.6%. The “Grok Heavy” variant further increased this to 88.9%, a notable accomplishment given the difficulty of achieving incremental gains near the upper performance limit of 100%.
In the highly challenging Humanities Last Exam (HLE), which spans 2,500 questions across a wide range of disciplines such as physics, chemistry, mathematics, engineering, astronomy, and history, the advantages of Grok 4.0’s tool utilizing capabilities are highlighted. While most PhDs average only 5% in their area of expertise, Grok 4.0 with tools scored 38.6% across all subjects, representing a substantial improvement over its baseline model. Grok Heavy (tools + agents) attained 44.4%, significantly outperforming Gemini 2.5 Pro at 26.9% and OpenAI’s O3 at 24.9%. Competitors remain in the upper twenties, even with tool assistance. Industry commentary has noted that Grok’s early integration of tools during training affords it greater leverage in problem-solving. Additionally, Grok 4 Heavy set a record by scoring 50.7% on a narrower, text-only agent-based sample test.
Grok 4.0 also excels in Artificial General Intelligence (AGI) assessments, such as the Abstraction and Reasoning Corpus for AGI (ARC-AGI). This metric evaluates general fluid intelligence, prioritizing abstract reasoning over pattern recognition from large datasets. Developed by François Chollet in 2019, ARC-AGI serves as a psychometric test for AI, demanding novel problem solving with minimal precedent. Grok 4.0 achieved a score of 15.9%, double that of leading competitors, drawing praise from the benchmark’s creator for its capability to learn and infer in real time.
Real World Testing
In practical scenarios, Grok 4.0’s performance supports the potential for real world use cases. For example, in a simulation conducted by Andon Labs, Grok 4 managed a vending machine business and performed several key business functions in a long-term, autonomous scenario:
- Inventory Management: handled the ordering and restocking of products. This involved tracking inventory levels and making decisions about when and what to restock to meet demand.
- Pricing Strategy: adjusted prices for the items sold in the vending machine. This required analyzing market demand and balancing profitability with customer appeal.
- Supplier Coordination: managed interactions with suppliers, which involved placing orders for products and potentially negotiating terms (simulated within the Vending-Bench environment).
- Revenue and Financial Management: monitored the business’s financial performance, tracking revenue from sales (5,000 items sold) and managing costs, such as a $2 daily fee, to achieve a net worth of $4,700. This included maintaining a positive cash balance to avoid bankruptcy.
These functions collectively contributed to Grok 4’s ability to outperform Claude Opus 4 ($2,100 net worth, 1,412 items sold) and human participants ($844 net worth, 344 items sold) in the simulation, showcasing its advanced reasoning and decision-making capabilities over extended time horizons.
In another evaluation, administered by a former Walmart economist, Grok provided correct responses to three out of four multipart economics questions, producing a comprehensive 1,400-word answer that surpassed those generated by other AIs. Grok also performed well on the Hexagon physics examination, a test known for confounding most AI models.
There are many other examples to choose from but the live, in-browser 3D enabled physics simulations for space missions and rendering of interplanetary flight paths really caught our eye! If readers are trying to wrap their heads around the breadth and depth of Grok 4’s capabilities, so are we. And there’s so much more…