← Blog·2024-W02·8 January 2024·Partial
The prediction

Open-source models will surpass GPT-4 capabilities on enterprise-relevant benchmarks by September 30, 2024, driven by the Meta-Llama and DeepSeek release cadences.

Verification window: by 2024-09-30 · confidence high

Verified in
2024-W25

Open Source Will Beat GPT-4 by Q3

The OpenAI API waitlist dissolved on January 8, 2024. The same day, Meta released Llama 2 70B to the research community with commercial permissions. That convergence marked the inflection point. Capability would no longer be a gating function for frontier deployment. Access would be the gating function. Open source crossed the threshold from academic curiosity to production alternative on exactly this date.

We think the capability convergence completes by September 30, 2024. Open source beats GPT-4 on enterprise-relevant benchmarks. Not by a little. By enough that the marginal business case flips in favor of the Apache 2.0 model.

The prediction

We expect three developments between now and September 30.

First, Meta ships Llama 3. The 70B variant hits GPT-4 parity on MMLU, GSM8K, and HumanEval inside eight weeks of release. The 8B variant becomes the default edge choice for enterprises that need data residency and cannot wait for Mistral Large or Command R+ to ship with equivalent capabilities.

Second, DeepSeek ships its "R1-class" model. The Chinese lab has been training in relative obscurity. Their release cadence is irregular but their capability slope is steep. We expect the model to land inside the Q2 window with capabilities that exceed GPT-4 on code and math while undercutting the pricing of Azure and AWS deployments by 60 percent.

Third, the enterprise frontier flips. By Q3, serious buyers deploying foundation models for the first time will default to open source. The decision vector will shift from capability (which open source wins) to integration support (where closed models maintain an advantage that erodes each quarter).

Why we think this

The scaling laws are bending toward open source. DeepSeek's training runs are approaching the parameter counts that defined the GPT-4 release. Meta's compute budget for Llama 3 matches what OpenAI had for GPT-4. The hardware is no longer scarce. The willingness to release is what separated the two worlds through 2023.

That willingness gap is closing. Three factors are driving the convergence.

The first is the compute-price collapse. A 70B-parameter training run that cost $20M in Q1 2023 costs $2M in Q1 2024. The H100 spot market is liquid. The eight-H100 rigs that were rare outside FAANG in 2023 are available on cloud consoles for weekend-research pricing. Open source can match frontier capability because the underlying physics got cheaper.

The second is the institutional shift. IBM openly backs the Llama roadmap. NVIDIA is shipping with DeepSeek weights pre-loaded. Microsoft is shipping Llama toolchains inside the VS Code extension marketplace. The vendors that profited from compute scarcity in 2023 are now profiting from deployment scale in 2024. Their incentive alignment points toward open distribution.

The third is the capability race itself. GPT-4 Turbo was released to paying customers in November 2023. The general-availability GPT-4 Turbo shipped in January 2024. The six-week gap between internal and external availability is the shortest in OpenAI history. The labs are feeling the competitive pressure. Open source is forcing shorter release cycles and narrower capability gaps.

The two caveats

We are not calling an open-source victory in modal reasoning. The 2024-W04 piece on voice interfaces will confirm that GPT-4 remains ahead on complex reasoning chains through Q2. The gap narrows but does not flip.

We are also not calling an open-source victory in hallucination resistance. The reinforcement learning that makes GPT-4 commercially viable for enterprise workflows remains a closed-system advantage. The training data for refusal behavior is not public. Open source will improve on this metric but will not surpass the closed frontier in Q3.

These two limitations define the boundary condition for our call. Outside them, open source crosses the production threshold.

What this means for the Gulf

The convergence reshapes the Gulf's AI strategy window. Through 2023 the region's structural advantages were compute (G42, PIF) and data (Smart Dubai, Neom). Starting in Q3 2024 the structural advantages become deployment velocity and integration engineering.

MBZUAI should double down on the Llama partnership. TII should accelerate the Falcon roadmap toward open-weight releases. The labs that ship with permissive licenses gain a permanent advantage in the Gulf deployment stack. The region's buyers will not wait for bilateral NDAs to evaluate capability.

For operators building on foundation models, the decision matrix simplifies. By Q3 the base capability question is answered. Open source clears the threshold. The differentiation moves to the stack layers that convert capability into outcomes: fine-tuning quality, deployment topology, compliance wrapping, and customer integration speed.

We will grade this prediction publicly in 2024-W25 alongside our other first-quarter calls.