San Francisco-based AI startup Anthropic has released its latest LLM with its family of Claude 3 models.
Claude 3 comes in three variations, Haiku, Sonnet, and Opus. For the less poetic among us, that translates to small, medium, and large. Claude 3 Opus is Anthropic’s most advanced model and is the industry’s first to claim to beat OpenAI’s GPT-4 in a wide range of benchmarks.
GPT-4 has been the gold standard that AI companies have long used to compare their LLM performance. Those comparisons often used words like “approaching” or “nearly”, but Anthropic can finally claim to exceed GPT-4’s capability.
Here are the benchmark figures for Claude 3 compared to GPT-4, GPT-3, and Gemini Ultra and Pro.
It’s worth noting that the GPT-4 figures above are the ones that OpenAI supplied in its technical report before GPT-4 was released. The Claude 3 model card acknowledges that higher scores for GPT-4 Turbo have been reported.
Even so, the Claude 3 Opus figures are a big deal. In spite of the inevitable arguments over how the company got to these figures, Anthropic says Claude 3 Opus represents “higher intelligence than any other model available.”
Claude 3 Opus input/output API costs will run you $15 / $75 per million tokens. That’s steep compared to GPT-4 Turbo which costs $10 / $30. Claude 3 Sonnet ($3 / $15) and Claude 3 Haiku ($0.25 / $1.25) offer really good value when you look at the performance figures for these smaller models.
If you want to try Claude 3 for free you can do that on Anthropic’s claude.ai chatbot once its servers recover from the traffic rush. It’s powered by Claude 3 Sonnet, with paid Pro users getting access to Opus.
Claude 3 models aren’t multi-modal but they have impressive vision capabilities. They can’t generate an image for you but the benchmarks indicate that Opus is good at analyzing photos, charts, graphs, and technical diagrams.
Anthropic says the Claude 3 models are capable of accepting inputs exceeding 1 million tokens but, for most users, the context window will be limited to 200k tokens for now. That’s still a lot more than GPT-4 Turbo’s 128k context.
A large context window is only useful when coupled with good recall and Anthropic claims that Opus delivers “near-perfect recall, surpassing 99% accuracy.”
Something interesting happened during the “needle in haystack” recall testing of Claude 3 Opus. When asked a question that could only be answered if it spotted the inserted “needle” sentence it indicated that it understood that it was being tested. Impressive, and a little scary.
Anthropic is a big proponent of what it calls “Constitutional AI” which aims to improve the safety and transparency of its models. With Claude 2, this pursuit of safety resulted in a lot of refusals to respond to prompts that were actually harmless.
Claude 3 is better at understanding the nuance of prompts to better decide what does and doesn’t fall foul of Anthropic’s guardrails. Claude 3 also achieves much better accuracy and reduced hallucinations compared with Claude 2.1.
Some AI pessimists claim we’re heading for an AI winter and that LLM model performance is reaching a plateau, but Anthropic disagrees. The company says that it doesn’t believe that “model intelligence is anywhere near its limits.”
It plans to bring several interesting upgrades to Claude 3 in the future with the addition of more advanced agentic capabilities including Tool Use as well as interactive coding (REPL).
The high pricing may see the initial market for Claude 3 Opus being in more niche research or professional applications. The pricing and performance offered by Sonnet and Haiku are likely to be where the biggest adoption happens for now.
Will we see a price drop from OpenAI? With OpenAI feeling the heat at the top of the benchmarks, we must be getting really close to a GPT-5 announcement.