Google has played another card with Gemini 1.5 Pro, a model that builds on the achievements of its predecessor, Gemini 1.0.
With Google Bard dead and buried, the Gemini family seems to be multiplying faster than the AI community can keep track.
Now comes Gemini 1.5 Pro, which is more efficient than Google’s former flagship model, Gemini Ultra.
In fact, Gemini 1.5 Pro edges it on Ultra in a handful of benchmark tests, but we’ll need more information for a comprehensive comparison.
Gemini 1.5 Pro offers a new Mixture-of-Experts (MoE) architecture and outperforms Gemini Pro (now called Gemini 1.0 Pro) in 87% of benchmarks.
It’s available through Google’s new paid AI platform named Google One AI Premium, usurping Gemini Pro despite Google only upgrading that a couple of weeks ago.
So, what’s the purpose of a model that beats 1.0 Pro but is similar to Ultra?
Aside from increased computing efficiency versus Ultra and superior performance in some areas, the headline feature of Gemini 1.5 Pro is its 128,000 token context window, expandable up to 1 million tokens. This beats GPT-4 Turbo at 128,000 and Claude 2.1 at 200,000.
To put a 1 million context window in context, it broadly translates to 700,000 words, 11 hours of audio, or 1 hour of video.
This enables the processing and interpretation of colossal data sets, including entire books. However, Google emphasizes that Gemini 1.5 Pro is still a ‘mid-size’ multimodal model designed to be scalable and versatile.
Is Gemini 1.5 a GPT-4 killer, then? Certainly not in brute-force performance, but it should outflank it for specific tasks with very large quantities of information, as Google was keen to demonstrate.
Gemini’s applications and capabilities
Like its predecessors, Gemini 1.5 Pro’s capabilities extend across multiple modalities, from text to video and audio.
Its extended context window enables the model to process and reason about vast amounts of information, such as lengthy documents, extensive codebases, or hours of video content.
In a Google demo, Gemini 1.5 Pro can understand and identify details in the 402-page transcripts from Apollo 11’s mission to the moon.
Another challenge involved locating specific scenes in Buster Keaton’s “Sherlock Jr.” using descriptions and sketches, which 1.5 Pro managed despite taking up to a minute in some cases.
In another task, Gemini 1.5 Pro was challenged with translating English to the complex Guinean language, Kalamang, and vice versa.
This was especially daunting because Kalamang is not represented in the model’s training data.
Google provided the model with instructional materials in its input context, including approximately 500 pages of reference grammar, a bilingual wordlist (dictionary) with about 2,000 entries, and a set of around 400 parallel sentences.
These materials comprised around 250k tokens, fitting within the model’s extended context window.
With just the instructional materials provided, Gemini 1.5 Pro successfully translated sentences between English and Kalamang. This experiment showcased the model’s capability to absorb and apply new linguistic rules and vocabulary from the context, effectively learning a new language on the fly.
The quality of translations produced by Gemini 1.5 Pro was assessed by human experts who compared the model’s performance with that of a human language learner given the same set of materials.
Another demo gauged the model’s performance in analyzing and solving problems over 100,000 lines of code.
Insights from Gemini 1.5 Pro’s research paper
Google released an accompanying research paper on Gemini 1.5, titled “Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens of Context.”
It’s clear that Google intends to push Gemini 1.5 Pro’s extended context window, which currently dominates other LLMs at the upper end of its 1 million tokens.
Gemini 1.5 Pro achieves near-perfect recall on long-context retrieval tasks across different modalities and sets new standards in long-document QA, long-video QA, and long-context ASR.
The paper details Gemini 1.5 Pro’s performance in various core capabilities, comparing it to the Gemini 1.0 models:
- Win-rate improvements: Gemini 1.5 Pro shows an 87.1% win rate against Gemini 1.0 Pro and a 54.8% win rate against Gemini 1.0 Ultra across multiple benchmarks, demonstrating its improvements
- Specific area performance: In text-related tasks, the model achieves a 100% win rate against Gemini 1.0 Pro and a 77% win rate against Gemini 1.0 Ultra. In vision-related tasks, the win rates are 77% and 46% against Gemini 1.0 Pro and Ultra, respectively. Audio tasks show a 60% win rate against Gemini 1.0 Pro and a 20% win rate against Gemini 1.0 Ultra.
Overall, Gemini 1.5 Pro is a good GPT-3.5-level model with a longer context window than competitors.
Is that enough to lure people away from ChatGPT? The truth is, unless you’ve got entire books to analyze, the benefits may be slim to non-existent.
How to use Gemini 1.5 Pro
Gemini 1.5 is currently available in a limited preview for developers and enterprise customers.
Questions about long-term pricing and accessibility have yet gone unanswered. Google has hinted at pricing tiers that will vary based on the context window size, from the standard 128,000 tokens to the full 1 million.
The exact cost remains under wraps, stirring speculation about the potential investment required to leverage this advanced context window.
Some have highlighted that by the time Gemini 1.5 Pro goes live for the masses, the competition will have moved on.
Google differentiates itself on a product that only a select few early adopters can experiment with. That seems a little alienating.
The Gemini family: accessible or esoteric?
In the space of two to three months or so, Google raised and killed Bard, swapping it with Gemini Pro and releasing Ultra, Nano, and now Gemini 1.5 Pro.
This has involved renaming Gemini Pro (which was just Gemini?) to Gemini 1.0 Pro.
As a result of this AI splurge, DeepMind’s landing page for the Gemini family is quite frankly convoluted and crowded.
OpenAI, in many ways, pulled a slick marketing trick by keeping their models under the ‘ChatGPT’ umbrella from the start and keeping access more-or-less limited to just the free GPT-3.5 and paid GPT-4 for non-API users.
Gemini is Google going nuclear on generative AI, but they might get bogged down in their increasingly ambiguous product offerings.