Google unleashes its groundbreaking Gemini family of multi-modal models

  • Google releases its Gemini family of powerful multi-modal AI models
  • Gemini is built from the ground up to handle text, image, audio, and video
  • Google published benchmarks revealing it beats GPT-4 in a remarkable 30 out 32 tests
Google Med-PaLM 2

Google has launched its Gemini family of multi-modal AI models, a dramatic play in an industry still reeling from events at OpenAI.

Gemini is a multi-modal family of models capable of processing and understanding a blend of text, images, audio, and video.

Sundar Pichai, Google’s CEO, and Demis Hassabis, CEO of Google DeepMind, express high expectations for Gemini. Google plans to integrate it across Google’s extensive products and services, including search, Maps, and Chrome.

Gemini boasts comprehensive multimodality, processing and interacting with text, images, video, and audio. While we’ve become used to text and image processing, audio and video break new ground, offering exciting new ways to handle rich media.

Hassabis notes, “These models just sort of understand better about the world around them.”

Pichai emphasized  the model’s connectedness with Google products and services, stating, “One of the powerful things about this moment is you can work on one underlying technology and make it better and it immediately flows across our products.”

Gemini will take three different forms, they are:

  • Gemini Nano: A lighter version tailored for Android devices, enabling offline and native functionalities.
  • Gemini Pro: A more advanced version, set to power numerous Google AI services, including Bard.
  • Gemini Ultra: The most powerful iteration, designed primarily for data centers and enterprise applications, scheduled for release next year.

In terms of performance, Google claims Gemini outperforms GPT-4 in 30 out of 32 benchmarks, particularly excelling in understanding and interacting with video and audio. This performance is attributed to Gemini’s design as a multisensory model from the outset.


Additionally, Google was keen to highlight Gemini’s efficiency.

Trained on Google’s own Tensor Processing Units (TPUs), it’s faster and more cost-effective than previous models. Alongside Gemini, Google is launching TPU v5p for data centers, improving the efficiency of running large-scale models.

Is Gemini the ChatGPT killer?

Google is clearly bullish about Gemini. Earlier in the year, a ‘leak’ by Semi Analysis suggested Gemini could blow the competition out of the water, seeing Google rise from a peripheral member of the generative AI industry to the main character ahead of OpenAI.

In addition to its multi-modality, Gemini is allegedly the first model to outperform human experts on the massive multitask language understanding (MMLU) benchmark, which tests world knowledge and problem-solving abilities across 57 subjects, such as math, physics, history, law, medicine, and ethics.

 

Pichai says the launch of Gemini is heralding a “new era” in AI, emphasizing how Gemini will benefit from Google’s extensive product catalog.

Search engine integration is particularly interesting, as Google dominates this space and has the benefits of the world’s most comprehensive search index at its fingertips.

The release of Gemini places Google firmly in the ongoing AI race, and people will be all out to test it against GPT-4.

Gemini benchmarks tests and analysis

In a blog post, Google published benchmark results that showcase how Gemini Ultra beats GPT-4 in the majority of tests. It also boasts advanced coding capabilities, with stand-out performance in coding benchmarks such as HumanEval and Natural2Code.

 

Here’s the benchmark data. Be aware these measures use the unreleased Gemini Ultra version. Gemini can’t be considered a ChatGPT killer until next year. And you can bet on OpenAI moving to counteract Gemini ASAP.

Text/NLP benchmark performance

General knowledge:

  • MMLU (Massive Multitask Language Understanding):
    • Gemini Ultra: 90.0% (Chain of Thought at 32 examples)
    • GPT-4: 86.4% (5-shot, reported)

Reasoning:

  • Big-Bench Hard (Diverse set of challenging tasks requiring multi-step reasoning):
    • Gemini Ultra: 83.6% (3-shot)
    • GPT-4: 83.1% (3-shot, API)
  • DROP (Reading Comprehension, F1 Score):
    • Gemini Ultra: 82.4 (Variable shots)
    • GPT-4: 80.9 (3-shot, reported)
  • HellaSwag (Commonsense reasoning for everyday tasks):
    • Gemini Ultra: 87.8% (10-shot)
    • GPT-4: 95.3% (10-shot, reported)

Math:

  • GSM8K (Basic arithmetic manipulations including Grade School math problems):
    • Gemini Ultra: 94.4% (majority at 32 examples)
    • GPT-4: 92.0% (5-shot Chain of Thought, reported)
  • MATH (Challenging math problems including algebra, geometry, pre-calculus, and others):
    • Gemini Ultra: 53.2% (4-shot)
    • GPT-4: 52.9% (4-shot, API)

Code:

  • HumanEval (Python code generation):
    • Gemini Ultra: 74.4% (0-shot, internal test)
    • GPT-4: 67.0% (0-shot, reported)
  • Natural2Code (Python code generation, new held-out dataset, HumanEval-like, not leaked on the web):
    • Gemini Ultra: 74.9% (0-shot)
    • GPT-4: 73.9% (0-shot, API)

Multimodal benchmark performance

The multimodal capabilities of Google’s Gemini AI model are also compared with OpenAI’s GPT-4V.

Image understanding and processing:

  • MMMU (Multi-discipline College-level Reasoning Problems):
    • Gemini Ultra: 59.4% (0-shot pass@1, pixel only)
    • GPT-4V: 56.8% (0-shot pass@1)
  • VQAv2 (Natural Image Understanding):
    • Gemini Ultra: 77.8% (0-shot, pixel only)
    • GPT-4V: 77.2% (0-shot)
  • TextVQA (OCR on Natural Images):
    • Gemini Ultra: 82.3% (0-shot, pixel only)
    • GPT-4V: 78.0% (0-shot)
  • DocVQA (Document Understanding):
    • Gemini Ultra: 90.9% (0-shot, pixel only)
    • GPT-4V: 88.4% (0-shot, pixel only)
  • Infographic VQA (Infographic Understanding):
    • Gemini Ultra: 80.3% (0-shot, pixel only)
    • GPT-4V: 75.1% (0-shot, pixel only)
  • MathVista (Mathematical Reasoning in Visual Contexts):
    • Gemini Ultra: 53.0% (0-shot, pixel only)
    • GPT-4V: 49.9% (0-shot)

Video processing:

  • VATEX (English Video Captioning, CIDEr Score):
    • Gemini Ultra: 62.7 (4-shot)
    • DeepMind Flamingo: 56.0 (4-shot)
  • Perception Test MCQA (Video Question Answering):
    • Gemini Ultra: 54.7% (0-shot)
    • SeViLA: 46.3% (0-shot)

Audio processing:

  • CoVoST 2 (Automatic Speech Translation, 21 Languages, BLEU Score):
    • Gemini Pro: 40.1
    • Whisper v2: 29.1
  • FLEURS (Automatic Speech Recognition, 62 Languages, Word Error Rate):
    • Gemini Pro: 7.6% (lower is better)
    • Whisper v3: 17.6%

Google’s ethical commitment

In a blog post, Google emphasized its commitment to responsible and ethical AI practices.

According to Google, Gemini underwent more rigorous testing than any prior Google AI, assessing factors including bias, toxicity, cybersecurity threats, and potential for misuse. Adversarial techniques helped surface problems early. External experts then stress-tested and ‘red-teamed’ models to identify additional blindspots.

Google states that responsibility and safety will remain priorities amid rapid AI progress. The company helped launch industry groups to establish best practices, including MLCommons and the Secure AI Framework (SAIF).

Google pledges continued collaboration with researchers, governments, and civil society organizations globally.

Gemini Ultra release

For now, Google is limiting access to its most powerful model iteration, Gemini Ultra, which is coming early next year.

Prior to that, select developers and experts will experiment with Ultra to provide feedback. The launch will coincide with a new cutting-edge AI model platform, or as Google calls an ‘experience,’ named Bard Advanced.

Gemini for developers

Starting December 13, developers and enterprise customers will gain access to Gemini Pro through the Gemini API, available in Google AI Studio or Google Cloud Vertex AI.

Google AI Studio: A user-friendly, web-based tool, Google AI Studio is designed to help developers prototype and launch applications using an API key. This free resource is ideal for those in the initial stages of app development.

Vertex AI: A more comprehensive AI platform, Vertex AI offers fully managed services. It integrates seamlessly with Google Cloud, also offering enterprise security, privacy, and compliance with data governance regulations.

In addition to these platforms, Android developers will be able to access Gemini Nano for on-device tasks. It will be available for integration via AICore. This new system capability is set to debut in Android 14, starting with Pixel 8 Pro devices.

Google holds the aces, for now

OpenAI and Google are distinct in one big way: Google develops stacks of other tools and products in-house, including those used by billions of people every day.

We are, of course, talking about Android, Chrome, Gmail, Google Workplace, and Google Search.

OpenAI, through its alliance with Microsoft, has similar opportunities through Copilot, but that’s yet to really take off.

And if we are honest, Google probably holds sway across these product categories.

Google has pressed on in the AI race, but you can be sure this will only fuel OpenAI’s drive towards GPT-5 and AGI.

© 2023 Intelliquence Ltd. All Rights Reserved.

Privacy Policy | Terms and Conditions

×
 
 

FREE PDF EXCLUSIVE
Stay Ahead with DailyAI


 

Sign up for our weekly newsletter and receive exclusive access to DailyAI's Latest eBook: 'Mastering AI Tools: Your 2024 Guide to Enhanced Productivity'.



 
 

*By subscribing to our newsletter you accept our Privacy Policy and our Terms and Conditions