Google’s I/O 2024 event kicked off on Tuesday with multiple new AI product advancements announced.
OpenAI may have tried to upstage Google with the release of GPT-4o on Monday, but the Google I/O 2024 keynote was full of exciting announcements.
Here’s a look at the standout AI advancements, new tools, and prototypes Google is experimenting with.
Ask Photos
Google Photos, Google’s photo storage and sharing service, will be searchable using natural language queries with Ask Photos. Users can already search for specific items or people in their photos but Ask Photos takes this to the next level.
Google CEO Sundar Pichai showed how you could use Ask Photos to remind you what your car’s license plate number was or provide feedback on how a child’s swimming capabilities had progressed.
Powered by Gemini, Ask Photos understands context across images and can extract text, create highlight compilations, or answer queries about stored images.
With more than 6 billion images uploaded to Google Photos daily, Ask Photos will need a huge context window to be useful.
What if your photos could answer your questions? 🤔 At #GoogleIO today, we announced Ask Photos, a new Google Photos feature that does just that. Ask Photos is the new way to search your photos with the help of Gemini. #AskPhotos https://t.co/KhPeCauFAf pic.twitter.com/3MZg55SgdD
— Google Photos (@googlephotos) May 14, 2024
Gemini 1.5 Pro
Pichai announced that Gemini 1.5 Pro with a 1M token context window will be available to Gemini Advanced users. This equates to around 1,500 pages of text, hours of audio, and a full hour of video.
Developers can sign up for a waitlist to try Gemini 1.5 Pro with an impressive 2M context window which will soon be generally available. Pichai says this is the next step in Google’s journey toward the ultimate goal of infinite context.
Gemini 1.5 Pro has also had a performance boost in translation, reasoning, and coding and will be truly multimodal with the ability to analyze uploaded video and audio.
“It nailed it.”
“This changes everything.”
“It’s a mindblowing experience.”
“I felt like I had a superpower.”
“This is going to be amazing.”Hear from developers who have been trying out Gemini 1.5 Pro with a 1 million token context window. #GoogleIO pic.twitter.com/odOfI4lvOL
— Google (@Google) May 14, 2024
Google Workspace
The expanded context and multimodal capabilities enable Gemini to be extremely useful when integrated with Google Workspace.
Users can use natural language queries to ask Gemini questions related to their emails. The demo gave an example of a parent asking for a summary of recent emails from their child’s school.
Gemini will also be able to extract highlights from and answer questions about Google Meet meetings of up to an hour.
NotebookLM – Audio Overview
Google released NotebookLM last year. It allows users to upload their own notes and documents which NotebookLM becomes an expert on.
This is extremely useful as a research guide or tutor and Google demonstrated an experimental upgrade called Audio Overview.
Audio Overview uses the input source documents and generates an audio discussion based on the content. Users can join the conversation and use speech to query NotebookLM and steer the discussion.
NotebookLM! Love this project so much, the AI powered Arcades Project. With the multimodality of Gemini Pro 1.5, it can automatically create audio discussions of the source material you’ve added to your sources. pic.twitter.com/IhhSfj8AqR
— Dieter Bohn (@backlon) May 14, 2024
There’s no word on when Audio Overview will be rolled out but it could be a huge help for anyone wanting a tutor or sounding board to work through a problem.
Google also announced LearnLM, a new family of models based on Gemini and fine-tuned for learning and education. LearnLM will power NotebookLM, YouTube, Search, and other educational tools to be more interactive.
The demo was very impressive but already it seems like some of the mistakes Google made with its original Gemini release videos crept into this event.
The notebooklm demo is not real-time. I wish they had set that expectation without burying it in a footnote in the tiniest possible font. pic.twitter.com/tGN5i3fsVD
— Delip Rao e/σ (@deliprao) May 14, 2024
AI agents and Project Astra
Pichai says that AI agents powered by Gemini will soon be able to handle our mundane day-to-day tasks. Google is prototyping agents that will be able to work across platforms and browsers.
The example Pichai gave was of a user instructing Gemini to return a pair of shoes and then having the agent work through multiple emails to find the relevant details, log the return with the online store, and book the collection with a courier.
Demis Hassabis introduced Project Astra, Google’s prototype conversational AI assistant. The demo of its multimodal capabilities gave a glimpse of the future where an AI answers questions in real-time based on live video and remembers details from earlier video.
Hassabis said some of these features would roll out later this year.
For a long time, we’ve been working towards a universal AI agent that can be truly helpful in everyday life. Today at #GoogleIO we showed off our latest progress towards this: Project Astra. Here’s a video of our prototype, captured in real time. pic.twitter.com/TSGDJZVslg
— Demis Hassabis (@demishassabis) May 14, 2024
Generative AI
Google gave us a peek at the image, music, and video generative AI tools it’s been working on.
Google introduced Imagen 3, its most advanced image generator. It reportedly responds more accurately to details in nuanced prompts and delivers more photorealistic images.
Hassabis said Imagen 3 is Google’s “best model yet for rendering text, which has been a challenge for image generation models.”
Today we’re introducing Imagen 3, DeepMind?ref_src=twsrc%5Etfw”>@GoogleDeepMind’s most capable image generation model yet. It understands prompts the way people write, creates more photorealistic images and is our best model for rendering text. #GoogleIO pic.twitter.com/6bjidsz6pJ
— Google (@Google) May 14, 2024
Music AI Sandbox is an AI music generator designed to be a professional collaborative music creation tool, rather than a full track generator. This looks like a great example of how AI could be used to make good music with a human driving the creative process.
Veo is Google’s video generator that turns text, image, or video prompts into minute-long clips at 1080p. It also allows for text prompts to make video edits. Will Veo be as good as Sora?
Google will roll out its SynthID digital watermarking to text, audio, images, and video.
Trillium
All these new multimodal capabilities need a lot of processing power to train the models. Pichai unveiled Trillium, the 6th iteration of its Tensor Processing Units (TPUs). Trillium delivers more than 4 times the compute of the previous TPU generation.
Trillium will be available to Google’s cloud computing customers later this year and will make NVIDIA’s Blackwell GPUs available in early 2025.
AI Search
Google will integrate Gemini into its search platform as it moves toward using generative AI in answering queries.
With AI Overview a search query results in a comprehensive answer collated from multiple online sources. This turns Google Search into more of a research assistant than simply finding a website that may contain the answer.
Gemini enables Google Search to use multistep reasoning to break down complex multipart questions and return the most relevant information from multiple sources.
Gemini’s video understanding will soon allow users to use a video to query Google Search.
This will be great for users of Google Search, but it’ll likely result in a lot less traffic for the sites from which Google gets the info.
This is Search in the Gemini era. #GoogleIO pic.twitter.com/JxldNjbqyn
— Google (@Google) May 14, 2024
And you’ll also be able to ask questions with video, right in Search. Coming soon. #GoogleIO pic.twitter.com/zFVu8yOWI1
— Google (@Google) May 14, 2024
Gemini 1.5 Flash
Google announced a lightweight, cheaper, fast model called Gemini 1.5 Flash. Google says the model is “optimized for narrower or high-frequency tasks where the speed of the model’s response time matters the most.”
Gemini 1.5 Flash will cost $0.35 per million tokens, a lot less than the $7 you’d have to pay to use Gemini 1.5 Pro.
Each of these advancements and new products deserves a post of its own. We’ll post updates as more information becomes available or when we get to try them out ourselves.