Generate Video from Any Prompt with Gemini Omni

Q: What can I put in, and what comes out?

At launch, in: text, up to 5 reference images, a voice reference, a video clip, or sketches. Out: 10s clips, 16:9 aspect ratio, 1080p, with native audio. Image and audio outputs are on Google's roadmap and we'll surface them when they land.

Made with Omni · last 24h

A wall of generations.

Click any tile to remix it.

ONER

"When the person touches the mirror, transforms into a detailed monochrome line art drawing"

transform · 0:08via DeepMind

ZOOM

"Make the hand-shaped hole super zoom and magnify the ground it's looking at"

reimagine · 0:10via DeepMind

SOUND

"When the finger touches the animal toy, play the sound the animal makes"

sound · 0:08via DeepMind

CLAY

"Skeuomorphism stop-motion explainer of how the brain hippocampus works"

explainer · 0:18via DeepMind

VOXEL

"When the person touches the mirror, the entire environment turns into 3D voxel art"

transform · 0:08via DeepMind

MUSIC

"The lights of the apartments start turning on in sync with the music"

reimagine · 0:08via DeepMind

TEXT

"26 items, one per alphabet letter. Lower-third labels written on paper. 9 frames per item at 24fps."

text · 0:11via DeepMind

FIELD

"Transport the violinist to the image environment, sun-drenched grassy field"

multi-turn · 0:08via DeepMind

PUPPET

"When the person touches the mirror, transforms into a felted stuffed puppet with googley eyes and glasses"

transform · 0:08via DeepMind

ANGLE

"Change the camera angle to be over the violinist's shoulder"

multi-turn · 0:08via DeepMind

HOLO

"When the person touches the mirror, transforms into a vintage monochrome 3D line-art hologram inside a holodeck"

transform · 0:08via DeepMind

TEXT

"Word by word, one at a time. Each word appears with a different animated style, in rhythm with the audio."

text · 0:09via DeepMind

Browse the full showcase →

Multimodal in

Bring whatever you have. Mix it freely.

Mix any of these in a single prompt.

01 · TEXT

Plain language

Describe the shot. Lean on what the model already knows.

/place  a quiet forest clearing
/light  golden hour, warm
/action a small fox approaches the camera, curious

02 · IMAGE × 5

Reference images

Up to five reference frames.

03 · VOICE

Voice reference

One voice clip. Self-record a number sequence to claim it.

04 · VIDEO

Video clip

Remix an existing clip. Re-style, swap, transfer motion.

Beta testers say

Six early reads. One pattern.

The six-axis prompt is the thing. We declare the shot, framing, light, action, and iterate on what's actually there. Cut concept-board time by 80%.

Mira Tessier

Creative Director · Foxglove Studio

Text rendering is the unlock for me. Product hero with the SKU rendered in-frame, no After Effects pass. Three weeks of agency work in an afternoon.

Rachel Kim

Brand Lead · Northwind

I teach high-school physics. Stop-motion explainers used to take a week. With Omni I prompt the diagram once, refine in chat, ship in a class period.

Liam Patel

Educator · Klein & Co Academy

Conversational edits beat parameter tweaking. "Make the lighting warmer" just works, and the character stays the same person across cuts.

Sofia Garcia

YouTube Creator · 480k subs

Native audio is what sold me. Voice that matches lip movement, room tone, foley, all in one pass. Saved my post-prod budget twice this month.

Ethan Brooks

Indie Filmmaker · Lumen Labs

Reference any input, blend up to five. Style from a poster, motion from a clip, voice from a wav. Omni doesn't fight you, it just does the thing.

Maya Iwasaki

Brand Designer · Helio

How it works

Three steps. One studio.

From prompt to clip to edit, on one screen.

STEP 01

Compose along six axes

The prompt guide turned into fields.

/cadrage wide-angle, oner
/style cinematic, grounded
/light warm, golden hour
/place forest clearing
/action fox approaches fire

⌘↵ Generate

STEP 02

Watch it render

Median 23 seconds. Live status & cost.

⏱ 0:23 to first frame

STEP 03

Refine by talking

Conversational edits keep the scene consistent.

make the lighting warmer

✓ re-rendered

add light fog

✓ keeping fox & camera path

⌘B Toggle chat

Capabilities

What Gemini Omni actually does.

Every cell is something the model produces consistently, not a one-off cherry-pick.

01 · TEXT

On-screen text rendering

Type that actually reads. Lower thirds, posters, alphabet sequences, in-frame branding.

02 · CHAT

Multi-turn editing

Generate, then iterate by conversation. The scene stays consistent across edits.

03 · INPUTS

Any reference, any format

Image, video, audio, sketch. Combine up to five inputs in a single prompt.

04 · CAMERA

Camera direction

Dolly, push-in, oner, over-the-shoulder. Plain-language framing that the model honors.

05 · AUDIO

Native voice and SFX

Diegetic sound, ambient layers, voice that matches lip movement. No separate audio pass.

06 · STYLE

Style transfer

From claymation to voxel art to hologram. The motion holds, only the surface changes.

07 · MOTION

Physics-aware motion

Marbles roll, fabric settles, water reflects. Chain reactions actually chain.

08 · CHARS

Character consistency

Same person across cuts, environments, even style swaps. Faces and outfits hold.

09 · PROOF

SynthID watermarking

Provenance you can verify. Watermark survives compression, crops, and re-encodes.

How Omni compares

Gemini Omni vs the field.

Honest read on where Omni leads, where it ties, and what it's not trying to be.

	This is usOmni Studio	Google · VeoVeo 3.1	OpenAISora 2	RunwayGen-4
On-screen text	Class-leading. Lower thirds, posters, alphabet sequences hold.	Good. Short captions work.	Limited. Drifts on longer copy.	Good. Brand text decent.
Multi-turn editing	Native chat. Scene + character stay consistent.	Manual re-prompt.	Manual re-prompt.	Manual re-prompt.
Native audio	Voice + SFX + ambient in one pass.	Limited. SFX only.	Mute output.	Mute output.
Reference inputs	Image, video, audio, sketch. Up to 5 combined.	Image only.	Image, short clip.	Image, motion brush.
Output length	10 s base, chainable through chat edits.	8 s.	8-20 s tier-gated.	10 s.
Provenance	SynthID watermark, verifiable.	SynthID watermark.	C2PA metadata.	C2PA metadata.
Best for	Creators, educators, brand teams shipping production-ready video.	Filmmakers chasing pure cinematic look.	Story-driven short-form.	Motion design + VFX workflows.

Snapshot. The field moves fast; we'll refresh this table monthly.

Pricing

Same plans as Gemini.
No surprise markups.

Google's pricing, passed through. Flat seat on top.

Plus

$20/mo

Up to 200 minutes / month.

200 min / month
10s clips, 1080p, audio on
SynthID watermark
Library & templates

RECOMMENDED

Pro

$30/mo

Priority queue, unlimited edits.

1,000 min / month
Priority queue · faster render
Unlimited conversational edits
Personal API passthrough
Higher resolution presets

Ultra

$100/mo

Shared workspace for teams.

Unlimited generations
Team workspace (5 seats)
Brand kit & asset library
Priority support
Audit log & SSO

FAQ

Questions you'll probably ask.

If yours isn't here, drop us a line.

01What is Gemini Omni, exactly?

Gemini Omni is Google DeepMind's first any-to-any model, announced 19 May 2026 at I/O. One model, one pass: it reads text, images, audio, and video, and outputs video with native sound. It takes over from the Veo lineage and absorbs capabilities from Nano Banana (image editing) and Genie (interactive worlds). Omni Studio is our front-end on top of it, not affiliated with Google. We pass through the official Gemini and Vertex APIs (once they ship) without markup.

02What can I put in, and what comes out?

At launch, in: text, up to 5 reference images, a voice reference, a video clip, or sketches. Out: 10s clips, 16:9 aspect ratio, 1080p, with native audio. Image and audio outputs are on Google's roadmap and we'll surface them when they land.

03How does the conversational editing work?

Omni was trained for multi-turn editing, it holds the scene together across edits. After a generation, you type things like 'make the lighting warmer' or 'swap the background' and the model re-renders, keeping characters, motion, and camera path consistent. Each edit is a new node in your library tree, so you can branch and compare.

04What's SynthID, and why does it matter?

SynthID is Google's invisible watermark, baked into every Omni output. It's imperceptible to humans but verifiable through the Gemini app, Chrome, and Google Search. It is robust to re-encoding, cropping, and screen-recording. Provenance is non-optional: every clip you generate here ships signed.

05How do you handle voice and faces?

Voice modification is bridged at launch (Google's call) until a safer implementation lands. You can submit a voice reference, but to use your own voice as an avatar you'll record a short number sequence first (the official deepfake guard). All outputs are SynthID-watermarked, and the platform is gated 18+.

06When does the API ship, and how is it priced?

Google said 'in the coming weeks' on May 19. Pricing isn't public yet. Press projections sit around $0.10-0.30 / sec for video output. We'll pass Google's pricing through with no markup and bill the seat ($20-100/mo) on top. Join the API waitlist above to get keys the day it goes live.

07Can I cancel anytime? Refunds?

Yes. Cancel from settings, no email, no friction. Unused minutes roll over for 30 days. If you cancel within 14 days of paying we refund the full month, no questions, no forms.

08Where is my data stored? Is it used for training?

Prompts and outputs sit in Vercel Blob storage (EU region by default, US optional). We do not use your generations for training. Google's underlying processing follows their Gemini API data terms. Zero Data Retention is available on Pro and Ultra.

Make video
from any input,
with Gemini Omni.

Write a prompt.
See what Omni does.

One studio. Four kinds of work.

Short-form creators

Brand & marketing

Explainers & education

Agencies & studios

A wall of generations.

Bring whatever you have. Mix it freely.

Plain language

Reference images

Voice reference

Video clip

Six early reads. One pattern.

Gemini Omni Flash, in numbers.

Three steps. One studio.

Compose along six axes

Watch it render

Refine by talking

What Gemini Omni actually does.

On-screen text rendering

Multi-turn editing

Any reference, any format

Camera direction

Native voice and SFX

Style transfer

Physics-aware motion

Character consistency

SynthID watermarking

Gemini Omni vs the field.

Same plans as Gemini.
No surprise markups.

Questions you'll probably ask.

Make something today. Three on us.

Make videofrom any input,with Gemini Omni.

Write a prompt.See what Omni does.

One studio. Four kinds of work.

Short-form creators

Brand & marketing

Explainers & education

Agencies & studios

A wall of generations.

Bring whatever you have. Mix it freely.

Plain language

Reference images

Voice reference

Video clip

Six early reads. One pattern.

Gemini Omni Flash, in numbers.

Three steps. One studio.

Compose along six axes

Watch it render

Refine by talking

What Gemini Omni actually does.

On-screen text rendering

Multi-turn editing

Any reference, any format

Camera direction

Native voice and SFX

Style transfer

Physics-aware motion

Character consistency

SynthID watermarking

Gemini Omni vs the field.

Same plans as Gemini.No surprise markups.

Questions you'll probably ask.

Make something today. Three on us.

Make video
from any input,
with Gemini Omni.

Write a prompt.
See what Omni does.

Same plans as Gemini.
No surprise markups.