Multimodal ADK Agent using Gemini 2.5 Flash

Hi everyone! 

I’ve been working on an ADK Agent project powered by Gemini 2.5 Flash, and I wanted to share my progress and ask for a little guidance.

 What I've implemented so far:
Text-based prompts and responses are working smoothly using the ADK agent API.

Voice requests and responses integration is partially complete (using Whisper and TTS modules).

Planning to integrate real-time video input and response next.

 Objective:
I want to combine all three modalities into a single unified agent:

Accept text, voice, or video input.

Respond via text or voice, and ideally with visual/video output in real-time.

All managed within a single ADK agent instance (not separate agents for each mode).

My question:
Is it possible to implement full multimodal communication (text, audio, video) in a single Gemini 2.5 Flash ADK agent using the current ADK tooling?
If yes, are there any references or best practices you could share to handle:

Voice input/output routing

Real-time video streaming and response

Multimodal context preservation within the same agent?

Would love to hear your thoughts or experiences if you've built anything similar!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multimodal ADK Agent using Gemini 2.5 Flash #2432

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multimodal ADK Agent using Gemini 2.5 Flash #2432

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions