Skip to content

Multimodal ADK Agent using Gemini 2.5 Flash #2432

@AnishKumarRose

Description

@AnishKumarRose

Hi everyone!

I’ve been working on an ADK Agent project powered by Gemini 2.5 Flash, and I wanted to share my progress and ask for a little guidance.

What I've implemented so far:
Text-based prompts and responses are working smoothly using the ADK agent API.

Voice requests and responses integration is partially complete (using Whisper and TTS modules).

Planning to integrate real-time video input and response next.

Objective:
I want to combine all three modalities into a single unified agent:

Accept text, voice, or video input.

Respond via text or voice, and ideally with visual/video output in real-time.

All managed within a single ADK agent instance (not separate agents for each mode).

My question:
Is it possible to implement full multimodal communication (text, audio, video) in a single Gemini 2.5 Flash ADK agent using the current ADK tooling?
If yes, are there any references or best practices you could share to handle:

Voice input/output routing

Real-time video streaming and response

Multimodal context preservation within the same agent?

Would love to hear your thoughts or experiences if you've built anything similar!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bot triaged[Bot] This issue is triaged by ADK botquestion[Component] This issue is asking a question or clarification

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions