-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Hi everyone!
I’ve been working on an ADK Agent project powered by Gemini 2.5 Flash, and I wanted to share my progress and ask for a little guidance.
What I've implemented so far:
Text-based prompts and responses are working smoothly using the ADK agent API.
Voice requests and responses integration is partially complete (using Whisper and TTS modules).
Planning to integrate real-time video input and response next.
Objective:
I want to combine all three modalities into a single unified agent:
Accept text, voice, or video input.
Respond via text or voice, and ideally with visual/video output in real-time.
All managed within a single ADK agent instance (not separate agents for each mode).
My question:
Is it possible to implement full multimodal communication (text, audio, video) in a single Gemini 2.5 Flash ADK agent using the current ADK tooling?
If yes, are there any references or best practices you could share to handle:
Voice input/output routing
Real-time video streaming and response
Multimodal context preservation within the same agent?
Would love to hear your thoughts or experiences if you've built anything similar!