Skip to content

feat(tts): implement Gemini TTS API#1879

Open
nuthalapativarun wants to merge 1 commit into
agentscope-ai:mainfrom
nuthalapativarun:feat/1680-gemini-tts-model
Open

feat(tts): implement Gemini TTS API#1879
nuthalapativarun wants to merge 1 commit into
agentscope-ai:mainfrom
nuthalapativarun:feat/1680-gemini-tts-model

Conversation

@nuthalapativarun

Copy link
Copy Markdown
Contributor

PR Title Format

feat(tts): implement Gemini TTS API

AgentScope Version

2.0.1

Description

This PR adds a Gemini TTS implementation to the new TTS module, following the existing DashScopeTTSModel as a reference.

Changes made:

  • Added GeminiTTSModel under src/agentscope/tts/_gemini/, subclassing TTSModelBase and reusing the existing GeminiCredential.
  • The model calls the Gemini generateContent API with responseModalities: ["AUDIO"], using the same lazy-imported google.genai client pattern as GeminiChatModel.
  • Audio is returned as raw 24kHz/mono/16-bit PCM, wrapped into a self-contained WAV DataBlock with media_type="audio/wav".
  • Non-streaming/non-realtime only, as called out as acceptable in the issue.
  • Added model card YAMLs for gemini-2.5-flash-preview-tts and gemini-2.5-pro-preview-tts, including the full list of supported prebuilt voices.
  • Exported GeminiTTSModel from agentscope.tts.
  • Added tests/tts_gemini_test.py mirroring tests/tts_dashscope_test.py, skipping cleanly if google-genai is not installed.

How to test:

pytest tests/tts_gemini_test.py -v
pytest tests -k tts

Closes #1680

Checklist

  • An issue has been created for this PR
  • I have read CONTRIBUTING.md
  • Docstrings are in Google style
  • Related documentation has been updated (agentscope-ai/docs)
  • Code is ready for review

@qbc2016 qbc2016 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the pr. Please see the inline comments.

@@ -0,0 +1,8 @@
# -*- coding: utf-8 -*-

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the OpenAI PR — GeminiTTSModel needs to be registered in src/agentscope/credential/_gemini.py:
Without this, GeminiCredential.list_tts_models() returns an empty list.

`None`):
The TTS parameters (voice, etc.). When ``None``, the default
parameters will be used.
stream (`bool`, defaults to `False`):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gemini TTS API does support streaming output. Please refer to https://ai.google.dev/gemini-api/docs/speech-generation#streaming

inline_data = getattr(part, "inline_data", None)
if inline_data and inline_data.data:
data = inline_data.data
if isinstance(data, str):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The isinstance(data, str) check for inline_data.data is good defensive coding, but a brief comment explaining why both str and bytes are possible (SDK version differences?) would be helpful for maintainability.

_DEFAULT_MEDIA_TYPE = "audio/wav"


def _extract_usage(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other TTS implementations use _parse_usage for the equivalent function. Using _extract_usage here is fine, but for consistency across the TTS module, consider renaming to _parse_usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(tts): implement Gemini TTS API

2 participants