feat: Add SiliconFlow TTS API support with custom base URL and model selection

This commit adds comprehensive support for using SiliconFlow's TTS API as an alternative to OpenAI, including:

Features:
- Configurable API base URL (Settings > API Base URL)
- TTS model selection dropdown (CosyVoice2-0.5B, OpenAI compatible models)
- Dynamic voice options based on selected model
- Editable voice dropdown (Combobox) supporting custom voice IDs
- Automatic voice formatting for SiliconFlow (model:voice format)
- Debug logging for troubleshooting API calls
- Warning for incorrect base URL format

Changes:
- utils/settings_manager.py: Added api_base_url and tts_model settings
- utils/text_to_mic.py:
  - Added get_available_tts_models() for model options
  - Added get_siliconflow_voices() for SiliconFlow voices
  - Added change_api_base_url() method with validation
  - Added TTS model dropdown in GUI
  - Converted voice dropdown to Combobox for typing support
  - Added on_voice_exit() for validation
  - Updated API call to use selected model and formatted voice
- text-to-mic-cli.py: Added OPENAI_API_BASE_URL and OPENAI_TTS_MODEL env var support
- Readme.md: Updated documentation with SiliconFlow usage instructions

Supported Models:
- FunAudioLLM/CosyVoice2-0.5B (SiliconFlow - multi-language, emotional)
- tts-1, tts-1-hd (OpenAI compatible)
- gpt-4o-mini-tts (OpenAI default)

SiliconFlow Voices (CosyVoice2-0.5B):
- Male: alex, benjamin, charles, david
- Female: anna, bella, claire, diana
- Custom voices via voice ID entry

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
Xin Wang
2026-01-27 17:12:34 +08:00
parent 20358adafb
commit 92d20e59e9
4 changed files with 365 additions and 30 deletions

View File

@@ -65,12 +65,29 @@ https://platform.openai.com/docs/quickstart/account-setup
6. You can change the API key at any time under the 'Settings' menu.
7. (Optional) You can also configure a custom API Base URL under 'Settings > API Base URL' to use compatible API endpoints other than OpenAI. For example, to use SiliconFlow's API, set the base URL to `https://api.siliconflow.cn/v1` (Note: use just the base URL, NOT the full endpoint path). Leave empty to use OpenAI's default endpoint.
8. (Optional) You can select different TTS models from the "TTS Model" dropdown. When using SiliconFlow, the CosyVoice2-0.5B model will be available with 8 built-in voices (alex, anna, bella, benjamin, charles, claire, david, diana). The voice options will update automatically based on the selected model.
9. (Optional) The Voice dropdown supports both selecting from the list and typing custom voice IDs. Click on the voice field to type a custom voice ID (e.g., for SiliconFlow custom voices like `speech:your-voice-name:xxxx`). This is useful if you've uploaded custom voice samples to SiliconFlow.
This tool was brought to you by Scorchsoft - We build custom apps to your requirements. Please contact us if you have a requirement for a custom app project.
## Advanced Tips
### 1. ChatGPT AI Manipulation
### 1. Custom Voices with SiliconFlow
When using SiliconFlow's API, you can upload your own voice samples and use them by entering the custom voice ID in the Voice dropdown. To upload a custom voice:
1. Upload your voice sample to SiliconFlow (see their documentation)
2. You'll receive a voice ID like: `speech:your-voice-name:cm04pf7az00061413w7kz5qxs:mjtkgbyuunvtybnsvbxd`
3. Click on the Voice dropdown and type/paste this custom voice ID
4. The app will use this custom voice for TTS
For more information on uploading custom voices, see: [SiliconFlow Text-to-Speech Documentation](https://docs.siliconflow.cn/en/userguide/capabilities/text-to-speech)
### 2. ChatGPT AI Manipulation
If you go to "Settings > ChatGPT Manipulation" then you can turn this on and pick which model to use.
@@ -104,6 +121,22 @@ run the executable or "python text-to-mic.py"
https://vb-audio.com/Cable/
## 2) ensure the OpenAI API key is specified in the .env file
You can also optionally set `OPENAI_API_BASE_URL` in the .env file to use a compatible API endpoint other than OpenAI. For example, to use SiliconFlow's API:
```
OPENAI_API_KEY=your_api_key_here
OPENAI_API_BASE_URL=https://api.siliconflow.cn/v1
OPENAI_TTS_MODEL=FunAudioLLM/CosyVoice2-0.5B
```
**Important:** Use just the base URL (e.g., `https://api.siliconflow.cn/v1`), NOT the full endpoint path (don't add `/audio/speech`).
Leave `OPENAI_API_BASE_URL` empty to use OpenAI's default endpoint.
Available TTS models:
- `tts-1` (OpenAI standard, default)
- `tts-1-hd` (OpenAI high quality)
- `gpt-4o-mini-tts` (OpenAI)
- `FunAudioLLM/CosyVoice2-0.5B` (SiliconFlow - multi-language, emotional TTS)
This sets up a virtual microphone that we can use to sent text to speech audio to. Then, when you join a meeting, such as a google meeting, you can select this virtual cable to hear the audio being sent on the channel.
## 3) Run the script:

View File

@@ -10,7 +10,14 @@ import os
load_dotenv()
# Set up your OpenAI API key from the environment variable
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
api_key = os.getenv('OPENAI_API_KEY')
api_base_url = os.getenv('OPENAI_API_BASE_URL', '').strip()
# Create client with custom base URL if provided
if api_base_url:
client = OpenAI(api_key=api_key, base_url=api_base_url)
else:
client = OpenAI(api_key=api_key)
def list_audio_devices():
p = pyaudio.PyAudio()
@@ -81,9 +88,13 @@ def play_audio_multiplexed(file_paths, device_indices):
p.terminate()
def stream_audio_to_virtual_mic(text, voice="fable", device_index=None, device_index_2=None):
def stream_audio_to_virtual_mic(text, voice="fable", model=None, device_index=None, device_index_2=None):
# Get model from environment variable or use default
if model is None:
model = os.getenv('OPENAI_TTS_MODEL', 'tts-1')
response = client.audio.speech.create(
model="tts-1",
model=model,
voice=voice,
input=text,
response_format='wav'
@@ -114,10 +125,27 @@ if __name__ == "__main__":
if arglen < 2:
print("Usage: python script.py 'text to convert'")
print("Environment variables:")
print(" OPENAI_API_KEY - Your API key (required)")
print(" OPENAI_API_BASE_URL - Custom API base URL (optional)")
print(" OPENAI_TTS_MODEL - TTS model to use (default: tts-1)")
print("")
print("Example models:")
print(" - tts-1 (OpenAI standard)")
print(" - tts-1-hd (OpenAI high quality)")
print(" - gpt-4o-mini-tts (OpenAI)")
print(" - FunAudioLLM/CosyVoice2-0.5B (SiliconFlow)")
print("")
print("For SiliconFlow voices with CosyVoice2:")
print(" The voice will be auto-formatted as: FunAudioLLM/CosyVoice2-0.5B:alex")
sys.exit(1)
print(f"arg count {arglen}")
# Get TTS model from environment
tts_model = os.getenv('OPENAI_TTS_MODEL', 'tts-1')
print(f"Using TTS model: {tts_model}")
if arglen == 4:
device_index = int(sys.argv[2])
device_index_2 = int(sys.argv[3])
@@ -129,5 +157,5 @@ if __name__ == "__main__":
device_index = int(input("Enter the device index: "))
device_index_2 = None
stream_audio_to_virtual_mic(sys.argv[1], voice="fable", device_index=device_index,device_index_2=device_index_2)
stream_audio_to_virtual_mic(sys.argv[1], voice="fable", model=tts_model, device_index=device_index,device_index_2=device_index_2)

View File

@@ -37,7 +37,9 @@ class SettingsManager:
"play_last_audio": ["ctrl", "shift", "8"],
"cancel_operation": ["ctrl", "shift", "1"]
},
"max_tokens": 750
"max_tokens": 750,
"api_base_url": "",
"tts_model": ""
}
@classmethod

View File

@@ -121,9 +121,20 @@ class TextToMic(tk.Tk):
# Get API key using APIKeyManager
self.api_key = APIKeyManager.get_api_key(self)
self.has_api_key = bool(self.api_key)
# Initialize settings before loading them
self.api_base_url = ""
if self.has_api_key:
self.client = OpenAI(api_key=self.api_key)
# Load settings to get custom base URL
settings = self.load_settings()
self.api_base_url = settings.get("api_base_url", "").strip()
# Create OpenAI client with custom base URL if provided
if self.api_base_url:
self.client = OpenAI(api_key=self.api_key, base_url=self.api_base_url)
else:
self.client = OpenAI(api_key=self.api_key)
# Initializing device index variables before they are used
self.device_index = tk.StringVar(self)
@@ -247,8 +258,9 @@ class TextToMic(tk.Tk):
settings_menu = Menu(self.menubar, tearoff=0)
self.menubar.add_cascade(label="Settings", menu=settings_menu)
settings_menu.add_command(label="API Key", command=self.change_api_key)
settings_menu.add_command(label="API Base URL", command=self.change_api_base_url)
settings_menu.add_command(label="AI Copyediting", command=self.show_ai_editor_settings)
settings_menu.add_command(label="Keyboard Shortcuts", command=self.show_hotkey_settings)
settings_menu.add_command(label="Keyboard Shortcuts", command=self.show_hotkey_settings)
settings_menu.add_command(label="Manage Tones", command=self.show_tone_presets_manager)
settings_menu.add_separator()
@@ -287,7 +299,89 @@ class TextToMic(tk.Tk):
new_key = APIKeyManager.change_api_key(self)
if new_key:
self.api_key = new_key
self.client = OpenAI(api_key=self.api_key)
# Recreate client with base URL if set
if self.api_base_url:
self.client = OpenAI(api_key=self.api_key, base_url=self.api_base_url)
else:
self.client = OpenAI(api_key=self.api_key)
def change_api_base_url(self):
"""Change the API base URL."""
from tkinter import simpledialog
# Show current URL in the prompt
current_url = self.api_base_url if self.api_base_url else "OpenAI Default"
prompt = f"Current API Base URL: {current_url}\n\nEnter custom API Base URL (leave empty to use OpenAI default):\n\nNote: For SiliconFlow, use: https://api.siliconflow.cn/v1"
new_url = simpledialog.askstring("API Base URL", prompt, parent=self)
if new_url is not None: # User didn't cancel
new_url = new_url.strip()
# Warn if user included /audio/speech in the URL
if new_url and "/audio/speech" in new_url:
if not messagebox.askyesno("Incorrect Base URL",
f"The base URL should not include '/audio/speech'.\n\n"
f"You entered: {new_url}\n\n"
f"Did you mean: {new_url.replace('/audio/speech', '')}\n\n"
f"Click Yes to correct it, or No to use as-is.",
parent=self):
# User said No, keep as-is
pass
else:
# User said Yes, correct it
new_url = new_url.replace('/audio/speech', '')
# Update settings
SettingsManager.update_settings({"api_base_url": new_url})
# Update instance variable
self.api_base_url = new_url
# Recreate client with new base URL
if self.api_key:
if self.api_base_url:
self.client = OpenAI(api_key=self.api_key, base_url=self.api_base_url)
else:
self.client = OpenAI(api_key=self.api_key)
# Update TTS model options based on new base URL
self.update_tts_model_options()
# Show confirmation
if new_url:
messagebox.showinfo("API Base URL Updated", f"API Base URL has been set to:\n{new_url}\n\nTTS model options have been updated.")
else:
messagebox.showinfo("API Base URL Reset", "API Base URL has been reset to OpenAI default.\n\nTTS model options have been updated.")
def update_tts_model_options(self):
"""Update TTS model dropdown options based on current API base URL."""
if hasattr(self, 'tts_menu') and hasattr(self, 'tts_model_var'):
available_models = self.get_available_tts_models()
model_options = [model[1] for model in available_models]
model_ids = [model[0] for model in available_models]
# Store the new model IDs
self.tts_model_ids = model_ids
# Update the dropdown menu
self.tts_menu['menu'].delete(0, 'end')
for display_name in model_options:
self.tts_menu['menu'].add_command(label=display_name, command=tk._setit(self.tts_model_var, display_name, self.on_tts_model_change))
# Set default based on API base URL
if self.api_base_url and "siliconflow" in self.api_base_url.lower():
default_model = "FunAudioLLM/CosyVoice2-0.5B"
else:
default_model = "gpt-4o-mini-tts"
# Find and set the display name for the default model
for i, model_id in enumerate(model_ids):
if model_id == default_model:
self.tts_model_var.set(model_options[i])
break
# Trigger model change to update voices
self.on_tts_model_change()
def get_audio_file_path(self, filename):
if platform.system() == 'Darwin': # Check if the OS is macOS
@@ -403,41 +497,87 @@ class TextToMic(tk.Tk):
# Set fixed width for all labels
label_width = 35 # Adjust this value as needed for your UI
# Initialize TTS model selection
available_tts_models = self.get_available_tts_models()
tts_model_options = [model[1] for model in available_tts_models] # Use display names
tts_model_ids = [model[0] for model in available_tts_models] # Store model IDs
# Get saved TTS model or use default
settings = self.load_settings()
saved_tts_model = settings.get("tts_model", "")
if not saved_tts_model:
# Default based on API base URL
if self.api_base_url and "siliconflow" in self.api_base_url.lower():
saved_tts_model = "FunAudioLLM/CosyVoice2-0.5B"
else:
saved_tts_model = "gpt-4o-mini-tts"
# Find the display name for the saved model
default_tts_model_display = tts_model_options[0]
for i, model_id in enumerate(tts_model_ids):
if model_id == saved_tts_model:
default_tts_model_display = tts_model_options[i]
break
self.tts_model_var = tk.StringVar(value=default_tts_model_display)
self.tts_model_ids = tts_model_ids # Store for later lookup
# TTS Model selection dropdown
tts_label = ttk.Label(voice_frame, text="TTS Model:", width=label_width)
tts_label.grid(column=0, row=0, sticky=tk.W, pady=(0, 5))
tts_menu = ttk.OptionMenu(voice_frame, self.tts_model_var, self.tts_model_var.get(), *tts_model_options, command=self.on_tts_model_change)
tts_menu.grid(column=1, row=0, sticky="ew", pady=(0, 5))
tts_menu.config(width=dropdown_width, style='Compact.TMenubutton')
self.tts_menu = tts_menu # Store reference for later updates
# Initialize voice selection
self.available_voices = self.get_available_voices()
# Determine default voice based on whether API key is available
default_voice = "fable" if self.has_api_key else self.available_voices[0] if self.available_voices else "[System] Default"
self.voice_var = tk.StringVar(value=default_voice)
voice_label = ttk.Label(voice_frame, text="Voice:", width=label_width)
voice_label.grid(column=0, row=1, sticky=tk.W, pady=(0, 5))
voice_menu = ttk.OptionMenu(voice_frame, self.voice_var, self.voice_var.get(), *self.available_voices, command=self.on_voice_change)
# Use Combobox instead of OptionMenu to allow both selection and typing
voice_menu = ttk.Combobox(voice_frame, textvariable=self.voice_var, values=self.available_voices, state="readonly", width=30)
voice_menu.grid(column=1, row=1, sticky="ew", pady=(0, 5))
voice_menu.config(width=dropdown_width, style='Compact.TMenubutton')
voice_menu.bind('<<ComboboxSelected>>', lambda e: self.on_voice_change())
# Allow typing by switching to normal state on focus, readonly on unfocus
voice_menu.bind('<FocusIn>', lambda e: voice_menu.config(state="normal"))
voice_menu.bind('<FocusOut>', lambda e: self.on_voice_exit(voice_menu))
self.voice_menu = voice_menu # Store reference for later updates
# Add hint label for custom voices
voice_hint = ttk.Label(voice_frame,
text="💡 Click to edit or type custom voice ID",
font=("Arial", 7, "italic"),
foreground="gray")
voice_hint.grid(column=1, row=2, sticky="w", pady=(0, 5))
# Tone selection with warning for basic version
self.tone_var = tk.StringVar(value=self.current_tone_name)
tone_options = ["None"] + list(self.tone_presets.keys())
tone_label = ttk.Label(voice_frame, text="Tone Preset:", width=label_width)
tone_label.grid(column=0, row=2, sticky=tk.W, pady=(0, 5))
tone_label.grid(column=0, row=3, sticky=tk.W, pady=(0, 5))
self.tone_menu = ttk.OptionMenu(voice_frame, self.tone_var, self.tone_var.get(), *tone_options, command=self.on_tone_change)
self.tone_menu.grid(column=1, row=2, sticky="ew", pady=(0, 5))
self.tone_menu.grid(column=1, row=3, sticky="ew", pady=(0, 5))
self.tone_menu.config(width=dropdown_width, style='Compact.TMenubutton')
# Check if we should disable tone menu based on voice type
if self.voice_var.get().startswith("[System]"):
self.tone_menu.state(['disabled'])
self.tone_var.set("None")
# Add warning label for basic version
if not self.has_api_key:
warning_label = ttk.Label(voice_frame,
text="⚠️ Basic Version - Add API Key in Settings for full features",
warning_label = ttk.Label(voice_frame,
text="⚠️ Basic Version - Add API Key in Settings for full features",
foreground="orange",
font=("Arial", 8, "italic"))
warning_label.grid(column=0, row=3, columnspan=2, sticky=tk.W, pady=(5, 0))
warning_label.grid(column=0, row=4, columnspan=2, sticky=tk.W, pady=(5, 0))
# Separator between Voice Settings and Device Settings
separator = ttk.Separator(main_frame, orient='horizontal')
@@ -864,9 +1004,34 @@ class TextToMic(tk.Tk):
return
try:
# Get the selected TTS model
selected_tts_model_display = self.tts_model_var.get()
selected_tts_model = "gpt-4o-mini-tts" # Default
# Find the model ID from display name
available_models = self.get_available_tts_models()
for model_id, display_name in available_models:
if display_name == selected_tts_model_display:
selected_tts_model = model_id
break
print(f"[DEBUG] Selected TTS model display: {selected_tts_model_display}")
print(f"[DEBUG] Using TTS model ID: {selected_tts_model}")
print(f"[DEBUG] Selected voice: {selected_voice}")
# For SiliconFlow CosyVoice2-0.5B model, format voice as model:voice
# Example: FunAudioLLM/CosyVoice2-0.5B:alex
voice_to_use = selected_voice
if "CosyVoice2" in selected_tts_model and self.api_base_url and "siliconflow" in self.api_base_url.lower():
voice_to_use = f"{selected_tts_model}:{selected_voice}"
print(f"[DEBUG] Formatted voice for CosyVoice2: {voice_to_use}")
print(f"[DEBUG] API call - Model: {selected_tts_model}, Voice: {voice_to_use}")
print(f"[DEBUG] API Base URL: {self.api_base_url if self.api_base_url else 'OpenAI Default'}")
response = self.client.audio.speech.create(
model="gpt-4o-mini-tts",
voice=selected_voice,
model=selected_tts_model,
voice=voice_to_use,
input=text,
instructions=tone_instructions,
response_format='wav'
@@ -1573,13 +1738,13 @@ class TextToMic(tk.Tk):
if self.has_api_key:
# Add OpenAI voices
voices.extend(['alloy', 'ash', 'ballad', 'coral', 'echo', 'fable', 'onyx', 'nova', 'sage', 'shimmer'])
# Add system voices with [System] prefix
try:
if hasattr(self, 'system_voices') and self.system_voices:
for voice in self.system_voices:
voices.append(f"[System] {voice.name}")
# If no system voices were found, add a default system voice
if not voices:
voices.append("[System] Default")
@@ -1588,14 +1753,81 @@ class TextToMic(tk.Tk):
# Ensure we have at least one voice option
if not voices:
voices.append("[System] Default")
return voices
def get_available_tts_models(self):
"""Get list of available TTS models based on the current API base URL."""
# Check if using SiliconFlow
is_siliconflow = self.api_base_url and "siliconflow" in self.api_base_url.lower()
if is_siliconflow:
# SiliconFlow TTS models
return [
("FunAudioLLM/CosyVoice2-0.5B", "CosyVoice2-0.5B (Multi-language, Emotional)"),
("tts-1", "TTS-1 (OpenAI Compatible)"),
("tts-1-hd", "TTS-1 HD (OpenAI Compatible)")
]
else:
# OpenAI TTS models
return [
("gpt-4o-mini-tts", "GPT-4o Mini TTS (Recommended)"),
("tts-1", "TTS-1 (Standard)"),
("tts-1-hd", "TTS-1 HD (High Quality)")
]
def get_siliconflow_voices(self):
"""Get SiliconFlow-specific voices for CosyVoice2-0.5B model."""
return [
'alex', 'anna', 'bella', 'benjamin', 'charles',
'claire', 'david', 'diana'
]
def update_available_voices(self):
"""Update available voices based on selected TTS model."""
tts_model_display = self.tts_model_var.get() if hasattr(self, 'tts_model_var') else ""
# Get the actual model ID from display name
tts_model_id = None
available_models = self.get_available_tts_models()
for model_id, display_name in available_models:
if display_name == tts_model_display:
tts_model_id = model_id
break
# If using CosyVoice2-0.5B with SiliconFlow, use SiliconFlow voices
if tts_model_id and "CosyVoice2" in tts_model_id and self.api_base_url and "siliconflow" in self.api_base_url.lower():
voices = self.get_siliconflow_voices()
print(f"[DEBUG] Using SiliconFlow CosyVoice voices: {voices}")
# Also add system voices
if hasattr(self, 'system_voices') and self.system_voices:
for voice in self.system_voices:
voices.append(f"[System] {voice.name}")
if not voices:
voices.append("[System] Default")
else:
voices = self.get_available_voices()
print(f"[DEBUG] Using standard voices")
# Update the voice dropdown (now using Combobox)
if hasattr(self, 'voice_menu'):
current_voice = self.voice_var.get()
# Update the combobox values
self.voice_menu['values'] = voices
# Set default if current voice not in list (unless it's a custom voice)
if current_voice not in voices and not (current_voice and not current_voice.startswith("[System]")):
self.voice_var.set(voices[0] if voices else "")
print(f"[DEBUG] Voice changed to: {voices[0] if voices else ''}")
else:
print(f"[DEBUG] Voice kept as: {current_voice}")
def on_voice_change(self, *args):
"""Handle voice selection change."""
selected_voice = self.voice_var.get()
is_system_voice = selected_voice.startswith("[System]")
# Update tone menu state based on voice type
if is_system_voice:
self.tone_menu.state(['disabled'])
@@ -1603,6 +1835,46 @@ class TextToMic(tk.Tk):
else:
self.tone_menu.state(['!disabled'])
def on_voice_exit(self, combobox):
"""Handle voice combobox focus out - validate and update state."""
entered_voice = self.voice_var.get().strip()
# If empty, set to first available voice
if not entered_voice:
if hasattr(self, 'voice_menu'):
values = self.voice_menu['values']
if values:
self.voice_var.set(values[0])
self.on_voice_change()
# Switch back to readonly state
combobox.config(state="readonly")
# Trigger voice change to update tone menu
self.on_voice_change()
def on_tts_model_change(self, *args):
"""Handle TTS model selection change."""
selected_model_display = self.tts_model_var.get()
# Find the model ID from display name
model_id = None
available_models = self.get_available_tts_models()
for model_id_val, display_name in available_models:
if display_name == selected_model_display:
model_id = model_id_val
break
# Save the selected model to settings
if model_id:
SettingsManager.update_settings({"tts_model": model_id})
print(f"[DEBUG] TTS model changed to: {model_id}") # Debug logging
else:
print(f"[DEBUG] Warning: Could not find model ID for display name: {selected_model_display}")
# Update available voices based on the selected model
self.update_available_voices()
def update_window_size(self):
"""Update window size based on current banner and presets state."""
# Calculate a width that preserves the current width if it's larger than default