Text-to-Speech

Convert text to speech using AI voices

Convert text to natural-sounding speech using the latest AI voices. Sim's Text-to-Speech (TTS) tools let you generate audio from written text in dozens of languages, with a choice of expressive voices, formats, and advanced controls like speed, style, emotion, and more.

Supported Providers & Models:

  • OpenAI Text-to-Speech (OpenAI):
    OpenAI's TTS API offers ultra-realistic voices using advanced AI models like tts-1, tts-1-hd, and gpt-4o-mini-tts. Voices include both male and female, with options such as alloy, echo, fable, onyx, nova, shimmer, ash, ballad, coral, sage, and verse. Supports multiple audio formats (mp3, opus, aac, flac, wav, pcm), adjustable speed and streaming synthesis.

  • Deepgram Aura (Deepgram Inc.):
    Deepgram’s Aura provides expressive English and multilingual AI voices, optimized for conversational clarity, low latency, and customization. Models like aura-asteria-en, aura-luna-en, and others are available. Supports multiple encoding formats (linear16, mp3, opus, aac, flac) and fine tuning on speed, sample rate, and style.

  • ElevenLabs Text-to-Speech (ElevenLabs):
    ElevenLabs leads in lifelike, emotionally rich TTS, offering dozens of voices in 29+ languages and the ability to clone custom voices. Models support voice design, speech synthesis, and direct API access, with advanced controls for style, emotion, stability, and similarity. Suitable for audiobooks, content creation, accessibility, and more.

  • Cartesia TTS (Cartesia):
    Cartesia offers high-quality, fast, and secure text-to-speech with a focus on privacy and flexible deployment. It provides instant streaming, real-time synthesis, and supports multiple international voices and accents, accessible through a simple API.

  • Google Cloud Text-to-Speech (Google Cloud):
    Google uses DeepMind WaveNet and Neural2 models to power high-fidelity voices in 50+ languages and variants. Features include voice selection, pitch, speaking rate, volume control, SSML tags, and access to both standard and studio-grade premium voices. Widely used for accessibility, IVR, and media.

  • Microsoft Azure Speech (Microsoft Azure):
    Azure provides over 400 neural voices across 140+ languages and locales, with unique voice customization, style, emotion, role, and real-time controls. Offers SSML support for pronunciation, intonation, and more. Ideal for global, enterprise, or creative TTS needs.

  • PlayHT (PlayHT):
    PlayHT specializes in realistic voice synthesis, voice cloning, and instant streaming playback with 800+ voices in over 100 languages. Features include emotion, pitch and speed controls, multi-voice audio, and custom voice creation via the API or online studio.

How to Choose:
Pick your provider and model by prioritizing languages, supported voice types, desired formats (mp3, wav, etc.), control granularity (speed, emotion, etc.), and specialized features (voice cloning, accent, streaming). For creative, accessibility, or developer use cases, ensure compatibility with your application's requirements and compare costs.

Visit each provider’s official site for up-to-date capabilities, pricing, and documentation details!

Usage Instructions

Generate natural-sounding speech from text using state-of-the-art AI voices from OpenAI, Deepgram, ElevenLabs, Cartesia, Google Cloud, Azure, and PlayHT. Supports multiple voices, languages, and audio formats.

Tools

tts_openai

Convert text to speech using OpenAI TTS models

Input

ParameterTypeRequiredDescription
textstringYesThe text content to convert to speech (e.g., "Hello, welcome to our service!")
apiKeystringYesOpenAI API key
modelstringNoOpenAI TTS model identifier (e.g., "tts-1", "tts-1-hd", "gpt-4o-mini-tts")
voicestringNoOpenAI voice identifier (e.g., "alloy", "ash", "ballad", "coral", "echo", "sage", "shimmer")
responseFormatstringNoAudio format (mp3, opus, aac, flac, wav, pcm)
speednumberNoSpeech speed multiplier from 0.25 to 4.0 (e.g., 0.5 for slower, 1.0 for normal, 2.0 for faster)

Output

ParameterTypeDescription
audioUrlstringURL to the generated audio file
audioFilefileGenerated audio file object
durationnumberAudio duration in seconds
characterCountnumberNumber of characters processed
formatstringAudio format
providerstringTTS provider used

tts_deepgram

Convert text to speech using Deepgram Aura

Input

ParameterTypeRequiredDescription
textstringYesThe text content to convert to speech (e.g., "Hello, welcome to our service!")
apiKeystringYesDeepgram API key
modelstringNoDeepgram model/voice identifier (e.g., "aura-asteria-en", "aura-luna-en", "aura-2-luna-en")
voicestringNoDeepgram voice identifier, alternative to model param (e.g., "aura-asteria-en", "aura-orion-en")
encodingstringNoAudio encoding (linear16, mp3, opus, aac, flac)
sampleRatenumberNoSample rate (8000, 16000, 24000, 48000)
bitRatenumberNoBit rate for compressed formats
containerstringNoContainer format (none, wav, ogg)

Output

ParameterTypeDescription
audioUrlstringURL to the generated audio file
audioFilefileGenerated audio file object
durationnumberAudio duration in seconds
characterCountnumberNumber of characters processed
formatstringAudio format
providerstringTTS provider used

tts_elevenlabs

Convert text to speech using ElevenLabs voices

Input

ParameterTypeRequiredDescription
textstringYesThe text content to convert to speech (e.g., "Hello, welcome to our service!")
voiceIdstringYesElevenLabs voice identifier (e.g., "21m00Tcm4TlvDq8ikWAM", "AZnzlk1XvdvUeBnXmlld")
apiKeystringYesElevenLabs API key
modelIdstringNoElevenLabs model identifier (e.g., "eleven_turbo_v2_5", "eleven_flash_v2_5", "eleven_multilingual_v2")
stabilitynumberNoVoice stability (0.0 to 1.0, default: 0.5)
similarityBoostnumberNoSimilarity boost (0.0 to 1.0, default: 0.8)
stylenumberNoStyle exaggeration (0.0 to 1.0)
useSpeakerBoostbooleanNoUse speaker boost (default: true)

Output

ParameterTypeDescription
audioUrlstringURL to the generated audio file
audioFilefileGenerated audio file object
durationnumberAudio duration in seconds
characterCountnumberNumber of characters processed
formatstringAudio format
providerstringTTS provider used

tts_cartesia

Convert text to speech using Cartesia Sonic (ultra-low latency)

Input

ParameterTypeRequiredDescription
textstringYesThe text content to convert to speech (e.g., "Hello, welcome to our service!")
apiKeystringYesCartesia API key
modelIdstringNoCartesia model identifier (e.g., "sonic", "sonic-2", "sonic-3", "sonic-multilingual")
voicestringNoCartesia voice identifier or embedding (e.g., "a0e99841-438c-4a64-b679-ae501e7d6091")
languagestringNoLanguage code for speech synthesis (e.g., "en", "es", "fr", "de", "it", "pt")
outputFormatjsonNoOutput format configuration (container, encoding, sampleRate)
speednumberNoSpeech speed multiplier (e.g., 0.5 for slower, 1.0 for normal, 2.0 for faster)
emotionarrayNoEmotion tags for Sonic-3 (e.g., ['positivity:high'])

Output

ParameterTypeDescription
audioUrlstringURL to the generated audio file
audioFilefileGenerated audio file object
durationnumberAudio duration in seconds
characterCountnumberNumber of characters processed
formatstringAudio format
providerstringTTS provider used

tts_google

Convert text to speech using Google Cloud Text-to-Speech

Input

ParameterTypeRequiredDescription
textstringYesThe text content to convert to speech (e.g., "Hello, welcome to our service!")
apiKeystringYesGoogle Cloud API key
voiceIdstringNoGoogle Cloud voice identifier (e.g., "en-US-Neural2-A", "en-US-Wavenet-D", "en-GB-Neural2-B")
languageCodestringYesBCP-47 language code for speech synthesis (e.g., "en-US", "es-ES", "fr-FR", "de-DE")
genderstringNoVoice gender (MALE, FEMALE, NEUTRAL)
audioEncodingstringNoAudio encoding (LINEAR16, MP3, OGG_OPUS, MULAW, ALAW)
speakingRatenumberNoSpeaking rate multiplier from 0.25 to 2.0 (e.g., 0.5 for slower, 1.0 for normal, 1.5 for faster)
pitchnumberNoVoice pitch (-20.0 to 20.0, default: 0.0)
volumeGainDbnumberNoVolume gain in dB (-96.0 to 16.0)
sampleRateHertznumberNoSample rate in Hz
effectsProfileIdarrayNoEffects profile (e.g., ['headphone-class-device'])

Output

ParameterTypeDescription
audioUrlstringURL to the generated audio file
audioFilefileGenerated audio file object
durationnumberAudio duration in seconds
characterCountnumberNumber of characters processed
formatstringAudio format
providerstringTTS provider used

tts_azure

Convert text to speech using Azure Cognitive Services

Input

ParameterTypeRequiredDescription
textstringYesThe text content to convert to speech (e.g., "Hello, welcome to our service!")
apiKeystringYesAzure Speech Services API key
voiceIdstringNoAzure voice identifier (e.g., "en-US-JennyNeural", "en-US-GuyNeural", "en-GB-SoniaNeural")
regionstringNoAzure region (e.g., eastus, westus, westeurope)
outputFormatstringNoOutput audio format
ratestringNoSpeaking rate (e.g., +10%, -20%, 1.5)
pitchstringNoVoice pitch (e.g., +5Hz, -2st, low)
stylestringNoSpeaking style (e.g., cheerful, sad, angry - neural voices only)
styleDegreenumberNoStyle intensity (0.01 to 2.0)
rolestringNoRole (e.g., Girl, Boy, YoungAdultFemale)

Output

ParameterTypeDescription
audioUrlstringURL to the generated audio file
audioFilefileGenerated audio file object
durationnumberAudio duration in seconds
characterCountnumberNumber of characters processed
formatstringAudio format
providerstringTTS provider used

tts_playht

Convert text to speech using PlayHT (voice cloning)

Input

ParameterTypeRequiredDescription
textstringYesThe text content to convert to speech (e.g., "Hello, welcome to our service!")
apiKeystringYesPlayHT API key (AUTHORIZATION header)
userIdstringYesPlayHT user ID (X-USER-ID header)
voicestringNoPlayHT voice identifier or manifest URL (e.g., "s3://voice-cloning-zero-shot/...")
qualitystringNoQuality level (draft, standard, premium)
outputFormatstringNoOutput format (mp3, wav, ogg, flac, mulaw)
speednumberNoSpeech speed multiplier from 0.5 to 2.0 (e.g., 0.5 for slower, 1.0 for normal, 1.5 for faster)
temperaturenumberNoCreativity/randomness (0.0 to 2.0)
voiceGuidancenumberNoVoice stability (1.0 to 6.0)
textGuidancenumberNoText adherence (1.0 to 6.0)
sampleRatenumberNoSample rate (8000, 16000, 22050, 24000, 44100, 48000)

Output

ParameterTypeDescription
audioUrlstringURL to the generated audio file
audioFilefileGenerated audio file object
durationnumberAudio duration in seconds
characterCountnumberNumber of characters processed
formatstringAudio format
providerstringTTS provider used

On this page

Start building today
Trusted by over 60,000 builders.
Build Agentic workflows visually on a drag-and-drop canvas or with natural language.
Get started