alloy
Neutral, balanced, slightly warm. The default. Works for everything. Gender-ambiguous.
OpenAI’s Realtime API offers six voices. Each has a distinct character. Pick one that matches your brand’s tone.
alloy
Neutral, balanced, slightly warm. The default. Works for everything. Gender-ambiguous.
echo
Male-coded. Crisp, confident, slightly higher pitch than onyx. Good for tech, finance, professional services.
fable
Male-coded with a soft British lilt. Storyteller vibe. Good for content-heavy sites, long responses, hospitality with an upscale feel.
onyx
Male-coded, deeper. Authoritative, serious. Good for law firms, healthcare, luxury brands.
nova
Female-coded. Warm, upbeat, enthusiastic. The “front-of-house” voice. Restaurants, retail, travel.
shimmer
Female-coded. Calm, mature, measured. Good for professional services, healthcare, finance.
Dashboard → Voice → Voice → pick one. Changes take effect on the next session (existing sessions keep their voice).
Each industry template picks a sensible default:
| Template | Default voice |
|---|---|
| Restaurant | nova |
| Real estate | shimmer |
| Law firm | onyx |
| E-commerce | nova |
| Healthcare | shimmer |
| Professional services | echo |
| Customer support | alloy |
Override freely.
Not today. OpenAI’s Realtime API only exposes these six voices. Voice cloning is a separate product on OpenAI’s roadmap; we’ll integrate it when it’s available in the Realtime endpoint.
If you need a specific voice (branded spokesperson, celebrity licensing), build a custom integration via a different TTS provider and route the audio through a webhook. Contact sales@spelo.ai for Enterprise solutions.
The Realtime voices don’t expose pitch or speed controls. OpenAI chose their defaults deliberately — they’re the “native” prosody.
If you want slower delivery, write custom_instructions like:
Speak calmly and at a slightly slower pace than normal. Include small pausesbetween ideas. Acknowledge complex questions before answering.The model adjusts its cadence accordingly.
All six voices are clear to screen readers and conventional audio processors. Captions are auto-generated (turn on the captions toggle in the dashboard to expose them in the widget) using the WebRTC data-channel transcript stream.
Dashboard → Voice → Preview voice → type a sentence → cycle through all six. Takes ~30 seconds to get a feel for which fits.
Voice choice does not affect latency. All six are equivalent in first-audio-time.