Skip to content
GitHub
Get started →

Voices

OpenAI’s Realtime API offers six voices. Each has a distinct character. Pick one that matches your brand’s tone.

The six voices

alloy

Neutral, balanced, slightly warm. The default. Works for everything. Gender-ambiguous.

echo

Male-coded. Crisp, confident, slightly higher pitch than onyx. Good for tech, finance, professional services.

fable

Male-coded with a soft British lilt. Storyteller vibe. Good for content-heavy sites, long responses, hospitality with an upscale feel.

onyx

Male-coded, deeper. Authoritative, serious. Good for law firms, healthcare, luxury brands.

nova

Female-coded. Warm, upbeat, enthusiastic. The “front-of-house” voice. Restaurants, retail, travel.

shimmer

Female-coded. Calm, mature, measured. Good for professional services, healthcare, finance.

Selection

Dashboard → VoiceVoice → pick one. Changes take effect on the next session (existing sessions keep their voice).

Default by template

Each industry template picks a sensible default:

TemplateDefault voice
Restaurantnova
Real estateshimmer
Law firmonyx
E-commercenova
Healthcareshimmer
Professional servicesecho
Customer supportalloy

Override freely.

Can I upload my own voice?

Not today. OpenAI’s Realtime API only exposes these six voices. Voice cloning is a separate product on OpenAI’s roadmap; we’ll integrate it when it’s available in the Realtime endpoint.

If you need a specific voice (branded spokesperson, celebrity licensing), build a custom integration via a different TTS provider and route the audio through a webhook. Contact sales@spelo.ai for Enterprise solutions.

Pitch, speed, prosody

The Realtime voices don’t expose pitch or speed controls. OpenAI chose their defaults deliberately — they’re the “native” prosody.

If you want slower delivery, write custom_instructions like:

Speak calmly and at a slightly slower pace than normal. Include small pauses
between ideas. Acknowledge complex questions before answering.

The model adjusts its cadence accordingly.

Accessibility

All six voices are clear to screen readers and conventional audio processors. Captions are auto-generated (turn on the captions toggle in the dashboard to expose them in the widget) using the WebRTC data-channel transcript stream.

Sampling voices

Dashboard → VoicePreview voice → type a sentence → cycle through all six. Takes ~30 seconds to get a feel for which fits.

Latency

Voice choice does not affect latency. All six are equivalent in first-audio-time.

See also