Overview
Omi’s transcription system provides real-time speech-to-text conversion with speaker identification, multiple language support, and seamless integration with the conversation processing pipeline.- Quick Start
- Full Documentation
- Key Concepts
Connect to
/v4/listen WebSocket with your user token and start streaming audio. Transcripts arrive in real-time as JSON.WebSocket Endpoint
Endpoint URL
Query Parameters
uid (required)
uid (required)
Type:
stringUser ID obtained from Firebase authentication. Required for all connections.language
language
Type:
string | Default: 'en'Language code for transcription. Supports:- Standard codes:
'en','es','fr','de','ja','zh', etc. - Multi-language:
'multi'for automatic language detection (uses Soniox)
sample_rate
sample_rate
Type:
integer | Default: 8000Audio sample rate in Hz. Common values: 8000, 16000, 44100, 48000codec
codec
Type:
string | Default: 'pcm8'Audio codec. Supported options:pcm8- 8-bit PCM (default)pcm16- 16-bit PCMopus- Opus codec (16kHz)opus_fs320- Opus with 320 frame sizeaac- AAC codeclc3- LC3 codeclc3_fs1030- LC3 with 1030 frame size
channels
channels
Type:
integer | Default: 1Number of audio channels. Use 1 for mono, 2 for stereo.include_speech_profile
include_speech_profile
Type:
boolean | Default: trueEnable speaker identification using the user’s stored speech profile. When enabled, the system uses a dual-socket architecture for improved speaker detection.conversation_timeout
conversation_timeout
Type:
integer | Default: 120 | Range: 2-14400Seconds of silence before the conversation is automatically processed. After this timeout, the conversation is saved and LLM processing begins.stt_service
stt_service
Type:
string | OptionalExplicitly specify STT service. Options: deepgram, soniox, speechmatics. If not specified, the system selects based on language.custom_stt
custom_stt
Type:
string | Default: 'disabled'Enable custom STT mode. When set to 'enabled', the backend accepts app-provided transcripts instead of using STT services. Useful for apps with their own transcription.source
source
Type:
string | OptionalConversation source identifier. Examples: 'omi', 'openglass', 'phone'Audio Codecs
The system supports multiple audio codecs with automatic decoding:| Codec | Sample Rate | Description | Use Case |
|---|---|---|---|
pcm8 | 8kHz | 8-bit PCM | Default, low bandwidth |
pcm16 | 16kHz | 16-bit PCM | Better quality |
opus | 16kHz | Opus encoded | Efficient compression |
opus_fs320 | 16kHz | Opus 320 frame | Alternative frame size |
aac | Variable | AAC encoded | iOS compatibility |
lc3 | Variable | LC3 codec | Bluetooth audio |
lc3_fs1030 | Variable | LC3 1030 frame | Alternative LC3 |
All audio is internally converted to 16-bit linear PCM before being sent to STT providers.
STT Service Selection
The system automatically selects the best STT provider based on language:Provider Capabilities
| Provider | Languages | Model | Best For |
|---|---|---|---|
| Deepgram Nova-3 | 30+ | nova-3 | Primary English, major languages |
| Deepgram Nova-2 | 40+ | nova-2-general | Broader language support |
| Soniox | 95+ | Real-time | Multi-language, auto-detection |
| Speechmatics | 50+ | Real-time | Additional coverage |
Deepgram Configuration
When using Deepgram, the following options are configured:| Option | Value | Purpose |
|---|---|---|
punctuate | true | Automatic punctuation insertion |
no_delay | true | Minimize latency for real-time feedback |
endpointing | 300 | 300ms silence to detect sentence boundaries |
interim_results | false | Only return final transcripts |
smart_format | true | Format numbers, dates, currencies |
profanity_filter | false | Keep all words unfiltered |
diarize | true | Enable speaker identification |
filler_words | false | Remove “um”, “uh”, etc. |
encoding | linear16 | 16-bit PCM encoding |
External Custom STT Service
Build your own transcription/diarization WebSocket service that integrates with Omi.Your Service Receives
| Message | Format | Description |
|---|---|---|
| Audio frames | Binary | Raw audio bytes (codec configured by app, typically opus 16kHz) |
{"type": "CloseStream"} | JSON | End of audio stream |
Your Service Sends
Format: JSON object withsegments array
Segment Fields
| Field | Type | Required | Description |
|---|---|---|---|
text | string | Yes | Transcribed text |
speaker | string | No | Speaker label (SPEAKER_00, SPEAKER_01, etc.) |
start | float | No | Start time in seconds |
end | float | No | End time in seconds |
Requirements
Speech Profile & Dual-Socket Architecture
When a user has a speech profile, the system uses a sophisticated dual-socket architecture for improved speaker identification.
How It Works
Speech Profile Benefits
- User Identification: Audio from the first ~30 seconds trains speaker recognition
- Speaker Attribution: System identifies which segments are from the device owner
- Improved Accuracy: Better speaker diarization in multi-person conversations
Transcription Flow
Connection Established
WebSocket connection accepted, user validated, STT provider selected based on language.
STT Processing
Decoded audio sent to Deepgram/Soniox. Provider returns word-level transcripts with speaker IDs.
Segment Creation
Words grouped into segments. Same-speaker consecutive words merged. Timing adjusted for speech profile offset.
Message Formats
Incoming Messages (App → Backend)
- Audio Data
- Speaker Assignment
- Custom Transcript
- Image Chunk
Format: BinaryRaw audio bytes encoded according to the Keep-alive: Messages of 2 bytes or less are treated as heartbeat pings.
codec parameter. Sent continuously during recording.Outgoing Messages (Backend → App)
- Transcript Segments
- Service Status
- Speaker Suggestion
- Conversation Created
- Translations
Format: JSON ArrayReal-time transcript segments as they’re detected:
Transcript Segment Model
Each transcript segment contains:| Field | Type | Description |
|---|---|---|
id | string | Unique UUID for the segment |
text | string | Transcribed text content |
speaker | string | Speaker label ("SPEAKER_00", "SPEAKER_01", etc.) |
speaker_id | integer | Numeric speaker ID (0, 1, 2…) |
is_user | boolean | true if spoken by device owner |
person_id | string? | UUID of identified person (if matched) |
start | float | Start time in seconds |
end | float | End time in seconds |
speech_profile_processed | boolean | Whether speech profile was used for identification |
stt_provider | string? | Name of STT provider used |
Connection Lifecycle
Lifecycle Events
Open
- WebSocket accepted
- User authentication verified
- Language/STT service selected
- STT connections initialized (with retry logic)
- Speech profile loaded in background
- Heartbeat task started (10s interval)
Stream
- Audio received and decoded
- Sent to STT provider(s)
- Results collected in buffers
- Processed every 600ms
- Segments sent to client
- Speaker suggestions generated
Error Handling & Retry Logic
The system includes robust error handling:| Error Type | Handling |
|---|---|
| STT Connection Failed | Exponential backoff retry (1s → 32s, 3 attempts) |
| Provider Error | Automatic fallback to next provider |
| Decode Error | Log and skip corrupted audio chunk |
| WebSocket Error | Clean close with appropriate code |
Key File Locations
| Component | Path |
|---|---|
| WebSocket Handler | backend/routers/transcribe.py |
| Deepgram Integration | backend/utils/stt/streaming.py |
| Soniox Integration | backend/utils/stt/streaming.py |
| Audio Decoding | backend/routers/transcribe.py |
| Speech Profile | backend/utils/stt/speech_profile.py |
| VAD (Voice Activity) | backend/utils/stt/vad.py |
| Transcript Model | backend/models/transcript_segment.py |