Overview
Omi’s transcription system provides real-time speech-to-text conversion with speaker identification, multiple language support, and seamless integration with the conversation processing pipeline.- Quick Start
- Full Documentation
- Key Concepts
Connect to
/v4/listen WebSocket with your user token and start streaming audio. Transcripts arrive in real-time as JSON.WebSocket Endpoint
Endpoint URL
Query Parameters
uid (required)
uid (required)
Type:
stringUser ID obtained from Firebase authentication. Required for all connections.language
language
Type:
string | Default: 'en'Language code for transcription. Supports:- Standard codes:
'en','es','fr','de','ja','zh', etc. - Multi-language:
'multi'for automatic language detection (uses Soniox)
sample_rate
sample_rate
Type:
integer | Default: 8000Audio sample rate in Hz. Common values: 8000, 16000, 44100, 48000codec
codec
Type:
string | Default: 'pcm8'Audio codec. Supported options:pcm8- 8-bit PCM (default)pcm16- 16-bit PCMopus- Opus codec (16kHz)opus_fs320- Opus with 320 frame sizeaac- AAC codeclc3- LC3 codeclc3_fs1030- LC3 with 1030 frame size
channels
channels
Type:
integer | Default: 1Number of audio channels. Use 1 for mono, 2 for stereo.include_speech_profile
include_speech_profile
Type:
boolean | Default: trueEnable speaker identification using the user’s stored speech profile. When enabled, the system uses a dual-socket architecture for improved speaker detection.conversation_timeout
conversation_timeout
Type:
integer | Default: 120 | Range: 2-14400Seconds of silence before the conversation is automatically processed. After this timeout, the conversation is saved and LLM processing begins.stt_service
stt_service
Type:
string | OptionalExplicitly specify STT service. Options: deepgram, soniox, speechmatics. If not specified, the system selects based on language.custom_stt
custom_stt
Type:
string | Default: 'disabled'Enable custom STT mode. When set to 'enabled', the backend accepts app-provided transcripts instead of using STT services. Useful for apps with their own transcription.source
source
Type:
string | OptionalConversation source identifier. Examples: 'omi', 'openglass', 'phone'Audio Codecs
The system supports multiple audio codecs with automatic decoding:| Codec | Sample Rate | Description | Use Case |
|---|---|---|---|
pcm8 | 8kHz | 8-bit PCM | Default, low bandwidth |
pcm16 | 16kHz | 16-bit PCM | Better quality |
opus | 16kHz | Opus encoded | Efficient compression |
opus_fs320 | 16kHz | Opus 320 frame | Alternative frame size |
aac | Variable | AAC encoded | iOS compatibility |
lc3 | Variable | LC3 codec | Bluetooth audio |
lc3_fs1030 | Variable | LC3 1030 frame | Alternative LC3 |
All audio is internally converted to 16-bit linear PCM before being sent to STT providers.
STT Service Selection
The system automatically selects the best STT provider based on language:Provider Capabilities
| Provider | Languages | Model | Best For |
|---|---|---|---|
| Deepgram Nova-3 | 30+ | nova-3 | Primary English, major languages |
| Deepgram Nova-2 | 40+ | nova-2-general | Broader language support |
| Soniox | 95+ | Real-time | Multi-language, auto-detection |
| Speechmatics | 50+ | Real-time | Additional coverage |
Deepgram Configuration
When using Deepgram, the following options are configured:| Option | Value | Purpose |
|---|---|---|
punctuate | true | Automatic punctuation insertion |
no_delay | true | Minimize latency for real-time feedback |
endpointing | 300 | 300ms silence to detect sentence boundaries |
interim_results | false | Only return final transcripts |
smart_format | true | Format numbers, dates, currencies |
profanity_filter | false | Keep all words unfiltered |
diarize | true | Enable speaker identification |
filler_words | false | Remove “um”, “uh”, etc. |
encoding | linear16 | 16-bit PCM encoding |
Speech Profile & Dual-Socket Architecture
When a user has a speech profile, the system uses a sophisticated dual-socket architecture for improved speaker identification.
How It Works
Speech Profile Benefits
- User Identification: Audio from the first ~30 seconds trains speaker recognition
- Speaker Attribution: System identifies which segments are from the device owner
- Improved Accuracy: Better speaker diarization in multi-person conversations
Transcription Flow
Connection Established
WebSocket connection accepted, user validated, STT provider selected based on language.
Audio Streaming
App sends binary audio chunks. Backend decodes based on codec parameter.
STT Processing
Decoded audio sent to Deepgram/Soniox. Provider returns word-level transcripts with speaker IDs.
Segment Creation
Words grouped into segments. Same-speaker consecutive words merged. Timing adjusted for speech profile offset.
Real-time Delivery
JSON segments streamed back to app immediately. UI updates as user speaks.
Conversation Lifecycle
Background task monitors silence. After
conversation_timeout, conversation is processed and saved.Message Formats
Incoming Messages (App → Backend)
- Audio Data
- Speaker Assignment
- Custom Transcript
- Image Chunk
Format: BinaryRaw audio bytes encoded according to the Keep-alive: Messages of 2 bytes or less are treated as heartbeat pings.
codec parameter. Sent continuously during recording.Outgoing Messages (Backend → App)
- Transcript Segments
- Service Status
- Speaker Suggestion
- Conversation Created
- Translations
Format: JSON ArrayReal-time transcript segments as they’re detected:
Transcript Segment Model
Each transcript segment contains:| Field | Type | Description |
|---|---|---|
id | string | Unique UUID for the segment |
text | string | Transcribed text content |
speaker | string | Speaker label ("SPEAKER_00", "SPEAKER_01", etc.) |
speaker_id | integer | Numeric speaker ID (0, 1, 2…) |
is_user | boolean | true if spoken by device owner |
person_id | string? | UUID of identified person (if matched) |
start | float | Start time in seconds |
end | float | End time in seconds |
speech_profile_processed | boolean | Whether speech profile was used for identification |
stt_provider | string? | Name of STT provider used |
Connection Lifecycle
Lifecycle Events
Open
- WebSocket accepted
- User authentication verified
- Language/STT service selected
- STT connections initialized (with retry logic)
- Speech profile loaded in background
- Heartbeat task started (10s interval)
Stream
- Audio received and decoded
- Sent to STT provider(s)
- Results collected in buffers
- Processed every 600ms
- Segments sent to client
- Speaker suggestions generated
Close
- Usage statistics recorded
- All STT sockets closed
- Client WebSocket closed (code 1000/1001)
- Buffers and collections cleared
Error Handling & Retry Logic
The system includes robust error handling:| Error Type | Handling |
|---|---|
| STT Connection Failed | Exponential backoff retry (1s → 32s, 3 attempts) |
| Provider Error | Automatic fallback to next provider |
| Decode Error | Log and skip corrupted audio chunk |
| WebSocket Error | Clean close with appropriate code |
Key File Locations
| Component | Path |
|---|---|
| WebSocket Handler | backend/routers/transcribe.py |
| Deepgram Integration | backend/utils/stt/streaming.py |
| Soniox Integration | backend/utils/stt/streaming.py |
| Audio Decoding | backend/routers/transcribe.py |
| Speech Profile | backend/utils/stt/speech_profile.py |
| VAD (Voice Activity) | backend/utils/stt/vad.py |
| Transcript Model | backend/models/transcript_segment.py |