> ## Documentation Index > Fetch the complete documentation index at: https://docs.omi.me/llms.txt > Use this file to discover all available pages before exploring further. # Real-time Transcription > A comprehensive guide to Omi's real-time audio transcription system, covering WebSocket connections, STT providers, speaker diarization, message formats, and building external custom STT services. ## Overview Omi's transcription system provides **real-time speech-to-text** conversion with speaker identification, multiple language support, and seamless integration with the conversation processing pipeline. ```mermaid theme={null} flowchart LR subgraph Client["📱 Omi App"] Audio[Audio Capture] end subgraph Backend["🖥️ Backend"] WS["/v4/listen
WebSocket"] Decode[Audio Decoder] end subgraph STT["🎧 STT Providers"] DG[Deepgram Nova-3] end Audio -->|Binary stream| WS WS --> Decode Decode --> DG DG -->|Transcript| WS WS -->|JSON segments| Audio ``` Connect to `/v4/listen` WebSocket with your user token and start streaming audio. Transcripts arrive in real-time as JSON. Read through for complete endpoint details, configuration options, and message formats. * Multiple STT providers with automatic fallback * Speech profile for user identification * Dual-socket architecture for speaker training * [External Custom STT](#external-custom-stt-service) for your own transcription service ## WebSocket Endpoint WebSocket connections require Firebase authentication. The `uid` parameter must be a valid user ID obtained through Firebase Auth. ### Endpoint URL ``` wss://api.omi.me/v4/listen?uid={uid}&language={lang}&sample_rate={rate}&codec={codec} ``` ### Query Parameters **Type:** `string` User ID obtained from Firebase authentication. Required for all connections. **Type:** `string` | **Default:** `'en'` Language code for transcription. Supports: * Standard codes: `'en'`, `'es'`, `'fr'`, `'de'`, `'ja'`, `'zh'`, etc. * Multi-language: `'multi'` for automatic language detection **Type:** `integer` | **Default:** `8000` Audio sample rate in Hz. Common values: `8000`, `16000`, `44100`, `48000` **Type:** `string` | **Default:** `'pcm8'` Audio codec. Supported options: * `pcm8` - 8-bit PCM (default) * `pcm16` - 16-bit PCM * `opus` - Opus codec (16kHz) * `opus_fs320` - Opus with 320 frame size * `aac` - AAC codec * `lc3` - LC3 codec * `lc3_fs1030` - LC3 with 1030 frame size **Type:** `integer` | **Default:** `1` Number of audio channels. Use `1` for mono, `2` for stereo. **Type:** `boolean` | **Default:** `true` Enable speaker identification using the user's stored speech profile. When enabled, the system extracts a speaker embedding from the user's speech profile and uses it to identify the user's voice via biometric matching. **Type:** `integer` | **Default:** `120` | **Range:** `2-14400` Seconds of silence before the conversation is automatically processed. After this timeout, the conversation is saved and LLM processing begins. **Type:** `string` | **Optional** Explicitly specify STT service. Options: `deepgram`. If not specified, Deepgram is used. **Type:** `string` | **Default:** `'disabled'` Enable custom STT mode. When set to `'enabled'`, the backend accepts app-provided transcripts instead of using STT services. Useful for apps with their own transcription. **Type:** `string` | **Optional** Conversation source identifier. Examples: `'omi'`, `'openglass'`, `'phone'` ## Audio Codecs The system supports multiple audio codecs with automatic decoding: | Codec | Sample Rate | Description | Use Case | | ------------ | ----------- | -------------- | ---------------------- | | `pcm8` | 8kHz | 8-bit PCM | Default, low bandwidth | | `pcm16` | 16kHz | 16-bit PCM | Better quality | | `opus` | 16kHz | Opus encoded | Efficient compression | | `opus_fs320` | 16kHz | Opus 320 frame | Alternative frame size | | `aac` | Variable | AAC encoded | iOS compatibility | | `lc3` | Variable | LC3 codec | Bluetooth audio | | `lc3_fs1030` | Variable | LC3 1030 frame | Alternative LC3 | All audio is internally converted to 16-bit linear PCM before being sent to STT providers. ## STT Service Selection The system uses Deepgram for all transcription: ```mermaid theme={null} flowchart TD Start[Incoming Audio] --> Lang{Language?} Lang -->|Supported| DG3[Deepgram Nova-3] Lang -->|Unsupported| Fallback[Fallback to English Nova-3] ``` ### Provider Capabilities | Provider | Languages | Model | Best For | | ------------------- | --------- | -------- | ----------------------- | | **Deepgram Nova-3** | 50+ | `nova-3` | All supported languages | ## Deepgram Configuration When using Deepgram, the following options are configured: | Option | Value | Purpose | | ------------------ | ---------- | ------------------------------------------- | | `punctuate` | `true` | Automatic punctuation insertion | | `no_delay` | `true` | Minimize latency for real-time feedback | | `endpointing` | `300` | 300ms silence to detect sentence boundaries | | `interim_results` | `false` | Only return final transcripts | | `smart_format` | `true` | Format numbers, dates, currencies | | `profanity_filter` | `false` | Keep all words unfiltered | | `diarize` | `true` | Enable speaker identification | | `filler_words` | `false` | Remove "um", "uh", etc. | | `encoding` | `linear16` | 16-bit PCM encoding | ## External Custom STT Service Build your own transcription/diarization WebSocket service that integrates with Omi. ```mermaid theme={null} flowchart LR subgraph App["📱 Omi App"] Capture[Audio Capture] end subgraph Custom["🎧 Your STT Service"] WS[WebSocket Server] end subgraph Backend["🖥️ Omi Backend"] API["/v4/listen"] end Capture -->|Binary audio| WS WS -->|JSON transcripts| Capture Capture -->|suggested_transcript| API ``` ### Your Service Receives | Message | Format | Description | | ------------------------- | ------ | ----------------------------------------------------------------- | | Audio frames | Binary | Raw audio bytes (codec configured by app, typically `opus` 16kHz) | | `{"type": "CloseStream"}` | JSON | End of audio stream | ### Your Service Sends **Format:** JSON object with `segments` array ```json theme={null} { "segments": [ { "text": "Hello, how are you?", "speaker": "SPEAKER_00", "start": 0.0, "end": 1.5 }, { "text": "I'm doing great, thanks!", "speaker": "SPEAKER_01", "start": 1.6, "end": 3.2 } ] } ``` ### Segment Fields | Field | Type | Required | Description | | --------- | -------- | -------- | ------------------------------------------------ | | `text` | `string` | Yes | Transcribed text | | `speaker` | `string` | No | Speaker label (`SPEAKER_00`, `SPEAKER_01`, etc.) | | `start` | `float` | No | Start time in seconds | | `end` | `float` | No | End time in seconds | ### Requirements * Response **must be an object** with `segments` key. Raw arrays `[{...}]` will fail. * Do **not** include a `type` field, or set it to `"Results"`. Other values are ignored. * Connection closes after **90 seconds** of inactivity. ## Speech Profile & Speaker Embedding When a user has a speech profile, the system uses speaker embedding comparison to identify the user's voice in real-time. ### How It Works ```mermaid theme={null} sequenceDiagram participant App as 📱 Omi App participant Backend as 🖥️ Backend participant DG as 🎧 Deepgram participant Embed as 🧠 Embedding API Note over Backend: User has speech profile App->>Backend: Connect WebSocket Backend->>DG: Create single socket Backend->>Embed: Extract user embedding from profile WAV loop Audio streaming App->>Backend: Audio chunk Backend->>DG: Forward to Deepgram DG-->>Backend: Transcript with speaker IDs end Note over Backend: New speaker detected (2s+ audio) Backend->>Embed: Extract speaker embedding from audio Embed-->>Backend: Compare with user embedding Backend-->>App: Segments (is_user: true/false) ``` ### Speech Profile Benefits 1. **User Identification**: Speaker embedding comparison identifies the device owner by voice biometrics 2. **No Startup Delay**: Transcription begins immediately (no profile audio prepending) 3. **Single Socket**: One Deepgram connection per session (reduced API costs) ## Transcription Flow WebSocket connection accepted, user validated, STT provider selected based on language. App sends binary audio chunks. Backend decodes based on codec parameter. Decoded audio sent to Deepgram. Provider returns word-level transcripts with speaker IDs. Words grouped into segments. Same-speaker consecutive words merged. Timing adjusted for speech profile offset. JSON segments streamed back to app immediately. UI updates as user speaks. Background task monitors silence. After `conversation_timeout`, conversation is processed and saved. ## Message Formats ### Incoming Messages (App → Backend) **Format:** Binary Raw audio bytes encoded according to the `codec` parameter. Sent continuously during recording. ``` [Binary audio chunk - varies by codec] ``` **Keep-alive:** Messages of 2 bytes or less are treated as heartbeat pings. **Format:** JSON Assign a known person to detected speakers: ```json theme={null} { "type": "speaker_assigned", "speaker_id": 1, "person_id": "person-uuid-here", "person_name": "John", "segment_ids": ["seg-uuid-1", "seg-uuid-2"] } ``` **Format:** JSON When `custom_stt=enabled`, apps can provide their own transcripts: ```json theme={null} { "type": "suggested_transcript", "segments": [ { "text": "Hello there", "speaker": "SPEAKER_00", "speaker_id": 0, "start": 0.0, "end": 1.5, "is_user": true, "person_id": "known-person-uuid-or-null" } ], "stt_provider": "custom-provider-name" } ``` See [External Custom STT Service](#external-custom-stt-service) for building your own transcription service. **Format:** JSON For OpenGlass and visual captures: ```json theme={null} { "type": "image_chunk", "id": "temp-image-id", "index": 0, "total": 3, "data": "base64-encoded-chunk" } ``` ### Outgoing Messages (Backend → App) **Format:** JSON Array Real-time transcript segments as they're detected: ```json theme={null} [ { "id": "uuid-string", "text": "Hello there", "speaker": "SPEAKER_00", "speaker_id": 0, "is_user": true, "person_id": null, "start": 0.0, "end": 1.5, "speech_profile_processed": true, "stt_provider": "deepgram" } ] ``` **Format:** JSON Connection and service status updates: ```json theme={null} { "type": "service_status", "status": "ready", "status_text": "Service Ready" } ``` **Format:** JSON System suggests a known person for a detected speaker: ```json theme={null} { "type": "speaker_label_suggestion", "speaker_id": 1, "person_id": "person-uuid", "person_name": "John", "segment_id": "segment-uuid" } ``` **Format:** JSON Sent when conversation timeout triggers processing: ```json theme={null} { "type": "memory_created", "memory": { "id": "conversation-uuid", "structured": { "title": "Meeting Discussion", "overview": "..." } }, "messages": [] } ``` **Format:** JSON When translation is enabled: ```json theme={null} { "type": "translation", "segments": [ { "id": "segment-uuid", "translations": [ {"lang": "es", "text": "Hola ahí"} ] } ] } ``` ## Transcript Segment Model Each transcript segment contains: | Field | Type | Description | | -------------------------- | --------- | ---------------------------------------------------- | | `id` | `string` | Unique UUID for the segment | | `text` | `string` | Transcribed text content | | `speaker` | `string` | Speaker label (`"SPEAKER_00"`, `"SPEAKER_01"`, etc.) | | `speaker_id` | `integer` | Numeric speaker ID (0, 1, 2...) | | `is_user` | `boolean` | `true` if spoken by device owner | | `person_id` | `string?` | UUID of identified person (if matched) | | `start` | `float` | Start time in seconds | | `end` | `float` | End time in seconds | | `speech_profile_processed` | `boolean` | Whether speech profile was used for identification | | `stt_provider` | `string?` | Name of STT provider used | ## Connection Lifecycle ```mermaid theme={null} stateDiagram-v2 [*] --> Connecting: WebSocket request Connecting --> Authenticating: Connection accepted Authenticating --> Ready: User validated Authenticating --> Closed: Auth failed Ready --> Streaming: Audio received Streaming --> Streaming: More audio Streaming --> Processing: Silence timeout Processing --> Streaming: New audio Processing --> Closed: Session complete Ready --> Closed: Client disconnect Streaming --> Closed: Client disconnect note right of Processing Conversation saved LLM extracts structure Memories extracted end note ``` ### Lifecycle Events 1. WebSocket accepted 2. User authentication verified 3. Language/STT service selected 4. STT connections initialized (with retry logic) 5. Speech profile loaded in background 6. Heartbeat task started (10s interval) 1. Audio received and decoded 2. Sent to STT provider(s) 3. Results collected in buffers 4. Processed every 600ms 5. Segments sent to client 6. Speaker suggestions generated 1. Usage statistics recorded 2. All STT sockets closed 3. Client WebSocket closed (code 1000/1001) 4. Buffers and collections cleared ## Error Handling & Retry Logic The system includes robust error handling: | Error Type | Handling | | ------------------------- | ------------------------------------------------ | | **STT Connection Failed** | Exponential backoff retry (1s → 32s, 3 attempts) | | **Provider Error** | Exponential backoff retry with Deepgram | | **Decode Error** | Log and skip corrupted audio chunk | | **WebSocket Error** | Clean close with appropriate code | If the STT provider fails after retries, the connection will be closed with an error message. The app should handle reconnection. ## Key File Locations | Component | Path | | -------------------- | -------------------------------------- | | WebSocket Handler | `backend/routers/transcribe.py` | | Deepgram Integration | `backend/utils/stt/streaming.py` | | Audio Decoding | `backend/routers/transcribe.py` | | Speech Profile | `backend/utils/stt/speech_profile.py` | | VAD (Voice Activity) | `backend/utils/stt/vad.py` | | Transcript Model | `backend/models/transcript_segment.py` | ## Related Documentation Complete backend architecture overview How conversations and memories are stored How the AI chat system uses transcriptions Environment setup and configuration