Skip to main content

Overview

Omi’s transcription system provides real-time speech-to-text conversion with speaker identification, multiple language support, and seamless integration with the conversation processing pipeline.
Connect to /v4/listen WebSocket with your user token and start streaming audio. Transcripts arrive in real-time as JSON.

WebSocket Endpoint

WebSocket connections require Firebase authentication. The uid parameter must be a valid user ID obtained through Firebase Auth.

Endpoint URL

wss://api.omi.me/v4/listen?uid={uid}&language={lang}&sample_rate={rate}&codec={codec}

Query Parameters

Type: stringUser ID obtained from Firebase authentication. Required for all connections.
Type: string | Default: 'en'Language code for transcription. Supports:
  • Standard codes: 'en', 'es', 'fr', 'de', 'ja', 'zh', etc.
  • Multi-language: 'multi' for automatic language detection (uses Soniox)
Type: integer | Default: 8000Audio sample rate in Hz. Common values: 8000, 16000, 44100, 48000
Type: string | Default: 'pcm8'Audio codec. Supported options:
  • pcm8 - 8-bit PCM (default)
  • pcm16 - 16-bit PCM
  • opus - Opus codec (16kHz)
  • opus_fs320 - Opus with 320 frame size
  • aac - AAC codec
  • lc3 - LC3 codec
  • lc3_fs1030 - LC3 with 1030 frame size
Type: integer | Default: 1Number of audio channels. Use 1 for mono, 2 for stereo.
Type: boolean | Default: trueEnable speaker identification using the user’s stored speech profile. When enabled, the system uses a dual-socket architecture for improved speaker detection.
Type: integer | Default: 120 | Range: 2-14400Seconds of silence before the conversation is automatically processed. After this timeout, the conversation is saved and LLM processing begins.
Type: string | OptionalExplicitly specify STT service. Options: deepgram, soniox, speechmatics. If not specified, the system selects based on language.
Type: string | Default: 'disabled'Enable custom STT mode. When set to 'enabled', the backend accepts app-provided transcripts instead of using STT services. Useful for apps with their own transcription.
Type: string | OptionalConversation source identifier. Examples: 'omi', 'openglass', 'phone'

Audio Codecs

The system supports multiple audio codecs with automatic decoding:
CodecSample RateDescriptionUse Case
pcm88kHz8-bit PCMDefault, low bandwidth
pcm1616kHz16-bit PCMBetter quality
opus16kHzOpus encodedEfficient compression
opus_fs32016kHzOpus 320 frameAlternative frame size
aacVariableAAC encodediOS compatibility
lc3VariableLC3 codecBluetooth audio
lc3_fs1030VariableLC3 1030 frameAlternative LC3
All audio is internally converted to 16-bit linear PCM before being sent to STT providers.

STT Service Selection

The system automatically selects the best STT provider based on language:

Provider Capabilities

ProviderLanguagesModelBest For
Deepgram Nova-330+nova-3Primary English, major languages
Deepgram Nova-240+nova-2-generalBroader language support
Soniox95+Real-timeMulti-language, auto-detection
Speechmatics50+Real-timeAdditional coverage

Deepgram Configuration

When using Deepgram, the following options are configured:
OptionValuePurpose
punctuatetrueAutomatic punctuation insertion
no_delaytrueMinimize latency for real-time feedback
endpointing300300ms silence to detect sentence boundaries
interim_resultsfalseOnly return final transcripts
smart_formattrueFormat numbers, dates, currencies
profanity_filterfalseKeep all words unfiltered
diarizetrueEnable speaker identification
filler_wordsfalseRemove “um”, “uh”, etc.
encodinglinear1616-bit PCM encoding

Speech Profile & Dual-Socket Architecture

When a user has a speech profile, the system uses a sophisticated dual-socket architecture for improved speaker identification.

How It Works

Speech Profile Benefits

  1. User Identification: Audio from the first ~30 seconds trains speaker recognition
  2. Speaker Attribution: System identifies which segments are from the device owner
  3. Improved Accuracy: Better speaker diarization in multi-person conversations

Transcription Flow

Connection Established

WebSocket connection accepted, user validated, STT provider selected based on language.

Audio Streaming

App sends binary audio chunks. Backend decodes based on codec parameter.

STT Processing

Decoded audio sent to Deepgram/Soniox. Provider returns word-level transcripts with speaker IDs.

Segment Creation

Words grouped into segments. Same-speaker consecutive words merged. Timing adjusted for speech profile offset.

Real-time Delivery

JSON segments streamed back to app immediately. UI updates as user speaks.

Conversation Lifecycle

Background task monitors silence. After conversation_timeout, conversation is processed and saved.

Message Formats

Incoming Messages (App → Backend)

Format: BinaryRaw audio bytes encoded according to the codec parameter. Sent continuously during recording.
[Binary audio chunk - varies by codec]
Keep-alive: Messages of 2 bytes or less are treated as heartbeat pings.

Outgoing Messages (Backend → App)

Format: JSON ArrayReal-time transcript segments as they’re detected:
[
  {
    "id": "uuid-string",
    "text": "Hello there",
    "speaker": "SPEAKER_00",
    "speaker_id": 0,
    "is_user": true,
    "person_id": null,
    "start": 0.0,
    "end": 1.5,
    "speech_profile_processed": true,
    "stt_provider": "deepgram"
  }
]

Transcript Segment Model

Each transcript segment contains:
FieldTypeDescription
idstringUnique UUID for the segment
textstringTranscribed text content
speakerstringSpeaker label ("SPEAKER_00", "SPEAKER_01", etc.)
speaker_idintegerNumeric speaker ID (0, 1, 2…)
is_userbooleantrue if spoken by device owner
person_idstring?UUID of identified person (if matched)
startfloatStart time in seconds
endfloatEnd time in seconds
speech_profile_processedbooleanWhether speech profile was used for identification
stt_providerstring?Name of STT provider used

Connection Lifecycle

Lifecycle Events

Open

  1. WebSocket accepted
  2. User authentication verified
  3. Language/STT service selected
  4. STT connections initialized (with retry logic)
  5. Speech profile loaded in background
  6. Heartbeat task started (10s interval)

Stream

  1. Audio received and decoded
  2. Sent to STT provider(s)
  3. Results collected in buffers
  4. Processed every 600ms
  5. Segments sent to client
  6. Speaker suggestions generated

Close

  1. Usage statistics recorded
  2. All STT sockets closed
  3. Client WebSocket closed (code 1000/1001)
  4. Buffers and collections cleared

Error Handling & Retry Logic

The system includes robust error handling:
Error TypeHandling
STT Connection FailedExponential backoff retry (1s → 32s, 3 attempts)
Provider ErrorAutomatic fallback to next provider
Decode ErrorLog and skip corrupted audio chunk
WebSocket ErrorClean close with appropriate code
If all STT providers fail after retries, the connection will be closed with an error message. The app should handle reconnection.

Key File Locations

ComponentPath
WebSocket Handlerbackend/routers/transcribe.py
Deepgram Integrationbackend/utils/stt/streaming.py
Soniox Integrationbackend/utils/stt/streaming.py
Audio Decodingbackend/routers/transcribe.py
Speech Profilebackend/utils/stt/speech_profile.py
VAD (Voice Activity)backend/utils/stt/vad.py
Transcript Modelbackend/models/transcript_segment.py