Skip to main content

Realtime Transcription

Post Processing

📡 Audio Streaming

  1. The Omi App initiates a real-time audio stream to the backend.
  2. Audio data is sent via WebSocket to the /listen endpoint.
  3. Audio can be in Opus or Linear16 encoding, depending on device settings.

🔌 WebSocket Handling

/listen Endpoint

  • Located in routers/transcribe.py
  • websocket_endpoint function sets up the connection
  • Calls _websocket_util function to manage the connection

_websocket_util Function

  • Accepts the WebSocket connection
  • Checks for user speech profile
    • If exists, sends profile audio to Deepgram first
    • Uses utils/other/storage.py to retrieve profile from Google Cloud Storage
  • Creates asynchronous tasks:
    • receive_audio: Receives audio chunks and sends to Deepgram
    • send_heartbeat: Sends periodic messages to keep connection alive

🔊 Deepgram Integration

process_audio_dg Function

  • Located in utils/stt/streaming.py
  • Initializes Deepgram client using DEEPGRAM_API_KEY
  • Defines on_message callback for handling transcripts
  • Starts live transcription stream with Deepgram

Deepgram Configuration

OptionValueDescription
languageVariableAudio language
sample_rate8000 or 16000 HzAudio sample rate
codecOpus or Linear16Audio codec
channelsVariableNumber of audio channels
punctuateTrueAutomatic punctuation
no_delayTrueLow-latency transcription
endpointing100Sentence boundary detection
interim_resultsFalseOnly final transcripts sent
smart_formatTrueEnhanced transcript formatting
profanity_filterFalseNo profanity filtering
diarizeTrueSpeaker identification
filler_wordsFalseRemove filler words
multichannelchannels > 1Enable if multiple channels
model'nova-2-general'Deepgram model selection

🔄 Transcript Processing

  1. Deepgram processes audio and triggers on_message callback
  2. on_message receives raw transcript data
  3. Callback formats transcript data:
    • Groups words into segments
    • Creates list of segment dictionaries
  4. Formatted segments sent back to Omi App via WebSocket

Segment Dictionary Structure

FieldDescription
speakerSpeaker label (e.g., "SPEAKER_00")
startSegment start time (seconds)
endSegment end time (seconds)
textCombined, punctuated text
is_userBoolean indicating if segment is from the user
person_idID of matched person from user profiles (if applicable)

🔑 Key Considerations

  • Real-time, low-latency transcription
  • Speaker diarization accuracy may vary
  • Audio encoding choice (Opus vs. Linear16) may affect performance
  • Deepgram model selection based on specific needs
  • Implement proper error handling in on_message

This overview provides a comprehensive understanding of Omi's real-time transcription process, which can be adapted when integrating alternative audio transcription services.