Skip to main content

Understanding the Omi Ecosystem

Omi is a multimodal AI assistant designed to understand and interact with users in a way that’s both intelligent and human-centered. The backend plays a crucial role in this by:
  • Processing and analyzing data: Converting audio to text, extracting meaning, and creating structured information from user interactions.
  • Storing conversations and extracting memories: Building a rich knowledge base of user conversations and extracted facts that Omi can draw upon to provide context and insights.
  • Facilitating intelligent chat: Understanding user requests, retrieving relevant conversations, and generating personalized responses using an agentic AI system.
  • Integrating with external services: Extending Omi’s capabilities and connecting it to other tools and platforms.
This deep dive will walk you through the core elements of Omi’s backend, providing a clear roadmap for developers and enthusiasts alike to understand its inner workings.
Jump to the Quick Reference table to find what you need fast.

System Architecture

Quick Reference

Need to…Go to
Store a conversationdatabase/conversations.py
Query similar conversationsdatabase/vector_db.py
Process LLM callsutils/llm/ directory
Handle real-time audiorouters/transcribe.py
Manage cachingdatabase/redis_db.py
Understand chat systemChat System Architecture
Learn data modelsStoring Conversations

The Flow of Information

Let’s trace the journey of a typical interaction with Omi, focusing on how audio recordings are transformed into stored conversations:

User Initiates Recording

The user starts a recording session using the Omi app, capturing a conversation or their thoughts.

WebSocket Connection

The Omi app establishes a real-time connection with the backend at the /v4/listen endpoint in routers/transcribe.py.

Audio Streaming

The app streams audio data continuously through the WebSocket to the backend.

Deepgram Processing

The backend forwards audio to Deepgram API for real-time speech-to-text conversion.

Live Feedback

Transcription results stream back through the backend to the app, displaying words as the user speaks.

Conversation Creation

During the WebSocket connection, the backend creates an “in_progress” conversation stub in Firestore. As audio streams, transcript segments are continuously added to Firestore in real-time. When recording ends, the app sends a POST request to /v1/conversations with an empty body ({}). The backend retrieves the in-progress conversation from Firestore and processes it.

LLM Processing

The process_conversation function uses OpenAI to extract:
  • Title & Overview - Summarizes the conversation
  • Action Items - Tasks and to-dos mentioned
  • Events - Calendar-worthy moments
  • Memories - Facts about the user

Storage

  • Firestore: Stores the full conversation document with transcript segments and metadata
  • Pinecone: Stores the vector embedding for semantic search
  • Redis: Caches frequently accessed data (speech profile durations, enabled apps, user names) for performance
  • Google Cloud Storage: Stores binary files (speech profile audio, conversation recordings, photos)

What Gets Extracted

FieldDescription
titleA short, descriptive title
overviewConcise summary of main points
categoryWork, personal, etc.
action_itemsTasks or to-dos mentioned
eventsCalendar-worthy events
memoriesFacts about the user (stored separately)
Use the Quick Reference table above to jump directly to the component you need. Each section below includes key functions and usage patterns.

Core Components

Now that you understand the general flow, let’s dive deeper into the key modules and services that power Omi’s backend.

1. database/conversations.py: The Conversation Guardian

This module is responsible for managing the interaction with Firebase Firestore, Omi’s main database for storing conversations and related data. Key Functions:
  • upsert_conversation: Creates or updates a conversation document in Firestore, ensuring efficient storage and handling of updates.
  • get_conversation: Retrieves a specific conversation by its ID.
  • get_conversations: Fetches a list of conversations for a user, allowing for filtering, pagination, and optional inclusion of discarded conversations.
  • Photo Functions: Handles the storage and retrieval of photos associated with conversations.
Firestore Structure: Each conversation is stored as a document in Firestore with the following fields:
class Conversation(BaseModel):
    id: str  # Unique ID
    created_at: datetime  # Creation timestamp
    started_at: Optional[datetime]
    finished_at: Optional[datetime]

    source: Optional[ConversationSource]  # omi, phone, desktop, openglass, etc.
    language: Optional[str]
    status: ConversationStatus  # in_progress, processing, completed, failed

    structured: Structured  # Contains extracted title, overview, action items, etc.
    transcript_segments: List[TranscriptSegment]
    geolocation: Optional[Geolocation]
    photos: List[ConversationPhoto]

    apps_results: List[AppResult]
    external_data: Optional[Dict]

    discarded: bool
    deleted: bool
    visibility: str  # private, shared, public

    # See StoringConversations.mdx for complete field reference

Conversation Lifecycle

Processing Triggers:
  • Manual: App sends POST /v1/conversations with empty body to trigger immediate processing
  • Automatic: Backend automatically processes conversations after a timeout period (conversation_timeout parameter, default 120 seconds of silence)
  • Both paths use the same process_conversation() function to extract structure, memories, and embeddings

2. database/vector_db.py: The Embedding Expert

This module manages the interaction with Pinecone, a vector database used to store and query conversation embeddings. Key Functions:
  • upsert_vector: Adds or updates a conversation embedding in Pinecone.
  • upsert_vectors: Efficiently adds or updates multiple embeddings.
  • query_vectors: Performs similarity search to find conversations relevant to a user query.
  • delete_vector: Removes a conversation embedding.
Pinecone’s Role: Pinecone’s specialized vector search capabilities are essential for:
  • Contextual Retrieval: Finding conversations that are semantically related to a user’s request, even if they don’t share exact keywords.
  • Efficient Search: Quickly retrieving relevant conversations from a large collection.
  • Scalability: Handling the growing number of conversation embeddings as the user creates more conversations.

3. utils/llm/ Directory: The AI Maestro

This directory contains modules where the power of OpenAI’s LLMs is harnessed for a wide range of tasks. It’s the core of Omi’s intelligence! Key Files:
  • clients.py: LLM client configurations and embedding models
  • chat.py: Chat-related prompts and processing
  • conversation_processing.py: Conversation analysis and structuring
Key Functionalities:
  • Conversation Processing:
    • Determines if a conversation should be discarded.
    • Extracts structured information from transcripts (title, overview, categories, etc.).
    • Runs apps on conversation data.
  • External Integration Processing:
    • Creates structured summaries from photos and descriptions (OpenGlass).
    • Processes data from external sources to generate conversations.
  • Chat and Retrieval:
    • Generates initial chat messages.
    • Analyzes chat conversations to determine if context is needed.
    • Extracts relevant topics and dates from chat history.
    • Retrieves and summarizes relevant conversation content for chat responses.
  • Emotional Processing:
    • Analyzes conversation transcripts for user emotions.
    • Generates emotionally aware responses based on context and user facts.
  • Fact Extraction: Identifies and extracts new facts (memories) about the user from conversation transcripts.
OpenAI Integration:
  • The LLM modules leverage OpenAI’s models (GPT-4o and others) for language understanding, generation, and reasoning.
  • text-embedding-3-large is used to generate vector embeddings for conversations and user queries.
Why this is Essential:
  • The Brain of Omi: These modules enable Omi’s core AI capabilities, including natural language understanding, content generation, and context-aware interactions.
  • Conversation Enhancement: They enrich raw data by extracting meaning and creating structured information.
  • Personalized Responses: They help Omi provide responses tailored to individual users, incorporating their unique facts, conversations, and emotional states.
  • Extensibility: The app system and integration with external services make Omi highly versatile.
For detailed chat system architecture including LangGraph routing and the agentic tool system, see Chat System Architecture.

4. utils/other/storage.py: The Cloud Storage Manager

This module handles interactions with Google Cloud Storage (GCS), specifically for managing user speech profiles. Key Functions:
  • upload_profile_audio(file_path: str, uid: str):
    • Uploads a user’s speech profile audio recording to the GCS bucket specified by the BUCKET_SPEECH_PROFILES environment variable.
    • Organizes audio files within the bucket using the user’s ID (uid).
    • Returns the public URL of the uploaded file.
  • get_profile_audio_if_exists(uid: str) -> str:
    • Checks if a speech profile already exists for a given user ID in the GCS bucket.
    • Downloads the speech profile audio to a local temporary file if it exists and returns the file path.
    • Returns None if the profile does not exist.
Usage:
  • The upload_profile_audio function is called when a user uploads a new speech profile recording through the /v3/upload-audio endpoint (defined in routers/speech_profile.py).
  • The get_profile_audio_if_exists function is used to retrieve a user’s speech profile when needed, for example, during speaker identification in real-time transcription or post-processing.

5. database/redis_db.py: The Data Speedster

Redis is optional for local development. The backend will work without it, but features like speech profiles and app preferences caching will be disabled.
Redis is an in-memory data store known for its speed and efficiency. The database/redis_db.py module handles Omi’s interactions with Redis, which is primarily used for caching, managing user settings, and storing user speech profiles. Data Stored and Retrieved from Redis:
  • User Speech Profile Metadata:
    • Storage: When a user uploads a speech profile, the audio file is stored in Google Cloud Storage, while the duration is cached in Redis for quick access.
    • Retrieval: During real-time transcription or post-processing, the speech profile duration is retrieved from Redis cache, while the actual audio file is loaded from Google Cloud Storage when needed for speaker identification.
  • Enabled Apps:
    • Storage: A set of app IDs is stored for each user, representing the apps they have enabled.
    • Retrieval: When processing a conversation or handling a chat request, the backend checks Redis to see which apps are enabled for the user.
  • App Reviews:
    • Storage: Reviews for each app (score, review text, date) are stored in Redis, organized by app ID and user ID.
    • Retrieval: When displaying app information, the backend retrieves reviews from Redis.
  • Cached User Names:
    • Storage: User names are cached in Redis to avoid repeated lookups from Firebase.
    • Retrieval: The backend first checks Redis for a user’s name before querying Firestore, improving performance.
Key Functions:
FunctionPurpose
set_speech_profile_durationCache speech profile duration (audio stored in GCS)
get_speech_profile_durationRetrieve cached speech profile duration
set_user_has_soniox_speech_profileMark that user has Soniox speech profile
get_user_has_soniox_speech_profileCheck if user has Soniox speech profile
enable_app, disable_appManage app enable/disable states
get_enabled_appsGet user’s enabled apps
get_app_reviewsRetrieve reviews for an app
cache_user_name, get_cached_user_nameCache and retrieve user names
Storage Separation: Redis is used for caching metadata (durations, flags, enabled apps) to improve performance. The actual speech profile audio files are stored in Google Cloud Storage via utils/other/storage.py. This separation ensures Redis remains fast and lightweight while GCS handles binary file storage.
Why Redis is Important:
  • Performance: Caching data in Redis significantly improves the backend’s speed, as frequently accessed data can be retrieved from memory very quickly.
  • User Data Management: Redis provides a flexible and efficient way to manage user-specific metadata, such as app preferences, speech profile durations, and enabled apps.
  • Real-time Features: The low-latency nature of Redis makes it ideal for supporting real-time features like live transcription and instant app interactions.
  • Scalability: As the number of users grows, Redis helps maintain performance by reducing the load on primary databases (Firestore) and storage systems (GCS).

6. routers/transcribe.py: The Real-Time Transcription Engine

This module is the powerhouse behind Omi’s real-time transcription capabilities, allowing the app to convert spoken audio into text as the user is speaking. It leverages WebSockets for bidirectional communication with the Omi app and multiple STT services (Deepgram, Soniox, Speechmatics) for accurate and efficient transcription.

1. WebSocket Communication

WebSocket connections require proper authentication. The uid parameter must be a valid Firebase user ID, and connections without proper auth will be rejected.
  • /v4/listen Endpoint: The Omi app initiates a WebSocket connection with the backend at the /v4/listen endpoint, which is defined in the websocket_endpoint function of routers/transcribe.py.
  • Bidirectional Communication: WebSockets enable a two-way communication channel, allowing:
    • The Omi app to stream audio data to the backend continuously.
    • The backend to send back transcribed text segments as they become available from Deepgram.
  • Real-Time Feedback: This constant back-and-forth ensures that users see their words being transcribed in real-time, creating a more interactive and engaging experience.

2. STT Service Integration

Omi supports multiple Speech-to-Text (STT) services with automatic selection based on language support:
  • Deepgram (Nova-2, Nova-3 models) - Primary service for most languages
  • Soniox (stt-rt-preview model) - Used for specific language combinations
  • Speechmatics - Additional fallback option
The backend automatically selects the appropriate STT service using get_stt_service_for_language() in utils/stt/streaming.py, which considers language support and the configured service priority order (set via STT_SERVICE_MODELS environment variable). Processing Functions:
  • process_audio_dg() - Manages interaction with Deepgram API (found in utils/stt/streaming.py)
  • process_audio_soniox() - Manages interaction with Soniox API
  • process_audio_speechmatics() - Manages interaction with Speechmatics API
The audio chunks streamed from the Omi app are sent to the selected STT service API for transcription. The service’s speech recognition models process the audio and return text results in real-time. Deepgram Options Configuration: The process_audio_dg function configures various Deepgram options:
Automatically adds punctuation to transcribed text for readability. Essential for creating natural-looking transcripts.
Minimizes latency for real-time feedback. This is essential for live transcription where users expect to see words appear as they speak.
Sets the language for transcription (e.g., ‘en’, ‘es’, ‘fr’). The user can specify their preferred language when starting a recording.
Controls whether to send interim (partial) transcription results or only final results. Set to False in production for cleaner output.
Enables speaker diarization - identifying different speakers in the audio. Critical for multi-person conversations to attribute text to the correct speaker.
Audio format settings for Deepgram compatibility. These must match the audio format being sent from the app.

3. Transcription Flow

The numbered breakdown of this flow:
  1. App Streams Audio: The Omi app captures audio from the user’s device and continuously sends chunks of audio data through the WebSocket to the backend’s /v4/listen endpoint.
  2. Backend Receives and Selects STT Service: The backend’s websocket_endpoint function receives the audio chunks and selects the appropriate STT service (Deepgram, Soniox, or Speechmatics) based on the language using get_stt_service_for_language().
  3. Backend Forwards to STT: The backend immediately forwards audio chunks to the selected STT service using the corresponding processing function (process_audio_dg, process_audio_soniox, or process_audio_speechmatics).
  4. STT Processes: The STT service’s speech recognition models transcribe the audio data in real-time.
  5. Results Sent Back: The STT service sends the transcribed text segments back to the backend as they become available.
  6. Backend Relays to App: The backend immediately sends these transcription results back to the Omi app over the WebSocket connection.
  7. App Displays Transcript: The Omi app updates the user interface with the newly transcribed text, providing instant feedback.

4. Key Considerations

  • Speaker Identification: The code uses the STT service’s speaker diarization feature (available in Deepgram, Soniox, and Speechmatics) to identify different speakers in the audio. This information is included in the transcription results, allowing the app to display who said what.
  • User Speech Profile Integration: If a user has uploaded a speech profile, the backend can use this information (duration cached in Redis, audio file stored in Google Cloud Storage) to improve the accuracy of speaker identification.
  • Latency Management: Real-time transcription requires careful attention to latency to ensure a seamless user experience. The no_delay option in Deepgram and the efficient handling of data in the backend are essential for minimizing delays.
  • Error Handling: The code includes error handling mechanisms to gracefully handle any issues that may occur during the WebSocket connection or STT service transcription process.

5. Example Code Snippet (Simplified):

from fastapi import APIRouter, WebSocket

# ... other imports ...

router = APIRouter()


@router.websocket("/v4/listen")
async def websocket_endpoint(websocket: WebSocket, uid: str, language: str = 'en', ...):
    await websocket.accept()  # Accept the WebSocket connection

    # Start STT transcription (service selected automatically based on language)
    transcript_socket = await process_audio_dg(uid, websocket, language, ...)
    # Note: Could also be process_audio_soniox() or process_audio_speechmatics()
    # depending on language support

    # Receive and process audio chunks from the app
    async for data in websocket.iter_bytes():
        transcript_socket.send(data)

        # ... other logic for speaker identification, error handling, etc. 
For more detailed information on specific subsystems:

Need Help?