Backend Deep Dive

Understanding the Omi Ecosystem

Omi is a multimodal AI assistant designed to understand and interact with users in a way that’s both intelligent and human-centered. The backend plays a crucial role in this by:

Processing and analyzing data: Converting audio to text, extracting meaning, and creating structured information from user interactions.
Storing conversations and extracting memories: Building a rich knowledge base of user conversations and extracted facts that Omi can draw upon to provide context and insights.
Facilitating intelligent chat: Understanding user requests, retrieving relevant conversations, and generating personalized responses using an agentic AI system.
Integrating with external services: Extending Omi’s capabilities and connecting it to other tools and platforms.

This deep dive will walk you through the core elements of Omi’s backend, providing a clear roadmap for developers and enthusiasts alike to understand its inner workings.

Quick Start
Full Deep Dive
Visual Guides

Jump to the Quick Reference table to find what you need fast.

System Architecture

Quick Reference

Conversations DB

Firestore CRUD operations for storing and retrieving conversations

Vector Search

Pinecone embeddings for semantic similarity search

LLM Processing

OpenAI integrations for structure extraction and chat

Cloud Storage

Google Cloud Storage for audio files

Redis Cache

High-speed caching for profiles and preferences

Transcription

Real-time speech-to-text with multiple STT services

Need to…	Go to
Store a conversation	`database/conversations.py`
Query similar conversations	`database/vector_db.py`
Process LLM calls	`utils/llm/` directory
Handle real-time audio	`routers/transcribe.py`
Manage caching	`database/redis_db.py`
Understand chat system	Chat System Architecture
Learn data models	Storing Conversations

The Flow of Information

Let’s trace the journey of a typical interaction with Omi, focusing on how audio recordings are transformed into stored conversations:

User Initiates Recording

The user starts a recording session using the Omi app, capturing a conversation or their thoughts.

WebSocket Connection

The Omi app establishes a real-time connection with the backend at the /v4/listen endpoint in routers/transcribe.py.

Audio Streaming

The app streams audio data continuously through the WebSocket to the backend.

Deepgram Processing

The backend forwards audio to Deepgram API for real-time speech-to-text conversion.

Live Feedback

Transcription results stream back through the backend to the app, displaying words as the user speaks.

Conversation Creation

During the WebSocket connection, the backend creates an “in_progress” conversation stub in Firestore. As audio streams, transcript segments are continuously added to Firestore in real-time. When recording ends, the app sends a POST request to /v1/conversations with an empty body ({}). The backend retrieves the in-progress conversation from Firestore and processes it.

LLM Processing

The process_conversation function uses OpenAI to extract:

Title & Overview - Summarizes the conversation
Action Items - Tasks and to-dos mentioned
Events - Calendar-worthy moments
Memories - Facts about the user

Storage

Firestore: Stores the full conversation document with transcript segments and metadata
Pinecone: Stores the vector embedding for semantic search
Redis: Caches frequently accessed data (speech profile durations, enabled apps, user names) for performance
Google Cloud Storage: Stores binary files (speech profile audio, conversation recordings, photos)

What Gets Extracted

Field	Description
`title`	A short, descriptive title
`overview`	Concise summary of main points
`category`	Work, personal, etc.
`action_items`	Tasks or to-dos mentioned
`events`	Calendar-worthy events
`memories`	Facts about the user (stored separately)

Use the Quick Reference table above to jump directly to the component you need. Each section below includes key functions and usage patterns.

Core Components

Now that you understand the general flow, let’s dive deeper into the key modules and services that power Omi’s backend.

1. `database/conversations.py`: The Conversation Guardian

This module is responsible for managing the interaction with Firebase Firestore, Omi’s main database for storing conversations and related data. Key Functions:

upsert_conversation: Creates or updates a conversation document in Firestore, ensuring efficient storage and handling of updates.
get_conversation: Retrieves a specific conversation by its ID.
get_conversations: Fetches a list of conversations for a user, allowing for filtering, pagination, and optional inclusion of discarded conversations.
Photo Functions: Handles the storage and retrieval of photos associated with conversations.

Firestore Structure: Each conversation is stored as a document in Firestore with the following fields:

class Conversation(BaseModel):
    id: str  # Unique ID
    created_at: datetime  # Creation timestamp
    started_at: Optional[datetime]
    finished_at: Optional[datetime]

    source: Optional[ConversationSource]  # omi, phone, desktop, openglass, etc.
    language: Optional[str]
    status: ConversationStatus  # in_progress, processing, completed, failed

    structured: Structured  # Contains extracted title, overview, action items, etc.
    transcript_segments: List[TranscriptSegment]
    geolocation: Optional[Geolocation]
    photos: List[ConversationPhoto]

    apps_results: List[AppResult]
    external_data: Optional[Dict]

    discarded: bool
    deleted: bool
    visibility: str  # private, shared, public

    # See StoringConversations.mdx for complete field reference

Conversation Lifecycle

Processing Triggers:

Manual: App sends POST /v1/conversations with empty body to trigger immediate processing
Automatic: Backend automatically processes conversations after a timeout period (conversation_timeout parameter, default 120 seconds of silence)
Both paths use the same process_conversation() function to extract structure, memories, and embeddings

2. `database/vector_db.py`: The Embedding Expert

This module manages the interaction with Pinecone, a vector database used to store and query conversation embeddings. Key Functions:

upsert_vector: Adds or updates a conversation embedding in Pinecone.
upsert_vectors: Efficiently adds or updates multiple embeddings.
query_vectors: Performs similarity search to find conversations relevant to a user query.
delete_vector: Removes a conversation embedding.

Pinecone’s Role: Pinecone’s specialized vector search capabilities are essential for:

Contextual Retrieval: Finding conversations that are semantically related to a user’s request, even if they don’t share exact keywords.
Efficient Search: Quickly retrieving relevant conversations from a large collection.
Scalability: Handling the growing number of conversation embeddings as the user creates more conversations.

3. `utils/llm/` Directory: The AI Maestro

This directory contains modules where the power of OpenAI’s LLMs is harnessed for a wide range of tasks. It’s the core of Omi’s intelligence! Key Files:

clients.py: LLM client configurations and embedding models
chat.py: Chat-related prompts and processing
conversation_processing.py: Conversation analysis and structuring

Key Functionalities:

Conversation Processing:
- Determines if a conversation should be discarded.
- Extracts structured information from transcripts (title, overview, categories, etc.).
- Runs apps on conversation data.
External Integration Processing:
- Creates structured summaries from photos and descriptions (OpenGlass).
- Processes data from external sources to generate conversations.
Chat and Retrieval:
- Generates initial chat messages.
- Analyzes chat conversations to determine if context is needed.
- Extracts relevant topics and dates from chat history.
- Retrieves and summarizes relevant conversation content for chat responses.
Emotional Processing:
- Analyzes conversation transcripts for user emotions.
- Generates emotionally aware responses based on context and user facts.
Fact Extraction: Identifies and extracts new facts (memories) about the user from conversation transcripts.

OpenAI Integration:

The LLM modules leverage OpenAI’s models (GPT-4o and others) for language understanding, generation, and reasoning.
text-embedding-3-large is used to generate vector embeddings for conversations and user queries.

Why this is Essential:

The Brain of Omi: These modules enable Omi’s core AI capabilities, including natural language understanding, content generation, and context-aware interactions.
Conversation Enhancement: They enrich raw data by extracting meaning and creating structured information.
Personalized Responses: They help Omi provide responses tailored to individual users, incorporating their unique facts, conversations, and emotional states.
Extensibility: The app system and integration with external services make Omi highly versatile.

For detailed chat system architecture including LangGraph routing and the agentic tool system, see Chat System Architecture.

4. `utils/other/storage.py`: The Cloud Storage Manager

This module handles interactions with Google Cloud Storage (GCS), specifically for managing user speech profiles. Key Functions:

upload_profile_audio(file_path: str, uid: str):
- Uploads a user’s speech profile audio recording to the GCS bucket specified by the BUCKET_SPEECH_PROFILES environment variable.
- Organizes audio files within the bucket using the user’s ID (uid).
- Returns the public URL of the uploaded file.
get_profile_audio_if_exists(uid: str) -> str:
- Checks if a speech profile already exists for a given user ID in the GCS bucket.
- Downloads the speech profile audio to a local temporary file if it exists and returns the file path.
- Returns None if the profile does not exist.

Usage:

The upload_profile_audio function is called when a user uploads a new speech profile recording through the /v3/upload-audio endpoint (defined in routers/speech_profile.py).
The get_profile_audio_if_exists function is used to retrieve a user’s speech profile when needed, for example, during speaker identification in real-time transcription or post-processing.

5. `database/redis_db.py`: The Data Speedster

Redis is optional for local development. The backend will work without it, but features like speech profiles and app preferences caching will be disabled.

Redis is an in-memory data store known for its speed and efficiency. The database/redis_db.py module handles Omi’s interactions with Redis, which is primarily used for caching, managing user settings, and storing user speech profiles. Data Stored and Retrieved from Redis:

User Speech Profile Metadata:
- Storage: When a user uploads a speech profile, the audio file is stored in Google Cloud Storage, while the duration is cached in Redis for quick access.
- Retrieval: During real-time transcription or post-processing, the speech profile duration is retrieved from Redis cache, while the actual audio file is loaded from Google Cloud Storage when needed for speaker identification.
Enabled Apps:
- Storage: A set of app IDs is stored for each user, representing the apps they have enabled.
- Retrieval: When processing a conversation or handling a chat request, the backend checks Redis to see which apps are enabled for the user.
App Reviews:
- Storage: Reviews for each app (score, review text, date) are stored in Redis, organized by app ID and user ID.
- Retrieval: When displaying app information, the backend retrieves reviews from Redis.
Cached User Names:
- Storage: User names are cached in Redis to avoid repeated lookups from Firebase.
- Retrieval: The backend first checks Redis for a user’s name before querying Firestore, improving performance.

Key Functions:

Function	Purpose
`set_speech_profile_duration`	Cache speech profile duration (audio stored in GCS)
`get_speech_profile_duration`	Retrieve cached speech profile duration
`set_user_has_soniox_speech_profile`	Mark that user has Soniox speech profile
`get_user_has_soniox_speech_profile`	Check if user has Soniox speech profile
`enable_app`, `disable_app`	Manage app enable/disable states
`get_enabled_apps`	Get user’s enabled apps
`get_app_reviews`	Retrieve reviews for an app
`cache_user_name`, `get_cached_user_name`	Cache and retrieve user names

Storage Separation: Redis is used for caching metadata (durations, flags, enabled apps) to improve performance. The actual speech profile audio files are stored in Google Cloud Storage via utils/other/storage.py. This separation ensures Redis remains fast and lightweight while GCS handles binary file storage.

Why Redis is Important:

Performance: Caching data in Redis significantly improves the backend’s speed, as frequently accessed data can be retrieved from memory very quickly.
User Data Management: Redis provides a flexible and efficient way to manage user-specific metadata, such as app preferences, speech profile durations, and enabled apps.
Real-time Features: The low-latency nature of Redis makes it ideal for supporting real-time features like live transcription and instant app interactions.
Scalability: As the number of users grows, Redis helps maintain performance by reducing the load on primary databases (Firestore) and storage systems (GCS).

6. `routers/transcribe.py`: The Real-Time Transcription Engine

This module is the powerhouse behind Omi’s real-time transcription capabilities, allowing the app to convert spoken audio into text as the user is speaking. It leverages WebSockets for bidirectional communication with the Omi app and multiple STT services (Deepgram, Soniox, Speechmatics) for accurate and efficient transcription.

1. WebSocket Communication

WebSocket connections require proper authentication. The uid parameter must be a valid Firebase user ID, and connections without proper auth will be rejected.

/v4/listen Endpoint: The Omi app initiates a WebSocket connection with the backend at the /v4/listen endpoint, which is defined in the websocket_endpoint function of routers/transcribe.py.
Bidirectional Communication: WebSockets enable a two-way communication channel, allowing:
- The Omi app to stream audio data to the backend continuously.
- The backend to send back transcribed text segments as they become available from Deepgram.
Real-Time Feedback: This constant back-and-forth ensures that users see their words being transcribed in real-time, creating a more interactive and engaging experience.

2. STT Service Integration

Omi supports multiple Speech-to-Text (STT) services with automatic selection based on language support:

Deepgram (Nova-2, Nova-3 models) - Primary service for most languages
Soniox (stt-rt-preview model) - Used for specific language combinations
Speechmatics - Additional fallback option

The backend automatically selects the appropriate STT service using get_stt_service_for_language() in utils/stt/streaming.py, which considers language support and the configured service priority order (set via STT_SERVICE_MODELS environment variable). Processing Functions:

process_audio_dg() - Manages interaction with Deepgram API (found in utils/stt/streaming.py)
process_audio_soniox() - Manages interaction with Soniox API
process_audio_speechmatics() - Manages interaction with Speechmatics API

The audio chunks streamed from the Omi app are sent to the selected STT service API for transcription. The service’s speech recognition models process the audio and return text results in real-time. Deepgram Options Configuration: The process_audio_dg function configures various Deepgram options:

punctuate

Automatically adds punctuation to transcribed text for readability. Essential for creating natural-looking transcripts.

no_delay

Minimizes latency for real-time feedback. This is essential for live transcription where users expect to see words appear as they speak.

language

Sets the language for transcription (e.g., ‘en’, ‘es’, ‘fr’). The user can specify their preferred language when starting a recording.

interim_results

Controls whether to send interim (partial) transcription results or only final results. Set to False in production for cleaner output.

diarize

Enables speaker diarization - identifying different speakers in the audio. Critical for multi-person conversations to attribute text to the correct speaker.

encoding & sample_rate

Audio format settings for Deepgram compatibility. These must match the audio format being sent from the app.

3. Transcription Flow

The numbered breakdown of this flow:

App Streams Audio: The Omi app captures audio from the user’s device and continuously sends chunks of audio data through the WebSocket to the backend’s /v4/listen endpoint.
Backend Receives and Selects STT Service: The backend’s websocket_endpoint function receives the audio chunks and selects the appropriate STT service (Deepgram, Soniox, or Speechmatics) based on the language using get_stt_service_for_language().
Backend Forwards to STT: The backend immediately forwards audio chunks to the selected STT service using the corresponding processing function (process_audio_dg, process_audio_soniox, or process_audio_speechmatics).
STT Processes: The STT service’s speech recognition models transcribe the audio data in real-time.
Results Sent Back: The STT service sends the transcribed text segments back to the backend as they become available.
Backend Relays to App: The backend immediately sends these transcription results back to the Omi app over the WebSocket connection.
App Displays Transcript: The Omi app updates the user interface with the newly transcribed text, providing instant feedback.

4. Key Considerations

Speaker Identification: The code uses the STT service’s speaker diarization feature (available in Deepgram, Soniox, and Speechmatics) to identify different speakers in the audio. This information is included in the transcription results, allowing the app to display who said what.
User Speech Profile Integration: If a user has uploaded a speech profile, the backend can use this information (duration cached in Redis, audio file stored in Google Cloud Storage) to improve the accuracy of speaker identification.
Latency Management: Real-time transcription requires careful attention to latency to ensure a seamless user experience. The no_delay option in Deepgram and the efficient handling of data in the backend are essential for minimizing delays.
Error Handling: The code includes error handling mechanisms to gracefully handle any issues that may occur during the WebSocket connection or STT service transcription process.

5. Example Code Snippet (Simplified):

from fastapi import APIRouter, WebSocket

# ... other imports ...

router = APIRouter()


@router.websocket("/v4/listen")
async def websocket_endpoint(websocket: WebSocket, uid: str, language: str = 'en', ...):
    await websocket.accept()  # Accept the WebSocket connection

    # Start STT transcription (service selected automatically based on language)
    transcript_socket = await process_audio_dg(uid, websocket, language, ...)
    # Note: Could also be process_audio_soniox() or process_audio_speechmatics()
    # depending on language support

    # Receive and process audio chunks from the app
    async for data in websocket.iter_bytes():
        transcript_socket.send(data)

        # ... other logic for speaker identification, error handling, etc. 

For more detailed information on specific subsystems:

Storing Conversations & Memories

Complete data model and storage architecture for conversations and extracted memories

Chat System Architecture

LangGraph routing, agentic tools, and vector search deep dive

Transcription Details

Detailed transcription pipeline and Deepgram configuration

Backend Setup

Environment setup, dependencies, and local development guide

Need Help?

GitHub Issues

Report bugs or request features

Discord Community

Get help from the community

Contribution Guide

Learn how to contribute to Omi

Full Documentation

Browse the complete docs

Get Started

Developer

Build Apps

Hardware

DIY Guide

Info

Backend Deep Dive

Understanding the Omi Ecosystem

System Architecture

Quick Reference

Conversations DB

Vector Search

LLM Processing

Cloud Storage

Redis Cache

Transcription

The Flow of Information

What Gets Extracted

Core Components

1. `database/conversations.py`: The Conversation Guardian

Conversation Lifecycle

2. `database/vector_db.py`: The Embedding Expert

3. `utils/llm/` Directory: The AI Maestro

4. `utils/other/storage.py`: The Cloud Storage Manager

5. `database/redis_db.py`: The Data Speedster

6. `routers/transcribe.py`: The Real-Time Transcription Engine

1. WebSocket Communication

2. STT Service Integration

3. Transcription Flow

4. Key Considerations

5. Example Code Snippet (Simplified):

Storing Conversations & Memories

Chat System Architecture

Transcription Details

Backend Setup

Need Help?

GitHub Issues

Discord Community

Contribution Guide

Full Documentation

Get Started

Developer

Build Apps

Hardware

DIY Guide

Info

​Understanding the Omi Ecosystem

​System Architecture

​Quick Reference

Conversations DB

Vector Search

LLM Processing

Cloud Storage

Redis Cache

Transcription

​The Flow of Information

​What Gets Extracted

​Core Components

​1. database/conversations.py: The Conversation Guardian

​Conversation Lifecycle

​2. database/vector_db.py: The Embedding Expert

​3. utils/llm/ Directory: The AI Maestro

​4. utils/other/storage.py: The Cloud Storage Manager

​5. database/redis_db.py: The Data Speedster

​6. routers/transcribe.py: The Real-Time Transcription Engine

​1. WebSocket Communication

​2. STT Service Integration

​3. Transcription Flow

​4. Key Considerations

​5. Example Code Snippet (Simplified):

​Related Documentation

Storing Conversations & Memories

Chat System Architecture

Transcription Details

Backend Setup

​Need Help?

GitHub Issues

Discord Community

Contribution Guide

Full Documentation

Understanding the Omi Ecosystem

System Architecture

Quick Reference

The Flow of Information

What Gets Extracted

Core Components

1. `database/conversations.py`: The Conversation Guardian

Conversation Lifecycle

2. `database/vector_db.py`: The Embedding Expert

3. `utils/llm/` Directory: The AI Maestro

4. `utils/other/storage.py`: The Cloud Storage Manager

5. `database/redis_db.py`: The Data Speedster

6. `routers/transcribe.py`: The Real-Time Transcription Engine

1. WebSocket Communication

2. STT Service Integration

3. Transcription Flow

4. Key Considerations

5. Example Code Snippet (Simplified):

Related Documentation

Need Help?