Understanding the Omi Ecosystem 🗺️
Omi is a multimodal AI assistant designed to understand and interact with users in a way that’s both intelligent and human-centered. The backend plays a crucial role in this by:- Processing and analyzing data: Converting audio to text, extracting meaning, and creating structured information from user interactions.
- Storing and managing memories: Building a rich knowledge base of user experiences that Omi can draw upon to provide context and insights.
- Facilitating intelligent conversations: Understanding user requests, retrieving relevant information, and generating personalized responses.
- Integrating with external services: Extending Omi’s capabilities and connecting it to other tools and platforms.
System Architecture

The Flow of Information: From User Interaction to Memory 🌊
Let’s trace the journey of a typical interaction with Omi, focusing on how audio recordings are transformed into lasting memories:A. User Initiates a Recording 🎤
- Recording Audio: The user starts a recording session using the Omi app, capturing a conversation or their thoughts.
B. Real-Time Transcription with Deepgram 🎧
- WebSocket Connection: The Omi app establishes a real-time connection with the backend using WebSockets (at the
/listen
endpoint inrouters/transcribe.py
). - Streaming Audio: The app streams audio data continuously through the WebSocket to the backend.
- Deepgram Processing: The backend forwards the audio data to the Deepgram API for real-time speech-to-text conversion.
- Transcription Results: As Deepgram transcribes the audio, it sends results back to the backend.
- Live Feedback: The backend relays these transcription results back to the Omi app, allowing for live transcription display as the user is speaking.
C. Creating a Lasting Memory 💾
- API Request to
/v1/memories
: When the conversation session ends, the Omi app sends a POST request to the/v1/memories
endpoint inrouters/memories.py
. - Data Formatting: The request includes information about the start and end time of the recording, the language, optional geolocation data, and the transcribed text segments from Deepgram.
- Memory Creation (
routers/memories.py
): Thecreate_memory
function in this file receives the request and performs basic validation on the data. - Processing the Memory (
utils/memories/process_memory.py
):- The
create_memory
function delegates the core memory processing logic to theprocess_memory
function. This function is where the real magic happens! - Structure Extraction: OpenAI’s powerful large language model (LLM) is used to analyze the transcript and extract key information, creating a structured representation of the memory. This
includes:
title
: A short, descriptive title.overview
: A concise summary of the main points.category
: A relevant category to organize memories (work, personal, etc.).action_items
: Any tasks or to-dos mentioned.events
: Events that might need to be added to a calendar.
- Embedding Generation: The LLM is also used to create a vector embedding of the memory, capturing its semantic meaning for later retrieval.
- App Execution: If the user has enabled any apps, relevant apps are run to enrich the memory with additional insights, external actions, or other context-specific information.
- Storage in Firestore: The fully processed memory, including the transcript, structured data, app results, and other metadata, is stored in Firebase Firestore (a NoSQL database) for persistence.
- Embedding Storage in Pinecone: The memory embedding is sent to Pinecone, a vector database, to enable fast and efficient similarity searches later.
- The
D. Enhancing the Memory (Optional)
- Post-Processing: The user can optionally trigger post-processing of the memory to improve the quality of the transcript. This involves:
- Sending the audio to a more accurate transcription service (like WhisperX through a FAL.ai function).
- Updating the memory in Firestore with the new transcript.
- Re-generating the embedding to reflect the updated content.
The Core Components: A Closer Look 🔎
Now that you understand the general flow, let’s dive deeper into the key modules and services that power Omi’s backend.1. database/memories.py
: The Memory Guardian 🛡️
This module is responsible for managing the interaction with Firebase Firestore, Omi’s main database for storing memories and related data.
Key Functions:
upsert_memory
: Creates or updates a memory document in Firestore, ensuring efficient storage and handling of updates.get_memory
: Retrieves a specific memory by its ID.get_memories
: Fetches a list of memories for a user, allowing for filtering, pagination, and optional inclusion of discarded memories.- OpenGlass Functions: Handles the storage and retrieval of photos associated with memories created through OpenGlass.
- Post-Processing Functions: Manages the storage of data related to transcript post-processing (status, model used, alternative transcription segments).
2. database/vector_db.py
: The Embedding Expert 🌲
This module manages the interaction with Pinecone, a vector database used to store and query memory embeddings.
Key Functions:
upsert_vector
: Adds or updates a memory embedding in Pinecone.upsert_vectors
: Efficiently adds or updates multiple embeddings.query_vectors
: Performs similarity search to find memories relevant to a user query.delete_vector
: Removes a memory embedding.
- Contextual Retrieval: Finding memories that are semantically related to a user’s request, even if they don’t share exact keywords.
- Efficient Search: Quickly retrieving relevant memories from a large collection.
- Scalability: Handling the growing number of memory embeddings as the user creates more memories.
3. utils/llm.py
: The AI Maestro 🧠
This module is where the power of OpenAI’s LLMs is harnessed for a wide range of tasks. It’s the core of Omi’s intelligence!
Key Functionalities:
- Memory Processing:
- Determines if a conversation should be discarded.
- Extracts structured information from transcripts (title, overview, categories, etc.).
- Runs apps on memory data.
- Handles post-processing of transcripts to improve accuracy.
- OpenGlass and External Integration Processing:
- Creates structured summaries from photos and descriptions (OpenGlass).
- Processes data from external sources (like ScreenPipe) to generate memories.
- Chat and Retrieval:
- Generates initial chat messages.
- Analyzes chat conversations to determine if context is needed.
- Extracts relevant topics and dates from chat history.
- Retrieves and summarizes relevant memory content for chat responses.
- Emotional Processing:
- Analyzes conversation transcripts for user emotions.
- Generates emotionally aware responses based on context and user facts.
- Fact Extraction: Identifies and extracts new facts about the user from conversation transcripts.
llm.py
leverages OpenAI’sChatOpenAI
model (specificallygpt-4o
in the code, but you can use other models) for language understanding, generation, and reasoning.- It uses OpenAI’s
OpenAIEmbeddings
model to generate vector embeddings for memories and user queries.
llm.py
is Essential:
- The Brain of Omi: This module enables Omi’s core AI capabilities, including natural language understanding, content generation, and context-aware interactions.
- Memory Enhancement: It enriches raw data by extracting meaning and creating structured information.
- Personalized Responses: It helps Omi provide responses that are tailored to individual users, incorporating their unique facts, memories, and even emotional states.
- Extensibility: The app system and integration with external services make Omi highly versatile.
4. utils/other/storage.py
: The Cloud Storage Manager ☁️
This module handles interactions with Google Cloud Storage (GCS), specifically for managing user speech profiles.
Key Functions:
upload_profile_audio(file_path: str, uid: str)
:- Uploads a user’s speech profile audio recording to the GCS bucket specified by the
BUCKET_SPEECH_PROFILES
environment variable. - Organizes audio files within the bucket using the user’s ID (
uid
). - Returns the public URL of the uploaded file.
- Uploads a user’s speech profile audio recording to the GCS bucket specified by the
get_profile_audio_if_exists(uid: str) -> str
:- Checks if a speech profile already exists for a given user ID in the GCS bucket.
- Downloads the speech profile audio to a local temporary file if it exists and returns the file path.
- Returns
None
if the profile does not exist.
- The
upload_profile_audio
function is called when a user uploads a new speech profile recording through the/v3/upload-audio
endpoint (defined inrouters/speech_profile.py
). - The
get_profile_audio_if_exists
function is used to retrieve a user’s speech profile when needed, for example, during speaker identification in real-time transcription or post-processing.
5. database/redis_db.py
: The Data Speedster 🚀
Redis is an in-memory data store known for its speed and efficiency. The database/redis_db.py
module handles Omi’s interactions with Redis, which is primarily used for caching, managing user
settings, and storing user speech profiles.
Data Stored and Retrieved from Redis:
- User Speech Profiles:
- Storage: When a user uploads a speech profile, the raw audio data, along with its duration, is stored in Redis.
- Retrieval: During real-time transcription or post-processing, the user’s speech profile is retrieved from Redis to aid in speaker identification.
- Enabled Apps:
- Storage: A set of app IDs is stored for each user, representing the apps they have enabled.
- Retrieval: When processing a memory or handling a chat request, the backend checks Redis to see which apps are enabled for the user.
- App Reviews:
- Storage: Reviews for each app (score, review text, date) are stored in Redis, organized by app ID and user ID.
- Retrieval: When displaying app information, the backend retrieves reviews from Redis.
- Cached User Names:
- Storage: User names are cached in Redis to avoid repeated lookups from Firebase.
- Retrieval: The backend first checks Redis for a user’s name before querying Firestore, improving performance.
-
store_user_speech_profile
,get_user_speech_profile
: For storing and retrieving speech profiles. -
store_user_speech_profile_duration
,get_user_speech_profile_duration
: For managing speech profile durations. -
enable_app
,disable_app
,get_enabled_apps
: For handling app enable/disable states. -
get_app_reviews
: Retrieves reviews for a app. -
cache_user_name
,get_cached_user_name
: For caching user names. Why Redis is Important: - Performance: Caching data in Redis significantly improves the backend’s speed, as frequently accessed data can be retrieved from memory very quickly.
- User Data Management: Redis provides a flexible and efficient way to manage user-specific data, such as app preferences and speech profiles.
- Real-time Features: The low-latency nature of Redis makes it ideal for supporting real-time features like live transcription and instant app interactions.
- Scalability: As the number of users grows, Redis helps maintain performance by reducing the load on primary databases.
6. routers/transcribe.py
: The Real-Time Transcription Engine 🎙️
This module is the powerhouse behind Omi’s real-time transcription capabilities, allowing the app to convert spoken audio into text as the user is speaking. It leverages WebSockets for bidirectional
communication with the Omi app and Deepgram’s speech-to-text API for accurate and efficient transcription.
1. WebSocket Communication: The Lifeline of Real-Time Interactions 🔌
/listen
Endpoint: The Omi app initiates a WebSocket connection with the backend at the/listen
endpoint, which is defined in thewebsocket_endpoint
function ofrouters/transcribe.py
.- Bidirectional Communication: WebSockets enable a two-way communication channel, allowing:
- The Omi app to stream audio data to the backend continuously.
- The backend to send back transcribed text segments as they become available from Deepgram.
- Real-Time Feedback: This constant back-and-forth ensures that users see their words being transcribed in real-time, creating a more interactive and engaging experience.
2. Deepgram Integration: Converting Speech to Text with Precision 🎧➡️📝
process_audio_dg
Function: Theprocess_audio_dg
function (found inutils/stt/streaming.py
) manages the interaction with Deepgram.- Deepgram API: The audio chunks streamed from the Omi app are sent to the Deepgram API for transcription. Deepgram’s sophisticated speech recognition models process the audio and return text results.
- Options Configuration: The
process_audio_dg
function configures various Deepgram options, including:punctuate
: Automatically adds punctuation to the transcribed text.no_delay
: Minimizes latency for real-time feedback.language
: Sets the language for transcription.interim_results
: (Set toFalse
in the code) Controls whether to send interim (partial) transcription results or only final results.diarize
: Enables speaker diarization (identifying different speakers in the audio).encoding
,sample_rate
: Sets audio encoding and sample rate for compatibility with Deepgram.
3. Transcription Flow: A Step-by-Step Breakdown 🌊
- App Streams Audio: The Omi app captures audio from the user’s device and continuously sends chunks of audio data through the WebSocket to the backend’s
/listen
endpoint. - Backend Receives and Forwards: The backend’s
websocket_endpoint
function receives the audio chunks and immediately forwards them to Deepgram using theprocess_audio_dg
function. - Deepgram Processes: Deepgram’s speech recognition models transcribe the audio data in real-time.
- Results Sent Back: Deepgram sends the transcribed text segments back to the backend as they become available.
- Backend Relays to App: The backend immediately sends these transcription results back to the Omi app over the WebSocket connection.
- App Displays Transcript: The Omi app updates the user interface with the newly transcribed text, providing instant feedback.
4. Key Considerations
- Speaker Identification: The code uses Deepgram’s speaker diarization feature to identify different speakers in the audio. This information is included in the transcription results, allowing the app to display who said what.
- User Speech Profile Integration: If a user has uploaded a speech profile, the backend can use this information (retrieved from Redis or Google Cloud Storage) to improve the accuracy of speaker identification.
- Latency Management: Real-time transcription requires careful attention to latency to ensure a seamless user experience. The
no_delay
option in Deepgram and the efficient handling of data in the backend are essential for minimizing delays. - Error Handling: The code includes error handling mechanisms to gracefully handle any issues that may occur during the WebSocket connection or Deepgram transcription process.
5. Example Code Snippet (Simplified):
Other Important Components 🧩
routers/transcribe.py
: Manages real-time audio transcription using Deepgram, sending the transcribed text back to the Omi app for display.routers/workflow.py
,routers/screenpipe.py
: Define API endpoints for external integrations to trigger memory creation.
Contributing 🤝
We welcome contributions from the open source community! Whether it’s improving documentation, adding new features, or reporting bugs, your input is valuable. Check out our Contribution Guide for more information.Support 🆘
If you’re stuck, have questions, or just want to chat about Omi:- GitHub Issues: 🐛 For bug reports and feature requests
- Community Forum: 💬 Join our community forum for discussions and questions
- Documentation: 📚 Check out our full documentation for in-depth guides