Real-Time Audio Streaming

Overview

Omi allows you to stream audio bytes from your DevKit directly to your backend or any external service. This enables custom audio processing like:

Custom Speech Recognition

Use your own ASR models instead of Omi’s default transcription

Voice Activity Detection

Implement custom VAD logic for specialized use cases

Audio Analysis

Extract features, spectrograms, or embeddings in real-time

Cloud Storage

Store raw audio for later processing or compliance

Technical Specifications

Specification	Value
HTTP Method	POST
Content-Type	`application/octet-stream`
Audio Format	Raw PCM16 (16-bit signed, little-endian)
Bytes per Sample	2
Sample Rate	16,000 Hz (DevKit1 v1.0.4+, DevKit2) or 8,000 Hz (DevKit1 v1.0.2)
Channels	Mono (1 channel)

The sample rate is passed as a query parameter so your endpoint can handle different device versions.

Setup Guide

Create Your Endpoint

Create a webhook that accepts POST requests with binary audio data.Request format:

POST /your-endpoint?sample_rate=16000&uid=user123
Content-Type: application/octet-stream
Body: [raw PCM16 audio bytes]

Your endpoint should:

Accept application/octet-stream content type
Read sample_rate and uid from query parameters
Process the raw bytes (buffer, save, or analyze)
Return 200 OK quickly to avoid timeouts

Configure in Omi App

Open the Omi App
Go to Settings → Developer Mode
Scroll to Realtime audio bytes
Enter your webhook URL
Set the Every x seconds field (e.g., 10 for 10-second chunks)

Test Your Integration

Start speaking while wearing your Omi device. Audio bytes should arrive at your webhook at the configured interval.

Use webhook.site to verify data is arriving before implementing your processing logic.

Working with Audio Bytes

Converting to WAV

The received bytes are raw PCM16 audio. To create a playable WAV file, prepend a WAV header:

import struct
import wave
import io

def create_wav(audio_bytes: bytes, sample_rate: int) -> bytes:
    """Convert raw PCM16 bytes to WAV format."""
    buffer = io.BytesIO()

    with wave.open(buffer, 'wb') as wav_file:
        wav_file.setnchannels(1)  # Mono
        wav_file.setsampwidth(2)  # 16-bit = 2 bytes
        wav_file.setframerate(sample_rate)
        wav_file.writeframes(audio_bytes)

    buffer.seek(0)
    return buffer.read()

Accumulating Chunks

If you need continuous audio (not chunked), accumulate bytes across requests:

from collections import defaultdict

# Store audio by session
audio_buffers = defaultdict(bytes)

@app.post("/audio")
async def receive_audio(request: Request, uid: str, sample_rate: int):
    audio_bytes = await request.body()

    # Accumulate audio for this user
    audio_buffers[uid] += audio_bytes

    # Process when you have enough audio (e.g., 60 seconds)
    if len(audio_buffers[uid]) >= sample_rate * 2 * 60:  # 2 bytes per sample
        process_audio(audio_buffers[uid], sample_rate)
        audio_buffers[uid] = bytes()

    return {"status": "ok"}

Example: Save to Google Cloud Storage

A complete example that saves audio files to Google Cloud Storage.

Create GCS Bucket

Follow the Saving Audio Guide steps 1-5 to create a bucket with proper permissions.

Fork the Example Repository

Fork github.com/mdmohsin7/omi-audio-streaming

Clone and Deploy

Clone the repository and deploy to your preferred cloud provider (GCP, AWS, DigitalOcean) or run locally with ngrok.The repository includes a Dockerfile for easy deployment.

Set Environment Variables

Configure these environment variables during deployment:

Variable	Description
`GOOGLE_APPLICATION_CREDENTIALS_JSON`	GCP service account credentials (base64 encoded)
`GCS_BUCKET_NAME`	Your GCS bucket name

Configure Omi App

Set the endpoint in Developer Settings → Realtime audio bytes:

https://your-deployment-url.com/audio

Verify

Audio files should now appear in your GCS bucket every X seconds (based on your configured interval).

Processing Ideas

Custom Speech Recognition

Feed audio to your own ASR models for specialized vocabulary or languages:

import whisper

model = whisper.load_model("base")

@app.post("/audio")
async def transcribe(request: Request, sample_rate: int):
    audio_bytes = await request.body()
    wav_data = create_wav(audio_bytes, sample_rate)

    # Save temporarily and transcribe
    with tempfile.NamedTemporaryFile(suffix=".wav") as f:
        f.write(wav_data)
        f.flush()
        result = model.transcribe(f.name)

    return {"text": result["text"]}

Voice Activity Detection

Detect speech vs. silence for custom endpointing:

import webrtcvad

vad = webrtcvad.Vad(3)  # Aggressiveness 0-3

def detect_speech(audio_bytes: bytes, sample_rate: int) -> bool:
    # webrtcvad needs 10, 20, or 30ms frames
    frame_duration = 30  # ms
    frame_size = int(sample_rate * frame_duration / 1000) * 2

    speech_frames = 0
    total_frames = 0

    for i in range(0, len(audio_bytes), frame_size):
        frame = audio_bytes[i:i + frame_size]
        if len(frame) == frame_size:
            if vad.is_speech(frame, sample_rate):
                speech_frames += 1
            total_frames += 1

    return speech_frames / total_frames > 0.5 if total_frames else False

Audio Embeddings

Extract embeddings for speaker identification or audio similarity:

from speechbrain.pretrained import EncoderClassifier

classifier = EncoderClassifier.from_hparams(
    source="speechbrain/spkrec-ecapa-voxceleb"
)

def get_embedding(wav_path: str):
    return classifier.encode_batch(
        classifier.load_audio(wav_path)
    )

Real-time Sentiment

Analyze emotional tone from audio features:

import librosa
import numpy as np

def extract_features(audio_bytes: bytes, sample_rate: int):
    # Convert bytes to numpy array
    audio = np.frombuffer(audio_bytes, dtype=np.int16).astype(np.float32)
    audio = audio / 32768.0  # Normalize

    # Extract features
    mfccs = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=13)
    energy = librosa.feature.rms(y=audio)

    return {
        "mfcc_mean": mfccs.mean(axis=1).tolist(),
        "energy_mean": float(energy.mean()),
    }

Best Practices

Respond Quickly

Return 200 OK immediately, process async. Slow responses may cause timeouts.

Handle Missing Data

Network issues may cause gaps. Design your processing to handle incomplete audio.

Buffer Appropriately

Choose chunk interval based on your use case. Larger chunks = fewer requests but higher latency.

Monitor Usage

Audio streaming generates significant data. Monitor storage and bandwidth costs.

Audio data is sensitive. Ensure your endpoint is secured with HTTPS and implement appropriate access controls.

Integration Apps

Overview of webhook-based integrations

Saving Audio

Guide to storing audio in cloud storage

Real-time Transcription

How Omi’s built-in transcription works

Apps Introduction

Overview of all Omi app types

Get Started

Core Development

Developer API

MCP Integration

Build Apps

Hardware

DIY Guide

Info

Real-Time Audio Streaming

Overview

Custom Speech Recognition

Voice Activity Detection

Audio Analysis

Cloud Storage

Technical Specifications

Setup Guide

Working with Audio Bytes

Converting to WAV

Accumulating Chunks

Example: Save to Google Cloud Storage

Processing Ideas

Best Practices

Respond Quickly

Handle Missing Data

Buffer Appropriately

Monitor Usage

Integration Apps

Saving Audio

Real-time Transcription

Apps Introduction

Get Started

Core Development

Developer API

MCP Integration

Build Apps

Hardware

DIY Guide

Info

​Overview

Custom Speech Recognition

Voice Activity Detection

Audio Analysis

Cloud Storage

​Technical Specifications

​Setup Guide

​Working with Audio Bytes

​Converting to WAV

​Accumulating Chunks

​Example: Save to Google Cloud Storage

​Processing Ideas

​Best Practices

Respond Quickly

Handle Missing Data

Buffer Appropriately

Monitor Usage

​Related Documentation

Integration Apps

Saving Audio

Real-time Transcription

Apps Introduction

Overview

Technical Specifications

Setup Guide

Working with Audio Bytes

Converting to WAV

Accumulating Chunks

Example: Save to Google Cloud Storage

Processing Ideas

Best Practices

Related Documentation