> ## Documentation Index > Fetch the complete documentation index at: https://docs.omi.me/llms.txt > Use this file to discover all available pages before exploring further. # Real-Time Audio Streaming > Stream raw audio bytes from your Omi device to any backend for custom speech processing, VAD, or audio analysis. ## Overview Omi allows you to stream audio bytes from your DevKit directly to your backend or any external service. This enables custom audio processing like: Use your own ASR models instead of Omi's default transcription Implement custom VAD logic for specialized use cases Extract features, spectrograms, or embeddings in real-time Store raw audio for later processing or compliance ```mermaid theme={null} flowchart LR subgraph Device["Omi Device"] M[Microphone] --> A[Audio Buffer] end A -->|Every X seconds| B[Omi Backend] B -->|POST octet-stream| W[Your Webhook] W --> P[Your Processing] ``` *** ## Technical Specifications | Specification | Value | | -------------------- | ----------------------------------------------------------------- | | **HTTP Method** | POST | | **Content-Type** | `application/octet-stream` | | **Audio Format** | Raw PCM16 (16-bit signed, little-endian) | | **Bytes per Sample** | 2 | | **Sample Rate** | 16,000 Hz (DevKit1 v1.0.4+, DevKit2) or 8,000 Hz (DevKit1 v1.0.2) | | **Channels** | Mono (1 channel) | The sample rate is passed as a query parameter so your endpoint can handle different device versions. *** ## Setup Guide Create a webhook that accepts POST requests with binary audio data. **Request format:** ``` POST /your-endpoint?sample_rate=16000&uid=user123 Content-Type: application/octet-stream Body: [raw PCM16 audio bytes] ``` Your endpoint should: * Accept `application/octet-stream` content type * Read `sample_rate` and `uid` from query parameters * Process the raw bytes (buffer, save, or analyze) * Return 200 OK quickly to avoid timeouts 1. Open the Omi App 2. Go to **Settings** → **Developer Mode** 3. Scroll to **Realtime audio bytes** 4. Enter your webhook URL 5. Set the **Every x seconds** field (e.g., `10` for 10-second chunks) Start speaking while wearing your Omi device. Audio bytes should arrive at your webhook at the configured interval. Use [webhook.site](https://webhook.site) to verify data is arriving before implementing your processing logic. *** ## Working with Audio Bytes ### Converting to WAV The received bytes are raw PCM16 audio. To create a playable WAV file, prepend a WAV header: ```python theme={null} import struct import wave import io def create_wav(audio_bytes: bytes, sample_rate: int) -> bytes: """Convert raw PCM16 bytes to WAV format.""" buffer = io.BytesIO() with wave.open(buffer, 'wb') as wav_file: wav_file.setnchannels(1) # Mono wav_file.setsampwidth(2) # 16-bit = 2 bytes wav_file.setframerate(sample_rate) wav_file.writeframes(audio_bytes) buffer.seek(0) return buffer.read() ``` ### Accumulating Chunks If you need continuous audio (not chunked), accumulate bytes across requests: ```python theme={null} from collections import defaultdict # Store audio by session audio_buffers = defaultdict(bytes) @app.post("/audio") async def receive_audio(request: Request, uid: str, sample_rate: int): audio_bytes = await request.body() # Accumulate audio for this user audio_buffers[uid] += audio_bytes # Process when you have enough audio (e.g., 60 seconds) if len(audio_buffers[uid]) >= sample_rate * 2 * 60: # 2 bytes per sample process_audio(audio_buffers[uid], sample_rate) audio_buffers[uid] = bytes() return {"status": "ok"} ``` *** ## Example: Save to Google Cloud Storage A complete example that saves audio files to Google Cloud Storage. Follow the [Saving Audio Guide](/doc/developer/savingaudio) steps 1-5 to create a bucket with proper permissions. Fork [github.com/mdmohsin7/omi-audio-streaming](https://github.com/mdmohsin7/omi-audio-streaming) Clone the repository and deploy to your preferred cloud provider (GCP, AWS, DigitalOcean) or run locally with ngrok. The repository includes a Dockerfile for easy deployment. Configure these environment variables during deployment: | Variable | Description | | ------------------------------------- | ------------------------------------------------ | | `GOOGLE_APPLICATION_CREDENTIALS_JSON` | GCP service account credentials (base64 encoded) | | `GCS_BUCKET_NAME` | Your GCS bucket name | Set the endpoint in Developer Settings → Realtime audio bytes: ``` https://your-deployment-url.com/audio ``` Audio files should now appear in your GCS bucket every X seconds (based on your configured interval). *** ## Processing Ideas Feed audio to your own ASR models for specialized vocabulary or languages: ```python theme={null} import whisper model = whisper.load_model("base") @app.post("/audio") async def transcribe(request: Request, sample_rate: int): audio_bytes = await request.body() wav_data = create_wav(audio_bytes, sample_rate) # Save temporarily and transcribe with tempfile.NamedTemporaryFile(suffix=".wav") as f: f.write(wav_data) f.flush() result = model.transcribe(f.name) return {"text": result["text"]} ``` Detect speech vs. silence for custom endpointing: ```python theme={null} import webrtcvad vad = webrtcvad.Vad(3) # Aggressiveness 0-3 def detect_speech(audio_bytes: bytes, sample_rate: int) -> bool: # webrtcvad needs 10, 20, or 30ms frames frame_duration = 30 # ms frame_size = int(sample_rate * frame_duration / 1000) * 2 speech_frames = 0 total_frames = 0 for i in range(0, len(audio_bytes), frame_size): frame = audio_bytes[i:i + frame_size] if len(frame) == frame_size: if vad.is_speech(frame, sample_rate): speech_frames += 1 total_frames += 1 return speech_frames / total_frames > 0.5 if total_frames else False ``` Extract embeddings for speaker identification or audio similarity: ```python theme={null} from speechbrain.pretrained import EncoderClassifier classifier = EncoderClassifier.from_hparams( source="speechbrain/spkrec-ecapa-voxceleb" ) def get_embedding(wav_path: str): return classifier.encode_batch( classifier.load_audio(wav_path) ) ``` Analyze emotional tone from audio features: ```python theme={null} import librosa import numpy as np def extract_features(audio_bytes: bytes, sample_rate: int): # Convert bytes to numpy array audio = np.frombuffer(audio_bytes, dtype=np.int16).astype(np.float32) audio = audio / 32768.0 # Normalize # Extract features mfccs = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=13) energy = librosa.feature.rms(y=audio) return { "mfcc_mean": mfccs.mean(axis=1).tolist(), "energy_mean": float(energy.mean()), } ``` *** ## Best Practices Return 200 OK immediately, process async. Slow responses may cause timeouts. Network issues may cause gaps. Design your processing to handle incomplete audio. Choose chunk interval based on your use case. Larger chunks = fewer requests but higher latency. Audio streaming generates significant data. Monitor storage and bandwidth costs. Audio data is sensitive. Ensure your endpoint is secured with HTTPS and implement appropriate access controls. *** ## Related Documentation Overview of webhook-based integrations Guide to storing audio in cloud storage How Omi's built-in transcription works Overview of all Omi app types