VoiceLive Architecture Clarification

Architecture Overview

The VoiceLive WebSocket proxy architecture works as follows:

┌─────────────────┐         WebSocket          ┌──────────────────┐         VoiceLive SDK         ┌─────────────────────┐
│   Browser       │ ◀────────────────────────▶│   Backend        │ ◀──────────────────────────▶│  Azure VoiceLive    │
│  (SWA Frontend) │   wss://backend/voicelive/ │  (Container App) │   Direct Connection        │  (gpt-realtime)     │
│                 │                             │                  │                             │                     │
│ VoiceChat.tsx   │                             │ voice.py         │                             │ Cognitive Services  │
└─────────────────┘                             └──────────────────┘                             └─────────────────────┘
       │                                                 │                                                    │
       │ Audio chunks                                    │                                                    │
       │ Transcripts                                      │ Audio + Events                                    │
       │                                                 │                                                    │
       │                                                 │ (Automatic Memory Persistence)                     │
       │                                                 ▼                                                    │
       │                                         ┌──────────────────┐                                        │
       └────────────────────────────────────────▶│   Zep Memory     │                                        │
                                                 │   (Episodic)     │                                        │
                                                 └──────────────────┘                                        │

Key Points

1. SWA is the Frontend (Browser)

The Static Web App (SWA) is the frontend code that runs in the user’s browser:

Location: https://engram.work (deployed to Azure Static Web Apps)
Code: React/TypeScript application (frontend/ directory)
Component: VoiceChat.tsx handles voice interaction
Connection: Browser connects to backend via WebSocket

2. Backend is the Proxy

The Backend (Container App) acts as a WebSocket proxy:

Location: staging-env-api.gentleriver-dd0de193.eastus2.azurecontainerapps.io
Endpoint: /api/v1/voice/voicelive/{session_id}
Function:
- Receives WebSocket connection from browser
- Connects to Azure VoiceLive using SDK
- Proxies audio and events between browser and Azure
- Automatically persists transcripts to Zep memory

3. Flow Direction

Audio Flow:

User speaks → Browser captures audio (microphone)
Browser → Sends audio chunks to Backend via WebSocket
Backend → Forwards audio to Azure VoiceLive
Azure VoiceLive → Processes audio, generates response
Azure VoiceLive → Sends audio + transcripts back to Backend
Backend → Forwards audio + transcripts to Browser
Browser → Plays audio and displays transcripts

Memory Flow:

Backend → Extracts transcripts from VoiceLive events
Backend → Stores transcripts in EnterpriseContext
Backend → Persists to Zep Memory (async, non-blocking)

Why Proxy Through Backend?

Benefits

Works with Unified Endpoints
- Unified endpoints (services.ai.azure.com) don’t support REST token endpoints
- WebSocket proxy bypasses this limitation
Automatic Memory Integration
- Backend automatically extracts transcripts
- Persists to Zep without frontend code
- Memory enrichment happens automatically
Simplified Frontend
- No need to manage Azure tokens
- No need to handle VoiceLive SDK complexity
- Just connect to backend WebSocket
Better Error Handling
- Backend can retry connections
- Backend can handle reconnection logic
- Centralized error handling
Security
- API keys stay on backend
- No exposure to browser
- Backend handles authentication

Alternative Architecture (Not Used)

Direct Browser-to-Azure (Token Endpoint)

Browser → Get Token from Backend → Connect Directly to Azure

Why Not Used:

❌ Token endpoint doesn’t work with unified endpoints
❌ Requires direct OpenAI endpoint (openai.azure.com)
❌ Frontend must manage tokens
❌ No automatic memory persistence

Summary

The voice does NOT proxy through the backend to get to the SWA.

Correct understanding:

SWA = Frontend (runs in browser)
Backend = WebSocket proxy (runs in Container App)
Flow: Browser (SWA) → Backend → Azure VoiceLive

The backend is a proxy between the browser and Azure, not between Azure and the SWA. The SWA IS the browser frontend.

Visual Flow

User's Browser (SWA Frontend)
    │
    │ WebSocket: wss://backend/api/v1/voice/voicelive/{session_id}
    │
    ▼
Backend Container App (Proxy)
    │
    │ VoiceLive SDK Connection
    │
    ▼
Azure VoiceLive (gpt-realtime)
    │
    │ Transcripts (async)
    │
    ▼
Zep Memory (Episodic Storage)

The backend sits between the browser and Azure, proxying the connection and handling memory persistence automatically.