VoiceLive Architecture Clarification
Architecture Overview
The VoiceLive WebSocket proxy architecture works as follows:
┌─────────────────┐ WebSocket ┌──────────────────┐ VoiceLive SDK ┌─────────────────────┐
│ Browser │ ◀────────────────────────▶│ Backend │ ◀──────────────────────────▶│ Azure VoiceLive │
│ (SWA Frontend) │ wss://backend/voicelive/ │ (Container App) │ Direct Connection │ (gpt-realtime) │
│ │ │ │ │ │
│ VoiceChat.tsx │ │ voice.py │ │ Cognitive Services │
└─────────────────┘ └──────────────────┘ └─────────────────────┘
│ │ │
│ Audio chunks │ │
│ Transcripts │ Audio + Events │
│ │ │
│ │ (Automatic Memory Persistence) │
│ ▼ │
│ ┌──────────────────┐ │
└────────────────────────────────────────▶│ Zep Memory │ │
│ (Episodic) │ │
└──────────────────┘ │
Key Points
1. SWA is the Frontend (Browser)
The Static Web App (SWA) is the frontend code that runs in the user’s browser:
- Location:
https://engram.work(deployed to Azure Static Web Apps) - Code: React/TypeScript application (
frontend/directory) - Component:
VoiceChat.tsxhandles voice interaction - Connection: Browser connects to backend via WebSocket
2. Backend is the Proxy
The Backend (Container App) acts as a WebSocket proxy:
- Location:
staging-env-api.gentleriver-dd0de193.eastus2.azurecontainerapps.io - Endpoint:
/api/v1/voice/voicelive/{session_id} - Function:
- Receives WebSocket connection from browser
- Connects to Azure VoiceLive using SDK
- Proxies audio and events between browser and Azure
- Automatically persists transcripts to Zep memory
3. Flow Direction
Audio Flow:
- User speaks → Browser captures audio (microphone)
- Browser → Sends audio chunks to Backend via WebSocket
- Backend → Forwards audio to Azure VoiceLive
- Azure VoiceLive → Processes audio, generates response
- Azure VoiceLive → Sends audio + transcripts back to Backend
- Backend → Forwards audio + transcripts to Browser
- Browser → Plays audio and displays transcripts
Memory Flow:
- Backend → Extracts transcripts from VoiceLive events
- Backend → Stores transcripts in EnterpriseContext
- Backend → Persists to Zep Memory (async, non-blocking)
Why Proxy Through Backend?
Benefits
- Works with Unified Endpoints
- Unified endpoints (
services.ai.azure.com) don’t support REST token endpoints - WebSocket proxy bypasses this limitation
- Unified endpoints (
- Automatic Memory Integration
- Backend automatically extracts transcripts
- Persists to Zep without frontend code
- Memory enrichment happens automatically
- Simplified Frontend
- No need to manage Azure tokens
- No need to handle VoiceLive SDK complexity
- Just connect to backend WebSocket
- Better Error Handling
- Backend can retry connections
- Backend can handle reconnection logic
- Centralized error handling
- Security
- API keys stay on backend
- No exposure to browser
- Backend handles authentication
Alternative Architecture (Not Used)
Direct Browser-to-Azure (Token Endpoint)
Browser → Get Token from Backend → Connect Directly to Azure
Why Not Used:
- ❌ Token endpoint doesn’t work with unified endpoints
- ❌ Requires direct OpenAI endpoint (
openai.azure.com) - ❌ Frontend must manage tokens
- ❌ No automatic memory persistence
Summary
The voice does NOT proxy through the backend to get to the SWA.
Correct understanding:
- SWA = Frontend (runs in browser)
- Backend = WebSocket proxy (runs in Container App)
- Flow: Browser (SWA) → Backend → Azure VoiceLive
The backend is a proxy between the browser and Azure, not between Azure and the SWA. The SWA IS the browser frontend.
Visual Flow
User's Browser (SWA Frontend)
│
│ WebSocket: wss://backend/api/v1/voice/voicelive/{session_id}
│
▼
Backend Container App (Proxy)
│
│ VoiceLive SDK Connection
│
▼
Azure VoiceLive (gpt-realtime)
│
│ Transcripts (async)
│
▼
Zep Memory (Episodic Storage)
The backend sits between the browser and Azure, proxying the connection and handling memory persistence automatically.