Voice & Chat Integration Guide
Engram provides two integrated communication channels for interacting with AI agents:
| Channel | Endpoint | Model | Auth | Use Case |
|---|---|---|---|---|
| Chat | API Gateway | gpt-5.1-chat | API Key | Text-based conversations |
| Voice | Azure AI Services | gpt-realtime | DefaultAzureCredential | Real-time voice interactions |
Security Model (NIST AI RMF Compliant)
| Level | Environment | Chat Auth | VoiceLive Auth |
|---|---|---|---|
| 1-2 | POC/Staging | API Key (APIM) | Azure CLI / DefaultAzureCredential |
| 3-5 | Production/Enterprise | API Key (APIM) | Managed Identity (DefaultAzureCredential) |
DefaultAzureCredential automatically selects the best credential for each environment:
- Local Dev: Azure CLI (
az login) - Azure Container Apps: Managed Identity
- AKS: Workload Identity
- VMs: System-assigned Managed Identity
Chat Integration (API Gateway)
Configuration
The chat system uses an OpenAI-compatible API gateway for text-based agent interactions.
# Environment Variables
AZURE_AI_ENDPOINT=https://zimax-gw.azure-api.net/zimax/openai/v1
AZURE_AI_KEY=<your-api-key>
AZURE_AI_DEPLOYMENT=gpt-5.1-chat
API Endpoints
REST Chat (Single Turn)
POST /api/v1/chat
Content-Type: application/json
{
"content": "What are the key requirements for this project?",
"agent_id": "elena",
"session_id": "optional-session-id"
}
Response:
{
"message_id": "uuid",
"content": "Based on my analysis...",
"agent_id": "elena",
"agent_name": "Dr. Elena Vasquez",
"timestamp": "2025-12-15T10:30:00Z",
"session_id": "session-123"
}
WebSocket Chat (Streaming)
// Connect to streaming chat
const ws = new WebSocket('ws://localhost:8082/api/v1/chat/ws/session-123');
// Send message
ws.send(JSON.stringify({
type: 'message',
content: 'Analyze project risks',
agent_id: 'marcus'
}));
// Receive streaming response
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
switch (data.type) {
case 'typing':
// Agent is typing
break;
case 'chunk':
// Append text chunk: data.content
break;
case 'complete':
// Response finished: data.message_id
break;
}
};
Agents
| Agent | ID | Expertise |
|---|---|---|
| Elena | elena | Business Analysis, Requirements Engineering |
| Marcus | marcus | Project Management, Risk Assessment |
VoiceLive Integration (Real-time Voice)
Configuration
VoiceLive provides real-time speech-to-speech conversations using Azure AI Services with NIST AI RMF compliant authentication.
# Environment Variables
AZURE_VOICELIVE_ENDPOINT=https://zimax.services.ai.azure.com
AZURE_VOICELIVE_MODEL=gpt-realtime
AZURE_VOICELIVE_VOICE=en-US-Ava:DragonHDLatestNeural
MARCUS_VOICELIVE_VOICE=en-US-GuyNeural
# Optional: API key for POC/staging (not needed if using Azure CLI or Managed Identity)
# AZURE_VOICELIVE_KEY=<optional-for-poc-only>
Authentication
VoiceLive uses DefaultAzureCredential which automatically selects the appropriate credential:
| Environment | Credential Used |
|---|---|
| Local Development | Azure CLI (az login) |
| Azure Container Apps | Managed Identity |
| Azure Kubernetes Service | Workload Identity |
| Azure VMs | System Managed Identity |
| GitHub Actions | Federated Identity |
Enterprise (Level 3-5): Managed Identity is required - no API keys in production.
API Endpoints
Voice Configuration
GET /api/v1/voice/config/{agent_id}
Response:
{
"agent_id": "elena",
"voice_name": "en-US-Ava:DragonHDLatestNeural",
"model": "gpt-realtime",
"endpoint_configured": true
}
Voice Status
GET /api/v1/voice/status
Response:
{
"voicelive_configured": true,
"endpoint": "https://zimax.services.ai.azure.com...",
"model": "gpt-realtime",
"active_sessions": 0,
"agents": {
"elena": {"voice": "en-US-Ava:DragonHDLatestNeural"},
"marcus": {"voice": "en-US-GuyNeural"}
}
}
VoiceLive WebSocket (Real-time Audio)
// Connect to VoiceLive
const ws = new WebSocket('ws://localhost:8082/api/v1/voice/voicelive/session-123');
ws.onopen = () => {
// Set agent
ws.send(JSON.stringify({ type: 'agent', agent_id: 'elena' }));
};
// Send audio (PCM16, 24kHz, mono)
function sendAudio(pcm16Base64) {
ws.send(JSON.stringify({
type: 'audio',
data: pcm16Base64
}));
}
// Switch agent mid-conversation
function switchAgent(agentId) {
ws.send(JSON.stringify({
type: 'agent',
agent_id: agentId // 'elena' or 'marcus'
}));
}
// Cancel current response
function cancelResponse() {
ws.send(JSON.stringify({ type: 'cancel' }));
}
// Handle responses
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
switch (data.type) {
case 'transcription':
// data.status: 'listening' | 'processing' | 'complete'
// data.text: transcribed text (when complete)
break;
case 'audio':
// data.data: base64 audio response
// data.format: 'audio/pcm16'
playAudio(data.data);
break;
case 'agent_switched':
// data.agent_id: new active agent
break;
case 'error':
// data.message: error description
break;
}
};
Audio Format Requirements
| Property | Value |
|---|---|
| Format | PCM16 (signed 16-bit) |
| Sample Rate | 24,000 Hz (Required for low-latency) |
| Channels | Mono (1 channel) |
| Encoding | Base64 |
| Chunk Size | 1200 samples (50ms) |
Memory Enrichment
The Voice system integrates with memory in two directions:
1. Pre-Session Context (Read)
Upon connection, the system retrieves up to 20 relevant facts from the user’s Zep memory graph and injects them into the Agent’s system instructions. This gives the agent awareness of the user’s context (role, current projects, past conversations).
2. Post-Session Persistence (Write) — v2 Architecture
[!NOTE] VoiceLive v2 decouples audio streaming from memory enrichment. See VoiceLive Configuration SOP for details.
In the evolved architecture:
// Browser: On receiving final transcript
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
if (data.type === 'transcription' && data.status === 'complete') {
// Async fire-and-forget to memory
fetch('/api/v1/memory/enrich', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
text: data.text,
session_id: sessionId,
speaker: data.speaker, // 'user' or 'assistant'
agent_id: currentAgentId
})
});
}
};
Key principle: Audio playback happens immediately. Memory enrichment flows behind it asynchronously. If memory persistence fails, the voice experience is unaffected.
Voice Personalities
| Agent | Voice | Personality |
|---|---|---|
| Elena | en-US-Ava:DragonHDLatestNeural | Warm, measured, professional |
| Marcus | en-US-GuyNeural | Confident, energetic, direct |
Frontend Integration
VoiceChat Component
The frontend includes a ready-to-use VoiceChat component:
import VoiceChat from './components/VoiceChat';
function App() {
return (
<VoiceChat
agentId="elena"
onMessage={(msg) => console.log('Voice message:', msg)}
onVisemes={(visemes) => /* Lip-sync data */}
onStatusChange={(status) => console.log('Status:', status)}
/>
);
}
Push-to-Talk Usage
- Hold the voice button to speak
- Release to send audio
- Agent responds with synthesized speech
- Switch agents anytime during conversation
Local Development
Start Backend
cd backend
# Set environment variables (or use .env file)
export AZURE_AI_ENDPOINT=https://zimax-gw.azure-api.net/zimax/openai/v1
export AZURE_AI_KEY=<your-key>
export AZURE_AI_DEPLOYMENT=gpt-5.1-chat
export AZURE_VOICELIVE_ENDPOINT=https://zimax.services.ai.azure.com
export AZURE_VOICELIVE_KEY=<your-key>
export AZURE_VOICELIVE_MODEL=gpt-realtime
# Start server
uvicorn backend.api.main:app --host 0.0.0.0 --port 8082 --reload
Start Frontend
cd frontend
export VITE_API_URL=http://localhost:8082
export VITE_WS_URL=ws://localhost:8082
npm run dev
Test Chat
curl -X POST http://localhost:8082/api/v1/chat \
-H "Content-Type: application/json" \
-d '{"content": "Hello Elena!", "agent_id": "elena"}'
Test Voice Status
curl http://localhost:8082/api/v1/voice/status
Troubleshooting
Chat Issues
| Issue | Solution |
|---|---|
| 401 Unauthorized | Check AZURE_AI_KEY is set correctly |
| Model not found | Verify AZURE_AI_DEPLOYMENT=gpt-5.1-chat |
| Temperature error | The gateway may not support custom temperature |
Voice Issues
| Issue | Solution |
|---|---|
| “VoiceLive not configured” | Set AZURE_VOICELIVE_ENDPOINT and AZURE_VOICELIVE_KEY |
| Connection failed | Verify the endpoint supports real-time audio |
| No audio response | Check microphone permissions in browser |
WebSocket Connection
// Debug WebSocket connection
ws.onerror = (error) => console.error('WebSocket error:', error);
ws.onclose = (event) => console.log('WebSocket closed:', event.code, event.reason);
Related Documentation
- Architecture - Platform design and context schema
- Agents - Elena and Marcus agent details
- Local Testing Guide - Development setup