Voice Transcription Configuration & Troubleshooting SOP
Last Updated: December 21, 2025
Status: Production-Ready
Maintainer: Engram Platform Team
Overview
Engram provides two transcription capabilities:
- Live Transcription (VoiceLive): Real-time speech-to-text during voice conversations
- Dictation (Chat): One-shot audio transcription for text input
Both use Azure Cognitive Services but with different flows and purposes.
Architecture
Live Transcription (VoiceLive)
┌─────────────┐ 24kHz PCM16 ┌──────────────┐ Events ┌─────────────────────────┐
│ Browser │ ─────────────────▶│ Backend │ ◀───────────▶│ gpt-realtime │
│ Microphone │ │ (voice.py) │ │ (transcription events) │
└─────────────┘ └──────┬───────┘ └─────────────────────────┘
│
▼
┌──────────────┐
│ Frontend │
│ (transcript) │
└──────────────┘
Events used:
CONVERSATION_ITEM_INPUT_AUDIO_TRANSCRIPTION_DELTA- Partial user speechCONVERSATION_ITEM_INPUT_AUDIO_TRANSCRIPTION_COMPLETED- Final user speechRESPONSE_AUDIO_TRANSCRIPT_DELTA- Partial assistant speechRESPONSE_AUDIO_TRANSCRIPT_DONE- Final assistant speech
Dictation (Chat Input)
┌─────────────┐ WAV/WebM ┌──────────────┐ API Call ┌─────────────────────────┐
│ Browser │ ──────────────▶│ Backend │ ──────────────▶│ Azure Speech Services │
│ Recording │ │ (chat.py) │ │ (whisper or speech) │
└─────────────┘ └──────────────┘ └─────────────────────────┘
Prerequisites
For VoiceLive Transcription
Same as VoiceLive SOP - requires:
AZURE_VOICELIVE_ENDPOINTAZURE_VOICELIVE_KEYAZURE_VOICELIVE_MODEL(gpt-realtime)
For Dictation
Requires Azure Speech Services or OpenAI Whisper:
| Variable | Description | Example |
|---|---|---|
AZURE_SPEECH_KEY | Azure Speech Services key | (from Key Vault) |
AZURE_SPEECH_REGION | Azure region | eastus2 |
Configuration
VoiceLive Transcription
VoiceLive transcription is automatic when using the voice interface. No additional configuration needed beyond VoiceLive setup.
Transcription events are mapped to UI messages:
# Backend to Frontend message mapping
{
"type": "transcription",
"speaker": "user" | "assistant",
"status": "listening" | "processing" | "complete",
"text": "transcribed text..."
}
Dictation (Chat Input)
The dictation button in ChatPanel captures audio and sends it for transcription:
// Frontend: ChatPanel.tsx
const handleDictation = async () => {
// 1. Start recording
// 2. Stop after silence or button press
// 3. Send audio to /api/v1/chat/transcribe
// 4. Insert transcribed text into input
};
Audio Format Requirements
VoiceLive (Real-time)
| Parameter | Value |
|---|---|
| Format | PCM16 (raw) |
| Sample Rate | 24,000 Hz |
| Channels | Mono |
| Bit Depth | 16-bit |
[!IMPORTANT] The browser records at 48kHz. The frontend must downsample to 24kHz before sending to the backend. The backend expects 24kHz PCM16.
Dictation (Batch)
| Parameter | Value |
|---|---|
| Format | WAV or WebM |
| Sample Rate | 16,000 Hz (optimal) |
| Channels | Mono |
| Max Duration | 60 seconds |
Troubleshooting
Issue: No Transcription Appearing
Symptom: User speaks but no text appears in the transcript area.
Possible Causes:
- Microphone not accessible
- Audio not being sent to backend
- VoiceLive not processing audio
Solution:
- Check browser microphone permissions
- Verify WebSocket connection is established
- Check backend logs for transcription events:
az containerapp logs show --name engram-api --resource-group engram --tail 100 | grep -i transcript
Issue: Partial Transcription Only
Symptom: Only partial text appears, never finalizes.
Root Cause: TRANSCRIPTION_COMPLETED event not being received.
Solution:
- Check VoiceLive voice activity detection (VAD) settings
- Adjust silence threshold in backend:
turn_detection=ServerVad(
threshold=0.6, # Lower = more sensitive
prefix_padding_ms=300,
silence_duration_ms=800, # Longer = waits more before finalizing
)
Issue: Wrong Sample Rate Error
Symptom: Audio sounds distorted or connection fails.
Root Cause: Frontend sending 48kHz audio instead of 24kHz.
Solution: Verify frontend downsampling in VoiceChat.tsx:
// Must downsample from 48kHz to 24kHz
const downsampledData = downsample(audioData, 48000, 24000);
Issue: Dictation Button Not Working
Symptom: Clicking microphone button in chat does nothing.
Root Cause: Event handler not bound correctly.
Solution:
- Check browser console for errors
- Verify transcription API endpoint exists
- Check CORS configuration
Verification Procedures
1. Verify VoiceLive Transcription
# Check voice status
curl -s https://engram.work/api/v1/voice/status | jq '.voicelive_configured'
# Should return: true
2. Test Transcription Events (Local)
Run the backend with debug logging:
LOG_LEVEL=DEBUG python -m uvicorn backend.api.main:app
Look for transcription events in logs:
VoiceLive Event: CONVERSATION_ITEM_INPUT_AUDIO_TRANSCRIPTION_DELTA
VoiceLive Event: CONVERSATION_ITEM_INPUT_AUDIO_TRANSCRIPTION_COMPLETED
3. Verify Dictation Endpoint
# Test transcription endpoint (with valid audio file)
curl -X POST https://engram.work/api/v1/chat/transcribe \
-H "Content-Type: audio/wav" \
--data-binary @test.wav
Memory Integration
Transcriptions are automatically stored in Zep memory:
# User speech → memory
voice_context.episodic.add_turn(
Turn(
role=MessageRole.USER,
content=final_text,
agent_id=None,
)
)
# Assistant speech → memory
voice_context.episodic.add_turn(
Turn(
role=MessageRole.ASSISTANT,
content=final_text,
agent_id=session.get("agent_id", "elena"),
)
)
Code References
| File | Purpose |
|---|---|
backend/api/routers/voice.py | VoiceLive transcription handling |
backend/api/routers/chat.py | Dictation endpoint |
frontend/src/components/VoiceChat/VoiceChat.tsx | Voice UI and audio processing |
frontend/src/components/ChatPanel/ChatPanel.tsx | Dictation button handler |
Changelog
| Date | Change |
|---|---|
| 2025-12-21 | Initial SOP created |
| 2025-12-21 | Documented 24kHz sample rate requirement |
| 2025-12-21 | Added memory integration section |