Storage Strategy & Scale: Postgres/pgvector for Zep

Overview

Engram’s self-hosted Zep deployment uses PostgreSQL with the pgvector extension for storing:

Episodic memory (conversation sessions, message transcripts)
Semantic memory (knowledge graph facts, entities, relationships)
Vector embeddings (for similarity search)

This document outlines the storage strategy, tuning, retention, backups, and RPO/RTO targets per environment.

Postgres Configuration per Environment

Staging (POC)

SKU: Standard_B1ms (Burstable, 1 vCore, 2GB RAM)
Version: PostgreSQL 16
Storage: 32GB
High Availability: Disabled
Backup Retention: 7 days
Geo-redundant Backup: Disabled
Cost: Minimum (scale-to-zero friendly)

Dev/Test

SKU: Standard_B1ms (Burstable)
Version: PostgreSQL 16
Storage: 32GB
High Availability: Disabled
Backup Retention: 7 days
Geo-redundant Backup: Disabled

UAT

SKU: Standard_D2s_v3 (General Purpose, 2 vCores, 8GB RAM)
Version: PostgreSQL 16
Storage: 64GB
High Availability: Zone-redundant
Backup Retention: 14 days
Geo-redundant Backup: Enabled
PITR: Point-in-time restore enabled

Production

SKU: Standard_D2s_v3 or higher (General Purpose, HA-capable)
Version: PostgreSQL 16
Storage: 128GB+ (scales with data volume)
High Availability: Zone-redundant
Backup Retention: 35 days
Geo-redundant Backup: Enabled
PITR: Point-in-time restore enabled

pgvector Extension Setup

Enable pgvector on Azure Postgres Flexible Server

Add extension to allowed list:
- Azure Portal → Postgres server → Server parameters
- Search for azure.extensions
- Add vector to the comma-separated list
- Save

Create extension in Zep database:

-- Connect to 'zep' database
\c zep

-- Create pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Verify
SELECT extname, extversion FROM pg_extension WHERE extname = 'vector';

pgvector Index Strategy

Zep uses vector similarity search. Recommended indexes:

HNSW (Hierarchical Navigable Small World) - Best for high-dimensional vectors (>100 dims)

CREATE INDEX ON zep_embeddings 
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

IVFFlat - Alternative for smaller datasets or when memory is constrained

CREATE INDEX ON zep_embeddings 
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

Recommendation: Start with HNSW for uat/prod; IVFFlat for dev/test if memory constrained.

Connection Pooling

Recommendation: Use PgBouncer or Azure Postgres connection pooler for production workloads
Pool size: 20-50 connections per app instance (backend/worker)
Idle timeout: 10 minutes
Max lifetime: 1 hour

Vacuum & Maintenance

Autovacuum Configuration

Postgres autovacuum should run regularly to maintain index health:

-- Check autovacuum status
SELECT schemaname, tablename, last_vacuum, last_autovacuum, last_analyze, last_autoanalyze
FROM pg_stat_user_tables
WHERE schemaname = 'public';

Manual Vacuum (if needed)

-- Analyze table statistics
ANALYZE zep_facts;

-- Vacuum (reclaim space, update statistics)
VACUUM ANALYZE zep_facts;

Automation: Azure Postgres Flexible Server handles autovacuum automatically. Monitor via Azure Monitor metrics (pg_stat_statements enabled for detailed query stats).

Retention & Archival

Episodic Memory (Sessions)

Active retention: 90 days (configurable per tenant)
Archive policy: Move sessions older than 90 days to cold storage (Blob) with metadata preserved
Deletion: GDPR-compliant deletion on request (purge from Postgres + Blob)

Semantic Memory (Facts)

Active retention: Indefinite (facts don’t expire unless explicitly invalidated)
Versioning: Track valid_from / valid_to timestamps for temporal queries
Cleanup: Periodic job to remove facts with valid_to < now() (optional, depends on retention policy)

Backup Strategy

RPO (Recovery Point Objective)

Staging/Dev/Test: 24 hours (daily backups)
UAT: 1 hour (hourly snapshots + continuous log backup)
Production: 5 minutes (continuous log backup with PITR)

RTO (Recovery Time Objective)

Staging/Dev/Test: 4 hours (restore from backup snapshot)
UAT: 1 hour (restore from PITR)
Production: 15 minutes (restore from PITR + HA failover)

Backup Types

Automated Backups (Azure Postgres Flexible Server)
- Full backups: Daily
- Log backups: Continuous (for PITR)
- Retention: Per environment (see above)

Manual Snapshots (before major schema migrations)

az postgres flexible-server backup create \
  --resource-group engram-rg \
  --server-name <env>-db \
  --backup-name pre-migration-$(date +%Y%m%d)

Export/Import (for cross-environment promotion)

# Export Zep database
pg_dump -h <source-host> -U cogadmin -d zep -F c -f zep-backup.dump

# Import to target
pg_restore -h <target-host> -U cogadmin -d zep zep-backup.dump

Encryption

Encryption at rest: Enabled by default (Azure Managed Keys)
Encryption in transit: TLS 1.2+ required (enforced via sslmode=require in DSN)
Key rotation: Azure handles key rotation automatically (can use customer-managed keys for prod)

Monitoring

Key Metrics

Connection pool utilization: active_connections / max_connections
Query performance: pg_stat_statements for slow queries
Index usage: pg_stat_user_indexes for index hit rates
Vacuum activity: pg_stat_progress_vacuum for long-running vacuums
Storage growth: pg_database_size('zep') for database size trends

Alerts

High connection count: >80% of max_connections
Slow queries: >5s execution time
Storage >90%: Approaching storage limit
Backup failures: Automated backup job failures

Scaling

Vertical Scaling

Increase SKU tier (B-series → D-series) for more CPU/RAM
Increase storage size (32GB → 64GB → 128GB)

Horizontal Scaling (Future)

Read replicas for read-heavy workloads (query-only access patterns)
Sharding (by tenant_id or date range) for very large datasets

Disaster Recovery

Primary failure: HA failover (automatic for uat/prod with zone redundancy)
Region failure: Geo-restore from geo-redundant backups (RTO: 1-4 hours)
Data corruption: Restore from PITR to point before corruption

Data Plane Split

System of Record: Raw artifacts stored in Azure Blob Storage (see docs/connectors.md)
System of Recall: Episodic + semantic memory in Zep/Postgres (this document)

The Postgres instance stores Zep’s memory data (recall plane). Raw documents and parsed artifacts are stored separately in Blob Storage (record plane) for provenance, reprocessing, and compliance.