# Zep Enterprise Deployment Strategy for Engram Platform
Executive Summary
This document outlines Zimax Networks LC's strategy for deploying Zep OSS in customer Kubernetes environments (dev/test/UAT/prod). Zep is central to Engram's success as the Memory layer, providing episodic memory (conversation history) and semantic memory (Graphiti knowledge graph). This plan addresses enterprise requirements including PostgreSQL/pgvector configuration, customer-managed key encryption, Graphiti knowledge graph management, and NIST AI RMF compliance.
**Key Decision**: Zep OSS (self-hosted) for customer environments to enable:
- Full control over data encryption with customer-managed keys
- Custom PostgreSQL/pgvector configuration and sizing
- Data residency in customer-controlled infrastructure
- Cost optimization through right-sizing (vs. consumption-based Cloud pricing)
- Integration with customer's existing PostgreSQL infrastructure
Table of Contents
- [1. Zep OSS vs Zep Cloud: Enterprise Comparison](#section1)
- [2. PostgreSQL/pgvector Configuration: Critical Sizing Strategy](#section2)
- [3. Data Encryption: Customer-Managed Keys with PostgreSQL TDE](#section3)
- [4. Kubernetes Deployment with Helm Charts](#section4)
- [5. NIST AI RMF Compliance Integration](#section5)
- [6. Operational Responsibilities](#section6)
- [7. Implementation Roadmap](#section7)
- [8. Risk Mitigation](#section8)
- [9. Success Metrics](#section9)
- [10. Next Steps](#section10)
1. Zep OSS vs Zep Cloud: Enterprise Comparison
Feature Comparison Matrix
Feature
Zep OSS (Self-Hosted)
Zep Cloud (Managed)
Engram Requirement
**Deployment**
Self-hosted on customer infrastructure
Fully managed by Zep
✅ Required - Customer K8s cluster
**PostgreSQL Control**
Full control over PostgreSQL/pgvector config
Managed database instances
✅ Required - Customer-managed PostgreSQL
**Data Residency**
Customer-controlled infrastructure
Zep-managed infrastructure
✅ Required - Customer data residency
**Encryption Keys**
Customer-managed keys (BYOK)
Zep-managed or BYOK option
✅ **CRITICAL** - Customer-managed keys required
**Graphiti Configuration**
Full control over knowledge graph settings
Managed configuration
✅ Required - Custom Graphiti tuning
**RBAC/SSO**
Custom implementation
Built-in (Enterprise)
✅ Required - Customer SSO integration
**Compliance**
Customer-managed compliance
SOC 2 Type II, HIPAA included
✅ Required - Customer-specific compliance
**Cost Model**
Infrastructure costs only
Consumption-based pricing
✅ Required - Predictable costs
**BYOM (Bring Your Own Model)**
Full control
Supported
✅ Required - Customer LLM accounts
**Multi-Tenancy**
Custom implementation
Tenant isolation
✅ Required - Customer-defined isolation
**On-Call Support**
Zimax Networks LC team
Zep support team
✅ Zimax Networks LC provides support
**PostgreSQL Sizing**
Customer-controlled sizing
Managed sizing
✅ **CRITICAL** - Right-sizing for cost optimization
**pgvector Configuration**
Full control over vector dimensions/indexes
Managed configuration
✅ Required - Custom vector tuning
Decision Rationale for OSS
**Why Zep OSS for Customer Environments:**
- **Customer-Managed PostgreSQL**: Engram uses PostgreSQL with pgvector extension. Customers require full control over database sizing, backups, and configuration to meet their infrastructure standards.
- **Data Residency**: Customer data (conversation history, knowledge graphs, facts) must remain in customer-controlled infrastructure. Zep Cloud stores data in Zep-managed infrastructure.
- **Cost Predictability**: For enterprise customers with predictable workloads, OSS on customer infrastructure provides cost predictability vs. consumption-based Cloud pricing.
- **PostgreSQL Integration**: Many customers already have PostgreSQL infrastructure. Zep OSS allows integration with existing databases, reducing operational overhead.
- **Custom Graphiti Tuning**: Graphiti knowledge graph requires tuning for customer-specific use cases. OSS provides full control over graph configuration, entity extraction, and relationship modeling.
- **Encryption Control**: Customer-managed keys (BYOK) are required for security assessments. OSS allows full control over encryption via PostgreSQL Transparent Data Encryption (TDE) and application-level encryption.
**Staging POC Exception**: Current ACA deployment is acceptable for testing only. Production requires Kubernetes deployment with customer-managed PostgreSQL.
2. PostgreSQL/pgvector Configuration: Critical Sizing Strategy
The Challenge
PostgreSQL with pgvector is the foundation of Zep's memory system. Unlike Temporal's immutable history shard count, PostgreSQL configuration can be changed, but **vector dimension and index type decisions are difficult to reverse**:
- Vector Dimensions: Once set, changing dimensions requires re-indexing all vectors (expensive operation)
- Index Type: HNSW vs. IVFFlat index choice affects query performance and cannot be easily swapped
- Connection Pooling: Must be configured correctly upfront for production workloads
- Autovacuum Tuning: Critical for maintaining vector index health
PostgreSQL Sizing Formula
``` PostgreSQL Capacity = (Episodic Memory + Semantic Memory) × Growth Factor
Where:
- Episodic Memory = Sessions × Avg Messages/Session × Avg Message Size
- Semantic Memory = Facts × Avg Fact Size + Graph Relationships × Overhead
- Growth Factor = 2.0-3.0 (for growth and retention) ```
Engram Platform PostgreSQL Sizing
Environment
Expected Load
Recommended SKU
Rationale
**Dev**
100 sessions, 1K facts
B1ms (1 vCore, 2GB)
Minimal, cost-optimized
**Test**
500 sessions, 10K facts
B2s (2 vCore, 4GB)
Testing load scenarios
**UAT**
2K sessions, 100K facts
D2s_v3 (2 vCore, 8GB)
Production-like load
**Production**
10K+ sessions, 1M+ facts
D4s_v3 (4 vCore, 16GB) or higher
Enterprise scale with headroom
**Formula Applied (Production)**:
``` 10,000 sessions × 50 messages/session × 500 bytes = 250 MB episodic 1,000,000 facts × 1 KB/fact = 1 GB semantic Total: ~1.25 GB data × 3.0 (growth + indexes) = 3.75 GB Recommended: D4s_v3 (16GB RAM) for headroom and performance ```
pgvector Configuration Strategy
1. Vector Dimension Selection
**Decision Point**: Vector dimensions are set at table creation and difficult to change.
**Engram Standard**: 1536 dimensions (OpenAI text-embedding-3-small compatibility)
2. Index Type Selection
Index Type
Build Time
Query Performance
Update Cost
Use Case
**HNSW**
Slow (hours for large datasets)
Excellent (sub-10ms)
High (rebuild required)
Production, read-heavy
**IVFFlat**
Fast (minutes)
Good (10-50ms)
Medium (periodic rebuild)
Development, write-heavy
**Engram Recommendation**: **HNSW for production** (better query performance, acceptable build time)
3. Connection Pooling Configuration
**Critical for Production**: PostgreSQL connection limits must be managed via connection pooling.
4. Autovacuum Tuning
**Critical for Vector Index Health**: Autovacuum maintains index statistics and prevents bloat.
3. Data Encryption: Customer-Managed Keys with PostgreSQL TDE
NIST AI RMF Alignment
Per [NIST AI RMF](https://www.nist.gov/itl/ai-risk-management-framework), data encryption requirements:
**Govern Function**:
- Establish encryption policies for AI memory data (conversation history, knowledge graphs)
- Define key management responsibilities
**Map Function**:
- Identify sensitive data in memory (PII in conversations, PHI in facts, business secrets)
- Map encryption requirements to memory types (episodic vs. semantic)
**Measure Function**:
- Verify encryption at rest and in transit
- Audit key rotation and access
**Manage Function**:
- Implement customer-managed key rotation
- Monitor encryption compliance
Zep Encryption Architecture
┌─────────────────────────────────────────────────────────┐ │ Engram Platform │ ├─────────────────────────────────────────────────────────┤ │ │ │ Memory Data (Plaintext) │ │ - Episodic: Conversation history │ │ - Semantic: Facts, entities, relationships │ │ │ │ │ ▼ │ │ ┌──────────────────┐ │ │ │ Application-Level │ ← Optional: Field-level encryption│ │ │ Encryption (Zep) │ - PII/PHI fields only │ │ └──────────────────┘ - Uses customer KMS │ │ │ │ │ ▼ │ │ ┌──────────────────┐ │ │ │ PostgreSQL TDE │ ← Database-level encryption │ │ │ (Transparent) │ - All data at rest │ │ └──────────────────┘ - Customer-managed keys │ │ │ (Azure Key Vault) │ │ ▼ │ │ Encrypted Data (PostgreSQL Storage) │ │ │ │ │ ▼ │ │ ┌──────────────────┐ │ │ │ Azure Key Vault │ ← Customer-managed encryption keys│ │ │ (Customer KMS) │ - Full control over keys │ │ └──────────────────┘ - Key rotation capability │ └─────────────────────────────────────────────────────────┘
Implementation Plan
Phase 1: PostgreSQL Transparent Data Encryption (TDE)
**Azure PostgreSQL Flexible Server TDE**:
- Built-in TDE using Azure Key Vault
- Customer-managed keys (CMK) supported
- Zero application changes required
Phase 2: Application-Level Encryption (Optional)
**For PII/PHI Fields**: Additional encryption for sensitive fields within memory data.
Phase 3: Key Rotation Strategy
**NIST AI RMF Requirement**: Regular key rotation
**Implementation**:
- **PostgreSQL TDE Key Rotation**: Azure PostgreSQL supports online key rotation
- **Application Key Rotation**: Gradual re-encryption of sensitive fields
- **Dual-Key Support**: Maintain current + previous key during rotation
- **Automated Rotation**: Scheduled rotation (e.g., quarterly)
4. Kubernetes Deployment with Helm Charts
Current State (Staging POC)
**Current**: Zep deployed in Azure Container Apps (ACA)
- `zep-server` Container App
- PostgreSQL backend (Azure Database for PostgreSQL Flexible Server)
- pgvector extension enabled
**Limitation**: ACA doesn't support advanced K8s features needed for enterprise:
- Custom PostgreSQL configuration
- Advanced networking (service mesh)
- Pod security policies
- Resource quotas and limits
Target State (Customer Environments)
**Deployment**: Zep OSS via Helm Charts (custom or community)
Kubernetes Cluster (Customer's) ├── Zep Namespace │ ├── Zep Server (Deployment) │ │ ├── API Server (3 replicas) │ │ ├── Memory Service (2 replicas) │ │ └── Graphiti Service (2 replicas) │ ├── PgBouncer (Optional, for connection pooling) │ └── Monitoring (Prometheus exporters) ├── PostgreSQL (Customer's managed database) │ ├── Azure Database for PostgreSQL Flexible Server │ │ └── pgvector extension enabled │ └── OR Customer's on-premises PostgreSQL └── Monitoring (Prometheus/Grafana)
5. NIST AI RMF Compliance Integration
Framework Mapping
NIST AI RMF Function
Zep Implementation
Engram Controls
**Govern**
Memory retention policies, data classification
Customer-defined governance
**Map**
Memory type classification (episodic vs. semantic)
Sensitivity tagging in metadata
**Measure**
Memory quality metrics, retrieval hit rates
Cost, performance, security metrics
**Manage**
Memory deletion, GDPR compliance, key rotation
Data lifecycle management
Data Encryption Controls (NIST AI RMF)
**Control ID**: AI-SEC-01 (Data Encryption)
**Implementation**:
- ✅ **Encryption at Rest**: All memory data encrypted via PostgreSQL TDE
- ✅ **Encryption in Transit**: TLS 1.3 for all PostgreSQL and Zep API communications
- ✅ **Key Management**: Customer-managed keys via Azure Key Vault
- ✅ **Key Rotation**: Automated quarterly rotation
- ✅ **Field-Level Encryption**: Optional encryption for PII/PHI fields
6. Operational Responsibilities
Zimax Networks LC Support Model
**For Customer Environments (dev/test/UAT/prod)**:
Responsibility
Zimax Networks LC
Customer
Zep deployment
✅ Helm chart deployment
Infrastructure provisioning
PostgreSQL sizing
✅ Assessment & recommendation
Approval & provisioning
pgvector configuration
✅ Setup & optimization
Database access
Graphiti tuning
✅ Configuration & optimization
Use case requirements
Monitoring & alerting
✅ Setup & maintenance
Alert response
Updates & patches
✅ Planning & execution
Maintenance windows
Troubleshooting
✅ 24/7 support
Issue reporting
Compliance documentation
✅ Preparation
Audit participation
**Dedicated Resources Required**:
- **Zep SME**: Deep expertise in OSS deployment and Graphiti
- **PostgreSQL DBA**: pgvector optimization, performance tuning
- **K8s Engineer**: Helm charts, deployment automation
- **Security Engineer**: Encryption, compliance, NIST AI RMF
- **SRE**: Monitoring, alerting, incident response
7. Implementation Roadmap
Phase 1: Foundation (Months 1-2)
- Research and document PostgreSQL/pgvector sizing methodology
- Build PostgreSQL sizing calculator tool
- Create Helm chart for Zep OSS
- Design Graphiti knowledge graph architecture
Phase 2: Encryption Implementation (Months 2-3)
- Configure PostgreSQL TDE with customer-managed keys
- Implement application-level encryption for PII/PHI
- Create key rotation workflow
- Test encryption end-to-end
Phase 3: Helm Chart Deployment (Months 3-4)
- Customize Zep Helm chart for enterprise
- Create deployment automation
- Test in dev environment
- Document deployment procedures
Phase 4: Compliance & Documentation (Months 4-5)
- Map NIST AI RMF controls to implementation
- Create security assessment documentation
- Prepare audit evidence
- Train support team
Phase 5: Production Deployment (Months 5-6)
- Deploy to customer dev environment
- Validate PostgreSQL sizing
- Test encryption with customer keys
- Gradual rollout to test/UAT/prod
8. Risk Mitigation
Risk: PostgreSQL Under-Sizing
Mitigation
- Conservative sizing (3x growth factor)
- Monitoring with early warning alerts
- Azure allows online SKU scaling (minimal downtime)
Risk: Vector Dimension Mismatch
Mitigation
- Standardize on 1536 dimensions (OpenAI compatibility)
- Document dimension requirements upfront
- Re-indexing procedure documented (requires downtime)
Risk: Encryption Key Compromise
Mitigation
- Key rotation workflow
- Dual-key support during rotation
- Audit logging of all key access
- PostgreSQL TDE with customer-managed keys
Risk: Graphiti Performance Issues
Mitigation
- Performance testing in UAT
- Graphiti configuration tuning guide
- Monitoring and alerting on query latency
- Index optimization recommendations
9. Success Metrics
Metric
Target
Measurement
PostgreSQL connection pool utilization
< 70%
Prometheus metrics
Vector search latency p95
< 100ms
Application metrics
Memory retrieval hit rate
> 90%
Zep metrics
Encryption coverage
100% of memory data
Audit logs
Key rotation success rate
100%
Rotation workflow logs
Deployment time
< 2 hours
Deployment automation
Support response time
< 1 hour (P1)
Incident tracking
10. Next Steps
- **Approve this strategy** for customer environment deployment
- **Allocate resources** for Zep SME, PostgreSQL DBA, K8s engineer, security engineer
- **Begin Phase 1** implementation (PostgreSQL sizing calculator, Helm chart research)
- **Engage with customer** database team for PostgreSQL integration requirements
- **Schedule security assessment** preparation timeline
References
- [Zep Documentation](https://docs.getzep.com/)
- [Graphiti Knowledge Graph](https://github.com/getzep/graphiti)
- [PostgreSQL pgvector Extension](https://github.com/pgvector/pgvector)
- [Azure Database for PostgreSQL TDE](https://learn.microsoft.com/azure/postgresql/flexible-server/concepts-data-encryption)
- [NIST AI Risk Management Framework](https://www.nist.gov/itl/ai-risk-management-framework)