# Temporal Enterprise Deployment Strategy for Engram Platform
Executive Summary
This document outlines Zimax Networks LC's strategy for deploying Temporal OSS in customer Kubernetes environments (dev/test/UAT/prod). Temporal is central to Engram's success as the durable execution "Spine" layer, requiring dedicated resources and expertise. This plan addresses enterprise requirements including immutable history shard configuration, customer-managed key encryption, and NIST AI RMF compliance.
**Key Decision**: Temporal OSS (not Cloud) for customer environments to enable:
- Full control over data encryption with customer-managed keys
- Custom RBAC and SSO integration
- Compliance with customer-specific security requirements
- Cost optimization through right-sizing (vs. consumption-based Cloud pricing)
Table of Contents
- [1. Temporal OSS vs Temporal Cloud: Enterprise Comparison](#section1)
- [2. History Shard Count: Immutable Configuration Strategy](#section2)
- [3. Data Encryption: Customer-Managed Keys with Codec Server](#section3)
- [4. Kubernetes Deployment with Helm Charts](#section4)
- [5. NIST AI RMF Compliance Integration](#section5)
- [6. Operational Responsibilities](#section6)
- [7. Implementation Roadmap](#section7)
- [8. Risk Mitigation](#section8)
- [9. Success Metrics](#section9)
- [10. Next Steps](#section10)
1. Temporal OSS vs Temporal Cloud: Enterprise Comparison
Feature Comparison Matrix
Feature
Temporal OSS (Self-Hosted)
Temporal Cloud
Engram Requirement
**RBAC**
Custom plugin required
Built-in
✅ Required - Custom integration with Entra ID
**SSO/SAML/SCIM**
Custom implementation
Included in Enterprise
✅ Required - Customer-specific SSO
**Compliance**
Internal implementation (SOC 2, HIPAA)
SOC 2 Type II, HIPAA included
✅ Required - Customer-specific compliance
**Private Link**
Manual setup
Managed
✅ Required - Customer VNet integration
**Service Accounts/API Keys**
Custom implementation
Managed
✅ Required - Customer-managed keys
**Audit Logging**
Custom implementation
Centralized, built-in
✅ Required - Customer audit requirements
**Cloud Availability**
Manual configuration (AWS/GCP/Azure)
Seamless integrations
✅ Required - Customer's cloud choice
**MRN Support**
Complex setup with Replicated Namespaces
Fully managed
✅ Required - Multi-region for enterprise
**On-Call Support**
Zimax Networks LC team
24x7 Temporal support
✅ Zimax Networks LC provides support
**Data Encryption**
Custom Data Converter + Codec Server
Managed (limited customization)
✅ **CRITICAL** - Customer-managed keys required
**History Shard Count**
Immutable after creation
Managed by Temporal
✅ **CRITICAL** - Must plan correctly upfront
Decision Rationale for OSS
**Why Temporal OSS for Customer Environments:**
- **Customer-Managed Keys (CMK)**: Required for security assessments. Temporal Cloud's encryption is managed; OSS allows full control via Data Converter + Codec Server.
- **Custom Compliance**: Customers may require specific compliance frameworks (beyond SOC 2/HIPAA). OSS enables custom audit logging, data residency, and compliance controls.
- **Cost Predictability**: For enterprise customers with predictable workloads, OSS on customer infrastructure provides cost predictability vs. consumption-based Cloud pricing ($25/1M actions + $200/month base).
- **Integration Requirements**: Custom RBAC, SSO, and audit logging require OSS flexibility.
- **Data Residency**: Customer data must remain in customer-controlled infrastructure.
**Staging POC Exception**: Current ACA deployment is acceptable for testing only. Production requires Kubernetes deployment.
2. History Shard Count: Immutable Configuration Strategy
The Problem
Temporal's history shard count **cannot be changed after cluster creation**. This requires:
- Over-provisioning for peak demand (wasteful)
- Under-provisioning risks (bottlenecks)
- Careful upfront planning
Shard Count Sizing Formula
``` Shard Count = (Peak Workflow Throughput / Shard Capacity) × Safety Factor
Where:
- Peak Workflow Throughput = Max workflows/second expected
- Shard Capacity = ~1,000-2,000 workflows/second per shard (conservative)
- Safety Factor = 1.5-2.0 (for growth and spikes) ```
Engram Platform Shard Sizing
Environment
Expected Peak Throughput
Recommended Shards
Rationale
**Dev**
10 workflows/sec
4 shards
Minimal, cost-optimized
**Test**
50 workflows/sec
8 shards
Testing load scenarios
**UAT**
200 workflows/sec
16 shards
Production-like load
**Production**
1,000+ workflows/sec
32-64 shards
Enterprise scale with headroom
**Formula Applied (Production)**:
``` 1,000 workflows/sec ÷ 1,500 workflows/shard = 0.67 shards 0.67 × 2.0 (safety factor) = 1.34 → Round to 32 shards (power of 2) For enterprise with growth: 64 shards recommended ```
Management Strategy
1. Pre-Deployment Assessment
**Workload Analysis Tool** (to be built):
``` # tools/shard-calculator.py def calculate_shard_count(
peak_workflows_per_second: int,
avg_workflow_duration_seconds: int,
peak_concurrent_workflows: int,
growth_factor: float = 2.0 ) -> int:
"""
Calculate recommended shard count with growth projection.
Returns: Power-of-2 shard count (4, 8, 16, 32, 64, 128)
"""
base_shards = (peak_workflows_per_second / 1500) * growth_factor
return 2 ** math.ceil(math.log2(base_shards)) ```
2. Helm Chart Configuration
**values-production.yaml**:
``` server: config:
numHistoryShards: 64 # Immutable - set correctly upfront
persistence:
default:
historyShardCount: 64 ```
3. Monitoring and Alerting
**Metrics to Track**:
- `temporal_history_shard_utilization` - Alert if > 80%
- `temporal_workflow_throughput_per_shard` - Alert if > 1,500/sec
- `temporal_shard_hotspots` - Identify uneven distribution
**Alert Thresholds**:
- **Warning**: Any shard > 70% utilization
- **Critical**: Any shard > 90% utilization (indicates need for migration)
4. Migration Strategy (If Under-Provisioned)
If shard count is insufficient:
- **Create new cluster** with correct shard count
- **Replicate namespaces** to new cluster (MRN)
- **Gradual migration** of workflows
- **Decommission** old cluster
**Cost**: Significant operational overhead. Prevention via proper sizing is critical.
3. Data Encryption: Customer-Managed Keys with Codec Server
NIST AI RMF Alignment
Per [NIST AI RMF](https://www.nist.gov/itl/ai-risk-management-framework), data encryption requirements:
**Govern Function**:
- Establish encryption policies for AI workflow data
- Define key management responsibilities
**Map Function**:
- Identify sensitive data in workflow payloads (PII, PHI, business secrets)
- Map encryption requirements to workflow types
**Measure Function**:
- Verify encryption at rest and in transit
- Audit key rotation and access
**Manage Function**:
- Implement customer-managed key rotation
- Monitor encryption compliance
Temporal Encryption Architecture
┌─────────────────────────────────────────────────────────┐ │ Engram Platform │ ├─────────────────────────────────────────────────────────┤ │ │ │ Workflow Payload (Plaintext) │ │ │ │ │ ▼ │ │ ┌──────────────────┐ │ │ │ Data Converter │ ← Customer-managed encryption │ │ │ (Client-side) │ - Uses customer's KMS │ │ └──────────────────┘ - Azure Key Vault / AWS KMS │ │ │ / GCP KMS │ │ ▼ │ │ Encrypted Payload (Stored in Temporal) │ │ │ │ │ ▼ │ │ ┌──────────────────┐ │ │ │ Codec Server │ ← Decryption service │ │ │ (Separate Pod) │ - Isolated from Temporal │ │ └──────────────────┘ - Customer-managed keys only │ │ │ │ │ ▼ │ │ Decrypted Payload (For workflow execution) │ └─────────────────────────────────────────────────────────┘
Implementation Plan
Phase 1: Data Converter (Client-Side Encryption)
**Location**: `backend/workflows/codec/`
**Files to Create**:
- `data_converter.py` - Custom Data Converter with encryption
- `encryption_codec.py` - Payload encryption/decryption
- `key_management.py` - Integration with customer KMS
Phase 2: Codec Server (Server-Side Decryption)
**Why Codec Server?**
- Temporal server cannot decrypt (doesn't have keys)
- Codec Server runs in customer's K8s cluster with key access
- Isolated security boundary
Phase 3: Key Rotation Strategy
**NIST AI RMF Requirement**: Regular key rotation
**Implementation**:
- **Dual-Key Support**: Maintain current + previous key during rotation
- **Gradual Migration**: Re-encrypt workflows with new key over time
- **Key Versioning**: Track key versions in payload metadata
- **Automated Rotation**: Scheduled rotation (e.g., quarterly)
4. Kubernetes Deployment with Helm Charts
Current State (Staging POC)
**Current**: Temporal deployed in Azure Container Apps (ACA)
- `temporal-server` Container App
- `temporal-ui` Container App
- PostgreSQL backend
**Limitation**: ACA doesn't support advanced K8s features needed for enterprise:
- Custom RBAC integration
- Pod security policies
- Network policies
- Service mesh integration
Target State (Customer Environments)
**Deployment**: Temporal OSS via [Helm Charts](https://github.com/temporalio/helm-charts)
Kubernetes Cluster (Customer's) ├── Temporal Namespace │ ├── Temporal Server (StatefulSet) │ │ ├── Frontend (3 replicas) │ │ ├── History (3 replicas) │ │ ├── Matching (3 replicas) │ │ └── Worker (2 replicas) │ ├── Temporal Admin Tools │ ├── Temporal Web UI │ └── Codec Server (Separate deployment) ├── PostgreSQL (Customer's managed database) └── Monitoring (Prometheus/Grafana)
5. NIST AI RMF Compliance Integration
Framework Mapping
NIST AI RMF Function
Temporal Implementation
Engram Controls
**Govern**
Temporal namespace policies
Customer-defined governance
**Map**
Workflow type classification
Sensitivity tagging in metadata
**Measure**
Temporal metrics + Evidence Telemetry
Cost, performance, security metrics
**Manage**
Workflow signals, human-in-the-loop
Approval workflows, incident response
Data Encryption Controls (NIST AI RMF)
**Control ID**: AI-SEC-01 (Data Encryption)
**Implementation**:
- ✅ **Encryption at Rest**: All workflow history encrypted via Data Converter
- ✅ **Encryption in Transit**: TLS 1.3 for all Temporal communications
- ✅ **Key Management**: Customer-managed keys via Codec Server
- ✅ **Key Rotation**: Automated quarterly rotation
- ✅ **Access Control**: RBAC + SSO integration
6. Operational Responsibilities
Zimax Networks LC Support Model
**For Customer Environments (dev/test/UAT/prod)**:
Responsibility
Zimax Networks LC
Customer
Temporal deployment
✅ Helm chart deployment
Infrastructure provisioning
History shard sizing
✅ Assessment & recommendation
Approval
Codec Server deployment
✅ Implementation
Key management
Monitoring & alerting
✅ Setup & maintenance
Alert response
Updates & patches
✅ Planning & execution
Maintenance windows
Troubleshooting
✅ 24/7 support
Issue reporting
Compliance documentation
✅ Preparation
Audit participation
**Dedicated Resources Required**:
- **Temporal SME**: Deep expertise in OSS deployment
- **K8s Engineer**: Helm charts, deployment automation
- **Security Engineer**: Encryption, compliance, NIST AI RMF
- **SRE**: Monitoring, alerting, incident response
7. Implementation Roadmap
Phase 1: Foundation (Months 1-2)
- Research and document history shard sizing methodology
- Build shard calculator tool
- Create Helm chart customizations
- Design Codec Server architecture
Phase 2: Encryption Implementation (Months 2-3)
- Implement Data Converter with customer KMS integration
- Build Codec Server (separate K8s deployment)
- Create key rotation workflow
- Test encryption end-to-end
Phase 3: Helm Chart Deployment (Months 3-4)
- Customize Temporal Helm chart for enterprise
- Create deployment automation
- Test in dev environment
- Document deployment procedures
Phase 4: Compliance & Documentation (Months 4-5)
- Map NIST AI RMF controls to implementation
- Create security assessment documentation
- Prepare audit evidence
- Train support team
Phase 5: Production Deployment (Months 5-6)
- Deploy to customer dev environment
- Validate shard count sizing
- Test encryption with customer keys
- Gradual rollout to test/UAT/prod
8. Risk Mitigation
Risk: History Shard Under-Provisioning
Mitigation
- Conservative sizing (2x safety factor)
- Monitoring with early warning alerts
- Migration plan documented upfront
Risk: Encryption Key Compromise
Mitigation
- Key rotation workflow
- Dual-key support during rotation
- Audit logging of all key access
- Isolated Codec Server (separate security boundary)
Risk: Operational Complexity
Mitigation
- Comprehensive documentation
- Automated deployment (Helm charts)
- Dedicated support team
- Training and runbooks
9. Success Metrics
Metric
Target
Measurement
History shard utilization
< 70%
Prometheus metrics
Encryption coverage
100% of sensitive workflows
Audit logs
Key rotation success rate
100%
Rotation workflow logs
Deployment time
< 2 hours
Deployment automation
Support response time
< 1 hour (P1)
Incident tracking
10. Next Steps
- **Approve this strategy** for customer environment deployment
- **Allocate resources** for Temporal SME, K8s engineer, security engineer
- **Begin Phase 1** implementation (shard calculator, Helm chart research)
- **Engage with customer** security team for KMS integration requirements
- **Schedule security assessment** preparation timeline
References
- [Temporal Helm Charts](https://github.com/temporalio/helm-charts)
- [Temporal Data Converter Documentation](https://docs.temporal.io/concepts/what-is-a-data-converter)
- [NIST AI Risk Management Framework](https://www.nist.gov/itl/ai-risk-management-framework)
- [Temporal History Shard Configuration](https://docs.temporal.io/server/production-deployment-guide)