tools/shard-calculator.py

    # Temporal Enterprise Deployment Strategy for Engram Platform

Executive Summary

        This document outlines Zimax Networks LC's strategy for deploying Temporal OSS in customer Kubernetes environments (dev/test/UAT/prod). Temporal is central to Engram's success as the durable execution "Spine" layer, requiring dedicated resources and expertise. This plan addresses enterprise requirements including immutable history shard configuration, customer-managed key encryption, and NIST AI RMF compliance.

        **Key Decision**: Temporal OSS (not Cloud) for customer environments to enable:

            - Full control over data encryption with customer-managed keys
            - Custom RBAC and SSO integration
            - Compliance with customer-specific security requirements
            - Cost optimization through right-sizing (vs. consumption-based Cloud pricing)

            - [1. Temporal OSS vs Temporal Cloud: Enterprise Comparison](#section1)
            - [2. History Shard Count: Immutable Configuration Strategy](#section2)
            - [3. Data Encryption: Customer-Managed Keys with Codec Server](#section3)
            - [4. Kubernetes Deployment with Helm Charts](#section4)
            - [5. NIST AI RMF Compliance Integration](#section5)
            - [6. Operational Responsibilities](#section6)
            - [7. Implementation Roadmap](#section7)
            - [8. Risk Mitigation](#section8)
            - [9. Success Metrics](#section9)
            - [10. Next Steps](#section10)

1. Temporal OSS vs Temporal Cloud: Enterprise Comparison

Feature Comparison Matrix

                    Feature
                    Temporal OSS (Self-Hosted)
                    Temporal Cloud
                    Engram Requirement
                
            
            
                
                    **RBAC**
                    Custom plugin required
                    Built-in
                    ✅ Required - Custom integration with Entra ID
                
                
                    **SSO/SAML/SCIM**
                    Custom implementation
                    Included in Enterprise
                    ✅ Required - Customer-specific SSO
                
                
                    **Compliance**
                    Internal implementation (SOC 2, HIPAA)
                    SOC 2 Type II, HIPAA included
                    ✅ Required - Customer-specific compliance
                
                
                    **Private Link**
                    Manual setup
                    Managed
                    ✅ Required - Customer VNet integration
                
                
                    **Service Accounts/API Keys**
                    Custom implementation
                    Managed
                    ✅ Required - Customer-managed keys
                
                
                    **Audit Logging**
                    Custom implementation
                    Centralized, built-in
                    ✅ Required - Customer audit requirements
                
                
                    **Cloud Availability**
                    Manual configuration (AWS/GCP/Azure)
                    Seamless integrations
                    ✅ Required - Customer's cloud choice
                
                
                    **MRN Support**
                    Complex setup with Replicated Namespaces
                    Fully managed
                    ✅ Required - Multi-region for enterprise
                
                
                    **On-Call Support**
                    Zimax Networks LC team
                    24x7 Temporal support
                    ✅ Zimax Networks LC provides support
                
                
                    **Data Encryption**
                    Custom Data Converter + Codec Server
                    Managed (limited customization)
                    ✅ **CRITICAL** - Customer-managed keys required
                
                
                    **History Shard Count**
                    Immutable after creation
                    Managed by Temporal
                    ✅ **CRITICAL** - Must plan correctly upfront

Decision Rationale for OSS

        **Why Temporal OSS for Customer Environments:**


        
            - **Customer-Managed Keys (CMK)**: Required for security assessments. Temporal Cloud's encryption is managed; OSS allows full control via Data Converter + Codec Server.
            - **Custom Compliance**: Customers may require specific compliance frameworks (beyond SOC 2/HIPAA). OSS enables custom audit logging, data residency, and compliance controls.
            - **Cost Predictability**: For enterprise customers with predictable workloads, OSS on customer infrastructure provides cost predictability vs. consumption-based Cloud pricing ($25/1M actions + $200/month base).
            - **Integration Requirements**: Custom RBAC, SSO, and audit logging require OSS flexibility.
            - **Data Residency**: Customer data must remain in customer-controlled infrastructure.
        
        **Staging POC Exception**: Current ACA deployment is acceptable for testing only. Production requires Kubernetes deployment.

2. History Shard Count: Immutable Configuration Strategy

The Problem

        Temporal's history shard count **cannot be changed after cluster creation**. This requires:

            - Over-provisioning for peak demand (wasteful)
            - Under-provisioning risks (bottlenecks)
            - Careful upfront planning

Shard Count Sizing Formula

        ``` Shard Count = (Peak Workflow Throughput / Shard Capacity) × Safety Factor

Where:

Peak Workflow Throughput = Max workflows/second expected
Shard Capacity = ~1,000-2,000 workflows/second per shard (conservative)
Safety Factor = 1.5-2.0 (for growth and spikes) ```

Engram Platform Shard Sizing

                    Environment
                    Expected Peak Throughput
                    Recommended Shards
                    Rationale
                
                    **Dev**
                    10 workflows/sec
                    4 shards
                    Minimal, cost-optimized
                
                    **Test**
                    50 workflows/sec
                    8 shards
                    Testing load scenarios
                
                    **UAT**
                    200 workflows/sec
                    16 shards
                    Production-like load
                
                    **Production**
                    1,000+ workflows/sec
                    32-64 shards
                    Enterprise scale with headroom
                
        **Formula Applied (Production)**:

        ``` 1,000 workflows/sec ÷ 1,500 workflows/shard = 0.67 shards 0.67 × 2.0 (safety factor) = 1.34 → Round to 32 shards (power of 2) For enterprise with growth: 64 shards recommended ```

Management Strategy

1. Pre-Deployment Assessment

        **Workload Analysis Tool** (to be built):


        ``` # tools/shard-calculator.py def calculate_shard_count(
peak_workflows_per_second: int,
avg_workflow_duration_seconds: int,
peak_concurrent_workflows: int,
growth_factor: float = 2.0 ) -> int:
"""
Calculate recommended shard count with growth projection.

Returns: Power-of-2 shard count (4, 8, 16, 32, 64, 128)
"""
base_shards = (peak_workflows_per_second / 1500) * growth_factor
return 2 ** math.ceil(math.log2(base_shards)) ```

2. Helm Chart Configuration

        **values-production.yaml**:


        ``` server:   config:
numHistoryShards: 64  # Immutable - set correctly upfront
persistence:
  default:
    historyShardCount: 64 ```

3. Monitoring and Alerting

        **Metrics to Track**:

            - `temporal_history_shard_utilization` - Alert if > 80%
            - `temporal_workflow_throughput_per_shard` - Alert if > 1,500/sec
            - `temporal_shard_hotspots` - Identify uneven distribution
        
        **Alert Thresholds**:

            - **Warning**: Any shard > 70% utilization
            - **Critical**: Any shard > 90% utilization (indicates need for migration)

4. Migration Strategy (If Under-Provisioned)

        If shard count is insufficient:


        
            - **Create new cluster** with correct shard count
            - **Replicate namespaces** to new cluster (MRN)
            - **Gradual migration** of workflows
            - **Decommission** old cluster
        
        **Cost**: Significant operational overhead. Prevention via proper sizing is critical.

3. Data Encryption: Customer-Managed Keys with Codec Server

NIST AI RMF Alignment

        Per [NIST AI RMF](https://www.nist.gov/itl/ai-risk-management-framework), data encryption requirements:

        **Govern Function**:

            - Establish encryption policies for AI workflow data
            - Define key management responsibilities
        
        **Map Function**:

            - Identify sensitive data in workflow payloads (PII, PHI, business secrets)
            - Map encryption requirements to workflow types
        
        **Measure Function**:

            - Verify encryption at rest and in transit
            - Audit key rotation and access
        
        **Manage Function**:

            - Implement customer-managed key rotation
            - Monitor encryption compliance

Temporal Encryption Architecture

        ┌─────────────────────────────────────────────────────────┐ │                    Engram Platform                       │ ├─────────────────────────────────────────────────────────┤ │                                                          │ │  Workflow Payload (Plaintext)                           │ │         │                                                │ │         ▼                                                │ │  ┌──────────────────┐                                   │ │  │ Data Converter   │ ← Customer-managed encryption     │ │  │ (Client-side)    │   - Uses customer's KMS           │ │  └──────────────────┘   - Azure Key Vault / AWS KMS    │ │         │                  / GCP KMS                     │ │         ▼                                                │ │  Encrypted Payload (Stored in Temporal)                 │ │         │                                                │ │         ▼                                                │ │  ┌──────────────────┐                                   │ │  │ Codec Server     │ ← Decryption service              │ │  │ (Separate Pod)   │   - Isolated from Temporal        │ │  └──────────────────┘   - Customer-managed keys only   │ │         │                                                │ │         ▼                                                │ │  Decrypted Payload (For workflow execution)             │ └─────────────────────────────────────────────────────────┘

Implementation Plan

Phase 1: Data Converter (Client-Side Encryption)

        **Location**: `backend/workflows/codec/`

        **Files to Create**:

            - `data_converter.py` - Custom Data Converter with encryption
            - `encryption_codec.py` - Payload encryption/decryption
            - `key_management.py` - Integration with customer KMS

Phase 2: Codec Server (Server-Side Decryption)

        **Why Codec Server?**

            - Temporal server cannot decrypt (doesn't have keys)
            - Codec Server runs in customer's K8s cluster with key access
            - Isolated security boundary

Phase 3: Key Rotation Strategy

        **NIST AI RMF Requirement**: Regular key rotation

        **Implementation**:

            - **Dual-Key Support**: Maintain current + previous key during rotation
            - **Gradual Migration**: Re-encrypt workflows with new key over time
            - **Key Versioning**: Track key versions in payload metadata
            - **Automated Rotation**: Scheduled rotation (e.g., quarterly)

4. Kubernetes Deployment with Helm Charts

Current State (Staging POC)

        **Current**: Temporal deployed in Azure Container Apps (ACA)

            - `temporal-server` Container App
            - `temporal-ui` Container App
            - PostgreSQL backend
        
        **Limitation**: ACA doesn't support advanced K8s features needed for enterprise:

            - Custom RBAC integration
            - Pod security policies
            - Network policies
            - Service mesh integration

Target State (Customer Environments)

        **Deployment**: Temporal OSS via [Helm Charts](https://github.com/temporalio/helm-charts)

        Kubernetes Cluster (Customer's) ├── Temporal Namespace │   ├── Temporal Server (StatefulSet) │   │   ├── Frontend (3 replicas) │   │   ├── History (3 replicas) │   │   ├── Matching (3 replicas) │   │   └── Worker (2 replicas) │   ├── Temporal Admin Tools │   ├── Temporal Web UI │   └── Codec Server (Separate deployment) ├── PostgreSQL (Customer's managed database) └── Monitoring (Prometheus/Grafana)

5. NIST AI RMF Compliance Integration

Framework Mapping

                    NIST AI RMF Function
                    Temporal Implementation
                    Engram Controls
                
            
            
                
                    **Govern**
                    Temporal namespace policies
                    Customer-defined governance
                
                
                    **Map**
                    Workflow type classification
                    Sensitivity tagging in metadata
                
                
                    **Measure**
                    Temporal metrics + Evidence Telemetry
                    Cost, performance, security metrics
                
                
                    **Manage**
                    Workflow signals, human-in-the-loop
                    Approval workflows, incident response

Data Encryption Controls (NIST AI RMF)

        **Control ID**: AI-SEC-01 (Data Encryption)

        **Implementation**:

            - ✅ **Encryption at Rest**: All workflow history encrypted via Data Converter
            - ✅ **Encryption in Transit**: TLS 1.3 for all Temporal communications
            - ✅ **Key Management**: Customer-managed keys via Codec Server
            - ✅ **Key Rotation**: Automated quarterly rotation
            - ✅ **Access Control**: RBAC + SSO integration

6. Operational Responsibilities

Zimax Networks LC Support Model

        **For Customer Environments (dev/test/UAT/prod)**:

                    Responsibility
                    Zimax Networks LC
                    Customer
                
                    Temporal deployment
                    ✅ Helm chart deployment
                    Infrastructure provisioning
                
                    History shard sizing
                    ✅ Assessment & recommendation
                    Approval
                
                    Codec Server deployment
                    ✅ Implementation
                    Key management
                
                    Monitoring & alerting
                    ✅ Setup & maintenance
                    Alert response
                
                    Updates & patches
                    ✅ Planning & execution
                    Maintenance windows
                
                    Troubleshooting
                    ✅ 24/7 support
                    Issue reporting
                
                    Compliance documentation
                    ✅ Preparation
                    Audit participation
                
        **Dedicated Resources Required**:

            - **Temporal SME**: Deep expertise in OSS deployment
            - **K8s Engineer**: Helm charts, deployment automation
            - **Security Engineer**: Encryption, compliance, NIST AI RMF
            - **SRE**: Monitoring, alerting, incident response

7. Implementation Roadmap

Phase 1: Foundation (Months 1-2)

                - Research and document history shard sizing methodology
                - Build shard calculator tool
                - Create Helm chart customizations
                - Design Codec Server architecture

Phase 2: Encryption Implementation (Months 2-3)

                - Implement Data Converter with customer KMS integration
                - Build Codec Server (separate K8s deployment)
                - Create key rotation workflow
                - Test encryption end-to-end

Phase 3: Helm Chart Deployment (Months 3-4)

                - Customize Temporal Helm chart for enterprise
                - Create deployment automation
                - Test in dev environment
                - Document deployment procedures

Phase 4: Compliance & Documentation (Months 4-5)

                - Map NIST AI RMF controls to implementation
                - Create security assessment documentation
                - Prepare audit evidence
                - Train support team

Phase 5: Production Deployment (Months 5-6)

                - Deploy to customer dev environment
                - Validate shard count sizing
                - Test encryption with customer keys
                - Gradual rollout to test/UAT/prod

8. Risk Mitigation

Risk: History Shard Under-Provisioning

Mitigation

                    - Conservative sizing (2x safety factor)
                    - Monitoring with early warning alerts
                    - Migration plan documented upfront

Risk: Encryption Key Compromise

Mitigation

                    - Key rotation workflow
                    - Dual-key support during rotation
                    - Audit logging of all key access
                    - Isolated Codec Server (separate security boundary)

Risk: Operational Complexity

Mitigation

                    - Comprehensive documentation
                    - Automated deployment (Helm charts)
                    - Dedicated support team
                    - Training and runbooks

9. Success Metrics

                    Metric
                    Target
                    Measurement
                
            
            
                
                    History shard utilization
                    < 70%
                    Prometheus metrics
                
                
                    Encryption coverage
                    100% of sensitive workflows
                    Audit logs
                
                
                    Key rotation success rate
                    100%
                    Rotation workflow logs
                
                
                    Deployment time
                    < 2 hours
                    Deployment automation
                
                
                    Support response time
                    < 1 hour (P1)
                    Incident tracking

10. Next Steps

            - **Approve this strategy** for customer environment deployment
            - **Allocate resources** for Temporal SME, K8s engineer, security engineer
            - **Begin Phase 1** implementation (shard calculator, Helm chart research)
            - **Engage with customer** security team for KMS integration requirements
            - **Schedule security assessment** preparation timeline

References

            - [Temporal Helm Charts](https://github.com/temporalio/helm-charts)
            - [Temporal Data Converter Documentation](https://docs.temporal.io/concepts/what-is-a-data-converter)
            - [NIST AI Risk Management Framework](https://www.nist.gov/itl/ai-risk-management-framework)
            - [Temporal History Shard Configuration](https://docs.temporal.io/server/production-deployment-guide)

Executive Summary

Table of Contents

1. Temporal OSS vs Temporal Cloud: Enterprise Comparison

Feature Comparison Matrix

Decision Rationale for OSS

2. History Shard Count: Immutable Configuration Strategy

The Problem

Shard Count Sizing Formula

Engram Platform Shard Sizing

Management Strategy

1. Pre-Deployment Assessment

2. Helm Chart Configuration

3. Monitoring and Alerting

4. Migration Strategy (If Under-Provisioned)

3. Data Encryption: Customer-Managed Keys with Codec Server

NIST AI RMF Alignment

Temporal Encryption Architecture

Implementation Plan

Phase 1: Data Converter (Client-Side Encryption)

Phase 2: Codec Server (Server-Side Decryption)

Phase 3: Key Rotation Strategy

4. Kubernetes Deployment with Helm Charts

Current State (Staging POC)

Target State (Customer Environments)

5. NIST AI RMF Compliance Integration

Framework Mapping

Data Encryption Controls (NIST AI RMF)

6. Operational Responsibilities

Zimax Networks LC Support Model

7. Implementation Roadmap

Phase 1: Foundation (Months 1-2)

Phase 2: Encryption Implementation (Months 2-3)

Phase 3: Helm Chart Deployment (Months 3-4)

Phase 4: Compliance & Documentation (Months 4-5)

Phase 5: Production Deployment (Months 5-6)

8. Risk Mitigation

Risk: History Shard Under-Provisioning

Mitigation

Risk: Encryption Key Compromise

Mitigation

Risk: Operational Complexity

Mitigation

9. Success Metrics

10. Next Steps

References