FinOps Strategy

Engram is designed with cost-conscious architecture as a first-class concern. This guide details our approach to minimizing and tracking costs while maintaining enterprise-grade capabilities.

Cost Philosophy

“Pay only for what you use, when you use it.”

Design Principles

Scale-to-Zero: All compute scales to zero when idle
Right-Sizing: Match resources to actual workload
Caching: Reduce redundant API calls
Monitoring: Track and alert on cost anomalies

Infrastructure Cost Breakdown

Azure Container Apps (Scale-to-Zero)

Service	Min Replicas	Max Replicas	vCPU	Memory	Est. Cost/Month
API	0	10	0.5	1GB	$0-15*
Worker	0	5	0.5	1GB	$0-10*
Frontend	Static	Static	N/A	N/A	$9

Scales to zero when no requests

Database & Storage

Service	Tier	Configuration	Est. Cost/Month
PostgreSQL	B1ms	1 vCore, 2GB RAM	$13
Storage Account	Standard	LRS, Hot tier	$1
Key Vault	Standard	< 10K operations	$0.03/10K ops

AI Services (Foundry)

Service	Model	Pricing	Notes
Azure AI (Foundry)	gpt-4o	$5/$15 per 1M tokens	Input/Output
Azure AI (Foundry)	gpt-4o-mini	$0.15/$0.60 per 1M tokens	Cost-efficient

Estimated Monthly Costs

Scenario	Description	Est. Cost
Idle	No active users	~$23
Light	100 conversations/day	~$50-80
Medium	1,000 conversations/day	~$200-400
Heavy	10,000 conversations/day	~$1,500-3,000

Cost Optimization Strategies

1. Model Selection

# Route simple queries to cheaper models
if query_complexity == "simple":
    model = "gpt-4o-mini"  # 30x cheaper
else:
    model = "gpt-4o"

Impact: 60-80% cost reduction on simple queries

2. Memory Caching

# Cache frequently accessed context
@lru_cache(maxsize=1000)
async def get_user_context(user_id: str):
    return await memory_client.get_context(user_id)

Impact: Reduce memory API calls by 50%+

3. Response Streaming

# Stream responses instead of waiting for full completion
async for chunk in agent.stream(message):
    yield chunk

Impact: Better UX, similar cost, faster perceived response

4. Conversation Summarization

# Summarize long conversations to reduce context size
if len(context.episodic.recent_turns) > 10:
    context.episodic.summary = await summarize(context.episodic.recent_turns)
    context.episodic.recent_turns = context.episodic.recent_turns[-3:]

Impact: 40% token reduction on long conversations

Budget Controls

Azure Budget Alerts

resource budget 'Microsoft.Consumption/budgets@2021-10-01' = {
  name: 'engram-monthly-budget'
  properties: {
    category: 'Cost'
    amount: 500
    timeGrain: 'Monthly'
    notifications: {
      '50percent': {
        enabled: true
        operator: 'GreaterThan'
        threshold: 50
        contactEmails: ['alerts@company.com']
      }
      '80percent': {
        enabled: true
        operator: 'GreaterThan'
        threshold: 80
        contactEmails: ['alerts@company.com']
      }
      '100percent': {
        enabled: true
        operator: 'GreaterThan'
        threshold: 100
        contactEmails: ['alerts@company.com']
      }
    }
  }
}

Application-Level Limits

# Rate limiting per user
RATE_LIMITS = {
    Role.VIEWER: 10,      # 10 requests/minute
    Role.ANALYST: 30,     # 30 requests/minute
    Role.MANAGER: 60,     # 60 requests/minute
    Role.ADMIN: 120,      # 120 requests/minute
}

# Token limits per request
MAX_TOKENS_PER_REQUEST = 4096
MAX_TOKENS_PER_USER_PER_DAY = 100000

Cost Tracking

Metrics Dashboard

Metric	Description	Alert Threshold
`tokens_used_total`	Total tokens consumed	> 1M/day
`api_calls_total`	External API calls	> 10K/day
`workflow_executions`	Temporal workflows	> 5K/day

Daily Cost Report

-- Query for daily cost breakdown
SELECT 
  DATE(timestamp) as date,
  agent_id,
  SUM(tokens_input) as input_tokens,
  SUM(tokens_output) as output_tokens,
  SUM(tokens_input * 0.000005 + tokens_output * 0.000015) as estimated_cost
FROM agent_executions
WHERE timestamp > NOW() - INTERVAL '30 days'
GROUP BY DATE(timestamp), agent_id
ORDER BY date DESC;

Cost Attribution

# Tag all resources with cost center
context.operational.metadata = {
    "cost_center": user.department,
    "project": context.operational.workflow_id,
    "agent": context.operational.active_agent,
}

Scaling Recommendations

Development Environment

# Minimal resources for dev
api:
  replicas: 0-1
  cpu: 0.25
  memory: 512MB
  
database:
  tier: B1ms  # ~$13/month

Production Environment

# Scaled for production
api:
  replicas: 0-10
  cpu: 1.0
  memory: 2GB
  auto_scale:
    min: 0
    max: 10
    target_cpu: 70%
    
database:
  tier: GP_Gen5_2  # ~$150/month
  read_replicas: 1

Reserved Capacity

For predictable workloads, consider:

Service	On-Demand	1-Year Reserved	3-Year Reserved
PostgreSQL	$0.034/hr	$0.022/hr (35% off)	$0.017/hr (50% off)
Container Apps	Pay-per-use	N/A	N/A
Azure AI (Foundry)	Standard	PTU (Provisioned)	PTU

When to Use Reserved Capacity

PostgreSQL: Always (predictable baseline)
Azure AI PTU: When > 100K tokens/hour sustained
Container Apps: Never (scale-to-zero is better)

Cost Optimization Checklist

Infrastructure

Enable scale-to-zero on all container apps
Use B1ms PostgreSQL SKU for staging/dev (cost-optimized)
Consider reserved capacity for PostgreSQL (35-50% savings)
Tag all resources for cost attribution

Application-Level

Cache Evidence Telemetry (60s TTL, reduces Monitor API calls by ~95%)
Paginate BAU artifacts (reduces memory queries by 50-80%)
Use gpt-4o-mini for simple queries (30x cheaper)
Implement conversation summarization (40% token reduction)
Batch Golden Thread runs for bulk operations (30-40% workflow overhead reduction)

Monitoring & Alerts

Set up budget alerts at 50%, 80%, 100%
Monitor daily token usage
Review monthly cost reports
Track Evidence Telemetry cache hit rate (target > 80%)
Alert on Golden Thread run failures (indicates waste)