Document Ingestion Strategy: Enterprise Guide

Engram Context Engineering Platform
Aligns with GTM Research Paper v2 Pilot Requirements

Document Ingestion Strategy


Executive Summary

The Engram Document Ingestion Strategy enables enterprises to ingest documents from 5 source types into a unified context engine, making content searchable via tri-search (keyword, vector, graph). This architecture directly addresses the GTM pilot requirements from the Context Engineering Research Paper.


GTM Alignment: 5 Pilot Source Types

Source Type GTM Requirement Engram Implementation Status
Policies Governance docs, procedures Unstructured PDF/DOCX → Tri-index ✅ Implemented
Tickets Issue trackers, case records API connectors → Session episodes 🔶 Planned
PDFs Dense documents, reports Unstructured partition → Chunking ✅ Implemented
Wikis Knowledge bases, Confluence Wiki connector → doc-wiki-* sessions ✅ Implemented
Code Source files, configs Git integration → Code-aware chunking 🔶 Planned

Antigravity Ingestion Router (CtxEco-Lib)

Context Ecology Architecture — Documents classified by Truth Value, routed to CtxGraph

Antigravity Router Architecture

The Antigravity Router replaces the generic Unstructured-for-everything approach with intelligent routing based on document type:

Class Name Engine Decay Rate Use Case
A Immutable Truth IBM Docling 0.01 Safety manuals, specs, ISO docs
B Ephemeral Chatter Unstructured.io 0.80 Email, PPT, meeting notes
C Operational Pandas 0.99 CSV logs, JSON telemetry

NIST SP 800-60 Classification

Every ingested document receives sensitivity classification:

  • 🔴 High Impact — PII, safety-critical, credentials
  • 🟡 Moderate Impact — Business confidential, specifications
  • 🟢 Low Impact — General communications

16 Enterprise Connectors

Category Connectors
Cloud Storage S3, Azure Blob, GCS
Collaboration SharePoint, Drive, OneDrive, Confluence
Ticketing ServiceNow, Jira, GitHub
Messaging Slack, Teams, Email
Data Database, Webhook, Local

See: Antigravity Router Story


Tri-Search Architecture

Every ingested document is indexed into three search layers, enabling comprehensive retrieval via Reciprocal Rank Fusion (RRF):

1. Keyword Search (BM25/Lexical)

  • Storage: Zep sessions with messages
  • Session Pattern: doc-upload-{uuid}
  • Enables: Exact phrase matching, acronym lookup

2. Vector Search (Semantic)

  • Storage: memory_embeddings table (pgvector)
  • Embedding Model: text-embedding-3-small (1536 dims)
  • Enables: Conceptual similarity, paraphrase matching

3. Graph Search (CtxGraph / OpenContextGraph)

  • Storage: CtxGraph (Zep facts per user)
  • Enables: Entity relationships, temporal facts, multi-hop reasoning (via OpenContextGraph API)
┌─────────────────────────────────────────────────────────────┐
│                    DOCUMENT UPLOAD                          │
│                  (PDF, DOCX, TXT, etc.)                     │
└─────────────────────────┬───────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────┐
│                 UNSTRUCTURED PROCESSOR                       │
│  • partition()     - Layout-aware extraction                 │
│  • chunk_by_title  - Semantic chunking (1000 char max)       │
│  • Metadata        - Page numbers, filetype, provenance      │
└─────────────────────────┬───────────────────────────────────┘
                          │
            ┌─────────────┼─────────────┐
            │             │             │
            ▼             ▼             ▼
┌───────────────┐ ┌─────────────┐ ┌─────────────┐
│   KEYWORD     │ │   VECTOR    │ │   GRAPH     │
│   LAYER       │ │   LAYER     │ │   LAYER     │
├───────────────┤ ├─────────────┤ ├─────────────┤
│ add_memory()  │ │ store_with_ │ │ add_fact()  │
│ to session    │ │ embedding() │ │ to user     │
│               │ │             │ │             │
│ Zep Sessions  │ │ pgvector    │ │ Zep Facts   │
│ `doc-upload-* │ │ memory_     │ │ Knowledge   │
│               │ │ embeddings  │ │ Graph       │
└───────────────┘ └─────────────┘ └─────────────┘
            │             │             │
            └─────────────┼─────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────┐
│                    HYBRID SEARCH                             │
│            Reciprocal Rank Fusion (RRF)                      │
│  search_memory() combines all three result sets              │
└─────────────────────────────────────────────────────────────┘

Technical Implementation

Core Components

Component File Purpose
DocumentProcessor backend/etl/processor.py Unstructured partition + chunking
IngestionService backend/etl/ingestion_service.py Tri-indexing orchestration
ZepMemoryClient backend/memory/client.py Keyword + graph storage
VectorStore backend/memory/vector_store.py Embedding storage + search
EmbeddingClient backend/memory/embedding_client.py Azure OpenAI embeddings

API Endpoint

POST /api/v1/etl/ingest
Content-Type: multipart/form-data

Parameters:
  file: <document binary>
  
Response:
{
  "success": true,
  "filename": "policy-v2.pdf",
  "chunks_processed": 42,
  "session_id": "doc-upload-a1b2c3d4",
  "document_id": "doc-e5f6g7h8i9j0"
}

Ingestion Flow (Background Task)

async def index_chunks_tri(chunks, user_id, filename, doc_id, session_id):
    # 1. Create session for keyword search
    await memory_client.get_or_create_session(
        session_id=session_id,
        user_id=user_id,
        metadata={"title": filename, "document_id": doc_id}
    )
    
    for chunk in chunks:
        # Layer 1: Graph (facts)
        await memory_client.add_fact(user_id, fact=chunk.text, metadata={...})
        
        # Layer 2: Vector (embeddings)
        await store_with_embedding(session_id, content=chunk.text, source_type="document")
        
        # Layer 3: Keyword (session messages)
        await memory_client.add_memory(session_id, messages=[...])

Context Layer Integration

Document ingestion populates Layer 3: Semantic Knowledge of the 4-Layer Enterprise Context Schema:

Layer Populated By Document Role
Layer 1: Security Azure Entra ID User permissions filter document access
Layer 2: Episodic Chat/Voice Conversation about documents
Layer 3: Semantic Document Ingestion Facts and entities from documents
Layer 4: Operational Temporal Durable ingestion workflows

Pilot Success Criteria (GTM Alignment)

From the GTM Research Paper, document ingestion supports:

Retrieval Metrics

  • Hit@1/Hit@3: Tri-search improves first-result accuracy
  • No-answer reduction: Graph relationships surface related content

Faithfulness Metrics

  • Grounded responses: Facts include provenance metadata
  • Citation correctness: document_id + chunk_index enable citations

Governance Metrics

  • Audit logs: Ingestion logged with user_id, filename, timestamp
  • Tenant isolation: User-scoped facts prevent cross-tenant leakage

Connector Roadmap

Phase 1: File Upload (Implemented)

  • Manual upload via API/UI
  • Supported: PDF, DOCX, TXT, MD
  • Processing: Unstructured.io

Phase 2: Enterprise Connectors (Planned)

Connector Source Type Priority
SharePoint Policies, wikis High
Confluence Wikis High
GitHub Code Medium
ServiceNow Tickets Medium
S3/Azure Blob All High

Phase 3: Real-time Sync

  • Webhook-based incremental updates
  • Bi-temporal tracking (valid time + transaction time)
  • Contradiction detection for updated facts

References