Skip to content

Ingestion Pipeline

The ingestion pipeline transforms raw documents into searchable vector embeddings stored in PostgreSQL with pgvector.


Pipeline Overview

sequenceDiagram
    participant User
    participant Frontend
    participant Backend
    participant Parser
    participant LLM
    participant EmbeddingAPI
    participant DB
    participant Storage

    User->>Frontend: Upload file(s)
    Frontend->>Backend: POST /documents/upload
    Backend->>Storage: Store original file
    Backend->>DB: Create notebook_file_job (status: pending)
    Backend->>Backend: Start background task

    Note over Backend: Phase 1 — Parsing
    Backend->>Parser: Send file to Docling or Mistral OCR
    Parser-->>Backend: Extracted text + structure

    Note over Backend: Phase 2 — Cleaning
    Backend->>Backend: Text normalization + dedup check

    Note over Backend: Phase 3 — Chunking
    Backend->>Backend: Recursive / Hybrid / Agentic chunking

    Note over Backend: Phase 4a — Image Description (optional)
    alt enable_multimodal_processing = true
        Backend->>LLM: Vision API: describe base64 images
        LLM-->>Backend: Text descriptions replace image blobs
    end

    Note over Backend: Phase 4b — Context Augmentation (optional)
    alt enable_contextual_retrieval = true
        Backend->>LLM: Enrich chunks with contextual metadata
        LLM-->>Backend: Enhanced chunk text
    end

    Note over Backend: Phase 5 — Embedding
    Backend->>EmbeddingAPI: Embed chunks (OpenAI / OpenRouter / Ollama)
    EmbeddingAPI-->>Backend: Vector embeddings

    Note over Backend: Phase 6 — Storage
    Backend->>DB: Insert into documents (content + embedding + metadata)
    Backend->>DB: Insert into document_records (file-level record)
    Backend->>DB: Insert into contextual_retrieval_table (if context aug.)
    Backend->>DB: Update notebook_file_job (status: completed)

Phase 1: Parsing

Two parsers are available:

Parser Strengths Speed
Docling Parser Structural extraction, table detection, DOCX native support 0.01–0.22s (DOCX), 4–182s (PDF)
Mistral OCR Scanned documents, image-heavy PDFs, handles DOCX too ~2–5s for any document

When to Use Mistral OCR

Mistral OCR extracts 4x more text than Docling from image-heavy PDFs (e.g., 995 vs 258 chars for a 3MB flyer). Use it for scanned documents.


Phase 2: Text Cleaning

  • Unicode normalization
  • Whitespace consolidation
  • Deduplication check against existing document records

Phase 3: Chunking

Strategy Description Best For
Recursive Chunking Splits on paragraph → sentence → character boundaries Most documents (default)
Docling Hybrid Chunker Uses Docling's structural analysis to split at section boundaries Structured documents with clear headings
Agentic Chunking LLM-guided chunking that considers semantic meaning Complex documents (falls back to Recursive if LLM fails)

Parameters:

Parameter Default Range Effect
chunk_size 1000 100–10000 Smaller = more precise citations, larger = more context
chunk_overlap 200 0–chunk_size Ensures important sentences aren't split across chunks

Phase 4a: Image Description (Optional)

When enable_multimodal_processing = true:

  1. Scans chunks for base64-encoded image blobs
  2. Sends each image to a vision LLM (low detail, ~85 tokens/image)
  3. Replaces the base64 blob with an AI-generated text description
  4. Controlled by asyncio.Semaphore for concurrency + exponential backoff for rate limits

Phase 4b: Context Augmentation (Optional)

When enable_contextual_retrieval = true:

  1. Each chunk is sent to the LLM with the full document context
  2. The LLM generates a contextual summary
  3. The summary is prepended to the chunk: # Context\n{summary}\n\n---\n\n# Content\n{original}
  4. Both the original and enhanced chunks are stored in contextual_retrieval_table

Phase 5: Embedding

Chunks are embedded using the notebook's configured embedding model:

Provider Models Batch Size
OpenRouter 21+ models (e.g., openai/text-embedding-3-small) SDK native
OpenAI Direct text-embedding-3-small (1536d) SDK native
Ollama nomic-embed-text (768d) 50 per batch

The embedding model is immutable — set at notebook creation and stored in notebook_settings.embedding_model.


Phase 6: Storage

Each chunk is inserted into the documents table with:

Column Content
content The chunk text
embedding Vector embedding (VECTOR type)
metadata JSONB: {notebook_id, file_id, file_title, chunk_index, loc, source, ...}
fts Auto-generated tsvector for full-text search

A document_records entry tracks the file at the file level.


Re-Ingestion

The re-ingestion flow provides atomic cleanup before re-processing:

  1. Status guard — Rejects files currently Processing or reprocessing (409)
  2. Cleanup — Deletes from 6 tables: documents, document_records, contextual_retrieval_table, raw_data_table, query_cache, notebook_file_jobs
  3. Verification — Confirms 4 tables are clean
  4. Re-pipeline — Runs the full ingestion pipeline with current settings

Storage Provider 'None'

If storage provider is "none", re-ingestion is blocked (409) because original files are not persisted.


Triggering via API

curl -X POST http://localhost:8000/api/notebooks/{notebook_id}/documents/ingest \
  -H "Content-Type: application/json" \
  -d '{
    "files": [{"file_id": "uuid", "file_name": "doc.pdf", "file_path": "nb/fid/doc.pdf"}],
    "settings": {
      "parser": "Docling Parser",
      "chunking_strategy": "Recursive Chunking",
      "chunk_size": 1000,
      "chunk_overlap": 200
    },
    "notebook_name": "My Notebook"
  }'
curl -X POST http://localhost:8000/api/notebooks/{notebook_id}/documents/{file_id}/reingest \
  -H "Content-Type: application/json" \
  -d '{"settings": {"parser": "Mistral OCR", "chunk_size": 600}}'
curl http://localhost:8000/api/notebooks/{notebook_id}/documents/{file_id}/stage