Ingestion Pipeline¶

The ingestion pipeline transforms raw documents into searchable vector embeddings stored in PostgreSQL with pgvector.

Pipeline Overview¶

sequenceDiagram
    participant User
    participant Frontend
    participant Backend
    participant Parser
    participant LLM
    participant EmbeddingAPI
    participant DB
    participant Storage

    User->>Frontend: Upload file(s)
    Frontend->>Backend: POST /documents/upload
    Backend->>Storage: Store original file
    Backend->>DB: Create notebook_file_job (status: pending)
    Backend->>Backend: Start background task

    Note over Backend: Phase 1 — Parsing
    Backend->>Parser: Send file to Docling or Mistral OCR
    Parser-->>Backend: Extracted text + structure

    Note over Backend: Phase 2 — Cleaning
    Backend->>Backend: Text normalization + dedup check

    Note over Backend: Phase 3 — Chunking
    Backend->>Backend: Recursive / Hybrid / Agentic chunking

    Note over Backend: Phase 4a — Image Description (optional)
    alt enable_multimodal_processing = true
        Backend->>LLM: Vision API: describe base64 images
        LLM-->>Backend: Text descriptions replace image blobs
    end

    Note over Backend: Phase 4b — Context Augmentation (optional)
    alt enable_contextual_retrieval = true
        Backend->>LLM: Enrich chunks with contextual metadata
        LLM-->>Backend: Enhanced chunk text
    end

    Note over Backend: Phase 5 — Embedding
    Backend->>EmbeddingAPI: Embed chunks (OpenAI / OpenRouter / Ollama)
    EmbeddingAPI-->>Backend: Vector embeddings

    Note over Backend: Phase 6 — Storage
    Backend->>DB: Insert into documents (content + embedding + metadata)
    Backend->>DB: Insert into document_records (file-level record)
    Backend->>DB: Insert into contextual_retrieval_table (if context aug.)
    Backend->>DB: Update notebook_file_job (status: completed)

Phase 1: Parsing¶

Two parsers are available:

Parser	Strengths	Speed
Docling Parser	Structural extraction, table detection, DOCX native support	0.01–0.22s (DOCX), 4–182s (PDF)
Mistral OCR	Scanned documents, image-heavy PDFs, handles DOCX too	~2–5s for any document

When to Use Mistral OCR

Mistral OCR extracts 4x more text than Docling from image-heavy PDFs (e.g., 995 vs 258 chars for a 3MB flyer). Use it for scanned documents.

Phase 2: Text Cleaning¶

Unicode normalization
Whitespace consolidation
Deduplication check against existing document records

Phase 3: Chunking¶

Strategy	Description	Best For
Recursive Chunking	Splits on paragraph → sentence → character boundaries	Most documents (default)
Docling Hybrid Chunker	Uses Docling's structural analysis to split at section boundaries	Structured documents with clear headings
Agentic Chunking	LLM-guided chunking that considers semantic meaning	Complex documents (falls back to Recursive if LLM fails)

Parameters:

Parameter	Default	Range	Effect
`chunk_size`	1000	100–10000	Smaller = more precise citations, larger = more context
`chunk_overlap`	200	0–chunk_size	Ensures important sentences aren't split across chunks

Phase 4a: Image Description (Optional)¶

When enable_multimodal_processing = true:

Scans chunks for base64-encoded image blobs
Sends each image to a vision LLM (low detail, ~85 tokens/image)
Replaces the base64 blob with an AI-generated text description
Controlled by asyncio.Semaphore for concurrency + exponential backoff for rate limits

Phase 4b: Context Augmentation (Optional)¶

When enable_contextual_retrieval = true:

Each chunk is sent to the LLM with the full document context
The LLM generates a contextual summary
The summary is prepended to the chunk: # Context\n{summary}\n\n---\n\n# Content\n{original}
Both the original and enhanced chunks are stored in contextual_retrieval_table

Phase 5: Embedding¶

Chunks are embedded using the notebook's configured embedding model:

Provider	Models	Batch Size
OpenRouter	21+ models (e.g., `openai/text-embedding-3-small`)	SDK native
OpenAI Direct	text-embedding-3-small (1536d)	SDK native
Ollama	nomic-embed-text (768d)	50 per batch

The embedding model is immutable — set at notebook creation and stored in notebook_settings.embedding_model.

Phase 6: Storage¶

Each chunk is inserted into the documents table with:

Column	Content
`content`	The chunk text
`embedding`	Vector embedding (VECTOR type)
`metadata`	JSONB: `{notebook_id, file_id, file_title, chunk_index, loc, source, ...}`
`fts`	Auto-generated tsvector for full-text search

A document_records entry tracks the file at the file level.

Re-Ingestion¶

The re-ingestion flow provides atomic cleanup before re-processing:

Status guard — Rejects files currently Processing or reprocessing (409)
Cleanup — Deletes from 6 tables: documents, document_records, contextual_retrieval_table, raw_data_table, query_cache, notebook_file_jobs
Verification — Confirms 4 tables are clean
Re-pipeline — Runs the full ingestion pipeline with current settings

Storage Provider 'None'

If storage provider is "none", re-ingestion is blocked (409) because original files are not persisted.

Triggering via API¶

Start IngestionRe-IngestCheck Status

curl -X POST http://localhost:8000/api/notebooks/{notebook_id}/documents/ingest \
  -H "Content-Type: application/json" \
  -d '{
    "files": [{"file_id": "uuid", "file_name": "doc.pdf", "file_path": "nb/fid/doc.pdf"}],
    "settings": {
      "parser": "Docling Parser",
      "chunking_strategy": "Recursive Chunking",
      "chunk_size": 1000,
      "chunk_overlap": 200
    },
    "notebook_name": "My Notebook"
  }'

curl -X POST http://localhost:8000/api/notebooks/{notebook_id}/documents/{file_id}/reingest \
  -H "Content-Type: application/json" \
  -d '{"settings": {"parser": "Mistral OCR", "chunk_size": 600}}'

curl http://localhost:8000/api/notebooks/{notebook_id}/documents/{file_id}/stage