Ingestion Pipeline¶
The ingestion pipeline transforms raw documents into searchable vector embeddings stored in PostgreSQL with pgvector.
Pipeline Overview¶
sequenceDiagram
participant User
participant Frontend
participant Backend
participant Parser
participant LLM
participant EmbeddingAPI
participant DB
participant Storage
User->>Frontend: Upload file(s)
Frontend->>Backend: POST /documents/upload
Backend->>Storage: Store original file
Backend->>DB: Create notebook_file_job (status: pending)
Backend->>Backend: Start background task
Note over Backend: Phase 1 — Parsing
Backend->>Parser: Send file to Docling or Mistral OCR
Parser-->>Backend: Extracted text + structure
Note over Backend: Phase 2 — Cleaning
Backend->>Backend: Text normalization + dedup check
Note over Backend: Phase 3 — Chunking
Backend->>Backend: Recursive / Hybrid / Agentic chunking
Note over Backend: Phase 4a — Image Description (optional)
alt enable_multimodal_processing = true
Backend->>LLM: Vision API: describe base64 images
LLM-->>Backend: Text descriptions replace image blobs
end
Note over Backend: Phase 4b — Context Augmentation (optional)
alt enable_contextual_retrieval = true
Backend->>LLM: Enrich chunks with contextual metadata
LLM-->>Backend: Enhanced chunk text
end
Note over Backend: Phase 5 — Embedding
Backend->>EmbeddingAPI: Embed chunks (OpenAI / OpenRouter / Ollama)
EmbeddingAPI-->>Backend: Vector embeddings
Note over Backend: Phase 6 — Storage
Backend->>DB: Insert into documents (content + embedding + metadata)
Backend->>DB: Insert into document_records (file-level record)
Backend->>DB: Insert into contextual_retrieval_table (if context aug.)
Backend->>DB: Update notebook_file_job (status: completed) Phase 1: Parsing¶
Two parsers are available:
| Parser | Strengths | Speed |
|---|---|---|
| Docling Parser | Structural extraction, table detection, DOCX native support | 0.01–0.22s (DOCX), 4–182s (PDF) |
| Mistral OCR | Scanned documents, image-heavy PDFs, handles DOCX too | ~2–5s for any document |
When to Use Mistral OCR
Mistral OCR extracts 4x more text than Docling from image-heavy PDFs (e.g., 995 vs 258 chars for a 3MB flyer). Use it for scanned documents.
Phase 2: Text Cleaning¶
- Unicode normalization
- Whitespace consolidation
- Deduplication check against existing document records
Phase 3: Chunking¶
| Strategy | Description | Best For |
|---|---|---|
| Recursive Chunking | Splits on paragraph → sentence → character boundaries | Most documents (default) |
| Docling Hybrid Chunker | Uses Docling's structural analysis to split at section boundaries | Structured documents with clear headings |
| Agentic Chunking | LLM-guided chunking that considers semantic meaning | Complex documents (falls back to Recursive if LLM fails) |
Parameters:
| Parameter | Default | Range | Effect |
|---|---|---|---|
chunk_size | 1000 | 100–10000 | Smaller = more precise citations, larger = more context |
chunk_overlap | 200 | 0–chunk_size | Ensures important sentences aren't split across chunks |
Phase 4a: Image Description (Optional)¶
When enable_multimodal_processing = true:
- Scans chunks for base64-encoded image blobs
- Sends each image to a vision LLM (low detail, ~85 tokens/image)
- Replaces the base64 blob with an AI-generated text description
- Controlled by
asyncio.Semaphorefor concurrency + exponential backoff for rate limits
Phase 4b: Context Augmentation (Optional)¶
When enable_contextual_retrieval = true:
- Each chunk is sent to the LLM with the full document context
- The LLM generates a contextual summary
- The summary is prepended to the chunk:
# Context\n{summary}\n\n---\n\n# Content\n{original} - Both the original and enhanced chunks are stored in
contextual_retrieval_table
Phase 5: Embedding¶
Chunks are embedded using the notebook's configured embedding model:
| Provider | Models | Batch Size |
|---|---|---|
| OpenRouter | 21+ models (e.g., openai/text-embedding-3-small) | SDK native |
| OpenAI Direct | text-embedding-3-small (1536d) | SDK native |
| Ollama | nomic-embed-text (768d) | 50 per batch |
The embedding model is immutable — set at notebook creation and stored in notebook_settings.embedding_model.
Phase 6: Storage¶
Each chunk is inserted into the documents table with:
| Column | Content |
|---|---|
content | The chunk text |
embedding | Vector embedding (VECTOR type) |
metadata | JSONB: {notebook_id, file_id, file_title, chunk_index, loc, source, ...} |
fts | Auto-generated tsvector for full-text search |
A document_records entry tracks the file at the file level.
Re-Ingestion¶
The re-ingestion flow provides atomic cleanup before re-processing:
- Status guard — Rejects files currently
Processingorreprocessing(409) - Cleanup — Deletes from 6 tables: documents, document_records, contextual_retrieval_table, raw_data_table, query_cache, notebook_file_jobs
- Verification — Confirms 4 tables are clean
- Re-pipeline — Runs the full ingestion pipeline with current settings
Storage Provider 'None'
If storage provider is "none", re-ingestion is blocked (409) because original files are not persisted.
Triggering via API¶
curl -X POST http://localhost:8000/api/notebooks/{notebook_id}/documents/ingest \
-H "Content-Type: application/json" \
-d '{
"files": [{"file_id": "uuid", "file_name": "doc.pdf", "file_path": "nb/fid/doc.pdf"}],
"settings": {
"parser": "Docling Parser",
"chunking_strategy": "Recursive Chunking",
"chunk_size": 1000,
"chunk_overlap": 200
},
"notebook_name": "My Notebook"
}'