Documents¶
Documents are files uploaded to a notebook for parsing, chunking, and vector embedding. The ingestion pipeline runs as a background task after upload.
Base path: /api/notebooks/{notebook_id}/documents
Supported file types: PDF, DOCX, DOC, MD, TXT, CSV, XLSX, XLS
POST /api/notebooks/{notebook_id}/documents/upload¶
Upload one or more files to storage. Each file is stored at documents/{notebook_id}/{file_id}/{safe_name}. Returns metadata ready to be passed to the /ingest endpoint.
Auth: Admin
Headers:
| Header | Value |
|---|---|
Authorization | Bearer <token> |
Content-Type | multipart/form-data |
Body:
Multipart form data with field name files containing one or more files.
Status: 200 OK
| Code | Cause |
|---|---|
400 | No files provided |
401 | Invalid or missing token |
403 | Non-admin user |
502 | Storage upload failed |
import httpx
notebook_id = "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
with open("handbook.pdf", "rb") as f:
response = httpx.post(
f"http://localhost:8000/api/notebooks/{notebook_id}/documents/upload",
headers={"Authorization": f"Bearer {token}"},
files={"files": ("handbook.pdf", f, "application/pdf")},
)
uploaded = response.json()["data"]
print(f"Uploaded {len(uploaded)} files")
GET /api/notebooks/{notebook_id}/documents/sources¶
List available files from storage for this notebook.
Auth: Admin
Headers:
| Header | Value |
|---|---|
Authorization | Bearer <token> |
Status: 200 OK
| Code | Cause |
|---|---|
401 | Invalid or missing token |
403 | Non-admin user |
POST /api/notebooks/{notebook_id}/documents/ingest¶
Trigger the ingestion pipeline for uploaded files. Creates job records synchronously, then runs parsing, chunking, and embedding as a background task.
Auth: Admin
Headers:
| Header | Value |
|---|---|
Authorization | Bearer <token> |
Content-Type | application/json |
Body:
{
"files": [
{
"file_id": "f1a2b3c4-...",
"file_name": "handbook.pdf",
"file_path": "notebook-id/f1a2b3c4/handbook.pdf"
}
],
"settings": {
"parser": "Docling Parser",
"chunking_strategy": "Recursive Chunking",
"chunk_size": 1000,
"chunk_overlap": 200,
"enable_contextual_retrieval": false,
"enable_multimodal_processing": false
},
"notebook_name": "Customer Support KB",
"inference_provider": "openrouter",
"inference_model": "openai/gpt-4o-mini",
"inference_temperature": 0.4
}
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
files | array | Yes | -- | Files to ingest (from upload response) |
files[].file_id | string | Yes | -- | File UUID |
files[].file_name | string | Yes | -- | Original file name |
files[].file_path | string | Yes | -- | Storage path |
settings | object | No | defaults | Ingestion configuration |
settings.parser | string | No | "Docling Parser" | "Docling Parser" or "Mistral OCR" |
settings.chunking_strategy | string | No | "Recursive Chunking" | "Recursive Chunking" or "Agentic Chunking" |
settings.chunk_size | integer | No | 1000 | Target chunk size in characters |
settings.chunk_overlap | integer | No | 200 | Overlap between chunks |
settings.enable_contextual_retrieval | boolean | No | false | Enable context augmentation |
settings.enable_multimodal_processing | boolean | No | false | Enable image description |
notebook_name | string | No | -- | Notebook title (for metadata) |
inference_provider | string | No | -- | LLM provider for context augmentation |
inference_model | string | No | -- | LLM model for context augmentation |
inference_temperature | float | No | -- | Temperature for context augmentation |
Status: 200 OK
| Code | Cause |
|---|---|
400 | No files provided |
401 | Invalid or missing token |
403 | Non-admin user |
curl -X POST http://localhost:8000/api/notebooks/$NOTEBOOK_ID/documents/ingest \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"files": [{"file_id": "f1a2b3c4", "file_name": "handbook.pdf", "file_path": "nb/f1a2b3c4/handbook.pdf"}],
"settings": {"parser": "Docling Parser", "chunking_strategy": "Recursive Chunking"},
"notebook_name": "Customer Support KB"
}'
import httpx
notebook_id = "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
response = httpx.post(
f"http://localhost:8000/api/notebooks/{notebook_id}/documents/ingest",
headers={"Authorization": f"Bearer {token}"},
json={
"files": [
{
"file_id": "f1a2b3c4",
"file_name": "handbook.pdf",
"file_path": "nb/f1a2b3c4/handbook.pdf",
}
],
"settings": {
"parser": "Docling Parser",
"chunking_strategy": "Recursive Chunking",
},
"notebook_name": "Customer Support KB",
},
)
jobs = response.json()["data"]["jobs"]
print(f"Started {len(jobs)} ingestion jobs")
POST /api/notebooks/{notebook_id}/documents/{file_id}/reingest¶
Re-ingest a document: performs atomic cleanup of all old data, then re-runs the pipeline. Rejects files that are currently being processed.
Auth: Admin
Headers:
| Header | Value |
|---|---|
Authorization | Bearer <token> |
Content-Type | application/json |
Body:
{
"settings": {
"parser": "Mistral OCR",
"chunking_strategy": "Recursive Chunking",
"chunk_size": 800
}
}
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
settings | object | No | -- | New ingestion settings (same schema as ingest) |
Status: 200 OK
| Code | Cause |
|---|---|
401 | Invalid or missing token |
403 | Non-admin user |
409 | File currently processing, or storage provider is "none" |
import httpx
notebook_id = "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
file_id = "f1a2b3c4-..."
response = httpx.post(
f"http://localhost:8000/api/notebooks/{notebook_id}/documents/{file_id}/reingest",
headers={"Authorization": f"Bearer {token}"},
json={"settings": {"parser": "Mistral OCR"}},
)
print(response.json()["data"])
POST /api/notebooks/{notebook_id}/documents/reingest-batch¶
Batch re-ingest multiple files. Each file is cleaned up and re-ingested independently -- one file's failure does not block others.
Auth: Admin
Headers:
| Header | Value |
|---|---|
Authorization | Bearer <token> |
Content-Type | application/json |
Body:
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
file_ids | array | Yes | -- | List of file IDs to re-ingest |
settings | object | No | -- | Ingestion settings (applied to all) |
Status: 200 OK
{
"success": true,
"data": {
"results": [
{
"file_id": "f1a2b3c4-...",
"job_id": "j1a2b3c4-...",
"status": "reprocessing",
"cleanup_summary": { "documents_deleted": 45 }
},
{
"file_id": "f5e6d7c8-...",
"status": "failed",
"error": "File is currently being processed"
}
],
"total": 2,
"succeeded": 1,
"failed": 1
}
}
| Code | Cause |
|---|---|
400 | No file_ids provided |
401 | Invalid or missing token |
403 | Non-admin user |
409 | Storage provider is "none" |
import httpx
notebook_id = "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
response = httpx.post(
f"http://localhost:8000/api/notebooks/{notebook_id}/documents/reingest-batch",
headers={"Authorization": f"Bearer {token}"},
json={
"file_ids": ["f1a2b3c4", "f5e6d7c8"],
"settings": {"parser": "Docling Parser"},
},
)
result = response.json()["data"]
print(f"Succeeded: {result['succeeded']}, Failed: {result['failed']}")
GET /api/notebooks/{notebook_id}/documents/¶
List all documents with their current status.
Auth: Admin
Headers:
| Header | Value |
|---|---|
Authorization | Bearer <token> |
Status: 200 OK
| Code | Cause |
|---|---|
401 | Invalid or missing token |
403 | Non-admin user |
DELETE /api/notebooks/{notebook_id}/documents/{file_id}¶
Delete a document and all related data (vectors, records, enhanced chunks, raw data, cache entries).
Auth: Admin
Headers:
| Header | Value |
|---|---|
Authorization | Bearer <token> |
| Code | Cause |
|---|---|
401 | Invalid or missing token |
403 | Non-admin user |
POST /api/notebooks/{notebook_id}/documents/delete-batch¶
Delete multiple documents in a single request.
Auth: Admin
Headers:
| Header | Value |
|---|---|
Authorization | Bearer <token> |
Content-Type | application/json |
Body:
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
file_ids | array | Yes | -- | List of file IDs to delete |
| Code | Cause |
|---|---|
400 | No file_ids provided |
401 | Invalid or missing token |
403 | Non-admin user |
import httpx
notebook_id = "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
response = httpx.post(
f"http://localhost:8000/api/notebooks/{notebook_id}/documents/delete-batch",
headers={"Authorization": f"Bearer {token}"},
json={"file_ids": ["f1a2b3c4", "f5e6d7c8"]},
)
print(f"Deleted: {response.json()['data']['deleted']}")
GET /api/notebooks/{notebook_id}/documents/settings¶
Get the ingestion settings used for a specific file.
Auth: Admin
Headers:
| Header | Value |
|---|---|
Authorization | Bearer <token> |
Query Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
file_id | string | Yes | File ID to get settings for |
Status: 200 OK
| Code | Cause |
|---|---|
401 | Invalid or missing token |
403 | Non-admin user |
GET /api/notebooks/{notebook_id}/documents/context-state¶
Get the contextual retrieval state for the notebook -- shows which files have been through context augmentation.
Auth: Admin
Headers:
| Header | Value |
|---|---|
Authorization | Bearer <token> |
Status: 200 OK
| Code | Cause |
|---|---|
401 | Invalid or missing token |
403 | Non-admin user |
GET /api/notebooks/{notebook_id}/documents/{file_id}/stage¶
Get the current workflow stage for a file (useful for tracking ingestion progress).
Auth: Admin
Headers:
| Header | Value |
|---|---|
Authorization | Bearer <token> |
Status: 200 OK
| Code | Cause |
|---|---|
401 | Invalid or missing token |
403 | Non-admin user |
404 | File job not found |
GET /api/notebooks/{notebook_id}/documents/errors¶
List all files with ingestion errors in this notebook.
Auth: Admin
Headers:
| Header | Value |
|---|---|
Authorization | Bearer <token> |
Status: 200 OK
| Code | Cause |
|---|---|
401 | Invalid or missing token |
403 | Non-admin user |
import httpx
notebook_id = "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
response = httpx.get(
f"http://localhost:8000/api/notebooks/{notebook_id}/documents/errors",
headers={"Authorization": f"Bearer {token}"},
)
errors = response.json()["data"]
for err in errors:
print(f"{err['file_name']}: {err['error_message']}")
POST /api/notebooks/{notebook_id}/documents/{file_id}/mark-error¶
Manually mark a file as errored with a custom error message and stage.
Auth: Admin
Headers:
| Header | Value |
|---|---|
Authorization | Bearer <token> |
Content-Type | application/json |
Body:
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
error_message | string | Yes | -- | Error description |
error_stage | string | Yes | -- | Pipeline stage where error occurred |
Status: 200 OK
| Code | Cause |
|---|---|
401 | Invalid or missing token |
403 | Non-admin user |
import httpx
notebook_id = "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
file_id = "f1a2b3c4-..."
response = httpx.post(
f"http://localhost:8000/api/notebooks/{notebook_id}/documents/{file_id}/mark-error",
headers={"Authorization": f"Bearer {token}"},
json={"error_message": "Manual abort", "error_stage": "upload"},
)
print(response.json()["data"])