metadata
title: Docsifer
emoji: 📚
colorFrom: indigo
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
license: mit
short_description: Convert documents into clean, LLM-ready Markdown.
📚 Docsifer
Convert documents into clean, LLM-ready Markdown.
PDF · Word · PowerPoint · Excel · HTML · Audio · Image · CSV · JSON · ZIP
Features
- Multi-format — PDF, Office, audio (Whisper), images (vision), HTML, CSV, JSON, ZIP, and more — powered by MarkItDown.
- Optional LLM — bring your own OpenAI-compatible key for OCR, transcription and structured layout extraction.
- Production-grade — bounded concurrency, per-IP fairness, memory watchdog, disk cleanup, circuit breaker.
- Hardened — SSRF guard, path-traversal sanitation, body-size limits, security headers.
- Observable — JSON logs with request id,
/v1/healthz,/v1/readyz,/v1/stats. - Privacy-first — files are processed in-memory and discarded immediately.
Quickstart
# Local
make install
cp .env.example .env
make run
# Docker
docker build -t docsifer .
docker run --rm -p 7860:7860 --env-file .env docsifer
Open http://localhost:7860 for the UI or http://localhost:7860/docs for the API.
API
| Method | Path | Description |
|---|---|---|
| POST | /v1/convert |
Convert a file or URL to Markdown |
| GET | /v1/stats |
Usage analytics snapshot |
| GET | /v1/healthz |
Liveness probe |
| GET | /v1/readyz |
Readiness probe |
Examples
Basic conversion:
curl -X POST http://localhost:7860/v1/convert \
-F "file=@document.pdf"
With LLM enhancement:
curl -X POST http://localhost:7860/v1/convert \
-F "file=@page.html" \
-F 'openai={"api_key":"sk-...","model":"gpt-4o-mini"}'
Convert a URL:
curl -X POST http://localhost:7860/v1/convert \
-F "url=https://example.com/article"
Configuration
All settings are environment-driven (prefix DOCSIFER_). See .env.example for the full list. Common knobs:
| Variable | Default | Purpose |
|---|---|---|
DOCSIFER_MAX_UPLOAD_BYTES |
10MB |
Hard upload limit |
DOCSIFER_MAX_CONCURRENT_CONVERSIONS |
2 |
Global parallelism |
DOCSIFER_MAX_QUEUE_DEPTH |
10 |
Reject 503 when exceeded |
DOCSIFER_MAX_PER_IP_CONCURRENT |
1 |
Per-IP fairness |
DOCSIFER_REQUEST_TIMEOUT_SEC |
55 |
Conversion timeout |
DOCSIFER_REDIS_URL / _TOKEN |
local | Upstash Redis for analytics |
DOCSIFER_URL_ALLOW_PRIVATE_NETWORKS |
false |
Disable to block SSRF |
Architecture
docsifer/
├── api/ FastAPI layer — routes, schemas, middleware
├── core/ Pure logic — converter, MIME, tokenizer, LLM cache
├── analytics/ Lifespan-managed analytics (Upstash + in-memory)
├── safety/ Anti-crash primitives (gate, limiter, watchdog, breaker)
├── ui/ Optional Gradio playground
├── config.py Pydantic settings
├── exceptions.py
├── logging_config.py
└── main.py App factory + lifespan
See ARCHITECTURE.md for the full design notes.
Development
make install # runtime + dev deps
make lint # ruff
make format # ruff format
make type # mypy
make test # pytest
make cov # pytest with coverage
License
MIT © Lam Hieu — built on top of the wonderful MarkItDown.