Spaces:

lamhieu
/

docsifer

Running

App Files Files Community

docsifer / README.md

lamhieu

deps: bump to Gradio 5 + FastAPI 0.115 stack (mirror lightweight-embeddings)

80eaaf7 2 days ago

preview code

raw

history blame contribute delete

4.37 kB

metadata

title: Docsifer
emoji: 📚
colorFrom: indigo
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
license: mit
short_description: Convert documents into clean, LLM-ready Markdown.

📚 Docsifer

Convert documents into clean, LLM-ready Markdown.

PDF · Word · PowerPoint · Excel · HTML · Audio · Image · CSV · JSON · ZIP

Features

Multi-format — PDF, Office, audio (Whisper), images (vision), HTML, CSV, JSON, ZIP, and more — powered by MarkItDown.
Optional LLM — bring your own OpenAI-compatible key for OCR, transcription and structured layout extraction.
Production-grade — bounded concurrency, per-IP fairness, memory watchdog, disk cleanup, circuit breaker.
Hardened — SSRF guard, path-traversal sanitation, body-size limits, security headers.
Observable — JSON logs with request id, /v1/healthz, /v1/readyz, /v1/stats.
Privacy-first — files are processed in-memory and discarded immediately.

Quickstart

# Local
make install
cp .env.example .env
make run

# Docker
docker build -t docsifer .
docker run --rm -p 7860:7860 --env-file .env docsifer

Open http://localhost:7860 for the UI or http://localhost:7860/docs for the API.

API

Method	Path	Description
POST	`/v1/convert`	Convert a file or URL to Markdown
GET	`/v1/stats`	Usage analytics snapshot
GET	`/v1/healthz`	Liveness probe
GET	`/v1/readyz`	Readiness probe

Examples

Basic conversion:

curl -X POST http://localhost:7860/v1/convert \
     -F "file=@document.pdf"

With LLM enhancement:

curl -X POST http://localhost:7860/v1/convert \
     -F "file=@page.html" \
     -F 'openai={"api_key":"sk-...","model":"gpt-4o-mini"}'

Convert a URL:

curl -X POST http://localhost:7860/v1/convert \
     -F "url=https://example.com/article"

Configuration

All settings are environment-driven (prefix DOCSIFER_). See .env.example for the full list. Common knobs:

Variable	Default	Purpose
`DOCSIFER_MAX_UPLOAD_BYTES`	`10MB`	Hard upload limit
`DOCSIFER_MAX_CONCURRENT_CONVERSIONS`	`2`	Global parallelism
`DOCSIFER_MAX_QUEUE_DEPTH`	`10`	Reject 503 when exceeded
`DOCSIFER_MAX_PER_IP_CONCURRENT`	`1`	Per-IP fairness
`DOCSIFER_REQUEST_TIMEOUT_SEC`	`55`	Conversion timeout
`DOCSIFER_REDIS_URL` / `_TOKEN`	local	Upstash Redis for analytics
`DOCSIFER_URL_ALLOW_PRIVATE_NETWORKS`	`false`	Disable to block SSRF

Architecture

docsifer/
├── api/         FastAPI layer — routes, schemas, middleware
├── core/        Pure logic — converter, MIME, tokenizer, LLM cache
├── analytics/   Lifespan-managed analytics (Upstash + in-memory)
├── safety/      Anti-crash primitives (gate, limiter, watchdog, breaker)
├── ui/          Optional Gradio playground
├── config.py    Pydantic settings
├── exceptions.py
├── logging_config.py
└── main.py      App factory + lifespan

See ARCHITECTURE.md for the full design notes.

Development

make install   # runtime + dev deps
make lint      # ruff
make format    # ruff format
make type      # mypy
make test      # pytest
make cov       # pytest with coverage