docsifer / ARCHITECTURE.md
lamhieu's picture
refactor(core): overhaul architecture for better performance, efficiency, and maintainability
41788c4
# Architecture
This document explains how the Docsifer codebase is structured and why.
## Goals
1. **Correctness** β€” eliminate the bugs identified in the audit (SSRF,
path-traversal, broken cookies, race conditions, deprecated FastAPI APIs).
2. **Production-grade safety** β€” never crash a free-tier container under load.
3. **Operability** β€” config driven, observable, well-tested.
4. **Performance** β€” minimize redundant I/O, reuse expensive clients.
## Module map
```
docsifer/
β”œβ”€β”€ api/ # HTTP layer
β”‚ β”œβ”€β”€ deps.py
β”‚ β”œβ”€β”€ error_handlers.py
β”‚ β”œβ”€β”€ middleware.py # request-id, body limit, security headers
β”‚ β”œβ”€β”€ schemas.py # Pydantic request/response models
β”‚ └── v1/
β”‚ β”œβ”€β”€ convert.py # POST /v1/convert
β”‚ β”œβ”€β”€ stats.py # GET /v1/stats
β”‚ └── health.py # GET /v1/healthz, /v1/readyz
β”œβ”€β”€ core/ # Pure business logic
β”‚ β”œβ”€β”€ service.py # DocsiferService
β”‚ β”œβ”€β”€ llm_registry.py # TTL-cached MarkItDown(LLM) instances
β”‚ β”œβ”€β”€ html_cleaner.py # selectolax-based HTML scrub
β”‚ β”œβ”€β”€ tokenizer.py # tiktoken wrapper with fallbacks
β”‚ β”œβ”€β”€ mime.py # safe MIME detection
β”‚ └── url_guard.py # SSRF protection
β”œβ”€β”€ analytics/ # Lifespan-managed analytics
β”‚ β”œβ”€β”€ service.py
β”‚ β”œβ”€β”€ periods.py # ISO 8601 week numbering
β”‚ └── store.py # AnalyticsStore protocol + Upstash & in-memory
β”œβ”€β”€ safety/ # Anti-crash primitives (Section N of the audit)
β”‚ β”œβ”€β”€ conversion_gate.py
β”‚ β”œβ”€β”€ per_ip_limiter.py
β”‚ β”œβ”€β”€ resource_guard.py
β”‚ β”œβ”€β”€ circuit_breaker.py
β”‚ β”œβ”€β”€ memory_watchdog.py
β”‚ └── disk_cleanup.py
β”œβ”€β”€ ui/ # Gradio UI (optional)
β”‚ └── gradio_app.py
β”œβ”€β”€ config.py # pydantic-settings
β”œβ”€β”€ exceptions.py
β”œβ”€β”€ logging_config.py
└── main.py # FastAPI app factory + lifespan
```
## Request flow (POST /v1/convert)
```
Client
β”‚ multipart/form-data
β–Ό
[ middleware ]
β”œβ”€ request_id β†’ X-Request-ID header + ContextVar
β”œβ”€ body_limit β†’ 413 if Content-Length > MAX
└─ security_headers β†’ X-Content-Type-Options, X-Frame-Options, …
β”‚
β–Ό
[ route handler convert.py ]
β”œβ”€ parse JSON forms via Pydantic (422 on errors)
β”œβ”€ acquire PerIPLimiter slot (429 on overflow)
β”œβ”€ acquire ConversionGate slot (503 when full)
β”œβ”€ stream upload to disk in worker thread
β”œβ”€ ResourceGuard checks RAM/disk
β”œβ”€ asyncio.wait_for() guards against runaway conversions
└─ DocsiferService.convert_file(...)
β”‚
β–Ό
[ DocsiferService (core) ]
β”œβ”€ choose converter:
β”‚ β”œβ”€ no api_key β†’ cached `MarkItDown()`
β”‚ └─ otherwise β†’ LLMRegistry.get(LLMConfig)
β”œβ”€ HTML path β†’ clean in-memory β†’ MarkItDown.convert_stream()
└─ everything else β†’ normalize_extension() β†’ MarkItDown.convert(path)
```
The route returns an `ORJSONResponse`; large markdown payloads are gzipped by
the `GZipMiddleware`.
## Lifespan
`docsifer.main._lifespan` owns the lifecycle of every singleton:
- `DocsiferService` β€” built once, has a bounded `ThreadPoolExecutor`.
- `AnalyticsService` β€” loads totals from Upstash, kicks off the
background sync loop, **flushes pending counters on shutdown**.
- `ConversionGate`, `PerIPLimiter`, `ResourceGuard` β€” no I/O, just config.
- `disk_cleanup_loop` β€” periodic temp-file sweeper.
- `memory_watchdog_loop` β€” optional, sends SIGTERM on RSS overrun.
All background tasks share an `asyncio.Event` so shutdown is deterministic.
## Concurrency model
- One global `asyncio.Semaphore` (in `ConversionGate`) bounds in-flight
conversions. Anything beyond `max_concurrent + max_queue` is rejected with
`503 + Retry-After`.
- `PerIPLimiter` enforces fairness so a single IP cannot monopolize the gate.
- `ResourceGuard` runs ahead of every conversion to short-circuit OOM.
- All synchronous work happens in a dedicated `ThreadPoolExecutor` to keep
the event loop responsive.
## Analytics
- Increments are stored in an in-process `pending` counter and applied to a
`totals` counter atomically. The lock-free `_snapshot` dict is what
`/v1/stats` returns β€” readers are never blocked by writers.
- The background sync loop pipelines `HINCRBY` operations into Upstash so a
full flush is a single HTTP round-trip when the server supports pipelines.
- Failures keep `pending` intact for the next attempt; on shutdown the
service performs one final flush.
## Security posture
- **SSRF** β€” every URL goes through `validate_url()` which resolves the host
and rejects private/loopback/link-local/multicast addresses by default.
- **Path traversal** β€” `Path(filename).name` strips any directory component.
- **Body limit** β€” enforced by middleware before the body is consumed.
- **Extension allowlist** β€” server-side allowlist independent of the UI.
- **CORS** β€” `allow_credentials` is automatically disabled when origins
contain `*` (which would otherwise be invalid per the spec).
- **Error responses** β€” never echo internal exception messages; only the
`public_message` of the typed exception is returned.
## Free-tier defaults
The defaults in `config.py` target a 2 vCPU / 16 GB RAM container (Hugging
Face Spaces basic):
- `max_upload_bytes = 10 MB`
- `max_concurrent_conversions = 2`
- `max_queue_depth = 10`
- `max_per_ip_concurrent = 1`
- `request_timeout_sec = 55` (just under HF's 60 s gateway)
- `analytics_sync_interval_sec = 1800`