| # Architecture |
|
|
| This document explains how the Docsifer codebase is structured and why. |
|
|
| ## Goals |
|
|
| 1. **Correctness** β eliminate the bugs identified in the audit (SSRF, |
| path-traversal, broken cookies, race conditions, deprecated FastAPI APIs). |
| 2. **Production-grade safety** β never crash a free-tier container under load. |
| 3. **Operability** β config driven, observable, well-tested. |
| 4. **Performance** β minimize redundant I/O, reuse expensive clients. |
|
|
| ## Module map |
|
|
| ``` |
| docsifer/ |
| βββ api/ # HTTP layer |
| β βββ deps.py |
| β βββ error_handlers.py |
| β βββ middleware.py # request-id, body limit, security headers |
| β βββ schemas.py # Pydantic request/response models |
| β βββ v1/ |
| β βββ convert.py # POST /v1/convert |
| β βββ stats.py # GET /v1/stats |
| β βββ health.py # GET /v1/healthz, /v1/readyz |
| βββ core/ # Pure business logic |
| β βββ service.py # DocsiferService |
| β βββ llm_registry.py # TTL-cached MarkItDown(LLM) instances |
| β βββ html_cleaner.py # selectolax-based HTML scrub |
| β βββ tokenizer.py # tiktoken wrapper with fallbacks |
| β βββ mime.py # safe MIME detection |
| β βββ url_guard.py # SSRF protection |
| βββ analytics/ # Lifespan-managed analytics |
| β βββ service.py |
| β βββ periods.py # ISO 8601 week numbering |
| β βββ store.py # AnalyticsStore protocol + Upstash & in-memory |
| βββ safety/ # Anti-crash primitives (Section N of the audit) |
| β βββ conversion_gate.py |
| β βββ per_ip_limiter.py |
| β βββ resource_guard.py |
| β βββ circuit_breaker.py |
| β βββ memory_watchdog.py |
| β βββ disk_cleanup.py |
| βββ ui/ # Gradio UI (optional) |
| β βββ gradio_app.py |
| βββ config.py # pydantic-settings |
| βββ exceptions.py |
| βββ logging_config.py |
| βββ main.py # FastAPI app factory + lifespan |
| ``` |
|
|
| ## Request flow (POST /v1/convert) |
|
|
| ``` |
| Client |
| β multipart/form-data |
| βΌ |
| [ middleware ] |
| ββ request_id β X-Request-ID header + ContextVar |
| ββ body_limit β 413 if Content-Length > MAX |
| ββ security_headers β X-Content-Type-Options, X-Frame-Options, β¦ |
| β |
| βΌ |
| [ route handler convert.py ] |
| ββ parse JSON forms via Pydantic (422 on errors) |
| ββ acquire PerIPLimiter slot (429 on overflow) |
| ββ acquire ConversionGate slot (503 when full) |
| ββ stream upload to disk in worker thread |
| ββ ResourceGuard checks RAM/disk |
| ββ asyncio.wait_for() guards against runaway conversions |
| ββ DocsiferService.convert_file(...) |
| β |
| βΌ |
| [ DocsiferService (core) ] |
| ββ choose converter: |
| β ββ no api_key β cached `MarkItDown()` |
| β ββ otherwise β LLMRegistry.get(LLMConfig) |
| ββ HTML path β clean in-memory β MarkItDown.convert_stream() |
| ββ everything else β normalize_extension() β MarkItDown.convert(path) |
| ``` |
|
|
| The route returns an `ORJSONResponse`; large markdown payloads are gzipped by |
| the `GZipMiddleware`. |
|
|
| ## Lifespan |
|
|
| `docsifer.main._lifespan` owns the lifecycle of every singleton: |
|
|
| - `DocsiferService` β built once, has a bounded `ThreadPoolExecutor`. |
| - `AnalyticsService` β loads totals from Upstash, kicks off the |
| background sync loop, **flushes pending counters on shutdown**. |
| - `ConversionGate`, `PerIPLimiter`, `ResourceGuard` β no I/O, just config. |
| - `disk_cleanup_loop` β periodic temp-file sweeper. |
| - `memory_watchdog_loop` β optional, sends SIGTERM on RSS overrun. |
|
|
| All background tasks share an `asyncio.Event` so shutdown is deterministic. |
|
|
| ## Concurrency model |
|
|
| - One global `asyncio.Semaphore` (in `ConversionGate`) bounds in-flight |
| conversions. Anything beyond `max_concurrent + max_queue` is rejected with |
| `503 + Retry-After`. |
| - `PerIPLimiter` enforces fairness so a single IP cannot monopolize the gate. |
| - `ResourceGuard` runs ahead of every conversion to short-circuit OOM. |
| - All synchronous work happens in a dedicated `ThreadPoolExecutor` to keep |
| the event loop responsive. |
|
|
| ## Analytics |
|
|
| - Increments are stored in an in-process `pending` counter and applied to a |
| `totals` counter atomically. The lock-free `_snapshot` dict is what |
| `/v1/stats` returns β readers are never blocked by writers. |
| - The background sync loop pipelines `HINCRBY` operations into Upstash so a |
| full flush is a single HTTP round-trip when the server supports pipelines. |
| - Failures keep `pending` intact for the next attempt; on shutdown the |
| service performs one final flush. |
|
|
| ## Security posture |
|
|
| - **SSRF** β every URL goes through `validate_url()` which resolves the host |
| and rejects private/loopback/link-local/multicast addresses by default. |
| - **Path traversal** β `Path(filename).name` strips any directory component. |
| - **Body limit** β enforced by middleware before the body is consumed. |
| - **Extension allowlist** β server-side allowlist independent of the UI. |
| - **CORS** β `allow_credentials` is automatically disabled when origins |
| contain `*` (which would otherwise be invalid per the spec). |
| - **Error responses** β never echo internal exception messages; only the |
| `public_message` of the typed exception is returned. |
|
|
| ## Free-tier defaults |
|
|
| The defaults in `config.py` target a 2 vCPU / 16 GB RAM container (Hugging |
| Face Spaces basic): |
|
|
| - `max_upload_bytes = 10 MB` |
| - `max_concurrent_conversions = 2` |
| - `max_queue_depth = 10` |
| - `max_per_ip_concurrent = 1` |
| - `request_timeout_sec = 55` (just under HF's 60 s gateway) |
| - `analytics_sync_interval_sec = 1800` |
|
|