Architecture
This document explains how the Docsifer codebase is structured and why.
Goals
- Correctness β eliminate the bugs identified in the audit (SSRF, path-traversal, broken cookies, race conditions, deprecated FastAPI APIs).
- Production-grade safety β never crash a free-tier container under load.
- Operability β config driven, observable, well-tested.
- Performance β minimize redundant I/O, reuse expensive clients.
Module map
docsifer/
βββ api/ # HTTP layer
β βββ deps.py
β βββ error_handlers.py
β βββ middleware.py # request-id, body limit, security headers
β βββ schemas.py # Pydantic request/response models
β βββ v1/
β βββ convert.py # POST /v1/convert
β βββ stats.py # GET /v1/stats
β βββ health.py # GET /v1/healthz, /v1/readyz
βββ core/ # Pure business logic
β βββ service.py # DocsiferService
β βββ llm_registry.py # TTL-cached MarkItDown(LLM) instances
β βββ html_cleaner.py # selectolax-based HTML scrub
β βββ tokenizer.py # tiktoken wrapper with fallbacks
β βββ mime.py # safe MIME detection
β βββ url_guard.py # SSRF protection
βββ analytics/ # Lifespan-managed analytics
β βββ service.py
β βββ periods.py # ISO 8601 week numbering
β βββ store.py # AnalyticsStore protocol + Upstash & in-memory
βββ safety/ # Anti-crash primitives (Section N of the audit)
β βββ conversion_gate.py
β βββ per_ip_limiter.py
β βββ resource_guard.py
β βββ circuit_breaker.py
β βββ memory_watchdog.py
β βββ disk_cleanup.py
βββ ui/ # Gradio UI (optional)
β βββ gradio_app.py
βββ config.py # pydantic-settings
βββ exceptions.py
βββ logging_config.py
βββ main.py # FastAPI app factory + lifespan
Request flow (POST /v1/convert)
Client
β multipart/form-data
βΌ
[ middleware ]
ββ request_id β X-Request-ID header + ContextVar
ββ body_limit β 413 if Content-Length > MAX
ββ security_headers β X-Content-Type-Options, X-Frame-Options, β¦
β
βΌ
[ route handler convert.py ]
ββ parse JSON forms via Pydantic (422 on errors)
ββ acquire PerIPLimiter slot (429 on overflow)
ββ acquire ConversionGate slot (503 when full)
ββ stream upload to disk in worker thread
ββ ResourceGuard checks RAM/disk
ββ asyncio.wait_for() guards against runaway conversions
ββ DocsiferService.convert_file(...)
β
βΌ
[ DocsiferService (core) ]
ββ choose converter:
β ββ no api_key β cached `MarkItDown()`
β ββ otherwise β LLMRegistry.get(LLMConfig)
ββ HTML path β clean in-memory β MarkItDown.convert_stream()
ββ everything else β normalize_extension() β MarkItDown.convert(path)
The route returns an ORJSONResponse; large markdown payloads are gzipped by
the GZipMiddleware.
Lifespan
docsifer.main._lifespan owns the lifecycle of every singleton:
DocsiferServiceβ built once, has a boundedThreadPoolExecutor.AnalyticsServiceβ loads totals from Upstash, kicks off the background sync loop, flushes pending counters on shutdown.ConversionGate,PerIPLimiter,ResourceGuardβ no I/O, just config.disk_cleanup_loopβ periodic temp-file sweeper.memory_watchdog_loopβ optional, sends SIGTERM on RSS overrun.
All background tasks share an asyncio.Event so shutdown is deterministic.
Concurrency model
- One global
asyncio.Semaphore(inConversionGate) bounds in-flight conversions. Anything beyondmax_concurrent + max_queueis rejected with503 + Retry-After. PerIPLimiterenforces fairness so a single IP cannot monopolize the gate.ResourceGuardruns ahead of every conversion to short-circuit OOM.- All synchronous work happens in a dedicated
ThreadPoolExecutorto keep the event loop responsive.
Analytics
- Increments are stored in an in-process
pendingcounter and applied to atotalscounter atomically. The lock-free_snapshotdict is what/v1/statsreturns β readers are never blocked by writers. - The background sync loop pipelines
HINCRBYoperations into Upstash so a full flush is a single HTTP round-trip when the server supports pipelines. - Failures keep
pendingintact for the next attempt; on shutdown the service performs one final flush.
Security posture
- SSRF β every URL goes through
validate_url()which resolves the host and rejects private/loopback/link-local/multicast addresses by default. - Path traversal β
Path(filename).namestrips any directory component. - Body limit β enforced by middleware before the body is consumed.
- Extension allowlist β server-side allowlist independent of the UI.
- CORS β
allow_credentialsis automatically disabled when origins contain*(which would otherwise be invalid per the spec). - Error responses β never echo internal exception messages; only the
public_messageof the typed exception is returned.
Free-tier defaults
The defaults in config.py target a 2 vCPU / 16 GB RAM container (Hugging
Face Spaces basic):
max_upload_bytes = 10 MBmax_concurrent_conversions = 2max_queue_depth = 10max_per_ip_concurrent = 1request_timeout_sec = 55(just under HF's 60 s gateway)analytics_sync_interval_sec = 1800