docsifer / ARCHITECTURE.md
lamhieu's picture
refactor(core): overhaul architecture for better performance, efficiency, and maintainability
41788c4

Architecture

This document explains how the Docsifer codebase is structured and why.

Goals

  1. Correctness β€” eliminate the bugs identified in the audit (SSRF, path-traversal, broken cookies, race conditions, deprecated FastAPI APIs).
  2. Production-grade safety β€” never crash a free-tier container under load.
  3. Operability β€” config driven, observable, well-tested.
  4. Performance β€” minimize redundant I/O, reuse expensive clients.

Module map

docsifer/
β”œβ”€β”€ api/              # HTTP layer
β”‚   β”œβ”€β”€ deps.py
β”‚   β”œβ”€β”€ error_handlers.py
β”‚   β”œβ”€β”€ middleware.py        # request-id, body limit, security headers
β”‚   β”œβ”€β”€ schemas.py           # Pydantic request/response models
β”‚   └── v1/
β”‚       β”œβ”€β”€ convert.py       # POST /v1/convert
β”‚       β”œβ”€β”€ stats.py         # GET  /v1/stats
β”‚       └── health.py        # GET  /v1/healthz, /v1/readyz
β”œβ”€β”€ core/             # Pure business logic
β”‚   β”œβ”€β”€ service.py           # DocsiferService
β”‚   β”œβ”€β”€ llm_registry.py      # TTL-cached MarkItDown(LLM) instances
β”‚   β”œβ”€β”€ html_cleaner.py      # selectolax-based HTML scrub
β”‚   β”œβ”€β”€ tokenizer.py         # tiktoken wrapper with fallbacks
β”‚   β”œβ”€β”€ mime.py              # safe MIME detection
β”‚   └── url_guard.py         # SSRF protection
β”œβ”€β”€ analytics/        # Lifespan-managed analytics
β”‚   β”œβ”€β”€ service.py
β”‚   β”œβ”€β”€ periods.py           # ISO 8601 week numbering
β”‚   └── store.py             # AnalyticsStore protocol + Upstash & in-memory
β”œβ”€β”€ safety/           # Anti-crash primitives (Section N of the audit)
β”‚   β”œβ”€β”€ conversion_gate.py
β”‚   β”œβ”€β”€ per_ip_limiter.py
β”‚   β”œβ”€β”€ resource_guard.py
β”‚   β”œβ”€β”€ circuit_breaker.py
β”‚   β”œβ”€β”€ memory_watchdog.py
β”‚   └── disk_cleanup.py
β”œβ”€β”€ ui/               # Gradio UI (optional)
β”‚   └── gradio_app.py
β”œβ”€β”€ config.py         # pydantic-settings
β”œβ”€β”€ exceptions.py
β”œβ”€β”€ logging_config.py
└── main.py           # FastAPI app factory + lifespan

Request flow (POST /v1/convert)

Client
  β”‚  multipart/form-data
  β–Ό
[ middleware ]
  β”œβ”€ request_id          β†’ X-Request-ID header + ContextVar
  β”œβ”€ body_limit          β†’ 413 if Content-Length > MAX
  └─ security_headers    β†’ X-Content-Type-Options, X-Frame-Options, …
  β”‚
  β–Ό
[ route handler convert.py ]
  β”œβ”€ parse JSON forms via Pydantic (422 on errors)
  β”œβ”€ acquire PerIPLimiter slot (429 on overflow)
  β”œβ”€ acquire ConversionGate slot (503 when full)
  β”œβ”€ stream upload to disk in worker thread
  β”œβ”€ ResourceGuard checks RAM/disk
  β”œβ”€ asyncio.wait_for() guards against runaway conversions
  └─ DocsiferService.convert_file(...)
       β”‚
       β–Ό
[ DocsiferService (core) ]
  β”œβ”€ choose converter:
  β”‚     β”œβ”€ no api_key β†’ cached `MarkItDown()`
  β”‚     └─ otherwise  β†’ LLMRegistry.get(LLMConfig)
  β”œβ”€ HTML path  β†’ clean in-memory β†’ MarkItDown.convert_stream()
  └─ everything else β†’ normalize_extension() β†’ MarkItDown.convert(path)

The route returns an ORJSONResponse; large markdown payloads are gzipped by the GZipMiddleware.

Lifespan

docsifer.main._lifespan owns the lifecycle of every singleton:

  • DocsiferService β€” built once, has a bounded ThreadPoolExecutor.
  • AnalyticsService β€” loads totals from Upstash, kicks off the background sync loop, flushes pending counters on shutdown.
  • ConversionGate, PerIPLimiter, ResourceGuard β€” no I/O, just config.
  • disk_cleanup_loop β€” periodic temp-file sweeper.
  • memory_watchdog_loop β€” optional, sends SIGTERM on RSS overrun.

All background tasks share an asyncio.Event so shutdown is deterministic.

Concurrency model

  • One global asyncio.Semaphore (in ConversionGate) bounds in-flight conversions. Anything beyond max_concurrent + max_queue is rejected with 503 + Retry-After.
  • PerIPLimiter enforces fairness so a single IP cannot monopolize the gate.
  • ResourceGuard runs ahead of every conversion to short-circuit OOM.
  • All synchronous work happens in a dedicated ThreadPoolExecutor to keep the event loop responsive.

Analytics

  • Increments are stored in an in-process pending counter and applied to a totals counter atomically. The lock-free _snapshot dict is what /v1/stats returns β€” readers are never blocked by writers.
  • The background sync loop pipelines HINCRBY operations into Upstash so a full flush is a single HTTP round-trip when the server supports pipelines.
  • Failures keep pending intact for the next attempt; on shutdown the service performs one final flush.

Security posture

  • SSRF β€” every URL goes through validate_url() which resolves the host and rejects private/loopback/link-local/multicast addresses by default.
  • Path traversal β€” Path(filename).name strips any directory component.
  • Body limit β€” enforced by middleware before the body is consumed.
  • Extension allowlist β€” server-side allowlist independent of the UI.
  • CORS β€” allow_credentials is automatically disabled when origins contain * (which would otherwise be invalid per the spec).
  • Error responses β€” never echo internal exception messages; only the public_message of the typed exception is returned.

Free-tier defaults

The defaults in config.py target a 2 vCPU / 16 GB RAM container (Hugging Face Spaces basic):

  • max_upload_bytes = 10 MB
  • max_concurrent_conversions = 2
  • max_queue_depth = 10
  • max_per_ip_concurrent = 1
  • request_timeout_sec = 55 (just under HF's 60 s gateway)
  • analytics_sync_interval_sec = 1800