Spaces:

lamhieu
/

docsifer

Running

App Files Files Community

docsifer / ARCHITECTURE.md

lamhieu

refactor(core): overhaul architecture for better performance, efficiency, and maintainability

41788c4 3 days ago

preview code

raw

history blame contribute delete

5.82 kB

Architecture

This document explains how the Docsifer codebase is structured and why.

Goals

Correctness — eliminate the bugs identified in the audit (SSRF, path-traversal, broken cookies, race conditions, deprecated FastAPI APIs).
Production-grade safety — never crash a free-tier container under load.
Operability — config driven, observable, well-tested.
Performance — minimize redundant I/O, reuse expensive clients.

Module map

docsifer/
├── api/              # HTTP layer
│   ├── deps.py
│   ├── error_handlers.py
│   ├── middleware.py        # request-id, body limit, security headers
│   ├── schemas.py           # Pydantic request/response models
│   └── v1/
│       ├── convert.py       # POST /v1/convert
│       ├── stats.py         # GET  /v1/stats
│       └── health.py        # GET  /v1/healthz, /v1/readyz
├── core/             # Pure business logic
│   ├── service.py           # DocsiferService
│   ├── llm_registry.py      # TTL-cached MarkItDown(LLM) instances
│   ├── html_cleaner.py      # selectolax-based HTML scrub
│   ├── tokenizer.py         # tiktoken wrapper with fallbacks
│   ├── mime.py              # safe MIME detection
│   └── url_guard.py         # SSRF protection
├── analytics/        # Lifespan-managed analytics
│   ├── service.py
│   ├── periods.py           # ISO 8601 week numbering
│   └── store.py             # AnalyticsStore protocol + Upstash & in-memory
├── safety/           # Anti-crash primitives (Section N of the audit)
│   ├── conversion_gate.py
│   ├── per_ip_limiter.py
│   ├── resource_guard.py
│   ├── circuit_breaker.py
│   ├── memory_watchdog.py
│   └── disk_cleanup.py
├── ui/               # Gradio UI (optional)
│   └── gradio_app.py
├── config.py         # pydantic-settings
├── exceptions.py
├── logging_config.py
└── main.py           # FastAPI app factory + lifespan

Request flow (POST /v1/convert)

Client
  │  multipart/form-data
  ▼
[ middleware ]
  ├─ request_id          → X-Request-ID header + ContextVar
  ├─ body_limit          → 413 if Content-Length > MAX
  └─ security_headers    → X-Content-Type-Options, X-Frame-Options, …
  │
  ▼
[ route handler convert.py ]
  ├─ parse JSON forms via Pydantic (422 on errors)
  ├─ acquire PerIPLimiter slot (429 on overflow)
  ├─ acquire ConversionGate slot (503 when full)
  ├─ stream upload to disk in worker thread
  ├─ ResourceGuard checks RAM/disk
  ├─ asyncio.wait_for() guards against runaway conversions
  └─ DocsiferService.convert_file(...)
       │
       ▼
[ DocsiferService (core) ]
  ├─ choose converter:
  │     ├─ no api_key → cached `MarkItDown()`
  │     └─ otherwise  → LLMRegistry.get(LLMConfig)
  ├─ HTML path  → clean in-memory → MarkItDown.convert_stream()
  └─ everything else → normalize_extension() → MarkItDown.convert(path)

The route returns an ORJSONResponse; large markdown payloads are gzipped by the GZipMiddleware.

Lifespan

docsifer.main._lifespan owns the lifecycle of every singleton:

DocsiferService — built once, has a bounded ThreadPoolExecutor.
AnalyticsService — loads totals from Upstash, kicks off the background sync loop, flushes pending counters on shutdown.
ConversionGate, PerIPLimiter, ResourceGuard — no I/O, just config.
disk_cleanup_loop — periodic temp-file sweeper.
memory_watchdog_loop — optional, sends SIGTERM on RSS overrun.

All background tasks share an asyncio.Event so shutdown is deterministic.

Concurrency model

One global asyncio.Semaphore (in ConversionGate) bounds in-flight conversions. Anything beyond max_concurrent + max_queue is rejected with 503 + Retry-After.
PerIPLimiter enforces fairness so a single IP cannot monopolize the gate.
ResourceGuard runs ahead of every conversion to short-circuit OOM.
All synchronous work happens in a dedicated ThreadPoolExecutor to keep the event loop responsive.

Analytics

Increments are stored in an in-process pending counter and applied to a totals counter atomically. The lock-free _snapshot dict is what /v1/stats returns — readers are never blocked by writers.
The background sync loop pipelines HINCRBY operations into Upstash so a full flush is a single HTTP round-trip when the server supports pipelines.
Failures keep pending intact for the next attempt; on shutdown the service performs one final flush.

Security posture

SSRF — every URL goes through validate_url() which resolves the host and rejects private/loopback/link-local/multicast addresses by default.
Path traversal — Path(filename).name strips any directory component.
Body limit — enforced by middleware before the body is consumed.
Extension allowlist — server-side allowlist independent of the UI.
CORS — allow_credentials is automatically disabled when origins contain * (which would otherwise be invalid per the spec).
Error responses — never echo internal exception messages; only the public_message of the typed exception is returned.

Free-tier defaults

The defaults in config.py target a 2 vCPU / 16 GB RAM container (Hugging Face Spaces basic):

max_upload_bytes = 10 MB
max_concurrent_conversions = 2
max_queue_depth = 10
max_per_ip_concurrent = 1
request_timeout_sec = 55 (just under HF's 60 s gateway)
analytics_sync_interval_sec = 1800