How Pruna Optimizes Models: Tracing the Smash Function and DAG Execution
Introduction: What Actually Happens During a Pruna Smash?
Modern optimization libraries expose a single API call, and you pass in a model, select a few optimizations, and receive a faster version of the same model. What is often less understood is the orchestration layer that decides how those optimizations are validated, ordered, and finally wrapped into a deployable runtime artifact. In this article, we will explore how this pipeline works in practice.
Pruna’s smash() function is a good example of this abstraction in practice. Under the surface, the process behaves like an optimization pipeline driven by configuration, dependency ordering, and algorithm composition. Quantization, compilation, caching, kernels, and other transformations are passed to the model in a structured flow.
The diagram below illustrates a view of that pipeline.
The process workflow separates the process into clearly scoped stages:
- The user defines a
SmashConfig, which processes as the optimization strategy for the entire run. - Once
smash()is invoked, Pruna validates the requested algorithms and determines a valid execution order for the selected algorithms. - The
smashfunction then applies the corresponding optimization strategy via the internalAlgorithmRegistryand wraps the transformed model in thePrunaModelinterface.
This structure becomes especially important once multiple optimizations interact with each other. Certain transformations may need to happen before compilation, others may modify runtime behavior, and some combinations can alter determinism, memory layout, or kernel execution paths.
From SmashConfig to Execution Pipeline
At the center of this process, determine_algorithm_order() is the method which constructs a Directed Acyclic Graph (DAG) representing the relationships between optimization algorithms.
An important detail to observe is that Pruna also performs cross-compatibility validation before execution. This prevents incompatible combinations from silently producing unstable or undefined behavior later in the pipeline. By explicitly modeling dependencies, compatibility constraints, and execution ordering, the framework can coordinate multiple optimizations in a structured and reproducible manner.
Tracing a Smashed Model Through Optimization Passes
After understanding how Pruna constructs and schedules its optimization pipeline, the next step is to inspect how those optimizations behave in practice.
To explore this, we compare two inference pipelines built around the same model Llama-3.2-1B-Instruct and optimization strategy:
- A HQQ + torch.compile runtime for the optimization pipeline
- A Pruna smashed runtime using the same underlying optimization strategy (HQQ + torch.compile) for a similar pipeline
We first process the raw output values, then benchmark the baseline model runtime against the Pruna-optimized (“smashed”) runtime across a set of visualizations. Because both setups rely on the same underlying optimization primitives, differences should primarily reflect how those optimizations are coordinated and executed during inference. The resulting benchmarks revealed several interesting differences between the manually assembled runtime and the smashed pipeline.
Memory Stability
The memory stability measurements further reinforce the execution differences between the smashed runtime and the manually assembled HQQ + torch.compile pipeline.
The memory stability envelope reveals a clear divergence in scaling behavior between the two runtimes as generation length increases. Pruna maintains a consistently lower peak memory footprint across all sequence lengths and exhibits a much tighter stability band.
- As generation length increases, the HQQ pipeline exhibits near-linear memory growth, rising from roughly 1.09 GB to nearly 1.49 GB at 4096 tokens.
- Pruna scales substantially more efficiently, increasing only modestly from approximately 1.09 GB to 1.19 GB over the same range.
In contrast, the HQQ baseline shows progressively steeper memory growth and wider variability at longer generations.
KV Cache Growth
The KV cache growth measurements exhibit behavior that differs from that of the latency and stability benchmarks. Unlike throughput and prefill execution, both pipelines exhibit similar KV cache scaling characteristics across all tested generation lengths.
The effective memory growth per generated token decreases progressively during longer decode workloads. As generation length increases, the fixed runtime memory overhead associated with model initialization, graph preparation, and execution buffers becomes increasingly amortized across a larger number of generated tokens. As a result, the effective memory growth per generated token decreases progressively during longer decode workloads.
This is an important observation because it isolates where the runtime differences are not occurring. The large throughput and latency improvements observed in the smashed runtime do not appear to originate from fundamentally different KV cache allocation strategies.
Peak Memory
The peak GPU memory measurements reveal an interesting contrast between the two pipelines. Unlike throughput and latency behavior, both runtimes maintain relatively stable memory scaling characteristics throughout the benchmark, with only marginal increases as generation length grows.
Both runtimes show a similar warm-up pattern at the very start. At 64 tokens, peak memory is high, around 1.32 GB for HQQ and 1.13 GB for Pruna because the first generation pays the one-time cost of weight materialization, compilation caches, and initial workspace allocations. By the time generation reaches 128 tokens, both pipelines settle into their steady-state baseline near 1.09 GB, which is the realistic floor for decode-time memory usage.
The HQQ + torch.compile pipeline scales peak memory roughly linearly with generation length. From 128 tokens up to 4096 tokens, peak memory climbs from ~1.09 GB to ~1.49 GB, an increase of nearly 400 MB over the decode window. The memory curve is essentially a straight line, which shows allocation cost is proportional to sequence length, not amortized across it.
The Pruna smashed pipeline behaves very differently. After the same 128-token settle point, peak memory grows only marginally, reaching ~1.19 GB at 4096 tokens. That is roughly a 100 MB increase across the entire decode window, about 4× less memory growth than the manually assembled pipeline over the same range. The curve is nearly flat, which suggests the orchestration layer is reusing pre-allocated buffers and keeping the compiled graph's memory plan stable.
Aggregate Benchmarks
The aggregate benchmark combines results from all decode workloads into a singular view. This matters because it smooths out individual prompt-length fluctuations and reveals three important properties:
- Performance: How fast the system runs on average
- Stability: How consistent the runtime remains across executions
- Efficiency: How much memory and compute overhead is required
Since both pipelines use the same underlying optimizations (HQQ quantization and torch.compile), we can assess the differences from how the runtimes are orchestrated.
Prefill Latency: Time to First Token
Prefill latency represents the cost of processing the prompt before token generation begins. In production systems, this largely determines time-to-first token (TTFT) and directly affects perceived responsiveness.
| Metric (seconds) | Pruna | Base |
|---|---|---|
| Mean | 0.022 | 2.123 |
| Std | 0.006 | 7.888 |
| Max | 0.032 | 29.529 |
Pruna consistently completes prefill in roughly 22 milliseconds, while the base runtime averages over 2 seconds. Pruna's standard deviation is only 6 milliseconds while the base runtime varies by almost 8 seconds. This suggests the manual pipeline is repeatedly triggering expensive graph recompilation steps, while Pruna keeps the compiled graphs stable and reusable across runs.
Decode Throughput: Generation Speed
Decode throughput (tokens/sec) measures how many output tokens can be generated each second once generation has started.
| Metrics (tokens/sec) | Pruna | Base |
|---|---|---|
| Mean | 82.58 | 92.23 |
| Std | 0.006 | 99.28 |
| Max | 0.48 | 26.38 |
At first glance, the base runtime appears slightly faster. However, the variance tells a different story. Pruna’s throughput varies by less than 1 token/sec, whereas the base runtime fluctuates by over 26 tokens/sec. The base runtime achieves peak throughput, but Pruna provides significantly more predictable generation performance.
Decode Latency: Cost Per Generated Token
Decode latency measures the average time required to generate each individual token.
| Metric (milliseconds) | Pruna | Base |
|---|---|---|
| Mean | 12.110 | 133.753 |
| Median | 12.111 | 10.072 |
| Std | 0.071 | 462.771 |
Pruna maintains a mean per-token decode latency of 12.110 miliseconds with almost no variation, delivering consistent behaviour. The base runtime averages 133.753 miliseconds per token with a massive 462.771 miliseconds standard deviation. This indicates severe long-tail latency spikes, likely caused by recompilation during execution.
Memory Usage
Peak memory reflects the maximum GPU memory consumed during execution.
| Metric | Pruna | Base |
|---|---|---|
| Peak Memory | 1.120 GB | 1.213 GB |
| Peak Memory Std | 0.052 GB | 0.142 GB |
| Memory / Token | 0.006 MB | 0.071 MB |
The base runtime consumes nearly 12× more memory per generated token, this becomes increasingly important at longer sequence lengths, where inefficient memory growth can significantly affect scalability of systems.
Why Pruna Performs Better
The results show that Pruna’s advantage does not come from applying fundamentally different optimization techniques. Both pipelines rely on the same quantization and compilation stack.
The improvement comes from stable runtime orchestration:
- Compiled graphs are reused instead of repeatedly rebuilt
- Execution order remains deterministic
- Memory planning stays stable across decode steps
As a result, Pruna achieves better memory efficiency and scalability with much lower latency variance. The overall effect is a runtime that behaves far more predictably under sustained workloads.
Conclusion
At first glance, the smash pipeline in Pruna may appear to be a higher-level orchestration layer around existing optimization primitives such as quantization and torch.compile.
The Directed Acyclic Graph (DAG) system does more than simply determine the order of optimization passes. By coordinating compatibility constraints, execution ordering, runtime preparation, and graph execution behavior, the smash pipeline appears to materially influence the realized inference characteristics of the final runtime.
As optimization stacks become increasingly layered and heterogeneous, performance is no longer determined solely by individual techniques such as quantization or compilation. The orchestration layer coordinating those components increasingly becomes part of the optimization itself.
Acknowledgements
I would also like to acknowledge the Pruna AI team for building the framework and the insightful documentation, special thanks to David Berenstein for his advice on structuring this article and to help shape the direction of the analysis.







