YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
DKM: Differentiable K-Means Clustering Layer for Neural Network Compression
PyTorch implementation of the ICLR 2022 paper by Cho et al.
π Paper (arXiv:2108.12659) | ποΈ ICLR 2022
Overview
DKM casts k-means weight clustering as a differentiable attention problem, enabling joint optimization of DNN parameters and clustering centroids through standard backpropagation. Unlike prior weight-clustering methods that rely on hard assignments and approximated gradients, DKM uses soft attention-based assignment that is fully differentiable.
Key Innovation
Traditional: weights β hard k-means assignment β fixed centroids (not differentiable)
DKM: weights β attention-based soft assignment β differentiable centroids
The DKM layer:
- Computes a distance matrix D between weights W and centroids C
- Applies softmax with temperature Ο to get attention matrix A = softmax(D/Ο)
- Updates centroids: c_j = Ξ£_i(a_ij Γ w_i) / Ξ£_i(a_ij)
- Iterates until convergence
- Returns compressed weights: WΜ = A Γ C
Paper Results
| Model | Config | Top-1 Acc (%) | Size (MB) | Compression |
|---|---|---|---|---|
| ResNet50 | cv:6/6, fc:6/4 | 74.5 | 3.32 | 29.4Γ |
| MobileNet-v1 | cv:4/4, fc:4/2 | 63.9 | 0.72 | 22.4Γ |
| MobileNet-v2 | cv:2/1, fc:4/4 | 68.0 | 0.84 | 15.8Γ |
| DistilBERT | - | -1.1% acc drop | - | 11.8Γ |
Installation
git clone https://huggingface.co/syedmohaiminulhoque/dkm-compression
cd dkm-compression
pip install torch torchvision
Quick Start
import torch
import torch.nn as nn
from dkm import compress_model
from dkm.utils import print_compression_summary
# Load any pre-trained model
model = torchvision.models.resnet18(weights="DEFAULT")
# Compress with DKM (2-bit clustering)
compressor = compress_model(
model,
bits=2, # k = 2^bits = 4 clusters
dim=1, # scalar clustering (dim=1) or multi-dim
tau=2e-5, # temperature (controls softness of assignment)
skip_first_last=True, # skip first/last layers (per paper protocol)
)
# Print compression statistics
info = compressor.get_compression_info()
print_compression_summary(info)
# Train with standard PyTorch loop (paper: SGD, lr=0.008, momentum=0.9)
optimizer = torch.optim.SGD(compressor.parameters(), lr=0.008, momentum=0.9)
criterion = nn.CrossEntropyLoss()
compressor.train()
for images, labels in dataloader:
optimizer.zero_grad()
outputs = compressor(images)
loss = criterion(outputs, labels)
loss.backward() # Gradients flow through DKM attention layers
optimizer.step()
# Snap to nearest centroids for inference
compressor.snap_weights()
# Export compressed model (codebook + assignments)
export = compressor.export_compressed()
torch.save(export, "compressed_model.pt")
Multi-Dimensional Clustering (Section 3.3)
DKM supports multi-dimensional weight clustering for higher compression:
# Paper notation: "bits/dim" e.g., "4/4" means 4 bits, 4 dimensions
# Effective bits-per-weight = bits / dim
# Configuration cv:6/8, fc:6/4 (as in Table 3 of the paper)
compressor = compress_model(
model,
bits=6,
conv_config={"bits": 6, "dim": 8}, # 6 bits, 8 dims β 0.75 bpw
fc_config={"bits": 6, "dim": 4}, # 6 bits, 4 dims β 1.5 bpw
tau=2e-5,
)
| Config | Clusters | Dim | Effective BPW |
|---|---|---|---|
| 3-bit | 8 | 1 | 3.0 |
| 2-bit | 4 | 1 | 2.0 |
| 1-bit | 2 | 1 | 1.0 |
| 4/4 | 16 | 4 | 1.0 |
| 8/8 | 256 | 8 | 1.0 |
| 4/8 | 16 | 8 | 0.5 |
| 8/16 | 256 | 16 | 0.5 |
Temperature Ο Guidelines (Appendix B)
The temperature controls the softness of cluster assignment:
- Smaller Ο β harder assignment (near one-hot), closer to standard k-means
- Larger Ο β softer assignment, more gradient flow, better for hard compression tasks
| Model | 3-bit | 2-bit | 1-bit | 4/4 | 8/8 |
|---|---|---|---|---|---|
| ResNet18 | 8e-6 | 2e-5 | 5e-5 | 5e-5 | 8e-5 |
| ResNet50 | 8e-6 | 2e-5 | 5e-5 | 4e-5 | OOM |
| MobileNet-v1 | 5e-5 | 1e-4 | 3e-4 | 1e-4 | 1e-4 |
| MobileNet-v2 | 5e-5 | 1e-4 | 1.5e-4 | 1e-4 | 1e-4 |
Architecture
dkm/
βββ __init__.py # Package exports
βββ dkm_layer.py # Core DKM layer (Section 3.2-3.3)
βββ compressor.py # Model wrapper with DKM layers (Section 4)
βββ utils.py # Compression analysis utilities
tests/
βββ test_dkm.py # 16 comprehensive test groups (all passing)
train.py # Full training pipeline (CIFAR-10 demo)
Core Components
DKMLayer: The differentiable k-means clustering layer. Implements the iterative attention-based clustering from Fig. 2 of the paper, with k-means++ initialization, warm start across batches, and convergence checking.DKMCompressor: Wraps any PyTorch model by inserting DKM layers via forward pre-hooks. Handles per-layer configuration (different bits/dim for conv vs fc), the paper's protocol for small layers (<10K params β 8-bit), and first/last layer skipping.compress_model: High-level API matching the paper's notation (cv:bits/dim, fc:bits/dim).
Training Protocol (Section 4)
Following the paper exactly:
- Optimizer: SGD with momentum 0.9
- Learning rate: 0.008 (fixed, no per-layer tuning)
- Loss: Original task loss (no regularizers or modifications)
- Epochs: 200 for ImageNet, varies for GLUE
- Batch size: 128 per GPU (paper used 8Γ V100)
- Convergence: Ξ΅ = 1e-4, max 5 DKM iterations per layer
- Small layers: Layers with <10,000 parameters get 8-bit clustering
Compressed Model Format
After training, export_compressed() returns:
- state_dict: Standard PyTorch state dict (with snapped weights)
- codebooks: Per-layer centroid tensors (k Γ d float32)
- assignments: Per-layer cluster index tensors (N/d integers, b bits each)
- layer_configs: Per-layer DKM configuration
The actual compressed size = Ξ£(codebook_bits + assignment_bits) per layer + uncompressed params.
Tests
All 16 test groups pass, covering:
- Shape preservation (train & eval)
- Distance matrix correctness
- Attention matrix properties (row-sum=1, temperature effect)
- Centroid convergence to cluster means
- Gradient flow (differentiability β key paper contribution)
- Multi-dimensional clustering
- Iterative convergence
- Full compressor pipeline
- Weight snapping for inference
- Model export
- Multi-step training stability
- Paper configurations (Table 1)
- K-means++ initialization
- Warm start across batches
- Numerical stability (large/small/uniform weights)
- ResNet-like model compression
python tests/test_dkm.py
Citation
@inproceedings{cho2022dkm,
title={DKM: Differentiable k-Means Clustering Layer for Neural Network Compression},
author={Cho, Minsik and Alizadeh-Vahid, Keivan and Adya, Saurabh and Rastegari, Mohammad},
booktitle={International Conference on Learning Representations (ICLR)},
year={2022},
url={https://openreview.net/forum?id=J_F_qqCE3Z5}
}
License
This is a research implementation. The original paper is by Apple Research (Cho et al., ICLR 2022).