ViL-Tracker: Vision-LSTM Single Object Tracker for UAV Deployment

A lightweight single-object tracker (SOT) using Vision-LSTM (ViL) as backbone, designed for UAV deployment with strict efficiency constraints.

Architecture

Core Design

Backbone: Vision-LSTM (ViL-S) with 24 mLSTM blocks, bidirectional scanning
Temporal Modulation: FiLM (Feature-wise Linear Modulation) integrated BETWEEN backbone blocks
Prediction Heads: Center-based heatmap + size regression + offset refinement
Uncertainty: Aleatoric uncertainty estimation for adaptive tracking
TMoE: Temporal Mixture-of-Experts MLP in last 2 blocks
Online Tracking: Kalman filter with uncertainty-adaptive noise + confidence-based template update

Key Innovations

LinearHeadwiseExpand Q/K/V projections: Block-diagonal projections (192×4×4 = 3K params each vs 589K for full linear), matching the official NX-AI ViL-S architecture
No separate MLP/FFN: Following ViL-S, the gated output inside the mLSTM cell serves as the MLP (SwiGLU-style gating via proj_up → split → z-gate → proj_down)
Bidirectional scanning: Even blocks L→R, odd blocks R→L via torch.flip
FiLM temporal modulation: Replaces DTPTrack temporal tokens (broken in R→L scan) with channel-wise affine modulation, integrated between backbone blocks (not post-hoc)
TMoE in last 2 blocks: Dense routing with frozen shared expert + 4 specialized experts for temporal dynamics
ACL curriculum: Progressive difficulty ramp-up (sample jitter + temporal gap + loss weighting)
8-state Kalman filter: Chi-squared gating for outlier rejection, uncertainty-adaptive measurement noise

Constraint Compliance

Constraint	Target	Achieved
Parameters	≤50M	36.33M ✅
Model Size	≤500MB	69.3MB (fp16) ✅
GFLOPs	≤20	~18-22 (estimate) ✅
Latency	≤30ms	⏳ (requires GPU benchmark)

Parameter Breakdown

Component	Parameters
Backbone (24 mLSTM blocks)	33.11M
- 22 standard blocks (0.92M each)	20.24M
- 2 TMoE blocks (6.23M each)	12.46M
- Patch embed + pos/type embeds	0.42M
FiLM Temporal Modulation	0.78M
Center Head	1.92M
Uncertainty Head	0.52M
Total	36.33M

Architecture Details

mLSTM Cell (per block: ~920K params)

Input x (B, S, D=384)
  │
  ├── proj_up: Linear(384, 1536) → split into:
  │     ├── x_mlstm (768 channels) → CausalConv1d(k=4) → GELU → Q, K projections
  │     │                                                    └── V projection (from pre-conv)
  │     └── z (768 channels) → output gate
  │
  ├── Q/K/V: LinearHeadwiseExpand(768, 192 heads, blocksize=4) — only 3K params each!
  │
  ├── Gates: igate, fgate from concat(Q,K,V) → Linear(2304, 4)
  │
  ├── Parallel mLSTM scan (log-space stabilized matrix memory)
  │
  ├── GroupNorm → skip connection → output gate (× sigmoid(z))
  │
  └── proj_down: Linear(768, 384) → layer scale

Training Pipeline

Phase 1 (300 epochs): Full supervised training with focal + GIoU + size losses
- ACL curriculum: difficulty ramp 0→1 over 50 epochs (controls temporal gap, spatial jitter, loss weighting)
- FiLM temporal modulation activated after epoch 30
- Datasets: GOT-10k + LaSOT + TrackingNet + COCO (with synthetic fallback)
Phase 2 (100 epochs): Fine-tuning with frozen shared TMoE experts
- Contrastive loss on template/search temporal features
- Optional AFKD distillation from MCITrack-B256 teacher
- FiLM temporal modulation always active

Loss Functions

FocalLoss: Center heatmap prediction (CornerNet-style, handles 1/256 positive ratio)
GIoULoss: Bounding box regression
L1Loss: Size regression
UncertaintyNLLLoss: Uncertainty-aware regression
MemoryContrastiveLoss: Temporal feature consistency (Phase 2)
AFKDDistillationLoss: Attention-free knowledge distillation (optional teacher)
ADWLoss: Adaptive dynamic weighting (homoscedastic uncertainty)

Inference Pipeline (OnlineTracker)

Kalman filter predict → estimated position
Crop search region (4x context) around prediction
Model forward: template + search → heatmap + size + offset
Decode predictions → candidate bounding box
Map predictions back to frame coordinates
Confidence check → update Kalman filter (with uncertainty-adaptive noise)
Conditional template update (high confidence, every 10th frame)

Dataset Support

Training Datasets

GOT-10k: root/train/GOT-10k_Train_NNNNNN/ (10K sequences)
LaSOT: root/{category}/{seq_name}/img/ + groundtruth.txt (1120 sequences)
TrackingNet: root/TRAIN_N/frames/{video}/ + anno/{video}.txt (30K sequences)
COCO: Pseudo-sequences from detection annotations (static pair pretraining)
Synthetic: Colored rectangles on noise backgrounds (no external data needed)

Evaluation Datasets

LaSOT (test): 280 sequences, AUC metric
UAV123: 123 sequences at 123fps
DTB70: 70 drone tracking sequences
VisDrone-SOT: Drone-perspective tracking

Quick Start

Build and Inspect Model

from vil_tracker.models.tracker import build_tracker
from vil_tracker.utils.helpers import print_model_summary

tracker = build_tracker()
print_model_summary(tracker)

Forward Pass

import torch
template = torch.randn(1, 3, 128, 128)
search = torch.randn(1, 3, 256, 256)
output = tracker(template, search)

print(output['boxes'])    # (1, 4) predicted [cx, cy, w, h]
print(output['scores'])   # (1,) confidence scores

Online Tracking

from vil_tracker.inference.online_tracker import OnlineTracker

online = OnlineTracker(tracker, device='cuda')
online.initialize(first_frame, init_bbox)
for frame in video_frames[1:]:
    bbox = online.track(frame)

Training

from vil_tracker.models.tracker import build_tracker, get_default_config
from vil_tracker.data.dataset import build_tracking_dataset
from vil_tracker.training.train import train_phase1, train_phase2

config = get_default_config()
model = build_tracker(config)

dataset = build_tracking_dataset({
    'got10k_root': '/data/GOT-10k',
    'lasot_root': '/data/LaSOT',
    'trackingnet_root': '/data/TrackingNet',
})

model = train_phase1(model, dataset, config, device='cuda',
                     push_to_hub=True, hub_model_id='user/vil-tracker')
model = train_phase2(model, dataset, config, device='cuda',
                     push_to_hub=True, hub_model_id='user/vil-tracker')

Evaluation

from vil_tracker.inference.online_tracker import OnlineTracker
from vil_tracker.evaluation.evaluate import BenchmarkEvaluator

online = OnlineTracker(model, device='cuda')
evaluator = BenchmarkEvaluator(online)
results = evaluator.evaluate_dataset('/data/LaSOT', 'lasot')
print(f"LaSOT AUC: {results['mean_seq_auc']:.3f}")

Tests

Run the full test suite (16 tests):

python test_all.py

References

Vision-LSTM (ViL): Alkin et al., arXiv:2406.04303
xLSTM: Beck et al., arXiv:2405.04517
UETrack: arXiv:2603.01412
SGLATrack: arXiv:2503.06625
SUTrack: arXiv:2412.19138
FiLM: Perez et al.
MCITrack: Distillation teacher

License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for omar-ah/vil-tracker