UETrack: A Unified and Efficient Framework for Single Object Tracking
Paper β’ 2603.01412 β’ Published
A lightweight single-object tracker (SOT) using Vision-LSTM (ViL) as backbone, designed for UAV deployment with strict efficiency constraints.
torch.flip| Constraint | Target | Achieved |
|---|---|---|
| Parameters | β€50M | 36.33M β |
| Model Size | β€500MB | 69.3MB (fp16) β |
| GFLOPs | β€20 | ~18-22 (estimate) β |
| Latency | β€30ms | β³ (requires GPU benchmark) |
| Component | Parameters |
|---|---|
| Backbone (24 mLSTM blocks) | 33.11M |
| - 22 standard blocks (0.92M each) | 20.24M |
| - 2 TMoE blocks (6.23M each) | 12.46M |
| - Patch embed + pos/type embeds | 0.42M |
| FiLM Temporal Modulation | 0.78M |
| Center Head | 1.92M |
| Uncertainty Head | 0.52M |
| Total | 36.33M |
Input x (B, S, D=384)
β
βββ proj_up: Linear(384, 1536) β split into:
β βββ x_mlstm (768 channels) β CausalConv1d(k=4) β GELU β Q, K projections
β β βββ V projection (from pre-conv)
β βββ z (768 channels) β output gate
β
βββ Q/K/V: LinearHeadwiseExpand(768, 192 heads, blocksize=4) β only 3K params each!
β
βββ Gates: igate, fgate from concat(Q,K,V) β Linear(2304, 4)
β
βββ Parallel mLSTM scan (log-space stabilized matrix memory)
β
βββ GroupNorm β skip connection β output gate (Γ sigmoid(z))
β
βββ proj_down: Linear(768, 384) β layer scale
root/train/GOT-10k_Train_NNNNNN/ (10K sequences)root/{category}/{seq_name}/img/ + groundtruth.txt (1120 sequences)root/TRAIN_N/frames/{video}/ + anno/{video}.txt (30K sequences)from vil_tracker.models.tracker import build_tracker
from vil_tracker.utils.helpers import print_model_summary
tracker = build_tracker()
print_model_summary(tracker)
import torch
template = torch.randn(1, 3, 128, 128)
search = torch.randn(1, 3, 256, 256)
output = tracker(template, search)
print(output['boxes']) # (1, 4) predicted [cx, cy, w, h]
print(output['scores']) # (1,) confidence scores
from vil_tracker.inference.online_tracker import OnlineTracker
online = OnlineTracker(tracker, device='cuda')
online.initialize(first_frame, init_bbox)
for frame in video_frames[1:]:
bbox = online.track(frame)
from vil_tracker.models.tracker import build_tracker, get_default_config
from vil_tracker.data.dataset import build_tracking_dataset
from vil_tracker.training.train import train_phase1, train_phase2
config = get_default_config()
model = build_tracker(config)
dataset = build_tracking_dataset({
'got10k_root': '/data/GOT-10k',
'lasot_root': '/data/LaSOT',
'trackingnet_root': '/data/TrackingNet',
})
model = train_phase1(model, dataset, config, device='cuda',
push_to_hub=True, hub_model_id='user/vil-tracker')
model = train_phase2(model, dataset, config, device='cuda',
push_to_hub=True, hub_model_id='user/vil-tracker')
from vil_tracker.inference.online_tracker import OnlineTracker
from vil_tracker.evaluation.evaluate import BenchmarkEvaluator
online = OnlineTracker(model, device='cuda')
evaluator = BenchmarkEvaluator(online)
results = evaluator.evaluate_dataset('/data/LaSOT', 'lasot')
print(f"LaSOT AUC: {results['mean_seq_auc']:.3f}")
Run the full test suite (16 tests):
python test_all.py
MIT