Filter-Tank — Machine Fault Recognition

A deep learning system that listens to factory machine audio recordings and classifies them into 6 categories across 3 machine types, each in either a normal or abnormal state. Built from scratch using SE-ResNet on log-mel spectrograms.

Overview

Filter-Tank is a complete machine learning pipeline for predictive maintenance. Given a raw .wav audio recording of a factory machine, the system automatically detects whether the machine is operating normally or has developed a fault — and identifies which machine type it belongs to.

The model is a custom SE-ResNet (Squeeze-and-Excitation ResNet) trained entirely from scratch with no pretrained weights, designed specifically for 1-channel log-mel spectrogram input.

Classes

Label	Description
0	Machine 1 — Normal
1	Machine 1 — Abnormal
2	Machine 2 — Normal
3	Machine 2 — Abnormal
4	Machine 3 — Normal
5	Machine 3 — Abnormal

Preprocessing Pipeline

Every audio file passes through a multi-stage preprocessing pipeline before reaching the model. All steps run on CPU and are excluded from the inference timer (only processing

prediction time is measured).

1. Resampling

All audio is resampled to a fixed sample rate of 16,000 Hz to ensure consistency across recordings made with different microphones or recording equipment.

2. Noise Reduction

Non-stationary background noise is removed using the noisereduce library with full noise reduction strength (prop_decrease=1.0). This handles real-world factory environments where background noise varies significantly between recordings.

3. Silence Trimming

Leading and trailing silence is removed using librosa's trim function (top_db=20). This ensures the model focuses only on the actual machine sound rather than quiet gaps at the start or end of a recording.

4. Fixed-Length Normalization

All recordings are normalized to exactly 11 seconds. Files longer than 11 seconds are truncated from the end. Files shorter than 11 seconds are zero-padded at the end. This gives the model a consistent input size regardless of the original recording length.

5. Log-Mel Spectrogram

The waveform is converted into a 2D log-mel spectrogram using the following settings:

Mel bands: 128
FFT window size: 1024
Hop length: 512
Power: 2.0 (power spectrogram)
Amplitude converted to dB scale (top_db=80)

This transforms the raw audio signal into a visual time-frequency representation that the convolutional model can process effectively.

6. CMVN Normalization

Cepstral Mean and Variance Normalization is applied per sample — each spectrogram is normalized to have zero mean and unit variance along the time axis. This handles volume variations and differences in microphone sensitivity across recordings.

Model Architecture

SE-ResNet (Squeeze-and-Excitation ResNet)

The model follows a standard ResNet structure enhanced with Squeeze-and-Excitation (SE) attention blocks at every residual stage.

Stem: A 7x7 convolution (stride 2) followed by batch normalization, ReLU, and max pooling reduces the input resolution before the residual stages.

4 Residual Stages:

Stage 1: 3 SE-Residual blocks, 64 channels
Stage 2: 4 SE-Residual blocks, 128 channels (stride 2)
Stage 3: 6 SE-Residual blocks, 256 channels (stride 2)
Stage 4: 3 SE-Residual blocks, 512 channels (stride 2)

SE Attention Block: Each residual block includes a Squeeze-and-Excitation module that performs global average pooling, passes the result through two fully-connected layers with a bottleneck (reduction=16), and produces per-channel attention weights via sigmoid. This lets the model focus on the most informative frequency channels for each input.

Head: Global Average Pooling → Dropout (0.3) → Fully Connected layer → 6-class output.

Weight Initialization:

Conv layers: Kaiming Normal (fan_out, relu)
BatchNorm: weight=1, bias=0
Linear layers: Xavier Uniform

Total Parameters: ~11 million

Training Details

Dataset Split

The dataset is divided using stratified splitting to ensure balanced class representation across all splits:

Training set: 80%
Validation set: 10%
Test set: 10%

Stratification is done by machine type and condition combined, so each split has proportional representation of all 6 classes.

Class Imbalance Handling

A WeightedRandomSampler is used during training to oversample underrepresented classes, ensuring the model sees a balanced distribution of all 6 classes per epoch regardless of the original dataset distribution.

Data Augmentation

Two augmentation strategies are applied during training:

SpecAugment (online, per batch): Applied directly to the spectrogram tensors during training. Two frequency masks (freq_mask_param=20) and two time masks (time_mask_param=40) are applied randomly, forcing the model to be robust to missing frequency bands and time segments.

Mixup (online, per batch): Pairs of training samples are blended together with a random interpolation weight drawn from a Beta distribution (alpha=0.4). Both the input spectrograms and their labels are mixed, which acts as a strong regularizer and improves generalization.

Loss Function

Cross-Entropy Loss with label smoothing (0.1). Label smoothing prevents overconfident predictions and improves calibration.

Optimizer & Scheduler

Optimizer: AdamW (weight decay=1e-4)
Scheduler: OneCycleLR with cosine annealing
- Max LR: 3e-3
- Warmup: 10% of total steps
Gradient clipping: max norm = 1.0

Mixed Precision Training

All forward and backward passes use torch.amp autocast with float16 precision, reducing memory usage and speeding up training on GPU.

Multi-GPU Support

The model supports DataParallel training across multiple GPUs automatically. The best model state is always saved from the unwrapped module to ensure compatibility during single-GPU inference.

Early Stopping

Training stops automatically if validation accuracy does not improve for 12 consecutive epochs (patience=12). The best model checkpoint is saved based on validation accuracy.

Setting	Value
Optimizer	AdamW
Max LR	3e-3
LR Schedule	OneCycleLR (cosine annealing)
Weight Decay	1e-4
Max Epochs	60
Early Stopping	Patience = 12
Batch Size	64
Label Smoothing	0.1
Mixup Alpha	0.4
Mixed Precision	float16 (AMP)
Dropout	0.3

Inference

During inference, audio files are processed strictly one-by-one in naturally sorted order (1.wav, 2.wav, ...). The preprocessing pipeline runs on each file individually, and only the processing + prediction time is measured (I/O reading is excluded from the timer).

Two output files are produced:

results.txt — one predicted class label (0–5) per line
time.txt — processing time per file in seconds (rounded to 3 decimal places)

Requirements

Python 3.8+
PyTorch
torchaudio
librosa
noisereduce
numpy
soundfile
scikit-learn

Limitations

Trained only on 3 specific machine types; may not generalize to unseen machine types out of the box
Performance may degrade with extremely noisy environments beyond the training distribution
Fixed 11-second input window; very short recordings are zero-padded which may affect accuracy

Team

Cairo University — Faculty of Engineering Computer Engineering Department Pattern Recognition and Neural Networks — Spring 2026

Downloads last month: -; Downloads are not tracked for this model. How to track