Filter-Tank β€” Machine Fault Recognition

A deep learning system that listens to factory machine audio recordings and classifies them into 6 categories across 3 machine types, each in either a normal or abnormal state. Built from scratch using SE-ResNet on log-mel spectrograms.


Overview

Filter-Tank is a complete machine learning pipeline for predictive maintenance. Given a raw .wav audio recording of a factory machine, the system automatically detects whether the machine is operating normally or has developed a fault β€” and identifies which machine type it belongs to.

The model is a custom SE-ResNet (Squeeze-and-Excitation ResNet) trained entirely from scratch with no pretrained weights, designed specifically for 1-channel log-mel spectrogram input.


Classes

Label Description
0 Machine 1 β€” Normal
1 Machine 1 β€” Abnormal
2 Machine 2 β€” Normal
3 Machine 2 β€” Abnormal
4 Machine 3 β€” Normal
5 Machine 3 β€” Abnormal

Preprocessing Pipeline

Every audio file passes through a multi-stage preprocessing pipeline before reaching the model. All steps run on CPU and are excluded from the inference timer (only processing

  • prediction time is measured).

1. Resampling

All audio is resampled to a fixed sample rate of 16,000 Hz to ensure consistency across recordings made with different microphones or recording equipment.

2. Noise Reduction

Non-stationary background noise is removed using the noisereduce library with full noise reduction strength (prop_decrease=1.0). This handles real-world factory environments where background noise varies significantly between recordings.

3. Silence Trimming

Leading and trailing silence is removed using librosa's trim function (top_db=20). This ensures the model focuses only on the actual machine sound rather than quiet gaps at the start or end of a recording.

4. Fixed-Length Normalization

All recordings are normalized to exactly 11 seconds. Files longer than 11 seconds are truncated from the end. Files shorter than 11 seconds are zero-padded at the end. This gives the model a consistent input size regardless of the original recording length.

5. Log-Mel Spectrogram

The waveform is converted into a 2D log-mel spectrogram using the following settings:

  • Mel bands: 128
  • FFT window size: 1024
  • Hop length: 512
  • Power: 2.0 (power spectrogram)
  • Amplitude converted to dB scale (top_db=80)

This transforms the raw audio signal into a visual time-frequency representation that the convolutional model can process effectively.

6. CMVN Normalization

Cepstral Mean and Variance Normalization is applied per sample β€” each spectrogram is normalized to have zero mean and unit variance along the time axis. This handles volume variations and differences in microphone sensitivity across recordings.


Model Architecture

SE-ResNet (Squeeze-and-Excitation ResNet)

The model follows a standard ResNet structure enhanced with Squeeze-and-Excitation (SE) attention blocks at every residual stage.

Stem: A 7x7 convolution (stride 2) followed by batch normalization, ReLU, and max pooling reduces the input resolution before the residual stages.

4 Residual Stages:

  • Stage 1: 3 SE-Residual blocks, 64 channels
  • Stage 2: 4 SE-Residual blocks, 128 channels (stride 2)
  • Stage 3: 6 SE-Residual blocks, 256 channels (stride 2)
  • Stage 4: 3 SE-Residual blocks, 512 channels (stride 2)

SE Attention Block: Each residual block includes a Squeeze-and-Excitation module that performs global average pooling, passes the result through two fully-connected layers with a bottleneck (reduction=16), and produces per-channel attention weights via sigmoid. This lets the model focus on the most informative frequency channels for each input.

Head: Global Average Pooling β†’ Dropout (0.3) β†’ Fully Connected layer β†’ 6-class output.

Weight Initialization:

  • Conv layers: Kaiming Normal (fan_out, relu)
  • BatchNorm: weight=1, bias=0
  • Linear layers: Xavier Uniform

Total Parameters: ~11 million


Training Details

Dataset Split

The dataset is divided using stratified splitting to ensure balanced class representation across all splits:

  • Training set: 80%
  • Validation set: 10%
  • Test set: 10%

Stratification is done by machine type and condition combined, so each split has proportional representation of all 6 classes.

Class Imbalance Handling

A WeightedRandomSampler is used during training to oversample underrepresented classes, ensuring the model sees a balanced distribution of all 6 classes per epoch regardless of the original dataset distribution.

Data Augmentation

Two augmentation strategies are applied during training:

SpecAugment (online, per batch): Applied directly to the spectrogram tensors during training. Two frequency masks (freq_mask_param=20) and two time masks (time_mask_param=40) are applied randomly, forcing the model to be robust to missing frequency bands and time segments.

Mixup (online, per batch): Pairs of training samples are blended together with a random interpolation weight drawn from a Beta distribution (alpha=0.4). Both the input spectrograms and their labels are mixed, which acts as a strong regularizer and improves generalization.

Loss Function

Cross-Entropy Loss with label smoothing (0.1). Label smoothing prevents overconfident predictions and improves calibration.

Optimizer & Scheduler

  • Optimizer: AdamW (weight decay=1e-4)
  • Scheduler: OneCycleLR with cosine annealing
    • Max LR: 3e-3
    • Warmup: 10% of total steps
  • Gradient clipping: max norm = 1.0

Mixed Precision Training

All forward and backward passes use torch.amp autocast with float16 precision, reducing memory usage and speeding up training on GPU.

Multi-GPU Support

The model supports DataParallel training across multiple GPUs automatically. The best model state is always saved from the unwrapped module to ensure compatibility during single-GPU inference.

Early Stopping

Training stops automatically if validation accuracy does not improve for 12 consecutive epochs (patience=12). The best model checkpoint is saved based on validation accuracy.

Setting Value
Optimizer AdamW
Max LR 3e-3
LR Schedule OneCycleLR (cosine annealing)
Weight Decay 1e-4
Max Epochs 60
Early Stopping Patience = 12
Batch Size 64
Label Smoothing 0.1
Mixup Alpha 0.4
Mixed Precision float16 (AMP)
Dropout 0.3

Inference

During inference, audio files are processed strictly one-by-one in naturally sorted order (1.wav, 2.wav, ...). The preprocessing pipeline runs on each file individually, and only the processing + prediction time is measured (I/O reading is excluded from the timer).

Two output files are produced:

  • results.txt β€” one predicted class label (0–5) per line
  • time.txt β€” processing time per file in seconds (rounded to 3 decimal places)

Requirements

  • Python 3.8+
  • PyTorch
  • torchaudio
  • librosa
  • noisereduce
  • numpy
  • soundfile
  • scikit-learn

Limitations

  • Trained only on 3 specific machine types; may not generalize to unseen machine types out of the box
  • Performance may degrade with extremely noisy environments beyond the training distribution
  • Fixed 11-second input window; very short recordings are zero-padded which may affect accuracy

Team

Cairo University β€” Faculty of Engineering Computer Engineering Department Pattern Recognition and Neural Networks β€” Spring 2026

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support