| --- |
| language: |
| - en |
| license: mit |
| tags: |
| - tabular |
| - classification |
| - scikit-learn |
| - ensemble-learning |
| - breast-cancer-detection |
| - medical-imaging |
| datasets: |
| - uci-wdbc |
| metrics: |
| - accuracy |
| - precision |
| - recall |
| - f1 |
| - roc_auc |
| pipeline_tag: tabular-classification |
| --- |
| |
| # ποΈ Breast Cancer Detection Ensemble Pipeline |
|
|
| An optimized, production-ready machine learning pipeline featuring a **Soft-Voting Ensemble Classifier**. This model is trained on clinical data to distinguish between malignant and benign tumors with high sensitivity (recall), minimizing false negatives in diagnostic screening. |
|
|
| This repository structure is modeled after the methodology discussed in *"Comparison of ML Algorithms for Breast Cancer Prediction" (CTEMS 2018)*, expanding the baseline framework to a robust 5-model ensemble architecture with automated pipeline scaling. |
|
|
| --- |
|
|
| # π Model Description |
|
|
| The model utilizes a **Soft-Voting architecture** that aggregates predicted class probabilities across five diverse individual base estimators. Every individual classifier is encapsulated within a leakage-free preprocessing pipeline featuring automated standardization using `StandardScaler`. |
|
|
| ## Component Estimators |
|
|
| 1. **Random Forest Classifier** |
| - 72 estimators |
| - Balanced class weights |
|
|
| 2. **k-Nearest Neighbors (kNN)** |
| - Euclidean distance metric |
| - `k = 5` |
|
|
| 3. **Gaussian Naive Bayes** |
| - Probabilistic baseline classifier |
|
|
| 4. **Support Vector Classifier (SVC)** |
| - `rbf` kernel |
| - Probability estimation enabled |
|
|
| 5. **Logistic Regression** |
| - Regularized linear classifier |
| - Balanced class distributions |
|
|
| --- |
|
|
| # π Dataset & Training Architecture |
|
|
| - **Dataset Source:** Wisconsin Diagnosis Breast Cancer (WDBC) β UCI Machine Learning Repository |
| - **Instances:** 569 samples |
| - 357 Benign |
| - 212 Malignant |
| - **Features:** 30 real-valued clinical features extracted from digitized FNA images |
| - **Split Strategy:** Stratified train-test split |
| - Training: 398 samples |
| - Testing: 171 samples |
|
|
| The pipeline uses: |
| - `StratifiedKFold` cross-validation |
| - Leakage-free preprocessing |
| - Automated scaling pipelines |
|
|
| --- |
|
|
| # β‘ Performance Metrics |
|
|
| Evaluation prioritizes **Recall (Sensitivity)** to reduce false negatives while maintaining strong overall classification accuracy. |
|
|
| | Model | Accuracy | Precision | Recall | F1-Score | ROC-AUC | |
| |---|---|---|---|---|---| |
| | **Ensemble (Soft Voting)** | **0.9766** | **0.9725** | **0.9907** | **0.9815** | **0.9972** | |
| | Random Forest | 0.9649 | 0.9633 | 0.9813 | 0.9722 | 0.9936 | |
| | kNN | 0.9591 | 0.9545 | 0.9813 | 0.9677 | 0.9877 | |
| | Support Vector Machine | 0.9766 | 0.9725 | 0.9907 | 0.9815 | 0.9974 | |
| | Logistic Regression | 0.9766 | 0.9725 | 0.9907 | 0.9815 | 0.9969 | |
| | Naive Bayes | 0.9591 | 0.9545 | 0.9813 | 0.9677 | 0.9892 | |
|
|
| > **Note:** Results may vary slightly depending on package versions and random seeds. |
|
|
| --- |
|
|
| # π» Installation |
|
|
| ## Dependencies |
|
|
| ```text |
| scikit-learn>=1.0 |
| numpy |
| pandas |
| joblib |
| huggingface_hub |
| ``` |
|
|
| Install dependencies: |
|
|
| ```bash |
| pip install scikit-learn numpy pandas joblib huggingface_hub |
| ``` |
|
|
| --- |
|
|
| # π Dynamic Inference Example |
|
|
| You can directly download and run the trained pipeline from Hugging Face Hub. |
|
|
| ```python |
| import joblib |
| import pandas as pd |
| from huggingface_hub import hf_hub_download |
| |
| # Download model pipeline |
| model_path = hf_hub_download( |
| repo_id="NethranjaliSE/Breast-Cancer-detection-using-ML-Algorithm", |
| filename="ensemble_soft_voting.pkl" |
| ) |
| |
| # Load pipeline |
| pipeline = joblib.load(model_path) |
| |
| # Example sample input (30 WDBC features) |
| sample_data = [[ |
| 14.12, 19.28, 91.96, 654.8, 0.096, 0.11, 0.08, 0.04, 0.18, 0.06, |
| 0.25, 0.89, 1.82, 24.3, 0.006, 0.02, 0.02, 0.01, 0.01, 0.003, |
| 16.26, 25.67, 107.26, 880.5, 0.132, 0.21, 0.19, 0.09, 0.28, 0.08 |
| ]] |
| |
| feature_names = ( |
| pipeline.feature_names_in_ |
| if hasattr(pipeline, "feature_names_in_") |
| else None |
| ) |
| |
| input_df = pd.DataFrame(sample_data, columns=feature_names) |
| |
| # Predict |
| prediction = pipeline.predict(input_df) |
| probabilities = pipeline.predict_proba(input_df)[0] |
| |
| diagnosis = ( |
| "Benign (Low Risk)" |
| if prediction[0] == 1 |
| else "Malignant (High Risk)" |
| ) |
| |
| print(f"Diagnostic Assessment: {diagnosis}") |
| |
| print( |
| f"Confidence Matrix -> " |
| f"Malignant: {probabilities[0]:.4f} | " |
| f"Benign: {probabilities[1]:.4f}" |
| ) |
| ``` |
|
|
| --- |
|
|
| # π Repository Structure |
|
|
| ```text |
| . |
| βββ ensemble_soft_voting.pkl |
| βββ training_pipeline.ipynb |
| βββ requirements.txt |
| βββ README.md |
| ``` |
|
|
| --- |
|
|
| # β οΈ Limitations & Intended Use |
|
|
| This model is developed strictly for: |
| - Academic research |
| - Educational purposes |
| - Machine learning experimentation |
| - Pipeline prototyping |
|
|
| It is **NOT** approved for: |
| - Clinical deployment |
| - Medical diagnosis |
| - Real-world healthcare decision-making |
|
|
| All diagnostic decisions must be performed by qualified medical professionals using certified medical systems. |
|
|
| --- |
|
|
| # π Citations |
|
|
| ### Research Reference |
|
|
| ```bibtex |
| @article{street1993nuclear, |
| title={Nuclear feature extraction for breast tumor diagnosis}, |
| author={Street, W.N. and Wolberg, W.H. and Mangasarian, O.L.}, |
| journal={IS&T/SPIE Biomedical Imaging}, |
| year={1993} |
| } |
| ``` |
|
|
| ### Dataset Reference |
|
|
| - UCI Machine Learning Repository |
| - Breast Cancer Wisconsin (Diagnostic) Dataset |
|
|
| --- |
|
|
| # π€ Acknowledgements |
|
|
| Special thanks to: |
| - UCI Machine Learning Repository |
| - Scikit-learn contributors |
| - Hugging Face Hub |
| - Open-source ML research community |
|
|
| --- |
|
|
| # π§ Model Author |
|
|
| **Sachini Praboda Nethranjali** |
| Electronic and Computer Science Undergraduate |
| University of Kelaniya, Sri Lanka |