Create Readme

46a5e5a verified 16 days ago

5.6 kB

	---
	language:
	- en
	license: mit
	tags:
	- tabular
	- classification
	- scikit-learn
	- ensemble-learning
	- breast-cancer-detection
	- medical-imaging
	datasets:
	- uci-wdbc
	metrics:
	- accuracy
	- precision
	- recall
	- f1
	- roc_auc
	pipeline_tag: tabular-classification
	---

	# 🎗️ Breast Cancer Detection Ensemble Pipeline

	An optimized, production-ready machine learning pipeline featuring a Soft-Voting Ensemble Classifier. This model is trained on clinical data to distinguish between malignant and benign tumors with high sensitivity (recall), minimizing false negatives in diagnostic screening.

	This repository structure is modeled after the methodology discussed in "Comparison of ML Algorithms for Breast Cancer Prediction" (CTEMS 2018), expanding the baseline framework to a robust 5-model ensemble architecture with automated pipeline scaling.

	---

	# 📊 Model Description

	The model utilizes a Soft-Voting architecture that aggregates predicted class probabilities across five diverse individual base estimators. Every individual classifier is encapsulated within a leakage-free preprocessing pipeline featuring automated standardization using `StandardScaler`.

	## Component Estimators

	1. Random Forest Classifier
	- 72 estimators
	- Balanced class weights

	2. k-Nearest Neighbors (kNN)
	- Euclidean distance metric
	- `k = 5`

	3. Gaussian Naive Bayes
	- Probabilistic baseline classifier

	4. Support Vector Classifier (SVC)
	- `rbf` kernel
	- Probability estimation enabled

	5. Logistic Regression
	- Regularized linear classifier
	- Balanced class distributions

	---

	# 📈 Dataset & Training Architecture

	- Dataset Source: Wisconsin Diagnosis Breast Cancer (WDBC) — UCI Machine Learning Repository
	- Instances: 569 samples
	- 357 Benign
	- 212 Malignant
	- Features: 30 real-valued clinical features extracted from digitized FNA images
	- Split Strategy: Stratified train-test split
	- Training: 398 samples
	- Testing: 171 samples

	The pipeline uses:
	- `StratifiedKFold` cross-validation
	- Leakage-free preprocessing
	- Automated scaling pipelines

	---

	# ⚡ Performance Metrics

	Evaluation prioritizes Recall (Sensitivity) to reduce false negatives while maintaining strong overall classification accuracy.

	\| Model \| Accuracy \| Precision \| Recall \| F1-Score \| ROC-AUC \|
	\|---\|---\|---\|---\|---\|---\|
	\| Ensemble (Soft Voting) \| 0.9766 \| 0.9725 \| 0.9907 \| 0.9815 \| 0.9972 \|
	\| Random Forest \| 0.9649 \| 0.9633 \| 0.9813 \| 0.9722 \| 0.9936 \|
	\| kNN \| 0.9591 \| 0.9545 \| 0.9813 \| 0.9677 \| 0.9877 \|
	\| Support Vector Machine \| 0.9766 \| 0.9725 \| 0.9907 \| 0.9815 \| 0.9974 \|
	\| Logistic Regression \| 0.9766 \| 0.9725 \| 0.9907 \| 0.9815 \| 0.9969 \|
	\| Naive Bayes \| 0.9591 \| 0.9545 \| 0.9813 \| 0.9677 \| 0.9892 \|

	> Note: Results may vary slightly depending on package versions and random seeds.

	---

	# 💻 Installation

	## Dependencies

	```text
	scikit-learn>=1.0
	numpy
	pandas
	joblib
	huggingface_hub
	```

	Install dependencies:

	```bash
	pip install scikit-learn numpy pandas joblib huggingface_hub
	```

	---

	# 🚀 Dynamic Inference Example

	You can directly download and run the trained pipeline from Hugging Face Hub.

	```python
	import joblib
	import pandas as pd
	from huggingface_hub import hf_hub_download

	# Download model pipeline
	model_path = hf_hub_download(
	repo_id="NethranjaliSE/Breast-Cancer-detection-using-ML-Algorithm",
	filename="ensemble_soft_voting.pkl"
	)

	# Load pipeline
	pipeline = joblib.load(model_path)

	# Example sample input (30 WDBC features)
	sample_data = [[
	14.12, 19.28, 91.96, 654.8, 0.096, 0.11, 0.08, 0.04, 0.18, 0.06,
	0.25, 0.89, 1.82, 24.3, 0.006, 0.02, 0.02, 0.01, 0.01, 0.003,
	16.26, 25.67, 107.26, 880.5, 0.132, 0.21, 0.19, 0.09, 0.28, 0.08
	]]

	feature_names = (
	pipeline.feature_names_in_
	if hasattr(pipeline, "feature_names_in_")
	else None
	)

	input_df = pd.DataFrame(sample_data, columns=feature_names)

	# Predict
	prediction = pipeline.predict(input_df)
	probabilities = pipeline.predict_proba(input_df)[0]

	diagnosis = (
	"Benign (Low Risk)"
	if prediction[0] == 1
	else "Malignant (High Risk)"
	)

	print(f"Diagnostic Assessment: {diagnosis}")

	print(
	f"Confidence Matrix -> "
	f"Malignant: {probabilities[0]:.4f} \| "
	f"Benign: {probabilities[1]:.4f}"
	)
	```

	---

	# 📂 Repository Structure

	```text
	.
	├── ensemble_soft_voting.pkl
	├── training_pipeline.ipynb
	├── requirements.txt
	└── README.md
	```

	---

	# ⚠️ Limitations & Intended Use

	This model is developed strictly for:
	- Academic research
	- Educational purposes
	- Machine learning experimentation
	- Pipeline prototyping

	It is NOT approved for:
	- Clinical deployment
	- Medical diagnosis
	- Real-world healthcare decision-making

	All diagnostic decisions must be performed by qualified medical professionals using certified medical systems.

	---

	# 📜 Citations

	### Research Reference

	```bibtex
	@article{street1993nuclear,
	title={Nuclear feature extraction for breast tumor diagnosis},
	author={Street, W.N. and Wolberg, W.H. and Mangasarian, O.L.},
	journal={IS&T/SPIE Biomedical Imaging},
	year={1993}
	}
	```

	### Dataset Reference

	- UCI Machine Learning Repository
	- Breast Cancer Wisconsin (Diagnostic) Dataset

	---

	# 🤝 Acknowledgements

	Special thanks to:
	- UCI Machine Learning Repository
	- Scikit-learn contributors
	- Hugging Face Hub
	- Open-source ML research community

	---

	# 🧠 Model Author

	Sachini Praboda Nethranjali
	Electronic and Computer Science Undergraduate
	University of Kelaniya, Sri Lanka