FredZhang7/malicious-website-features-2.4M
Preview • Updated • 282 • 6
How to use FredZhang7/malphish-eater-v1 with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-classification", model="FredZhang7/malphish-eater-v1") # Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("FredZhang7/malphish-eater-v1")
model = AutoModelForSequenceClassification.from_pretrained("FredZhang7/malphish-eater-v1")It's very important to note that this model is not production-ready.
The classification task for v1 is split into two stages:
I applied cross-validation with cv=5 to the training dataset to search for the best hyperparameters.
Here's the dict passed to sklearn's GridSearchCV function:
params = {
'objective': 'binary',
'metric': 'binary_logloss',
'boosting_type': ['gbdt', 'dart'],
'num_leaves': [15, 23, 31, 63],
'learning_rate': [0.001, 0.002, 0.01, 0.02],
'feature_fraction': [0.5, 0.6, 0.7, 0.9],
'early_stopping_rounds': [10, 20],
'num_boost_round': [500, 750, 800, 900, 1000, 1250, 2000]
}
To reproduce the 98.4% accurate model, you can follow the data analysis on the dataset page to filter out the unimportant features. Then train a LightGBM model using the most suited hyperparamters for this task:
params = {
'objective': 'binary',
'metric': 'binary_logloss',
'boosting_type': 'gbdt',
'num_leaves': 31,
'learning_rate': 0.01,
'feature_fraction': 0.6,
'early_stopping_rounds': 10,
'num_boost_round': 800
}
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("FredZhang7/malware-phisher")
model = AutoModelForSequenceClassification.from_pretrained("FredZhang7/malware-phisher")
pip install lightgbm
import lightgbm as lgb
lgb.Booster(model_file="phishing_model_combined_0.984_train.txt")