Gemma 4 E4B Fine-Tuned with Unsloth QLoRA
Competition: The Gemma 4 Good Hackathon on Kaggle
Tracks: Unsloth ($10K prize) + Impact Tracks
Framework: Unsloth β 2x faster fine-tuning
Base Model: google/gemma-4-e4b-it (4B params, instruction-tuned)
Highlights
- 99.6% training loss reduction β from 2.916 (baseline) to 0.0115 (final)
- 5 epochs of QLoRA fine-tuning on 10,000 high-quality samples
- Only 2.29% of parameters trained (146.8M / 6.4B) via rank-stabilized LoRA
- 12 hours total training on a single NVIDIA L4 GPU (24GB)
How to Use
With Unsloth (Recommended)
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
"bradduy/Any2AnyModels",
max_seq_length=2048,
load_in_4bit=True,
)
FastModel.for_inference(model)
messages = [
{"role": "user", "content": "Explain how renewable energy helps developing communities"}
]
inputs = tokenizer.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
).to("cuda")
outputs = model.generate(
input_ids=inputs,
max_new_tokens=512,
temperature=0.7,
do_sample=True,
)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))
With Transformers + PEFT
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained(
"google/gemma-4-e4b-it",
device_map="auto",
load_in_4bit=True,
)
model = PeftModel.from_pretrained(base_model, "bradduy/Any2AnyModels")
tokenizer = AutoTokenizer.from_pretrained("bradduy/Any2AnyModels")
Training Details
Method
We used Unsloth's QLoRA implementation with rank-stabilized LoRA (RSLoRA) for parameter-efficient fine-tuning. The key innovation was discovering that multi-epoch training dramatically reduces loss with each additional pass over the data.
Configuration
| Parameter | Value |
|---|---|
| Base Model | google/gemma-4-e4b-it (4B params) |
| Quantization | 4-bit QLoRA via bitsandbytes |
| LoRA Rank | 64 |
| LoRA Alpha | 64 |
| RSLoRA | Enabled (rank-stabilized scaling) |
| Learning Rate | 7e-5 |
| LR Scheduler | Cosine |
| Epochs | 5 |
| Dataset Size | 10,000 samples |
| Effective Batch Size | 8 (1 Γ 8 grad accumulation) |
| Weight Decay | 0.01 |
| Warmup Steps | 50 |
| Total Steps | 6,250 |
| Max Seq Length | 2048 |
| Optimizer | AdamW 8-bit |
| Seed | 3407 |
| Response Masking | train_on_responses_only enabled |
Dataset
- Source: mlabonne/FineTome-100k
- Samples Used: 10,000 (first 10k)
- Format: Multi-turn chat conversations
- Chat Template: Gemma 4 native (
role: "model", not"assistant") - Masking: Only model responses contribute to loss (instruction tokens masked)
Hardware
- GPU: NVIDIA L4 (24GB VRAM)
- RAM: 32GB
- Training Time: ~12 hours (with checkpoint resume)
- GPU Memory Used: ~14.8GB during training
Experiment Journey
We ran 8 systematic experiments to find the optimal configuration:
| Exp | LoRA r | Epochs | Samples | LR | Train Loss | Key Finding |
|---|---|---|---|---|---|---|
| 01 | 16 | 0.13 | 3k | 2e-4 | 2.916 | Baseline |
| 02 | 32 | 0.24 | 5k | 2e-4 | 1.725 | Higher rank helps (+41%) |
| 03 | 64+RSLoRA | 0.20 | 10k | 2e-4 | 1.460 | RSLoRA + more data (+50%) |
| 04 | 64+RSLoRA | 0.40 | 20k | 1e-4 | ~1.05 | Lower LR improves convergence |
| 05 | 128+RSLoRA | 0.40 | 20k | 5e-5 | 1.134 | r=128 slower than r=64 |
| 06 | 64+RSLoRA | 3 | 10k | 1e-4 | ~0.30 | Multi-epoch is transformative |
| 07 | 128+RSLoRA | 3 | 10k | 1e-4 | ~0.59 | r=64 > r=128 for multi-epoch |
| 08 | 64+RSLoRA | 5 | 10k | 7e-5 | 0.0115 | 5 epochs = 99.6% reduction |
The Multi-Epoch Discovery
The single most impactful finding: each additional epoch delivers a dramatic, consistent loss reduction:
Epoch 1: loss ~0.90 (learning the patterns)
Epoch 2: loss ~0.60 (reinforcing knowledge)
Epoch 3: loss ~0.30 (deep memorization)
Epoch 4: loss ~0.10 (fine polishing)
Epoch 5: loss ~0.01 (near-perfect fitting)
This pattern was consistent across experiments 06, 07, and 08. The loss drops happen at each epoch boundary as the model sees the training data again.
Other Key Insights
- r=64 with RSLoRA is the sweet spot β r=128 converges slower and provides no benefit in multi-epoch settings
- Lower LR (7e-5) stabilizes long training β higher LR (2e-4) causes instability after epoch 2
train_on_responses_onlyis essential β masks user/system tokens so the model only learns from responses- Checkpoint saving every 250 steps β long CUDA runs crash from memory fragmentation; resume from checkpoints solved this
- 10k high-quality samples > 20k samples for multi-epoch β quality over quantity when doing multiple passes
Training Pipeline
Built entirely with Unsloth:
from unsloth import FastModel
from trl import SFTTrainer, SFTConfig
from unsloth.chat_templates import get_chat_template, train_on_responses_only
# 1. Load 4-bit quantized model
model, tokenizer = FastModel.from_pretrained(
"unsloth/gemma-4-E4B-it-unsloth-bnb-4bit",
max_seq_length=2048, load_in_4bit=True,
)
# 2. Apply LoRA adapters (r=64, RSLoRA)
model = FastModel.get_peft_model(model,
finetune_vision_layers=False, finetune_language_layers=True,
finetune_attention_modules=True, finetune_mlp_modules=True,
r=64, lora_alpha=64, lora_dropout=0, bias="none",
random_state=3407, use_rslora=True,
)
# 3. Setup Gemma 4 chat template
tokenizer = get_chat_template(tokenizer, chat_template="gemma-4")
# 4. Train with response-only masking
trainer = SFTTrainer(model=model, tokenizer=tokenizer, train_dataset=dataset,
args=SFTConfig(
per_device_train_batch_size=1, gradient_accumulation_steps=8,
learning_rate=7e-5, num_train_epochs=5, lr_scheduler_type="cosine",
warmup_steps=50, weight_decay=0.01, optim="adamw_8bit",
save_strategy="steps", save_steps=250, save_total_limit=3,
),
)
trainer = train_on_responses_only(trainer,
instruction_part="<|turn>user\n", response_part="<|turn>model\n",
)
trainer.train()
Reproduce Training
git clone https://github.com/bradduy/Any2AnyModels
cd Any2AnyModels
pip install unsloth
python scripts/train.py \
--model unsloth/gemma-4-E4B-it-unsloth-bnb-4bit \
--load-4bit --lora-rank 64 --use-rslora \
--dataset mlabonne/FineTome-100k --max-samples 10000 \
--num-epochs 5 --learning-rate 7e-5 --grad-accum 8 \
--weight-decay 0.01 --warmup-steps 50 --scheduler cosine \
--save-steps 250 --save-total-limit 3
Limitations
- Fine-tuned on English-only data (FineTome-100k)
- Optimized for instruction following, not domain-specific tasks
- 4B parameter model β larger models (26B, 31B) would perform better but require more VRAM
- Training loss β downstream task performance; the model should be evaluated on specific benchmarks
Acknowledgments
- Google DeepMind for the Gemma 4 model family
- Unsloth for making QLoRA fine-tuning 2x faster and memory efficient
- Kaggle for hosting the Gemma 4 Good Hackathon
- mlabonne for the FineTome-100k dataset
License
Apache 2.0 (same as Gemma 4)
- Downloads last month
- 22