Gemma 4 E4B Fine-Tuned with Unsloth QLoRA

Competition: The Gemma 4 Good Hackathon on Kaggle
Tracks: Unsloth ($10K prize) + Impact Tracks
Framework: Unsloth — 2x faster fine-tuning
Base Model: google/gemma-4-e4b-it (4B params, instruction-tuned)

Highlights

99.6% training loss reduction — from 2.916 (baseline) to 0.0115 (final)
5 epochs of QLoRA fine-tuning on 10,000 high-quality samples
Only 2.29% of parameters trained (146.8M / 6.4B) via rank-stabilized LoRA
12 hours total training on a single NVIDIA L4 GPU (24GB)

How to Use

With Unsloth (Recommended)

from unsloth import FastModel

model, tokenizer = FastModel.from_pretrained(
    "bradduy/Any2AnyModels",
    max_seq_length=2048,
    load_in_4bit=True,
)
FastModel.for_inference(model)

messages = [
    {"role": "user", "content": "Explain how renewable energy helps developing communities"}
]

inputs = tokenizer.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
).to("cuda")

outputs = model.generate(
    input_ids=inputs,
    max_new_tokens=512,
    temperature=0.7,
    do_sample=True,
)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))

With Transformers + PEFT

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-4-e4b-it",
    device_map="auto",
    load_in_4bit=True,
)
model = PeftModel.from_pretrained(base_model, "bradduy/Any2AnyModels")
tokenizer = AutoTokenizer.from_pretrained("bradduy/Any2AnyModels")

Training Details

Method

We used Unsloth's QLoRA implementation with rank-stabilized LoRA (RSLoRA) for parameter-efficient fine-tuning. The key innovation was discovering that multi-epoch training dramatically reduces loss with each additional pass over the data.

Configuration

Parameter	Value
Base Model	`google/gemma-4-e4b-it` (4B params)
Quantization	4-bit QLoRA via bitsandbytes
LoRA Rank	64
LoRA Alpha	64
RSLoRA	Enabled (rank-stabilized scaling)
Learning Rate	7e-5
LR Scheduler	Cosine
Epochs	5
Dataset Size	10,000 samples
Effective Batch Size	8 (1 × 8 grad accumulation)
Weight Decay	0.01
Warmup Steps	50
Total Steps	6,250
Max Seq Length	2048
Optimizer	AdamW 8-bit
Seed	3407
Response Masking	`train_on_responses_only` enabled

Dataset

Source: mlabonne/FineTome-100k
Samples Used: 10,000 (first 10k)
Format: Multi-turn chat conversations
Chat Template: Gemma 4 native (role: "model", not "assistant")
Masking: Only model responses contribute to loss (instruction tokens masked)

Hardware

GPU: NVIDIA L4 (24GB VRAM)
RAM: 32GB
Training Time: ~12 hours (with checkpoint resume)
GPU Memory Used: ~14.8GB during training

Experiment Journey

We ran 8 systematic experiments to find the optimal configuration:

Exp	LoRA r	Epochs	Samples	LR	Train Loss	Key Finding
01	16	0.13	3k	2e-4	2.916	Baseline
02	32	0.24	5k	2e-4	1.725	Higher rank helps (+41%)
03	64+RSLoRA	0.20	10k	2e-4	1.460	RSLoRA + more data (+50%)
04	64+RSLoRA	0.40	20k	1e-4	~1.05	Lower LR improves convergence
05	128+RSLoRA	0.40	20k	5e-5	1.134	r=128 slower than r=64
06	64+RSLoRA	3	10k	1e-4	~0.30	Multi-epoch is transformative
07	128+RSLoRA	3	10k	1e-4	~0.59	r=64 > r=128 for multi-epoch
08	64+RSLoRA	5	10k	7e-5	0.0115	5 epochs = 99.6% reduction

The Multi-Epoch Discovery

The single most impactful finding: each additional epoch delivers a dramatic, consistent loss reduction:

Epoch 1: loss ~0.90  (learning the patterns)
Epoch 2: loss ~0.60  (reinforcing knowledge)
Epoch 3: loss ~0.30  (deep memorization)
Epoch 4: loss ~0.10  (fine polishing)
Epoch 5: loss ~0.01  (near-perfect fitting)

This pattern was consistent across experiments 06, 07, and 08. The loss drops happen at each epoch boundary as the model sees the training data again.

Other Key Insights

r=64 with RSLoRA is the sweet spot — r=128 converges slower and provides no benefit in multi-epoch settings
Lower LR (7e-5) stabilizes long training — higher LR (2e-4) causes instability after epoch 2
train_on_responses_only is essential — masks user/system tokens so the model only learns from responses
Checkpoint saving every 250 steps — long CUDA runs crash from memory fragmentation; resume from checkpoints solved this
10k high-quality samples > 20k samples for multi-epoch — quality over quantity when doing multiple passes

Training Pipeline

Built entirely with Unsloth:

from unsloth import FastModel
from trl import SFTTrainer, SFTConfig
from unsloth.chat_templates import get_chat_template, train_on_responses_only

# 1. Load 4-bit quantized model
model, tokenizer = FastModel.from_pretrained(
    "unsloth/gemma-4-E4B-it-unsloth-bnb-4bit",
    max_seq_length=2048, load_in_4bit=True,
)

# 2. Apply LoRA adapters (r=64, RSLoRA)
model = FastModel.get_peft_model(model,
    finetune_vision_layers=False, finetune_language_layers=True,
    finetune_attention_modules=True, finetune_mlp_modules=True,
    r=64, lora_alpha=64, lora_dropout=0, bias="none",
    random_state=3407, use_rslora=True,
)

# 3. Setup Gemma 4 chat template
tokenizer = get_chat_template(tokenizer, chat_template="gemma-4")

# 4. Train with response-only masking
trainer = SFTTrainer(model=model, tokenizer=tokenizer, train_dataset=dataset,
    args=SFTConfig(
        per_device_train_batch_size=1, gradient_accumulation_steps=8,
        learning_rate=7e-5, num_train_epochs=5, lr_scheduler_type="cosine",
        warmup_steps=50, weight_decay=0.01, optim="adamw_8bit",
        save_strategy="steps", save_steps=250, save_total_limit=3,
    ),
)
trainer = train_on_responses_only(trainer,
    instruction_part="<|turn>user\n", response_part="<|turn>model\n",
)
trainer.train()

Reproduce Training

git clone https://github.com/bradduy/Any2AnyModels
cd Any2AnyModels
pip install unsloth

python scripts/train.py \
  --model unsloth/gemma-4-E4B-it-unsloth-bnb-4bit \
  --load-4bit --lora-rank 64 --use-rslora \
  --dataset mlabonne/FineTome-100k --max-samples 10000 \
  --num-epochs 5 --learning-rate 7e-5 --grad-accum 8 \
  --weight-decay 0.01 --warmup-steps 50 --scheduler cosine \
  --save-steps 250 --save-total-limit 3

Limitations

Fine-tuned on English-only data (FineTome-100k)
Optimized for instruction following, not domain-specific tasks
4B parameter model — larger models (26B, 31B) would perform better but require more VRAM
Training loss ≠ downstream task performance; the model should be evaluated on specific benchmarks

Acknowledgments

Google DeepMind for the Gemma 4 model family
Unsloth for making QLoRA fine-tuning 2x faster and memory efficient
Kaggle for hosting the Gemma 4 Good Hackathon
mlabonne for the FineTome-100k dataset

License

Apache 2.0 (same as Gemma 4)

Downloads last month: 22

bradduy
/

Any2AnyModels