Gemma 4 E4B Fine-Tuned with Unsloth QLoRA

Competition: The Gemma 4 Good Hackathon on Kaggle
Tracks: Unsloth ($10K prize) + Impact Tracks
Framework: Unsloth β€” 2x faster fine-tuning
Base Model: google/gemma-4-e4b-it (4B params, instruction-tuned)

Highlights

  • 99.6% training loss reduction β€” from 2.916 (baseline) to 0.0115 (final)
  • 5 epochs of QLoRA fine-tuning on 10,000 high-quality samples
  • Only 2.29% of parameters trained (146.8M / 6.4B) via rank-stabilized LoRA
  • 12 hours total training on a single NVIDIA L4 GPU (24GB)

How to Use

With Unsloth (Recommended)

from unsloth import FastModel

model, tokenizer = FastModel.from_pretrained(
    "bradduy/Any2AnyModels",
    max_seq_length=2048,
    load_in_4bit=True,
)
FastModel.for_inference(model)

messages = [
    {"role": "user", "content": "Explain how renewable energy helps developing communities"}
]

inputs = tokenizer.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
).to("cuda")

outputs = model.generate(
    input_ids=inputs,
    max_new_tokens=512,
    temperature=0.7,
    do_sample=True,
)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))

With Transformers + PEFT

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-4-e4b-it",
    device_map="auto",
    load_in_4bit=True,
)
model = PeftModel.from_pretrained(base_model, "bradduy/Any2AnyModels")
tokenizer = AutoTokenizer.from_pretrained("bradduy/Any2AnyModels")

Training Details

Method

We used Unsloth's QLoRA implementation with rank-stabilized LoRA (RSLoRA) for parameter-efficient fine-tuning. The key innovation was discovering that multi-epoch training dramatically reduces loss with each additional pass over the data.

Configuration

Parameter Value
Base Model google/gemma-4-e4b-it (4B params)
Quantization 4-bit QLoRA via bitsandbytes
LoRA Rank 64
LoRA Alpha 64
RSLoRA Enabled (rank-stabilized scaling)
Learning Rate 7e-5
LR Scheduler Cosine
Epochs 5
Dataset Size 10,000 samples
Effective Batch Size 8 (1 Γ— 8 grad accumulation)
Weight Decay 0.01
Warmup Steps 50
Total Steps 6,250
Max Seq Length 2048
Optimizer AdamW 8-bit
Seed 3407
Response Masking train_on_responses_only enabled

Dataset

  • Source: mlabonne/FineTome-100k
  • Samples Used: 10,000 (first 10k)
  • Format: Multi-turn chat conversations
  • Chat Template: Gemma 4 native (role: "model", not "assistant")
  • Masking: Only model responses contribute to loss (instruction tokens masked)

Hardware

  • GPU: NVIDIA L4 (24GB VRAM)
  • RAM: 32GB
  • Training Time: ~12 hours (with checkpoint resume)
  • GPU Memory Used: ~14.8GB during training

Experiment Journey

We ran 8 systematic experiments to find the optimal configuration:

Exp LoRA r Epochs Samples LR Train Loss Key Finding
01 16 0.13 3k 2e-4 2.916 Baseline
02 32 0.24 5k 2e-4 1.725 Higher rank helps (+41%)
03 64+RSLoRA 0.20 10k 2e-4 1.460 RSLoRA + more data (+50%)
04 64+RSLoRA 0.40 20k 1e-4 ~1.05 Lower LR improves convergence
05 128+RSLoRA 0.40 20k 5e-5 1.134 r=128 slower than r=64
06 64+RSLoRA 3 10k 1e-4 ~0.30 Multi-epoch is transformative
07 128+RSLoRA 3 10k 1e-4 ~0.59 r=64 > r=128 for multi-epoch
08 64+RSLoRA 5 10k 7e-5 0.0115 5 epochs = 99.6% reduction

The Multi-Epoch Discovery

The single most impactful finding: each additional epoch delivers a dramatic, consistent loss reduction:

Epoch 1: loss ~0.90  (learning the patterns)
Epoch 2: loss ~0.60  (reinforcing knowledge)
Epoch 3: loss ~0.30  (deep memorization)
Epoch 4: loss ~0.10  (fine polishing)
Epoch 5: loss ~0.01  (near-perfect fitting)

This pattern was consistent across experiments 06, 07, and 08. The loss drops happen at each epoch boundary as the model sees the training data again.

Other Key Insights

  1. r=64 with RSLoRA is the sweet spot β€” r=128 converges slower and provides no benefit in multi-epoch settings
  2. Lower LR (7e-5) stabilizes long training β€” higher LR (2e-4) causes instability after epoch 2
  3. train_on_responses_only is essential β€” masks user/system tokens so the model only learns from responses
  4. Checkpoint saving every 250 steps β€” long CUDA runs crash from memory fragmentation; resume from checkpoints solved this
  5. 10k high-quality samples > 20k samples for multi-epoch β€” quality over quantity when doing multiple passes

Training Pipeline

Built entirely with Unsloth:

from unsloth import FastModel
from trl import SFTTrainer, SFTConfig
from unsloth.chat_templates import get_chat_template, train_on_responses_only

# 1. Load 4-bit quantized model
model, tokenizer = FastModel.from_pretrained(
    "unsloth/gemma-4-E4B-it-unsloth-bnb-4bit",
    max_seq_length=2048, load_in_4bit=True,
)

# 2. Apply LoRA adapters (r=64, RSLoRA)
model = FastModel.get_peft_model(model,
    finetune_vision_layers=False, finetune_language_layers=True,
    finetune_attention_modules=True, finetune_mlp_modules=True,
    r=64, lora_alpha=64, lora_dropout=0, bias="none",
    random_state=3407, use_rslora=True,
)

# 3. Setup Gemma 4 chat template
tokenizer = get_chat_template(tokenizer, chat_template="gemma-4")

# 4. Train with response-only masking
trainer = SFTTrainer(model=model, tokenizer=tokenizer, train_dataset=dataset,
    args=SFTConfig(
        per_device_train_batch_size=1, gradient_accumulation_steps=8,
        learning_rate=7e-5, num_train_epochs=5, lr_scheduler_type="cosine",
        warmup_steps=50, weight_decay=0.01, optim="adamw_8bit",
        save_strategy="steps", save_steps=250, save_total_limit=3,
    ),
)
trainer = train_on_responses_only(trainer,
    instruction_part="<|turn>user\n", response_part="<|turn>model\n",
)
trainer.train()

Reproduce Training

git clone https://github.com/bradduy/Any2AnyModels
cd Any2AnyModels
pip install unsloth

python scripts/train.py \
  --model unsloth/gemma-4-E4B-it-unsloth-bnb-4bit \
  --load-4bit --lora-rank 64 --use-rslora \
  --dataset mlabonne/FineTome-100k --max-samples 10000 \
  --num-epochs 5 --learning-rate 7e-5 --grad-accum 8 \
  --weight-decay 0.01 --warmup-steps 50 --scheduler cosine \
  --save-steps 250 --save-total-limit 3

Limitations

  • Fine-tuned on English-only data (FineTome-100k)
  • Optimized for instruction following, not domain-specific tasks
  • 4B parameter model β€” larger models (26B, 31B) would perform better but require more VRAM
  • Training loss β‰  downstream task performance; the model should be evaluated on specific benchmarks

Acknowledgments

  • Google DeepMind for the Gemma 4 model family
  • Unsloth for making QLoRA fine-tuning 2x faster and memory efficient
  • Kaggle for hosting the Gemma 4 Good Hackathon
  • mlabonne for the FineTome-100k dataset

License

Apache 2.0 (same as Gemma 4)

Downloads last month
22
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train bradduy/Any2AnyModels