Gouda-50M-SFT

I've always wanted to train a large language model from scratch. Most tutorials and walkthroughs start with the original transformer paper from 2017, "Attenttion is All You Need". However, I wanted to learn the newer advancements in LLMs, such as Grouped Query Attention, SwiGLU, RoPE, and Tied Embeddings. And that is what I did! I coded the model architecture from scratch in raw Pytorch (no Huggingface), and came up with a model which I named Gouda. Gouda-50M-SFT is a Llama-style transformer trained from scratch on only 1 billion training tokens (spanning books, wikipedia, github, and reddit) and 1 billion SFT tokens.

Gouda-50M-SFT is a 50 million parameter decoder-only language model trained from scratch using a Llama-style architecture. The model adopts modern architectural components including Grouped Query Attention (GQA), Rotary Position Embeddings (RoPE), SwiGLU feed-forward layers, RMSNorm, and tied input/output embeddings with a 1k context length.

The model is designed as an efficient small-scale research model suitable for experimentation, educational purposes, lightweight inference, and studying scaling behavior in modern language models.


Model Details

  • Architecture: Decoder-only Transformer
  • Parameters: 54.2 million
  • Context Length: 1k
  • Vocabulary Size: 32,000
  • Tokenizer: unsloth/llama-2-7b
  • Embedding Dimension: 512
  • Transformer Layers: 12
  • Attention Heads: 16
  • Key/Value Heads: 8 (Grouped Query Attention)
  • Feed-Forward Dimension: 1536 (SwiGLU)
  • Dropout: 0.1

Architecture

Gouda-50M-SFT follows a modern Llama-style decoder architecture:

  • Rotary Position Embeddings (RoPE)
  • Grouped Query Attention (GQA)
  • RMSNorm pre-normalization
  • SwiGLU feed-forward networks
  • Tied token embeddings and language modeling head

Each transformer block consists of:

  1. RMSNorm
  2. Multi-head grouped-query self-attention
  3. Residual connection
  4. RMSNorm
  5. SwiGLU feed-forward network
  6. Residual connection

The architecture is intentionally lightweight while retaining many of the design choices used in contemporary large language models.


Training

Training Datasets

Pretrain SFT
openbmb/Ultra-FineWeb HuggingFaceH4/no_robots
thomwolf/github-dataset mlabonne/ultrachat_200k_sft
applied-ai-018/pretraining_v1-omega_books nvidia/Nemotron-SFT-Instruction-Following-Chat-v2
wikimedia/wikipedia
tensorshield/reddit_dataset_157

Optimization

  • Optimizer: AdamW
  • Learning Rate: 1e-4 peak
  • Minimum Learning Rate: 1e-5
  • Learning Rate Schedule: Cosine decay with linear warmup
  • Warmup Steps: 2,000
  • Weight Decay: 0.1
  • Gradient Clipping: 1.0
  • Adam β1: 0.9
  • Adam β2: 0.95

Training Configuration

  • Sequence Length: 1024
  • Micro Batch Size: 12
  • Gradient Accumulation Steps: 8
  • Effective Batch Size: 96 sequences
  • Training Steps: 50K for base, 27.5K for SFT
  • GPU: RTX 5080

Limitations

As a 50M parameter language model, Gouda-50M-SFT has limited reasoning ability, factual knowledge, and long-context capabilities compared to larger models. Outputs may contain inaccuracies, hallucinations, or incomplete information.

Performance varies substantially depending on prompt format and task complexity.


Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("linkanjarad/Gouda-50M-SFT")
model = AutoModelForCausalLM.from_pretrained("linkanjarad/Gouda-50M-SFT")

user_input = "What is Artificial Intelligence?"
prompt = f"<|turn>user\n{user_input}\n<turn|>\n<|turn>model\n"

inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    do_sample=True,
    top_k=64,
    top_p=0.9,
    temperature=0.7
)

print(tokenizer.decode(outputs[0]))
Downloads last month
246
Safetensors
Model size
54.2M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support