Gros-Michel-90m-Base is a 90m parameter billingual LLM trained on 4.5 billion tokens of a custom dataset mixture, then further enhanced with a 2 billion token continued pretraining run. The goal with this model is to provide a flexible base for further finetuning on downstream tasks, such as translation, sentiment analysis and extraction.

Gros-Michel-90m-Base uses a tokenizer trained on both english and german data, with a vocab size of 20000.

Pretrain Data mixture

Dataset Weight
HuggingFaceFW/fineweb-edu 38%
epfml/FineWeb-HQ 18%
HuggingFaceTB/cosmopedia (stories split) 18%
HuggingFaceTB/finemath (finemath-4plus) 6%
finnianx/de_corpus 20%

Continued pretrain Data mixture

Dataset Weight
"wikimedia/wikipedia", "20231101.en" 40%
"wikimedia/wikipedia", "20231101.de" 40%
HuggingFaceTB/finemath (finemath-4plus) 20%

Comparison to other models

Maker Model Hellaswag ARC (easy) PIQA BLiMP Average
finnianx Gros-Michel-90M 30.26% 41.50% 59.41% 78.35% 52.38%
finnianx Michel-Nano-v2 27.40% 35.90% 56.75% 72.52% 48.14%
Axiomic Labs GPT-S-5M 27.39% 33.16% 57.13% 72.21% 47.47%
EleutherAI pythia-31m 27.14% 33.88% 56.26% 67.78% 46.27%
MaliosDark Isabel-50M 27.1% 43.81% 57.12% 73.75% 50.44%

German Benchmarks

Model arc_de acc arc_de acc_norm hellaswag_de acc hellaswag_de acc_norm m_mmlu_de acc truthfulqa_de_mc1 acc truthfulqa_de_mc2 acc
Gros-Michel-90M-Base 0.1865 0.2284 0.2697 0.2852 0.2346 0.2348 0.4285
nanochat German v1 0.2241 0.2626 0.3203 0.3581 0.2285 0.2500 0.4184
LLäMmlein-120M 0.1942 0.2301 0.2945 0.3178 0.2285 0.2310 0.4055
LLäMmlein-1B 0.2515 0.2960 0.3703 0.4490 0.2317 0.2322 0.3617

Notice

This model has not undergone any alignment, and therefore may produce harmful content.

Evaluation was done in lm-eval-harness by EleutherAI, all benchmark scores use normalized accuracy where applicable and are zero-shot.

Future plans

Sometime in the near(ish) future i will release an instruction tuned variant of this model, along with a translation focused finetune. GGUF support will also come in the near(ish) future.

Downloads last month
239
Safetensors
Model size
91.1M params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train finnianx/Gros-Michel-90m-Base

Space using finnianx/Gros-Michel-90m-Base 1

Collection including finnianx/Gros-Michel-90m-Base