How We Built OpenMythos: A Cybersecurity LLM Trained from Scratch

Published June 15, 2026

A behind-the-scenes look at the data, the compute, the fine-tuning pipeline, and the lessons learned.

We built OpenMythos for one reason: general-purpose LLMs are not good enough at cybersecurity. They hallucinate CVE details, miss vulnerability patterns in code, and give advice that sounds confident but is wrong in ways that matter. We wanted a model that had actually read the literature, understood real vulnerable codebases, and could reason about exploits and mitigations with precision.

This is the story of how we trained it.

The Data Problem

Everything starts with data. Cybersecurity is a domain where quality matters far more than quantity a model trained on vague, imprecise, or incorrect security content will produce vague, imprecise, or incorrect outputs. We needed data that was technically dense, accurate, and grounded in real vulnerabilities.

ArXiv cs.CR

Our first source was the ArXiv cs.CR (Cryptography and Security) category. We scraped and assembled a raw dataset of 10,000 papers everything from formal verification of cryptographic protocols to empirical studies of malware behavior. Raw scrapes are noisy though. Papers include LaTeX artifacts, author affiliations, acknowledgment sections, and references that add noise without adding signal.

We ran a multi-pass filtering pipeline that got to 90% completion before we finalized it, cleaning formatting artifacts, deduplicating near-identical abstracts, and removing papers that were only tangentially related to practical security. The cleaned dataset lives at himanshu17HF/ArvixImport-Filtered-Final about 1,840 high-quality records focused specifically on coding language vulnerabilities.

CVE Vulnerability Dataset

ArXiv gave us the theory. For the practice, we needed real vulnerability data. We assembled a structured dataset of CVEs with detailed descriptions, affected code patterns, and remediation context. This dataset is published at build-small-hackathon/CVE_Vulnerabilities_Detailed on Hugging Face.

The combination of academic research and real-world CVE data gave the model two complementary things: a deep understanding of why vulnerabilities exist, and a practical map of what they look like in production code.

Base Model Selection

We evaluated several strong base models for fine-tuning:

mistralai/Devstral-Small-2-24B-Instruct-2512
google/gemma-4-31B-it
'Qwen/Qwen3.6-27B'

The selection criteria came down to: instruction-following quality out of the box, code comprehension capability (since so much cybersecurity reasoning is about code), and whether the model's architecture would work well with our fine-tuning setup. We wanted a model that already had strong coding ability so that SFT could shift its focus toward security without having to teach it what code is.

So, we finally chose Qwen/Qwen3.6-27B as base.

Compute: Modal + H100s

Training a model of this scale requires serious hardware. We used Modal for our compute infrastructure, running on H100 GPUs.

Modal made sense for us for a few practical reasons. We didn't want to manage long-running GPU instances ourselves Modal's on-demand serverless GPU approach let us spin up exactly what we needed for each training run and tear it down when we were done. No idle GPU time burning money. The H100s gave us the memory bandwidth and FLOPS needed to run fine-tuning at a reasonable speed.

Modal Notebook feature is best for finetuning LLMs.

Training: Two Stages

Stage 1: Supervised Fine-Tuning (SFT)

The first stage was standard SFT using our combined cybersecurity dataset. We formatted the data as instruction-response pairs covering a range of security tasks: vulnerability identification, CVE explanation, code review for security issues, attack vector analysis, and remediation suggestions.

SFT teaches the model the shape of good cybersecurity reasoning what a thorough vulnerability analysis looks like, how to explain a CVE clearly, what a secure vs. insecure code pattern looks like side by side. It's the foundation.

Stage 2: Reinforcement Learning with Verifiable Reward (RLVR)

SFT alone has a ceiling. The model learns to imitate good responses, but it doesn't learn to verify its own outputs. For cybersecurity, that distinction matters enormously a model that confidently produces plausible-sounding but incorrect vulnerability analysis is worse than useless.

This is where RLVR came in. We constructed a dataset of GitHub repositories with known vulnerabilities: each entry paired a branch containing vulnerable code with the corresponding fixed version. The model's job was to identify the vulnerability and suggest the fix. The reward signal came from a separate evaluation model that checked the generated response against the ground truth did the model identify the right vulnerability? Was the suggested fix actually correct?

RLVR pushes the model toward responses that are not just fluent and well-structured, but verifiably accurate. Over training, the model learns to be more precise, more careful, and more honest about what it knows vs. what it's uncertain about. The improvement in output quality after RLVR versus SFT alone was noticeable responses became more targeted, less prone to conflating similar vulnerability classes, and better at flagging genuine uncertainty.

The Space

The model and demo are live on Hugging Face Spaces at build-small-hackathon/OpenMythos. The space runs the model behind an OpenAI-compatible API endpoint instead of ZeroGPU because of limits of it.

What We Learned

A few things stood out from this process:

Data quality over data quantity. 1,840 well-filtered security papers beat 10,000 noisy ones. The filtering pipeline took significant effort but paid off in training stability and output quality.

RLVR is worth the complexity. Setting up verifiable rewards for cybersecurity is harder than for domains like math, where correctness is easy to check programmatically. Having to build an evaluation model that can assess vulnerability analysis quality was non-trivial. But the gains in output precision made it worthwhile.

Modal + H100s removed the infrastructure headache. Being able to focus on the training pipeline rather than GPU instance management saved us real time, especially with a hackathon deadline.

OpenMythos is open the model weights, datasets, and space are all public on Hugging Face. If you're working on security tooling, red teaming pipelines, or vulnerability analysis workflows, we'd love to hear how it holds up in practice.

Built for the Build Small Hackathon. Model: OpenMythos · Dataset: CVE Vulnerabilities Detailed · ArXiv cs.CR Filtered · Space: OpenMythos

Models mentioned in this article 2

Datasets mentioned in this article 2

Spaces mentioned in this article 1

Muon vs MuonClip vs Muon+AdamW for Fine-Tuning

December 9, 2025

How OpenGPT 4o works

July 17, 2024

Community

abhishekkataria16

3 days ago

Amazing work, guys! I read your article and tested the project, and I absolutely loved it. Seriously, I'm really impressed, great job!

By the way, I couldn't find the trained model in your repository. Is it hosted somewhere else?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote