---
source_url: https://arxiv.org/abs/2605.31268
source: arxiv
arxiv_id: "2605.31268"
title: "Mellum2 Technical Report"
authors: ["Nikiita Pavlichenko et al. (JetBrains)"]
published: 2026-05-29
ingested: 2026-06-03
sha256: pending
---

# Mellum2 Technical Report

**Source**: https://arxiv.org/abs/2605.31268 (arXiv:2605.31268 [cs.CL])
**Authors**: Nikiita Pavlichenko et al. (JetBrains)
**Published**: 2026-05-29

## Abstract

> We present Mellum 2, an open-weight 12B-parameter Mixture-of-Experts (MoE) language model with 2.5B active parameters per token. Mellum 2 is a general-purpose language model specialized in software engineering, spanning code generation and editing, debugging, multi-step reasoning, tool use and function calling, agentic coding, and conversational programming assistance, and it is the successor to the completion-focused 4B dense Mellum model.

## Architecture Details

- **MoE configuration**: 64 experts, 8 active per token
- **Grouped-Query Attention**: 4 KV heads
- **Sliding Window Attention**: applied to three of every four layers
- **Multi-Token Prediction head**: doubles as both auxiliary pre-training objective AND built-in draft model for speculative decoding
- **Design constraint**: inference efficiency on commodity GPUs validated each architectural choice via ablation

## Pre-training

- **Token count**: ~10.6 trillion tokens
- **Curriculum**: 3-phase progressive shift from diverse web -> curated code + math
- **Optimizer**: Muon
- **Precision**: FP8 hybrid
- **Schedule**: Warmup-Hold-Decay with linear decay to zero
- **Context extension**: 128K via layer-selective YaRN

## Post-training

Two stages:
1. **Supervised fine-tuning (SFT)**
2. **RLVR** (Reinforcement Learning with Verifiable Rewards)

Two released variants:
- **Instruct**: direct answer
- **Thinking**: emits explicit reasoning trace before final answer

## Benchmark Performance

Competitive with open-weight baselines in the 4B-14B range while running at per-token compute of a 2.5B dense model. Coverage: code generation, math/reasoning, tool use, knowledge, safety.

## License

Apache 2.0 — base, instruct, thinking checkpoints all released.

## Submission History

v1: 2026-05-29 13:01:11 UTC (1,508 KB)