{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/duoan/TorchCode/blob/master/templates/35_bpe.ipynb)\n",
    "\n",
    "# 🔴 Hard: Byte-Pair Encoding (BPE)\n",
    "\n",
    "Implement a simple **BPE tokenizer** — the foundation of GPT/LLaMA tokenization.\n",
    "\n",
    "### Signature\n",
    "```python\n",
    "class SimpleBPE:\n",
    "    def __init__(self): ...\n",
    "    def train(self, corpus: list[str], num_merges: int): ...\n",
    "    def encode(self, text: str) -> list[str]: ...\n",
    "```\n",
    "\n",
    "### Algorithm (training)\n",
    "1. Split each word into characters + `</w>` end marker\n",
    "2. Count all adjacent pairs across the corpus\n",
    "3. Merge the most frequent pair into a single token\n",
    "4. Repeat for `num_merges` iterations"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "# Install torch-judge in Colab (no-op in JupyterLab/Docker)\n",
    "try:\n",
    "    import google.colab\n",
    "    get_ipython().run_line_magic('pip', 'install -q torch-judge')\n",
    "except ImportError:\n",
    "    pass\n"
   ],
   "outputs": [],
   "execution_count": null
  },
  {
   "cell_type": "code",
   "metadata": {},
   "outputs": [],
   "source": [
    "# No imports needed"
   ],
   "execution_count": null
  },
  {
   "cell_type": "code",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ✏️ YOUR IMPLEMENTATION HERE\n",
    "\n",
    "class SimpleBPE:\n",
    "    def __init__(self):\n",
    "        self.merges = []\n",
    "\n",
    "    def train(self, corpus, num_merges):\n",
    "        pass  # iteratively find & merge most frequent pairs\n",
    "\n",
    "    def encode(self, text):\n",
    "        pass  # apply learned merges to split text"
   ],
   "execution_count": null
  },
  {
   "cell_type": "code",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 🧪 Debug\n",
    "bpe = SimpleBPE()\n",
    "bpe.train(['low', 'low', 'low', 'lower', 'newest', 'widest'], num_merges=10)\n",
    "print('Merges:', bpe.merges[:5])\n",
    "print('Encode:', bpe.encode('low lower'))"
   ],
   "execution_count": null
  },
  {
   "cell_type": "code",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ✅ SUBMIT\n",
    "from torch_judge import check\n",
    "check('bpe')"
   ],
   "execution_count": null
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.11.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}