{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/duoan/TorchCode/blob/master/templates/35_bpe.ipynb)\n", "\n", "# ๐Ÿ”ด Hard: Byte-Pair Encoding (BPE)\n", "\n", "Implement a simple **BPE tokenizer** โ€” the foundation of GPT/LLaMA tokenization.\n", "\n", "### Signature\n", "```python\n", "class SimpleBPE:\n", " def __init__(self): ...\n", " def train(self, corpus: list[str], num_merges: int): ...\n", " def encode(self, text: str) -> list[str]: ...\n", "```\n", "\n", "### Algorithm (training)\n", "1. Split each word into characters + `` end marker\n", "2. Count all adjacent pairs across the corpus\n", "3. Merge the most frequent pair into a single token\n", "4. Repeat for `num_merges` iterations" ], "outputs": [] }, { "cell_type": "code", "metadata": {}, "source": [ "# Install torch-judge in Colab (no-op in JupyterLab/Docker)\n", "try:\n", " import google.colab\n", " get_ipython().run_line_magic('pip', 'install -q torch-judge')\n", "except ImportError:\n", " pass\n" ], "outputs": [], "execution_count": null }, { "cell_type": "code", "metadata": {}, "outputs": [], "source": [ "# No imports needed" ], "execution_count": null }, { "cell_type": "code", "metadata": {}, "outputs": [], "source": [ "# โœ๏ธ YOUR IMPLEMENTATION HERE\n", "\n", "class SimpleBPE:\n", " def __init__(self):\n", " self.merges = []\n", "\n", " def train(self, corpus, num_merges):\n", " pass # iteratively find & merge most frequent pairs\n", "\n", " def encode(self, text):\n", " pass # apply learned merges to split text" ], "execution_count": null }, { "cell_type": "code", "metadata": {}, "outputs": [], "source": [ "# ๐Ÿงช Debug\n", "bpe = SimpleBPE()\n", "bpe.train(['low', 'low', 'low', 'lower', 'newest', 'widest'], num_merges=10)\n", "print('Merges:', bpe.merges[:5])\n", "print('Encode:', bpe.encode('low lower'))" ], "execution_count": null }, { "cell_type": "code", "metadata": {}, "outputs": [], "source": [ "# โœ… SUBMIT\n", "from torch_judge import check\n", "check('bpe')" ], "execution_count": null } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.11.0" } }, "nbformat": 4, "nbformat_minor": 4 }