--- name: huggingface-tokenizers description: Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in <20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignments, handle padding/truncation. Integrates seamlessly with transformers. Use when you need high-performance tokenization or custom tokenizer training. version: 1.0.0 author: Orchestra Research license: MIT tags: [Tokenization, HuggingFace, BPE, WordPiece, Unigram, Fast Tokenization, Rust, Custom Tokenizer, Alignment Tracking, Production] dependencies: [tokenizers, transformers, datasets] --- # HuggingFace Tokenizers - Fast Tokenization for NLP Fast, production-ready tokenizers with Rust performance and Python ease-of-use. ## When to use HuggingFace Tokenizers **Use HuggingFace Tokenizers when:** - Need extremely fast tokenization (<20s per GB of text) - Training custom tokenizers from scratch - Want alignment tracking (token → original text position) - Building production NLP pipelines - Need to tokenize large corpora efficiently **Performance**: - **Speed**: <20 seconds to tokenize 1GB on CPU - **Implementation**: Rust core with Python/Node.js bindings - **Efficiency**: 10-100× faster than pure Python implementations **Use alternatives instead**: - **SentencePiece**: Language-independent, used by T5/ALBERT - **tiktoken**: OpenAI's BPE tokenizer for GPT models - **transformers AutoTokenizer**: Loading pretrained only (uses this library internally) ## Quick start ### Installation ```bash # Install tokenizers pip install tokenizers # With transformers integration pip install tokenizers transformers ``` ### Load pretrained tokenizer ```python from tokenizers import Tokenizer # Load from HuggingFace Hub tokenizer = Tokenizer.from_pretrained("bert-base-uncased") # Encode text output = tokenizer.encode("Hello, how are you?") print(output.tokens) # ['hello', ',', 'how', 'are', 'you', '?'] print(output.ids) # [7592, 1010, 2129, 2024, 2017, 1029] # Decode back text = tokenizer.decode(output.ids) print(text) # "hello, how are you?" ``` ### Train custom BPE tokenizer ```python from tokenizers import Tokenizer from tokenizers.models import BPE from tokenizers.trainers import BpeTrainer from tokenizers.pre_tokenizers import Whitespace # Initialize tokenizer with BPE model tokenizer = Tokenizer(BPE(unk_token="[UNK]")) tokenizer.pre_tokenizer = Whitespace() # Configure trainer trainer = BpeTrainer( vocab_size=30000, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], min_frequency=2 ) # Train on files files = ["train.txt", "validation.txt"] tokenizer.train(files, trainer) # Save tokenizer.save("my-tokenizer.json") ``` **Training time**: ~1-2 minutes for 100MB corpus, ~10-20 minutes for 1GB ### Batch encoding with padding ```python # Enable padding tokenizer.enable_padding(pad_id=3, pad_token="[PAD]") # Encode batch texts = ["Hello world", "This is a longer sentence"] encodings = tokenizer.encode_batch(texts) for encoding in encodings: print(encoding.ids) # [101, 7592, 2088, 102, 3, 3, 3] # [101, 2023, 2003, 1037, 2936, 6251, 102] ``` ## Tokenization algorithms ### BPE (Byte-Pair Encoding) **How it works**: 1. Start with character-level vocabulary 2. Find most frequent character pair 3. Merge into new token, add to vocabulary 4. Repeat until vocabulary size reached **Used by**: GPT-2, GPT-3, RoBERTa, BART, DeBERTa ```python from tokenizers import Tokenizer from tokenizers.models import BPE from tokenizers.trainers import BpeTrainer from tokenizers.pre_tokenizers import ByteLevel tokenizer = Tokenizer(BPE(unk_token="<|endoftext|>")) tokenizer.pre_tokenizer = ByteLevel() trainer = BpeTrainer( vocab_size=50257, special_tokens=["<|endoftext|>"], min_frequency=2 ) tokenizer.train(files=["data.txt"], trainer=trainer) ``` **Advantages**: - Handles OOV words well (breaks into subwords) - Flexible vocabulary size - Good for morphologically rich languages **Trade-offs**: - Tokenization depends on merge order - May split common words unexpectedly ### WordPiece **How it works**: 1. Start with character vocabulary 2. Score merge pairs: `frequency(pair) / (frequency(first) × frequency(second))` 3. Merge highest scoring pair 4. Repeat until vocabulary size reached **Used by**: BERT, DistilBERT, MobileBERT ```python from tokenizers import Tokenizer from tokenizers.models import WordPiece from tokenizers.trainers import WordPieceTrainer from tokenizers.pre_tokenizers import Whitespace from tokenizers.normalizers import BertNormalizer tokenizer = Tokenizer(WordPiece(unk_token="[UNK]")) tokenizer.normalizer = BertNormalizer(lowercase=True) tokenizer.pre_tokenizer = Whitespace() trainer = WordPieceTrainer( vocab_size=30522, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], continuing_subword_prefix="##" ) tokenizer.train(files=["corpus.txt"], trainer=trainer) ``` **Advantages**: - Prioritizes meaningful merges (high score = semantically related) - Used successfully in BERT (state-of-the-art results) **Trade-offs**: - Unknown words become `[UNK]` if no subword match - Saves vocabulary, not merge rules (larger files) ### Unigram **How it works**: 1. Start with large vocabulary (all substrings) 2. Compute loss for corpus with current vocabulary 3. Remove tokens with minimal impact on loss 4. Repeat until vocabulary size reached **Used by**: ALBERT, T5, mBART, XLNet (via SentencePiece) ```python from tokenizers import Tokenizer from tokenizers.models import Unigram from tokenizers.trainers import UnigramTrainer tokenizer = Tokenizer(Unigram()) trainer = UnigramTrainer( vocab_size=8000, special_tokens=["", "", ""], unk_token="" ) tokenizer.train(files=["data.txt"], trainer=trainer) ``` **Advantages**: - Probabilistic (finds most likely tokenization) - Works well for languages without word boundaries - Handles diverse linguistic contexts **Trade-offs**: - Computationally expensive to train - More hyperparameters to tune ## Tokenization pipeline Complete pipeline: **Normalization → Pre-tokenization → Model → Post-processing** ### Normalization Clean and standardize text: ```python from tokenizers.normalizers import NFD, StripAccents, Lowercase, Sequence tokenizer.normalizer = Sequence([ NFD(), # Unicode normalization (decompose) Lowercase(), # Convert to lowercase StripAccents() # Remove accents ]) # Input: "Héllo WORLD" # After normalization: "hello world" ``` **Common normalizers**: - `NFD`, `NFC`, `NFKD`, `NFKC` - Unicode normalization forms - `Lowercase()` - Convert to lowercase - `StripAccents()` - Remove accents (é → e) - `Strip()` - Remove whitespace - `Replace(pattern, content)` - Regex replacement ### Pre-tokenization Split text into word-like units: ```python from tokenizers.pre_tokenizers import Whitespace, Punctuation, Sequence, ByteLevel # Split on whitespace and punctuation tokenizer.pre_tokenizer = Sequence([ Whitespace(), Punctuation() ]) # Input: "Hello, world!" # After pre-tokenization: ["Hello", ",", "world", "!"] ``` **Common pre-tokenizers**: - `Whitespace()` - Split on spaces, tabs, newlines - `ByteLevel()` - GPT-2 style byte-level splitting - `Punctuation()` - Isolate punctuation - `Digits(individual_digits=True)` - Split digits individually - `Metaspace()` - Replace spaces with ▁ (SentencePiece style) ### Post-processing Add special tokens for model input: ```python from tokenizers.processors import TemplateProcessing # BERT-style: [CLS] sentence [SEP] tokenizer.post_processor = TemplateProcessing( single="[CLS] $A [SEP]", pair="[CLS] $A [SEP] $B [SEP]", special_tokens=[ ("[CLS]", 1), ("[SEP]", 2), ], ) ``` **Common patterns**: ```python # GPT-2: sentence <|endoftext|> TemplateProcessing( single="$A <|endoftext|>", special_tokens=[("<|endoftext|>", 50256)] ) # RoBERTa: sentence TemplateProcessing( single=" $A ", pair=" $A $B ", special_tokens=[("", 0), ("", 2)] ) ``` ## Alignment tracking Track token positions in original text: ```python output = tokenizer.encode("Hello, world!") # Get token offsets for token, offset in zip(output.tokens, output.offsets): start, end = offset print(f"{token:10} → [{start:2}, {end:2}): {text[start:end]!r}") # Output: # hello → [ 0, 5): 'Hello' # , → [ 5, 6): ',' # world → [ 7, 12): 'world' # ! → [12, 13): '!' ``` **Use cases**: - Named entity recognition (map predictions back to text) - Question answering (extract answer spans) - Token classification (align labels to original positions) ## Integration with transformers ### Load with AutoTokenizer ```python from transformers import AutoTokenizer # AutoTokenizer automatically uses fast tokenizers tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") # Check if using fast tokenizer print(tokenizer.is_fast) # True # Access underlying tokenizers.Tokenizer fast_tokenizer = tokenizer.backend_tokenizer print(type(fast_tokenizer)) # ``` ### Convert custom tokenizer to transformers ```python from tokenizers import Tokenizer from transformers import PreTrainedTokenizerFast # Train custom tokenizer tokenizer = Tokenizer(BPE()) # ... train tokenizer ... tokenizer.save("my-tokenizer.json") # Wrap for transformers transformers_tokenizer = PreTrainedTokenizerFast( tokenizer_file="my-tokenizer.json", unk_token="[UNK]", pad_token="[PAD]", cls_token="[CLS]", sep_token="[SEP]", mask_token="[MASK]" ) # Use like any transformers tokenizer outputs = transformers_tokenizer( "Hello world", padding=True, truncation=True, max_length=512, return_tensors="pt" ) ``` ## Common patterns ### Train from iterator (large datasets) ```python from datasets import load_dataset # Load dataset dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="train") # Create batch iterator def batch_iterator(batch_size=1000): for i in range(0, len(dataset), batch_size): yield dataset[i:i + batch_size]["text"] # Train tokenizer tokenizer.train_from_iterator( batch_iterator(), trainer=trainer, length=len(dataset) # For progress bar ) ``` **Performance**: Processes 1GB in ~10-20 minutes ### Enable truncation and padding ```python # Enable truncation tokenizer.enable_truncation(max_length=512) # Enable padding tokenizer.enable_padding( pad_id=tokenizer.token_to_id("[PAD]"), pad_token="[PAD]", length=512 # Fixed length, or None for batch max ) # Encode with both output = tokenizer.encode("This is a long sentence that will be truncated...") print(len(output.ids)) # 512 ``` ### Multi-processing ```python from tokenizers import Tokenizer from multiprocessing import Pool # Load tokenizer tokenizer = Tokenizer.from_file("tokenizer.json") def encode_batch(texts): return tokenizer.encode_batch(texts) # Process large corpus in parallel with Pool(8) as pool: # Split corpus into chunks chunk_size = 1000 chunks = [corpus[i:i+chunk_size] for i in range(0, len(corpus), chunk_size)] # Encode in parallel results = pool.map(encode_batch, chunks) ``` **Speedup**: 5-8× with 8 cores ## Performance benchmarks ### Training speed | Corpus Size | BPE (30k vocab) | WordPiece (30k) | Unigram (8k) | |-------------|-----------------|-----------------|--------------| | 10 MB | 15 sec | 18 sec | 25 sec | | 100 MB | 1.5 min | 2 min | 4 min | | 1 GB | 15 min | 20 min | 40 min | **Hardware**: 16-core CPU, tested on English Wikipedia ### Tokenization speed | Implementation | 1 GB corpus | Throughput | |----------------|-------------|---------------| | Pure Python | ~20 minutes | ~50 MB/min | | HF Tokenizers | ~15 seconds | ~4 GB/min | | **Speedup** | **80×** | **80×** | **Test**: English text, average sentence length 20 words ### Memory usage | Task | Memory | |-------------------------|---------| | Load tokenizer | ~10 MB | | Train BPE (30k vocab) | ~200 MB | | Encode 1M sentences | ~500 MB | ## Supported models Pre-trained tokenizers available via `from_pretrained()`: **BERT family**: - `bert-base-uncased`, `bert-large-cased` - `distilbert-base-uncased` - `roberta-base`, `roberta-large` **GPT family**: - `gpt2`, `gpt2-medium`, `gpt2-large` - `distilgpt2` **T5 family**: - `t5-small`, `t5-base`, `t5-large` - `google/flan-t5-xxl` **Other**: - `facebook/bart-base`, `facebook/mbart-large-cc25` - `albert-base-v2`, `albert-xlarge-v2` - `xlm-roberta-base`, `xlm-roberta-large` Browse all: https://huggingface.co/models?library=tokenizers ## References - **[Training Guide](references/training.md)** - Train custom tokenizers, configure trainers, handle large datasets - **[Algorithms Deep Dive](references/algorithms.md)** - BPE, WordPiece, Unigram explained in detail - **[Pipeline Components](references/pipeline.md)** - Normalizers, pre-tokenizers, post-processors, decoders - **[Transformers Integration](references/integration.md)** - AutoTokenizer, PreTrainedTokenizerFast, special tokens ## Resources - **Docs**: https://huggingface.co/docs/tokenizers - **GitHub**: https://github.com/huggingface/tokenizers ⭐ 9,000+ - **Version**: 0.20.0+ - **Course**: https://huggingface.co/learn/nlp-course/chapter6/1 - **Paper**: BPE (Sennrich et al., 2016), WordPiece (Schuster & Nakajima, 2012)