# SMILE NLP — Tokenizers and Text Splitters Tokenization is the foundational step in almost every NLP pipeline: raw text must be broken into discrete units (tokens, sentences, paragraphs) before any further processing can take place. The `smile.nlp.tokenizer` package provides a clean three-level hierarchy of splitters together with multiple implementations suited for different use cases. --- ## Package Overview | Level | Interface | Implementations | |-------|-----------|-----------------| | Paragraph | `ParagraphSplitter` | `SimpleParagraphSplitter` | | Sentence | `SentenceSplitter` | `SimpleSentenceSplitter`, `BreakIteratorSentenceSplitter` | | Word | `Tokenizer` | `SimpleTokenizer`, `PennTreebankTokenizer`, `BreakIteratorTokenizer` | All three interfaces extend `java.util.function.Function`, so any splitter can be used directly in a Java stream pipeline. A supporting dictionary class, `EnglishAbbreviations`, is used internally by the English-specific implementations to avoid mis-splitting abbreviation periods. --- ## Interfaces ### `Tokenizer` ```java public interface Tokenizer extends Function { String[] split(String text); // tokenize text into words/tokens } ``` ### `SentenceSplitter` ```java public interface SentenceSplitter extends Function { String[] split(String text); // segment text into sentences } ``` ### `ParagraphSplitter` ```java public interface ParagraphSplitter extends Function { String[] split(String text); // segment text into paragraphs } ``` Because all three interfaces implement `Function`, you can compose them with standard Java functional utilities: ```java SentenceSplitter splitter = SimpleSentenceSplitter.getInstance(); Tokenizer tokenizer = new SimpleTokenizer(); // Use as Function in a stream String[] sentences = splitter.apply(text); String[][] tokens = Arrays.stream(sentences) .map(tokenizer) .toArray(String[][]::new); ``` --- ## Word Tokenizers ### `SimpleTokenizer` `SimpleTokenizer` is the recommended general-purpose English word tokenizer. It handles contractions, possessives, punctuation, and abbreviation-final periods sensibly. **Key behaviours:** * Splits most punctuation from adjoining words. * Expands contractions to their full forms: | Input | Output tokens | |-------|---------------| | `won't` | `will not` | | `can't` | `can not` | | `shan't` | `shall not` | | `cannot` | `can not` | | `weren't` | `were not` | | `'tisn't` | `it is not` | | `I'm` | `I 'm` | | `he'll` | `he 'll` | | `gonna` | `gon na` | * Keeps abbreviation-terminal periods attached (e.g., `etc.` stays `etc.` at the end of a sentence, but emits an additional `.` sentence-terminal token). * Commas inside numbers (`2,500`) are **not** split. * **Thread safety:** instances are independent and thread-safe (no shared mutable state; each call is stateless beyond the compiled `Pattern` constants). #### Basic usage ```java import smile.nlp.tokenizer.SimpleTokenizer; SimpleTokenizer tokenizer = new SimpleTokenizer(); String[] tokens = tokenizer.split( "Dr. Smith won't attend the conference, but she'll send her notes."); System.out.println(java.util.Arrays.toString(tokens)); // [Dr., Smith, will, not, attend, the, conference, ,, but, she, 'll, send, her, notes, .] ``` #### Numeric and punctuation edge cases ```java SimpleTokenizer tokenizer = new SimpleTokenizer(); // Commas inside numbers are not split System.out.println(Arrays.toString(tokenizer.split("Population is 2,500,000."))); // [Population, is, 2,500,000, .] // Ellipsis is separated System.out.println(Arrays.toString(tokenizer.split("Wait... then go."))); // [Wait, ..., then, go, .] ``` --- ### `PennTreebankTokenizer` `PennTreebankTokenizer` follows the tokenization conventions of the Penn Treebank corpus. It is a singleton (use `PennTreebankTokenizer.getInstance()`) and is the standard choice when your downstream models (e.g., `HMMPOSTagger`) were trained on Penn Treebank data. **Key differences from `SimpleTokenizer`:** | Input | `SimpleTokenizer` | `PennTreebankTokenizer` | |-------|-------------------|------------------------| | `won't` | `will not` | `wo n't` | | `can't` | `can not` | `ca n't` | | `'tisn't` | `it is not` | `'t is n't` | The Penn Treebank convention keeps the contracted negative `n't` as a separate morpheme; `SimpleTokenizer` expands to natural English forms instead. #### Basic usage ```java import smile.nlp.tokenizer.PennTreebankTokenizer; PennTreebankTokenizer tokenizer = PennTreebankTokenizer.getInstance(); String[] tokens = tokenizer.split("They couldn't have known."); System.out.println(java.util.Arrays.toString(tokens)); // [They, could, n't, have, known, .] ``` #### When to use * Use `PennTreebankTokenizer` when feeding tokens to models trained on Penn Treebank data (including `HMMPOSTagger`). * Use `SimpleTokenizer` for all other English NLP tasks where natural-English token forms are preferred. --- ### `BreakIteratorTokenizer` `BreakIteratorTokenizer` wraps Java's `java.text.BreakIterator` for word segmentation. It supports **any locale** supported by the JVM, making it the right choice for non-English text. > ⚠️ **Not thread-safe.** `BreakIterator` maintains internal state; each > thread must create its own instance. #### Basic usage ```java import smile.nlp.tokenizer.BreakIteratorTokenizer; import java.util.Locale; // Default locale BreakIteratorTokenizer tokenizer = new BreakIteratorTokenizer(); System.out.println(java.util.Arrays.toString(tokenizer.split("Hello, world!"))); // Explicit locale BreakIteratorTokenizer frTokenizer = new BreakIteratorTokenizer(Locale.FRENCH); System.out.println(java.util.Arrays.toString(frTokenizer.split("Bonjour, le monde!"))); ``` #### Multi-threaded use ```java ThreadLocal tlTokenizer = ThreadLocal.withInitial(BreakIteratorTokenizer::new); // In each thread: BreakIteratorTokenizer tokenizer = tlTokenizer.get(); String[] tokens = tokenizer.split(text); ``` --- ## Sentence Splitters ### `SimpleSentenceSplitter` `SimpleSentenceSplitter` is the recommended English sentence splitter. It is a singleton that uses a set of regular-expression heuristics to handle the hardest cases: * A `.` after a known abbreviation (`Mr.`, `Dr.`, `etc.`, `vs.`, …) is **not** treated as a sentence boundary. * `.` followed by a lowercase letter is not a boundary. * `.` at the end of the string or before a newline is always a boundary. * `?` and `!` are always boundaries. * Treats carriage returns as whitespace (expects paragraph-segmented input). > Assumes input has already been split into paragraphs. Feed each paragraph > individually for best results. #### Basic usage ```java import smile.nlp.tokenizer.SimpleSentenceSplitter; SimpleSentenceSplitter splitter = SimpleSentenceSplitter.getInstance(); String paragraph = "Dr. Smith attended the conf. in Jan. He presented his findings. " + "Was the result surprising? Absolutely!"; for (String sentence : splitter.split(paragraph)) { System.out.println(sentence); } // Dr. Smith attended the conf. in Jan. // He presented his findings. // Was the result surprising? // Absolutely! ``` #### Thread safety `SimpleSentenceSplitter` is a stateless singleton and is **thread-safe**. --- ### `BreakIteratorSentenceSplitter` `BreakIteratorSentenceSplitter` wraps `java.text.BreakIterator` for sentence segmentation. Like `BreakIteratorTokenizer`, it supports **any locale**. > ⚠️ **Not thread-safe.** Create one instance per thread. #### Basic usage ```java import smile.nlp.tokenizer.BreakIteratorSentenceSplitter; import java.util.Locale; // Default locale BreakIteratorSentenceSplitter splitter = new BreakIteratorSentenceSplitter(); // Specific locale BreakIteratorSentenceSplitter deSplitter = new BreakIteratorSentenceSplitter(Locale.GERMAN); for (String sentence : deSplitter.split("Das ist ein Test. Und noch ein Satz.")) { System.out.println(sentence); } // Das ist ein Test. // Und noch ein Satz. ``` --- ## Paragraph Splitter ### `SimpleParagraphSplitter` `SimpleParagraphSplitter` is a singleton that segments text into paragraphs by splitting on **one or more blank lines**. A blank line is any line containing only whitespace characters. It also handles the Unicode paragraph separator character (U+2029). #### Basic usage ```java import smile.nlp.tokenizer.SimpleParagraphSplitter; SimpleParagraphSplitter splitter = SimpleParagraphSplitter.getInstance(); String document = "First paragraph with multiple sentences. It continues here.\n\n" + "Second paragraph begins after the blank line.\n\n" + "Third paragraph."; for (String para : splitter.split(document)) { System.out.println("PARAGRAPH: " + para); } // PARAGRAPH: First paragraph with multiple sentences. It continues here. // PARAGRAPH: Second paragraph begins after the blank line. // PARAGRAPH: Third paragraph. ``` `SimpleParagraphSplitter` is stateless and **thread-safe**. --- ## English Abbreviations — `EnglishAbbreviations` `EnglishAbbreviations` is a package-private interface that exposes a static dictionary of common English abbreviations loaded from the classpath resource `abbreviations_en.txt`. It is used internally by `SimpleSentenceSplitter` and `PennTreebankTokenizer` to avoid splitting on abbreviation periods. The dictionary includes titles (`Mr`, `Mrs`, `Dr`, `Prof`), calendar items (`Jan`, `Feb`, `Mon`, `Tue`), geographic terms (`Ave`, `Blvd`, `St`), Latin abbreviations (`etc`, `vs`, `cf`, `al`), and more. It is not directly accessible from outside the package. --- ## Complete Pipeline Example A typical NLP preprocessing pipeline works in three stages: paragraph → sentence → token. ```java import smile.nlp.tokenizer.*; import smile.nlp.stemmer.PorterStemmer; import smile.nlp.pos.*; // ── Splitters & tokenizer ──────────────────────────────────────────── ParagraphSplitter paragraphSplitter = SimpleParagraphSplitter.getInstance(); SentenceSplitter sentenceSplitter = SimpleSentenceSplitter.getInstance(); Tokenizer tokenizer = new SimpleTokenizer(); HMMPOSTagger tagger = HMMPOSTagger.getDefault(); ThreadLocal tlStemmer = ThreadLocal.withInitial(PorterStemmer::new); // ── Input document ─────────────────────────────────────────────────── String document = "Alan Turing was a British mathematician. " + "He proposed the Turing test in 1950.\n\n" + "His work laid the foundation for computer science."; // ── Pipeline ───────────────────────────────────────────────────────── PorterStemmer stemmer = tlStemmer.get(); for (String paragraph : paragraphSplitter.split(document)) { for (String sentence : sentenceSplitter.split(paragraph)) { String[] tokens = tokenizer.split(sentence); PennTreebankPOS[] tags = tagger.tag(tokens); for (int i = 0; i < tokens.length; i++) { if (tags[i].open) { // content word String stem = stemmer.stem(tokens[i].toLowerCase()); System.out.printf("%-20s %-6s %s%n", tokens[i], tags[i], stem); } } System.out.println(); } } ``` --- ## Choosing the Right Implementation ### Word tokenizer | Scenario | Recommended | |----------|-------------| | General English text | `SimpleTokenizer` | | Penn Treebank / pre-trained NLP models | `PennTreebankTokenizer` | | Non-English or multilingual | `BreakIteratorTokenizer` | ### Sentence splitter | Scenario | Recommended | |----------|-------------| | English text (production use) | `SimpleSentenceSplitter` | | Multilingual / locale-sensitive | `BreakIteratorSentenceSplitter` | ### Paragraph splitter | Scenario | Recommended | |----------|-------------| | Any text with blank-line paragraph boundaries | `SimpleParagraphSplitter` | --- ## Thread-Safety Summary | Class | Thread-safe? | Notes | |-------|-------------|-------| | `SimpleTokenizer` | ✅ Yes | Stateless after construction | | `PennTreebankTokenizer` | ✅ Yes | Stateless singleton | | `BreakIteratorTokenizer` | ❌ No | `BreakIterator` is not thread-safe; use `ThreadLocal` | | `SimpleSentenceSplitter` | ✅ Yes | Stateless singleton | | `BreakIteratorSentenceSplitter` | ❌ No | `BreakIterator` is not thread-safe; use `ThreadLocal` | | `SimpleParagraphSplitter` | ✅ Yes | Stateless singleton | --- ## API Quick-Reference ```java // ── Word tokenizers ────────────────────────────────────────────────── Tokenizer simple = new SimpleTokenizer(); // thread-safe Tokenizer ptb = PennTreebankTokenizer.getInstance(); // singleton, thread-safe Tokenizer biTok = new BreakIteratorTokenizer(); // per-thread Tokenizer biTokFr = new BreakIteratorTokenizer(Locale.FRENCH);// locale-aware String[] tokens = simple.split("He won't go."); // [He, will, not, go, .] // ── Sentence splitters ─────────────────────────────────────────────── SentenceSplitter ss = SimpleSentenceSplitter.getInstance(); // singleton SentenceSplitter bis = new BreakIteratorSentenceSplitter(); // per-thread SentenceSplitter bde = new BreakIteratorSentenceSplitter(Locale.GERMAN); String[] sentences = ss.split("Hello world. How are you?"); // [Hello world., How are you?] // ── Paragraph splitter ─────────────────────────────────────────────── ParagraphSplitter ps = SimpleParagraphSplitter.getInstance(); // singleton String[] paragraphs = ps.split("Para one.\n\nPara two."); // [Para one., Para two.] // ── As Function in streams ─────────────────────────────────────────── String[][] allTokens = Arrays.stream(sentences) .map(simple) // Tokenizer IS a Function .toArray(String[][]::new); ``` --- ## Notes and Caveats * **Input assumptions** — `SimpleSentenceSplitter` and both word tokenizers assume the input is a single paragraph (no embedded newlines from paragraph breaks). Pass paragraph-split text through `SimpleParagraphSplitter` first. * **Sentence-final abbreviations** — `SimpleSentenceSplitter` consults `EnglishAbbreviations` to avoid splitting on abbreviation periods, but the dictionary is not exhaustive. Domain-specific abbreviations may require a custom splitter. * **Penn Treebank conventions** — if you use `PennTreebankTokenizer`, make sure your downstream models (taggers, parsers) are trained on Penn Treebank tokenized data. Mixing conventions causes accuracy drops. * **Locale** — `BreakIterator`-based classes are locale-aware but rely on the ICU data bundled with the JVM. Results may vary across JVM vendors. * **Empty tokens** — all implementations filter out blank tokens, so `String[] tokens` will never contain an empty string. --- *SMILE — © 2010-2026 Haifeng Li. GNU GPL licensed.*