Coverage for nltk.tokenize.api : 54%
![](keybd_closed.png)
Hot-keys on this page
r m x p toggle line displays
j k next/prev highlighted chunk
0 (zero) top of page
1 (one) first highlighted chunk
# Natural Language Toolkit: Tokenizer Interface # # Copyright (C) 2001-2012 NLTK Project # Author: Edward Loper <edloper@gradient.cis.upenn.edu> # Steven Bird <sb@csse.unimelb.edu.au> # URL: <http://www.nltk.org/> # For license information, see LICENSE.TXT
Tokenizer Interface """
""" A processing interface for tokenizing a string. Subclasses must define ``tokenize()`` or ``batch_tokenize()`` (or both). """ """ Return a tokenized copy of *s*.
:rtype: list of str """ if overridden(self.batch_tokenize): return self.batch_tokenize([s])[0] else: raise NotImplementedError()
""" Identify the tokens using integer offsets ``(start_i, end_i)``, where ``s[start_i:end_i]`` is the corresponding token.
:rtype: iter(tuple(int, int)) """ raise NotImplementedError()
""" Apply ``self.tokenize()`` to each element of ``strings``. I.e.:
return [self.tokenize(s) for s in strings]
:rtype: list(list(str)) """ return [self.tokenize(s) for s in strings]
""" Apply ``self.span_tokenize()`` to each element of ``strings``. I.e.:
return [self.span_tokenize(s) for s in strings]
:rtype: iter(list(tuple(int, int))) """ for s in strings: yield list(self.span_tokenize(s))
"""A tokenizer that divides a string into substrings by splitting on the specified string (defined in subclasses). """
for span in string_span_tokenize(s, self._string): yield span
import doctest doctest.testmod(optionflags=doctest.NORMALIZE_WHITESPACE) |