Coverage for nltk.tokenize.sexpr : 89%
![](keybd_closed.png)
Hot-keys on this page
r m x p toggle line displays
j k next/prev highlighted chunk
0 (zero) top of page
1 (one) first highlighted chunk
# Natural Language Toolkit: Tokenizers # # Copyright (C) 2001-2012 NLTK Project # Author: Yoav Goldberg <yoavg@cs.bgu.ac.il> # Steven Bird <sb@csse.unimelb.edu.au> (minor edits) # URL: <http://nltk.sourceforge.net> # For license information, see LICENSE.TXT
S-Expression Tokenizer
``SExprTokenizer`` is used to find parenthesized expressions in a string. In particular, it divides a string into a sequence of substrings that are either parenthesized expressions (including any nested parenthesized expressions), or other whitespace-separated tokens.
>>> from nltk.tokenize import SExprTokenizer >>> SExprTokenizer().tokenize('(a b (c d)) e f (g)') ['(a b (c d))', 'e', 'f', '(g)']
By default, `SExprTokenizer` will raise a ``ValueError`` exception if used to tokenize an expression with non-matching parentheses:
>>> SExprTokenizer().tokenize('c) d) e (f (g') Traceback (most recent call last): ... ValueError: Un-matched close paren at char 1
The ``strict`` argument can be set to False to allow for non-matching parentheses. Any unmatched close parentheses will be listed as their own s-expression; and the last partial sexpr with unmatched open parentheses will be listed as its own sexpr:
>>> SExprTokenizer(strict=False).tokenize('c) d) e (f (g') ['c', ')', 'd', ')', 'e', '(f (g']
The characters used for open and close parentheses may be customized using the ``parens`` argument to the `SExprTokenizer` constructor:
>>> SExprTokenizer(parens='{}').tokenize('{a b {c d}} e f {g}') ['{a b {c d}}', 'e', 'f', '{g}']
The s-expression tokenizer is also available as a function:
>>> from nltk.tokenize import sexpr_tokenize >>> sexpr_tokenize('(a b (c d)) e f (g)') ['(a b (c d))', 'e', 'f', '(g)']
"""
""" A tokenizer that divides strings into s-expressions. An s-expresion can be either:
- a parenthesized expression, including any nested parenthesized expressions, or - a sequence of non-whitespace non-parenthesis characters.
For example, the string ``(a (b c)) d e (f)`` consists of four s-expressions: ``(a (b c))``, ``d``, ``e``, and ``(f)``.
By default, the characters ``(`` and ``)`` are treated as open and close parentheses, but alternative strings may be specified.
:param parens: A two-element sequence specifying the open and close parentheses that should be used to find sexprs. This will typically be either a two-character string, or a list of two strings. :type parens: str or list :param strict: If true, then raise an exception when tokenizing an ill-formed sexpr. """
raise ValueError('parens must contain exactly two strings') re.escape(parens[1])))
""" Return a list of s-expressions extracted from *text*. For example:
>>> SExprTokenizer().tokenize('(a b (c d)) e f (g)') ['(a b (c d))', 'e', 'f', '(g)']
All parentheses are assumed to mark s-expressions. (No special processing is done to exclude parentheses that occur inside strings, or following backslash characters.)
If the given expression contains non-matching parentheses, then the behavior of the tokenizer depends on the ``strict`` parameter to the constructor. If ``strict`` is ``True``, then raise a ``ValueError``. If ``strict`` is ``False``, then any unmatched close parentheses will be listed as their own s-expression; and the last partial s-expression with unmatched open parentheses will be listed as its own s-expression:
>>> SExprTokenizer(strict=False).tokenize('c) d) e (f (g') ['c', ')', 'd', ')', 'e', '(f (g']
:param text: the string to be tokenized :type text: str or iter(str) :rtype: iter(str) """ % m.start()) raise ValueError('Un-matched open paren at char %d' % pos)
import doctest doctest.testmod(optionflags=doctest.NORMALIZE_WHITESPACE)
|