Coverage for nltk.chunk: 47%

100

101

102

103

# Natural Language Toolkit: Chunkers

# Author: Steven Bird <sb@csse.unimelb.edu.au>

# Edward Loper <edloper@gradient.cis.upenn.edu>

# URL: <http://www.nltk.org/>

# For license information, see LICENSE.TXT

"""

Classes and interfaces for identifying non-overlapping linguistic

groups (such as base noun phrases) in unrestricted text. This task is

called "chunk parsing" or "chunking", and the identified groups are

called "chunks". The chunked text is represented using a shallow

tree called a "chunk structure." A chunk structure is a tree

containing tokens and chunks, where each chunk is a subtree containing

only tokens. For example, the chunk structure for base noun phrase

chunks in the sentence "I saw the big dog on the hill" is::

(SENTENCE:

(NP: <I>)

<saw>

(NP: <the> <big> <dog>)

<on>

(NP: <the> <hill>))

To convert a chunk structure back to a list of tokens, simply use the

chunk structure's ``leaves()`` method.

This module defines ``ChunkParserI``, a standard interface for

chunking texts; and ``RegexpChunkParser``, a regular-expression based

implementation of that interface. It also defines ``ChunkScore``, a

utility class for scoring chunk parsers.

RegexpChunkParser

=================

``RegexpChunkParser`` is an implementation of the chunk parser interface

that uses regular-expressions over tags to chunk a text. Its

``parse()`` method first constructs a ``ChunkString``, which encodes a

particular chunking of the input text. Initially, nothing is

chunked. ``parse.RegexpChunkParser`` then applies a sequence of

``RegexpChunkRule`` rules to the ``ChunkString``, each of which modifies

the chunking that it encodes. Finally, the ``ChunkString`` is

transformed back into a chunk structure, which is returned.

``RegexpChunkParser`` can only be used to chunk a single kind of phrase.

For example, you can use an ``RegexpChunkParser`` to chunk the noun

phrases in a text, or the verb phrases in a text; but you can not

use it to simultaneously chunk both noun phrases and verb phrases in

the same text. (This is a limitation of ``RegexpChunkParser``, not of

chunk parsers in general.)

RegexpChunkRules

----------------

A ``RegexpChunkRule`` is a transformational rule that updates the

chunking of a text by modifying its ``ChunkString``. Each

``RegexpChunkRule`` defines the ``apply()`` method, which modifies

the chunking encoded by a ``ChunkString``. The

``RegexpChunkRule`` class itself can be used to implement any

transformational rule based on regular expressions. There are

also a number of subclasses, which can be used to implement

simpler types of rules:

- ``ChunkRule`` chunks anything that matches a given regular

expression.

- ``ChinkRule`` chinks anything that matches a given regular

expression.

- ``UnChunkRule`` will un-chunk any chunk that matches a given

regular expression.

- ``MergeRule`` can be used to merge two contiguous chunks.

- ``SplitRule`` can be used to split a single chunk into two

smaller chunks.

- ``ExpandLeftRule`` will expand a chunk to incorporate new

unchunked material on the left.

- ``ExpandRightRule`` will expand a chunk to incorporate new

unchunked material on the right.

Tag Patterns

~~~~~~~~~~~~

A ``RegexpChunkRule`` uses a modified version of regular

expression patterns, called "tag patterns". Tag patterns are

used to match sequences of tags. Examples of tag patterns are::

r'(<DT>|<JJ>|<NN>)+'

r'<NN>+'

r'<NN.*>'

The differences between regular expression patterns and tag

patterns are:

- In tag patterns, ``'<'`` and ``'>'`` act as parentheses; so

``'<NN>+'`` matches one or more repetitions of ``'<NN>'``, not

``'<NN'`` followed by one or more repetitions of ``'>'``.

- Whitespace in tag patterns is ignored. So

``'<DT> | <NN>'`` is equivalant to ``'<DT>|<NN>'``

- In tag patterns, ``'.'`` is equivalant to ``'[^{}<>]'``; so

``'<NN.*>'`` matches any single tag starting with ``'NN'``.

The function ``tag_pattern2re_pattern`` can be used to transform

a tag pattern to an equivalent regular expression pattern.

Efficiency

----------

Preliminary tests indicate that ``RegexpChunkParser`` can chunk at a

rate of about 300 tokens/second, with a moderately complex rule set.

There may be problems if ``RegexpChunkParser`` is used with more than

5,000 tokens at a time. In particular, evaluation of some regular

expressions may cause the Python regular expression engine to

exceed its maximum recursion depth. We have attempted to minimize

these problems, but it is impossible to avoid them completely. We

therefore recommend that you apply the chunk parser to a single

sentence at a time.

Emacs Tip

---------

If you evaluate the following elisp expression in emacs, it will

colorize a ``ChunkString`` when you use an interactive python shell

with emacs or xemacs ("C-c !")::

(let ()

(defconst comint-mode-font-lock-keywords

'(("<[^>]+>" 0 'font-lock-reference-face)

("[{}]" 0 'font-lock-function-name-face)))

(add-hook 'comint-mode-hook (lambda () (turn-on-font-lock))))

You can evaluate this code by copying it to a temporary buffer,

placing the cursor after the last close parenthesis, and typing

"``C-x C-e``". You should evaluate it before running the interactive

session. The change will last until you close emacs.

Unresolved Issues

-----------------

If we use the ``re`` module for regular expressions, Python's

regular expression engine generates "maximum recursion depth

exceeded" errors when processing very large texts, even for

regular expressions that should not require any recursion. We

therefore use the ``pre`` module instead. But note that ``pre``

does not include Unicode support, so this module will not work

with unicode strings. Note also that ``pre`` regular expressions

are not quite as advanced as ``re`` ones (e.g., no leftward

zero-length assertions).

:type CHUNK_TAG_PATTERN: regexp

:var CHUNK_TAG_PATTERN: A regular expression to test whether a tag

pattern is valid.

"""

from nltk.data import load

from nltk.chunk.api import ChunkParserI

from nltk.chunk.util import (ChunkScore, accuracy, tagstr2tree, conllstr2tree,

tree2conlltags, tree2conllstr, tree2conlltags,

ieerstr2tree)

from nltk.chunk.regexp import RegexpChunkParser, RegexpParser

# Standard treebank POS tagger

_BINARY_NE_CHUNKER = 'chunkers/maxent_ne_chunker/english_ace_binary.pickle'

_MULTICLASS_NE_CHUNKER = 'chunkers/maxent_ne_chunker/english_ace_multiclass.pickle'

def ne_chunk(tagged_tokens, binary=False):

"""

Use NLTK's currently recommended named entity chunker to

chunk the given list of tagged tokens.

"""

if binary:

chunker_pickle = _BINARY_NE_CHUNKER

else:

chunker_pickle = _MULTICLASS_NE_CHUNKER

chunker = load(chunker_pickle)

return chunker.parse(tagged_tokens)

def batch_ne_chunk(tagged_sentences, binary=False):

"""

Use NLTK's currently recommended named entity chunker to chunk the

given list of tagged sentences, each consisting of a list of tagged tokens.

"""

if binary:

chunker_pickle = _BINARY_NE_CHUNKER

else:

chunker_pickle = _MULTICLASS_NE_CHUNKER

chunker = load(chunker_pickle)

return chunker.batch_parse(tagged_sentences)

Coverage for nltk.chunk : 47%

19 statements 9 run 10 missing 0 excluded