Hot-keys on this page

r m x p   toggle line displays

j k   next/prev highlighted chunk

0   (zero) top of page

1   (one) first highlighted chunk

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

# Natural Language Toolkit: Classifiers 

# 

# Copyright (C) 2001-2012 NLTK Project 

# Author: Edward Loper <edloper@gradient.cis.upenn.edu> 

# URL: <http://www.nltk.org/> 

# For license information, see LICENSE.TXT 

 

""" 

Classes and interfaces for labeling tokens with category labels (or 

"class labels").  Typically, labels are represented with strings 

(such as ``'health'`` or ``'sports'``).  Classifiers can be used to 

perform a wide range of classification tasks.  For example, 

classifiers can be used... 

 

- to classify documents by topic 

- to classify ambiguous words by which word sense is intended 

- to classify acoustic signals by which phoneme they represent 

- to classify sentences by their author 

 

Features 

======== 

In order to decide which category label is appropriate for a given 

token, classifiers examine one or more 'features' of the token.  These 

"features" are typically chosen by hand, and indicate which aspects 

of the token are relevant to the classification decision.  For 

example, a document classifier might use a separate feature for each 

word, recording how often that word occurred in the document. 

 

Featuresets 

=========== 

The features describing a token are encoded using a "featureset", 

which is a dictionary that maps from "feature names" to "feature 

values".  Feature names are unique strings that indicate what aspect 

of the token is encoded by the feature.  Examples include 

``'prevword'``, for a feature whose value is the previous word; and 

``'contains-word(library)'`` for a feature that is true when a document 

contains the word ``'library'``.  Feature values are typically 

booleans, numbers, or strings, depending on which feature they 

describe. 

 

Featuresets are typically constructed using a "feature detector" 

(also known as a "feature extractor").  A feature detector is a 

function that takes a token (and sometimes information about its 

context) as its input, and returns a featureset describing that token. 

For example, the following feature detector converts a document 

(stored as a list of words) to a featureset describing the set of 

words included in the document: 

 

    >>> # Define a feature detector function. 

    >>> def document_features(document): 

    ...     return dict([('contains-word(%s)' % w, True) for w in document]) 

 

Feature detectors are typically applied to each token before it is fed 

to the classifier: 

 

.. doctest:: 

    :options: +SKIP 

 

    >>> # Classify each Gutenberg document. 

    >>> from nltk.corpus import gutenberg 

    >>> for fileid in gutenberg.fileids(): 

    ...     doc = gutenberg.words(fileid) 

    ...     print fileid, classifier.classify(document_features(doc)) 

 

The parameters that a feature detector expects will vary, depending on 

the task and the needs of the feature detector.  For example, a 

feature detector for word sense disambiguation (WSD) might take as its 

input a sentence, and the index of a word that should be classified, 

and return a featureset for that word.  The following feature detector 

for WSD includes features describing the left and right contexts of 

the target word: 

 

    >>> def wsd_features(sentence, index): 

    ...     featureset = {} 

    ...     for i in range(max(0, index-3), index): 

    ...         featureset['left-context(%s)' % sentence[i]] = True 

    ...     for i in range(index, max(index+3, len(sentence))): 

    ...         featureset['right-context(%s)' % sentence[i]] = True 

    ...     return featureset 

 

Training Classifiers 

==================== 

Most classifiers are built by training them on a list of hand-labeled 

examples, known as the "training set".  Training sets are represented 

as lists of ``(featuredict, label)`` tuples. 

""" 

 

from nltk.classify.api import ClassifierI, MultiClassifierI 

from nltk.classify.mallet import config_mallet, call_mallet 

from nltk.classify.megam import config_megam, call_megam 

from nltk.classify.weka import WekaClassifier, config_weka 

from nltk.classify.naivebayes import NaiveBayesClassifier 

from nltk.classify.positivenaivebayes import PositiveNaiveBayesClassifier 

from nltk.classify.decisiontree import DecisionTreeClassifier 

from nltk.classify.rte_classify import rte_classifier, rte_features, RTEFeatureExtractor 

from nltk.classify.util import accuracy, log_likelihood 

 

# Conditional imports 

 

try: 

    from .scikitlearn import SklearnClassifier 

except ImportError: 

    pass 

 

try: 

    import numpy 

    from nltk.classify.maxent import (MaxentClassifier, BinaryMaxentFeatureEncoding, 

                                      TypedMaxentFeatureEncoding, 

                                      ConditionalExponentialClassifier) 

    import svmlight 

    from nltk.classify.svm import SvmClassifier 

except ImportError: 

    pass