Coverage for nltk.classify.svm : 89%
![](keybd_closed.png)
Hot-keys on this page
r m x p toggle line displays
j k next/prev highlighted chunk
0 (zero) top of page
1 (one) first highlighted chunk
# Natural Language Toolkit: SVM-based classifier # # Copyright (C) 2001-2012 NLTK Project # Author: Leon Derczynski <leon@dcs.shef.ac.uk> # # URL: <http://www.nltk.org/> # For license information, see LICENSE.TXT
A classifier based on a support vector machine. This code uses Thorsten Joachims' SVM^light implementation (http://svmlight.joachims.org/), wrapped using PySVMLight (https://bitbucket.org/wcauchois/pysvmlight). The default settings are to train a linear classification kernel, though through minor modification, full SVMlight capabilities should be accessible if needed. Only binary classification is possible at present. """
# # Interface to Support Vector Machine #
# create a boolean feature name for the SVM from a feature/value pair, # that'll take on a 1.0 value if the original feature:value is asserted. """ :param feature: a string denoting a feature name :param value: the value of the feature """
# convert a set of NLTK classifier features to SVMlight format """ :param features: a dict of features in the format {'feature':value} :param svmfeatureindex: a mapping from feature:value pairs to integer SVMlight feature labels """ # svmlight supports sparse feature sets and so we simply omit features that we don't include # each feature is represented as an (int, float) tuple where the int is the SVMlight feature label and the float is the value; as we either have or have not a feature, this is 1.0 # this does not support scalar features - rather, each value that a feature may take on is a discrete independent label # use 1.0 as the feature value to specify the presence of a feature:value couple # skip over feature:value pairs that were not in the training data and so not included in our mappings continue
# convert a whole instance (including label) from NLTK to SVMlight format
""" :param instance: an NLTK format instance, which is in the tuple format (dict(), label), where the dict contains feature:value pairs, and the label signifies the target attribute's value for this instance (e.g. its class) :param labelmapping: a previously-defined dict mapping from text labels in the NLTK instance format to SVMlight labels of either +1 or -1 @svmfeatureindex: a mapping from feature:value pairs to integer SVMlight feature labels """
""" A Support Vector Machine classifier. To explain briefly, support vector machines (SVM) treat each feature as a dimension, and position features in n-dimensional feature space. An optimal hyperplane is then determined that best divides feature space into classes, and future instances classified based on which side of the hyperplane they lie on, and their proximity to it.
This implementation is for a binary SVM - that is, only two classes are supported. You may achieve perform classification with more classes by training an SVM per class and then picking a best option for new instances given results from each binary class-SVM. """
""" :param labels: a list of text labels for classes :param labelmapping: a mapping from labels to SVM classes (-1,+1) :param svmfeatures: a list of svm features, where the index is the integer feature number and the value an feature/value pair :param model: the model generated by svmlight.learn() """ # _svmfeatureindex is the inverse of svmfeatures, allowing us # to find an SVM feature name (int) given a feature/value
""" Return the list of class labels. """ return self._labels
""" searches values of _labelmapping to resolve +1 or -1 to a string
:param label: the string label to look up """
""" resolve a float (in this case, probably from svmlight.learn().classify()) to either -1 or +1, and then look up the label for that class in _labelmapping, and return the text label
:param prediction: a signed float describing classifier confidence """
""" given a set of features, classify them with our trained model and return a signed float
:param featureset: a dict of feature/value pairs in NLTK format, representing a single instance """ print('instance', instance_to_classify) # svmlight.classify expects a list; this should be taken advantage of when writing SvmClassifier.batch_classify / .batch_prob_classify. # it returns a list of floats, too.
""" Return a probability distribution of classifications
:param featureset: a dict of feature/value pairs in NLTK format, representing a single instance """ raise Exception('This classifier is not yet trained') return None
# do the classification print('prediction', prediction)
# lump it into a boolean class, -1 or +1
# sometimes the result is not within -1 ... +1; clip it so # that it is, and we get a sane-looking probability # distribution. this will upset some results with non-linear # partitioning where instance-hyperplane distance can be many # orders of magnitude larger; I don't have a fix for that prediction = -1.0
# if the prediction is negative, then we will maximise the # value of the -1 class; otherwise, that of the 1 class will # be greater. str(self.resolve_prediction(-1)): 1 - prediction} else: distribution = {str(self.resolve_prediction(1)): prediction + 1, str(self.resolve_prediction(-1)): -prediction}
""" Use a trained SVM to predict a label given for an unlabelled instance
:param featureset: a dict of feature/value pairs in NLTK format, representing a single instance """ print('prediction', prediction)
def train(featuresets): """ given a set of training instances in nltk format: [ ( {feature:value, ..}, str(label) ) ] train a support vector machine
:param featuresets: training instances """
# build a unique list of labels
# this is a binary classifier only raise ValueError('Can only do boolean classification (labels: '+ str(labels) + ')') return False
# we need ordering, so a set's no good
# next, assign -1 and 1
# now for feature conversion # iter through instances, building a set of feature:type:str(value) triples # svmfeatures is indexable by integer svm feature number # svmfeatureindex is the inverse (svm feature name -> number)
# build svm feature set case by case
# train the svm # TODO: implement passing of SVMlight parameters from train() to learn()
[(name, 'female') for name in names.words('female.txt')])
demo()
|