Coverage for nltk.classify.scikitlearn : 75%
![](keybd_closed.png)
Hot-keys on this page
r m x p toggle line displays
j k next/prev highlighted chunk
0 (zero) top of page
1 (one) first highlighted chunk
# Natural Language Toolkit: Interface to scikit-learn classifiers # # Author: Lars Buitinck <L.J.Buitinck@uva.nl> # URL: <http://www.nltk.org/> # For license information, see LICENSE.TXT scikit-learn (http://scikit-learn.org) is a machine learning library for Python, supporting most of the basic classification algorithms, including SVMs, Naive Bayes, logistic regression and decision trees.
This package implement a wrapper around scikit-learn classifiers. To use this wrapper, construct a scikit-learn classifier, then use that to construct a SklearnClassifier. E.g., to wrap a linear SVM classifier with default settings, do
>>> from sklearn.svm.sparse import LinearSVC >>> from nltk.classify.scikitlearn import SklearnClassifier >>> classif = SklearnClassifier(LinearSVC())
The scikit-learn classifier may be arbitrarily complex. E.g., the following constructs and wraps a Naive Bayes estimator with tf-idf weighting and chi-square feature selection:
>>> from sklearn.feature_extraction.text import TfidfTransformer >>> from sklearn.feature_selection import SelectKBest, chi2 >>> from sklearn.naive_bayes import MultinomialNB >>> from sklearn.pipeline import Pipeline >>> pipeline = Pipeline([('tfidf', TfidfTransformer()), ... ('chi2', SelectKBest(chi2, k=1000)), ... ('nb', MultinomialNB())]) >>> classif = SklearnClassifier(pipeline)
(Such a classifier could be trained on word counts for text classification.) """
"""Wrapper for scikit-learn classifiers."""
""" :param estimator: scikit-learn classifier object.
:param dtype: data type used when building feature array. scikit-learn estimators work exclusively on numeric data; use bool when all features are binary.
:param sparse: Whether to use sparse matrices. The estimator must support these; not all scikit-learn classifiers do. The default value is True, since most NLP problems involve sparse feature sets. :type sparse: boolean. """
return "<SklearnClassifier(%r)>" % self._clf
X = self._convert(featuresets) y_proba_list = self._clf.predict_proba(X) return [self._make_probdist(y_proba) for y_proba in y_proba_list]
return self._label_index.keys()
""" Train (fit) the scikit-learn estimator.
:param labeled_featuresets: A list of classified featuresets, i.e., a list of tuples ``(featureset, label)``. """
else:
"""Convert featuresets to sparse matrix (COO format)."""
except KeyError: pass
"""Convert featureset to Numpy array."""
dtype=self._dtype)
except KeyError: # feature not seen in training pass
return DictionaryProbDist(dict((self._index_label[i], p) for i, p in enumerate(y_proba)))
from nltk.classify.util import names_demo, binary_names_demo_features try: from sklearn.linear_model.sparse import LogisticRegression except ImportError: # separate sparse LR to be removed in 0.12 from sklearn.linear_model import LogisticRegression from sklearn.naive_bayes import BernoulliNB
print("scikit-learn Naive Bayes:") names_demo(SklearnClassifier(BernoulliNB(binarize=False), dtype=bool).train, features=binary_names_demo_features) print("scikit-learn logistic regression:") names_demo(SklearnClassifier(LogisticRegression(), dtype=np.float64).train, features=binary_names_demo_features) |