{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Applying logistic regression and SVM\n", "> In this chapter you will learn the basics of applying logistic regression and support vector machines (SVMs) to classification problems. You'll use the scikit-learn library to fit classification models to real data. This is the Summary of lecture \"Linear Classifiers in Python\", via datacamp.\n", "\n", "- toc: true \n", "- badges: true\n", "- comments: true\n", "- author: Chanseok Kang\n", "- categories: [Python, Datacamp, Machine_Learning]\n", "- image: images/plot_4_classifiers.png" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Scikit-learn refresher" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### KNN classification\n", " this exercise you'll explore a subset of the [Large Movie Review Dataset](http://ai.stanford.edu/~amaas/data/sentiment/).The X variables contain features based on the words in the movie reviews, and the y variables contain labels for whether the review sentiment is positive (+1) or negative (-1)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> Large Movie Review Dataset : This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided. See the README file contained in the release for more details.\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from sklearn.datasets import load_svmlight_file" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "X_train, y_train = load_svmlight_file('./dataset/aclImdb/train/labeledBow.feat')\n", "X_test, y_test = load_svmlight_file('./dataset/aclImdb/test/labeledBow.feat')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "y_train[y_train < 5] = -1.0\n", "y_train[y_train >= 5] = 1.0\n", "\n", "y_test[y_test < 5] = -1.0\n", "y_test[y_test >= 5] = 1.0" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Prediction for test example 0: -1.0\n" ] } ], "source": [ "from sklearn.neighbors import KNeighborsClassifier\n", "\n", "# Create and fit the model\n", "knn = KNeighborsClassifier()\n", "knn.fit(X_train[:, :89523], y_train)\n", "\n", "# Predict on the test features, print the results\n", "pred = knn.predict(X_test)[0]\n", "print(\"Prediction for test example 0:\", pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Applying logistic regression and SVM\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Running LogisticRegression and SVC\n", "In this exercise, you'll apply logistic regression and a support vector machine to classify images of handwritten digits.\n", "\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1.0\n", "0.96\n", "0.9955456570155902\n", "0.9911111111111112\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\kcsgo\\anaconda3\\lib\\site-packages\\sklearn\\linear_model\\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):\n", "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n", "\n", "Increase the number of iterations (max_iter) or scale the data as shown in:\n", " https://scikit-learn.org/stable/modules/preprocessing.html\n", "Please also refer to the documentation for alternative solver options:\n", " https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n", " extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)\n" ] } ], "source": [ "from sklearn import datasets\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.svm import SVC\n", "\n", "digits = datasets.load_digits()\n", "X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target)\n", "\n", "# Apply logistic regression and print scores\n", "lr = LogisticRegression()\n", "lr.fit(X_train, y_train)\n", "print(lr.score(X_train, y_train))\n", "print(lr.score(X_test, y_test))\n", "\n", "# Apply SVM and print scores\n", "svm = SVC()\n", "svm.fit(X_train, y_train)\n", "print(svm.score(X_train, y_train))\n", "print(svm.score(X_test, y_test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Linear classifiers\n", "- Classification: learning to predict categories\n", "- decision boundary: the surface separating different predicted classes\n", "- linear classifier: a classifier that learn linear decision boundaries\n", " - e.g. logistic regression, linear SVM\n", "- linearly separable: a dataset can be perfectly explained by a linear classifier" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visualizing decision boundaries\n", "In this exercise, you'll visualize the decision boundaries of various classifier types." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "#hide\n", "X = np.array([[11.45, 2.4 ],\n", " [13.62, 4.95],\n", " [13.88, 1.89],\n", " [12.42, 2.55],\n", " [12.81, 2.31],\n", " [12.58, 1.29],\n", " [13.83, 1.57],\n", " [13.07, 1.5 ],\n", " [12.7 , 3.55],\n", " [13.77, 1.9 ],\n", " [12.84, 2.96],\n", " [12.37, 1.63],\n", " [13.51, 1.8 ],\n", " [13.87, 1.9 ],\n", " [12.08, 1.39],\n", " [13.58, 1.66],\n", " [13.08, 3.9 ],\n", " [11.79, 2.13],\n", " [12.45, 3.03],\n", " [13.68, 1.83],\n", " [13.52, 3.17],\n", " [13.5 , 3.12],\n", " [12.87, 4.61],\n", " [14.02, 1.68],\n", " [12.29, 3.17],\n", " [12.08, 1.13],\n", " [12.7 , 3.87],\n", " [11.03, 1.51],\n", " [13.32, 3.24],\n", " [14.13, 4.1 ],\n", " [13.49, 1.66],\n", " [11.84, 2.89],\n", " [13.05, 2.05],\n", " [12.72, 1.81],\n", " [12.82, 3.37],\n", " [13.4 , 4.6 ],\n", " [14.22, 3.99],\n", " [13.72, 1.43],\n", " [12.93, 2.81],\n", " [11.64, 2.06],\n", " [12.29, 1.61],\n", " [11.65, 1.67],\n", " [13.28, 1.64],\n", " [12.93, 3.8 ],\n", " [13.86, 1.35],\n", " [11.82, 1.72],\n", " [12.37, 1.17],\n", " [12.42, 1.61],\n", " [13.9 , 1.68],\n", " [14.16, 2.51]])\n", "\n", "y = np.array([ True, True, False, True, True, True, False, False, True,\n", " False, True, True, False, False, True, False, True, True,\n", " True, False, True, True, True, False, True, True, True,\n", " True, True, True, True, True, False, True, True, True,\n", " False, False, True, True, True, True, False, False, False,\n", " True, True, True, False, True])" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "def make_meshgrid(x, y, h=.02, lims=None):\n", " \"\"\"Create a mesh of points to plot in\n", " \n", " Parameters\n", " ----------\n", " x: data to base x-axis meshgrid on\n", " y: data to base y-axis meshgrid on\n", " h: stepsize for meshgrid, optional\n", " \n", " Returns\n", " -------\n", " xx, yy : ndarray\n", " \"\"\"\n", " \n", " if lims is None:\n", " x_min, x_max = x.min() - 1, x.max() + 1\n", " y_min, y_max = y.min() - 1, y.max() + 1\n", " else:\n", " x_min, x_max, y_min, y_max = lims\n", " xx, yy = np.meshgrid(np.arange(x_min, x_max, h),\n", " np.arange(y_min, y_max, h))\n", " return xx, yy" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "def plot_contours(ax, clf, xx, yy, proba=False, **params):\n", " \"\"\"Plot the decision boundaries for a classifier.\n", " \n", " Parameters\n", " ----------\n", " ax: matplotlib axes object\n", " clf: a classifier\n", " xx: meshgrid ndarray\n", " yy: meshgrid ndarray\n", " params: dictionary of params to pass to contourf, optional\n", " \"\"\"\n", " if proba:\n", " Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:,-1]\n", " Z = Z.reshape(xx.shape)\n", " out = ax.imshow(Z,extent=(np.min(xx), np.max(xx), np.min(yy), np.max(yy)), \n", " origin='lower', vmin=0, vmax=1, **params)\n", " ax.contour(xx, yy, Z, levels=[0.5])\n", " else:\n", " Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])\n", " Z = Z.reshape(xx.shape)\n", " out = ax.contourf(xx, yy, Z, **params)\n", " return out" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "def plot_classifier(X, y, clf, ax=None, ticks=False, proba=False, lims=None): \n", " # assumes classifier \"clf\" is already fit\n", " X0, X1 = X[:, 0], X[:, 1]\n", " xx, yy = make_meshgrid(X0, X1, lims=lims)\n", " \n", " if ax is None:\n", " plt.figure()\n", " ax = plt.gca()\n", " show = True\n", " else:\n", " show = False\n", " \n", " # can abstract some of this into a higher-level function for learners to call\n", " cs = plot_contours(ax, clf, xx, yy, cmap=plt.cm.coolwarm, alpha=0.8, proba=proba)\n", " if proba:\n", " cbar = plt.colorbar(cs)\n", " cbar.ax.set_ylabel('probability of red $\\Delta$ class', fontsize=20, rotation=270, labelpad=30)\n", " cbar.ax.tick_params(labelsize=14)\n", " #ax.scatter(X0, X1, c=y, cmap=plt.cm.coolwarm, s=30, edgecolors=\\'k\\', linewidth=1)\n", " labels = np.unique(y)\n", " if len(labels) == 2:\n", " ax.scatter(X0[y==labels[0]], X1[y==labels[0]], cmap=plt.cm.coolwarm, \n", " s=60, c='b', marker='o', edgecolors='k')\n", " ax.scatter(X0[y==labels[1]], X1[y==labels[1]], cmap=plt.cm.coolwarm, \n", " s=60, c='r', marker='^', edgecolors='k')\n", " else:\n", " ax.scatter(X0, X1, c=y, cmap=plt.cm.coolwarm, s=50, edgecolors='k', linewidth=1)\n", "\n", " ax.set_xlim(xx.min(), xx.max())\n", " ax.set_ylim(yy.min(), yy.max())\n", " # ax.set_xlabel(data.feature_names[0])\n", " # ax.set_ylabel(data.feature_names[1])\n", " if ticks:\n", " ax.set_xticks(())\n", " ax.set_yticks(())\n", " # ax.set_title(title)\n", " if show:\n", " plt.show()\n", " else:\n", " return ax" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "def plot_4_classifiers(X, y, clfs):\n", " # Set-up 2x2 grid for plotting.\n", " fig, sub = plt.subplots(2, 2)\n", " plt.subplots_adjust(wspace=0.2, hspace=0.2)\n", " \n", " for clf, ax, title in zip(clfs, sub.flatten(), (\"(1)\", \"(2)\", \"(3)\", \"(4)\")):\n", " # clf.fit(X, y)\n", " plot_classifier(X, y, clf, ax, ticks=True)\n", " ax.set_title(title)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\kcsgo\\anaconda3\\lib\\site-packages\\sklearn\\svm\\_base.py:947: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.\n", " \"the number of iterations.\", ConvergenceWarning)\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from sklearn.svm import LinearSVC, SVC\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.neighbors import KNeighborsClassifier\n", "\n", "# Define the classifiers\n", "classifiers = [LogisticRegression(), LinearSVC(), SVC(), KNeighborsClassifier()]\n", "\n", "# Fit the classifiers\n", "for c in classifiers:\n", " c.fit(X, y)\n", " \n", "# Plot the classifiers\n", "plot_4_classifiers(X, y, classifiers)\n", "plt.savefig('../images/plot_4_classifiers.png')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }