{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Python examples and notes for Machine Learning for Computational Linguistics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**(C) 2017-2024 by [Damir Cavar](http://cavar.me/damir/)**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Download:** This and various other Jupyter notebooks are available from my [GitHub repo](https://github.com/dcavar/python-tutorial-for-ipython)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Version:** 1.2, January 2024" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**License:** [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Prerequisites:**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install -U numpy" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install -U matplotlib" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a tutorial related to the discussion of SpamAssassin in the textbook [Machine Learning: The Art and Science of Algorithms that Make Sense of Data](https://www.cs.bris.ac.uk/~flach/mlbook/) by [Peter Flach](https://www.cs.bris.ac.uk/~flach/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This tutorial was developed as part of my course material for the course Machine Learning for Computational Linguistics in the [Computational Linguistics Program](http://cl.indiana.edu/) of the [Department of Linguistics](http://www.indiana.edu/~lingdept/) at [Indiana University](https://www.indiana.edu/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## SpamAssassin" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The linear classifier can be described as follows. A test $x$ returns $1$ (for true), if it succedes, otherwise it returns $0$. The $i^{th}$ test in the set of tests $x$ is refered to as $x_i$. The weight of the $i^{th}$ test is denoted as $w_i$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The total score of test results for a specific e-mail can be expressed as the sum of the products of $n$ test results and corresponding weights, that is $\\sum_{i=1}^{n} w_i x_i$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we assume two tests $x_1$ and $x_2$ with the corresponding weights $w_1 = 4$ and $w_2 = 4$, for some e-mail $e_1$ the tests could result in two positives $x_1 = 1$ and $x_2 = 1$. The computation of the equation above for the results can be coded in Python in the following way:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4\n" ] } ], "source": [ "x = (1, 0)\n", "w = (4, 4)\n", "\n", "result = 0\n", "for e in range(len(x)):\n", " result += x[e] * w[e]\n", "\n", "print(result)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we specify a threshold $t$ that seperates spam from ham, with $t = 5$ in our example, the decision for spam or ham could be coded in Python as:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ham 4\n" ] } ], "source": [ "t = 5\n", "\n", "if result >= t:\n", " print(\"spam\", result)\n", "else:\n", " print(\"ham\", result)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the code example above we define $x$ and $w$ as vectors of the same length. The computation of the result could be achieved even easier by making use of linear algebra and calculating the dot-product of $x$ and $w$:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "8" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy\n", "\n", "wn = [4, 4]\n", "xn = [1, 1]\n", "\n", "numpy.dot(wn, xn)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use a trick to manipulate the data to be rooted in the origin of an extended coordiante system. We can add a new dimension by adding a new virtual test result $x_0 = 1$ and a corresponding weight $w_0 = -t$. This way the decision boundary $t$ can be moved to $0$:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x0 = (1, 1, 1)\n", "w0 = (-t, 4, 4)\n", "\n", "numpy.dot(w0, x0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This kind of transformation of the vector space is usefull for other purposes as well. More on that later." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating and Using an SVM Classifier" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following example is inspired and partially taken from the [page Linear SVC Machine learning SVM example with Python](https://pythonprogramming.net/linear-svc-example-scikit-learn-svm-python/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To start learning and classifying, we will need to import some Python modules in addition to the ones above:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import matplotlib.pyplot\n", "from matplotlib import style\n", "\n", "style.use(\"ggplot\")\n", "\n", "from sklearn import svm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use two features that represent the axis on a graph. The samples are tuples taken from the ordered arrays $x$ and $y$, that is, the $i^{th}$ sample is $X_i = (x_i, y_i)$, sample $X_1 = (1,2)$, sample $X_2 = (5, 8)$, and so on." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "x = [1, 5, 1.5, 8, 1, 9]\n", "y = [2, 8, 1.8, 8, 0.6, 11]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can plot the datapoints now:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAgoAAAFqCAYAAAB73XKSAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAAPYQAAD2EBqD+naQAAG/NJREFUeJzt3X9sVfX9x/HXbe9taSmX3mtvi7ZCU9TKYk0V3NL10pLNDBHjUDYhc2bVUUXiFv6QOSVMcKnLyNQlTv8pxM4fTEiUpobIFDELt5B0oiYQUay1RgfUq+3laumv257vH4Z+rfJRCudHe/t8/GXP9Zzz5pMr9+k95976LMuyBAAAcAYZXg8AAAAmLkIBAAAYEQoAAMCIUAAAAEaEAgAAMCIUAACAEaEAAACMCAUAAGBEKAAAACNCAQAAGI07FI4cOaK//vWvuuuuu7RixQq98cYbo48NDw/r2Wef1b333qvbbrtNd911l/7xj3+op6fnnIaLxWLntB/OHWvuPtbcfay5+1hz99m15uMOhYGBAZWWlmrVqlVnfOyjjz7SL37xC23evFnr1q3T8ePHtXnz5nMarrW19Zz2w7ljzd3HmruPNXcfa+4+u9bcP94dKisrVVlZecbHcnNztX79+jHb7rjjDj3wwAP6/PPPdcEFF5zblAAAwBOO36PQ29srn8+n6dOnO30qAABgM0dDYWhoSNu2bVM0GtW0adOcPBUAAHCAY6EwPDysRx99VD6f74z3M5yNefPm2TwVvk9RUZHXI0w5rLn7WHP3sebus+s11GdZlnWuO69YsULr1q3TggULxmw/HQnxeFx/+tOflJeX953HicVi37rpYt68ebrxxhvPdTQAAKa8lpYWHTlyZMy26upqRaPRsz6G7aFwOhI+/fRTPfjgg98bCd+np6dHqVTqvI6BsxcMBpVMJr0eY0phzd3HmruPNXeX3+9XKBSy51jj3aG/v18nTpwY/bmrq0udnZ3Ky8tTKBTSI488os7OTv3xj39UKpVSIpGQJOXl5cnvH/fplEqlNDQ0NO79cG4sy2K9Xcaau481dx9rPnmN+5W7o6NDmzZtGv356aefliTV1tbql7/8pQ4ePChJWrdu3Zj9HnzwQf3gBz84n1kBAIDLzuvSgxvi8TgV6qJwOKzu7m6vx5hSWHP3sebuY83dFQgEFIlEbDkWv+sBAAAYEQoAAMCIUAAAAEaEAgAAMCIUAACAEaEAAACMCAUAAGBEKAAAACNCAQAAGBEKAADAiFAAAABGhAIAADAiFAAAgBGhAAAAjAgFAABgRCgAAAAjQgEAABgRCgAAwMjv9QAAAEw0qVSmenqylEz6FAxaCoUG5fcPez2WJ3hHAQCAr0mlMrVvX66qqvJVU5Ovqqp87duXq1Qq0+vRPEEoAADwNT09Waqvz1Nfn0+S1NfnU319nnp6sjyezBuEAgAAX5NM+kYj4bS+Pp+SSZ9hj/RGKAAA8DXBoKWcHGvMtpwcS8GgZdgjvREKAAB8TSg0qMbGL0djITfXUmPjlwqFBj2ezBt86gEAgK/x+4e1cOEpHTiQ4lMPIhQAAPgWv39YkUifIhGvJ/Eelx4AAIARoQAAAIwIBQAAYEQoAAAAI0IBAAAYEQoAAMCIUAAAAEaEAgAAMCIUAACAEaEAAACMCAUAAGBEKAAAACNCAQAAGBEKAADAiFAAAABG/vHucOTIEbW0tKijo0OJRELr1q3TggULxvw727dv1969e9Xb26vy8nLV19dr1qxZtg0NAADcMe53FAYGBlRaWqpVq1ad8fHm5mbt3r1bd955px5++GFlZ2eroaFBqVTqvIcFAADuGncoVFZWasWKFbrmmmvO+PjLL7+s5cuXa/78+Zo9e7buuecedXd3q62t7byHBQAA7rL1HoVPP/1UiURCFRUVo9tyc3N16aWX6ujRo3aeCgAAuMDWUEgkEpKkmTNnjtk+c+bM0ccAAMDkwaceAACA0bg/9fBd8vPzJUknT54c/efTP5eWlhr3i8Viam1tHbOtqKhIdXV1CgaDsizLzjHxHQKBgMLhsNdjTCmsuftYc/ex5u7y+XySpKamJnV1dY15rLq6WtFo9KyPZWsoFBYWKj8/X4cOHdKcOXMkSadOndL777+vxYsXG/eLRqPGoZPJpIaGhuwcE98hHA6ru7vb6zGmFNbcfay5+1hzdwUCAUUiEdXV1Z33scYdCv39/Tpx4sToz11dXers7FReXp4KCgp0/fXX68UXX9SsWbNUWFio559/XhdccIHxUxIAAGDiGncodHR0aNOmTaM/P/3005Kk2tparVmzRj//+c81MDCgxsZG9fb2at68eXrggQfk99v65gUAAHCBz5rgNwDE43EuPbiItwfdx5q7jzV3H2vurtOXHuzApx4AAIARoQAAAIwIBQAAYEQoAAAAI0IBAAAYEQoAAMCIUAAAAEaEAgAAMCIUAACAEaEAAACMCAUAAGBEKAAAACNCAQAAGBEKAADAiFAAAABGhAIAADAiFAAAgBGhAAAAjAgFAABgRCgAAAAjQgEAABgRCgAAwIhQAAAARoQCAAAwIhQAAIARoQAAAIwIBQAAYEQoAAAAI0IBAAAYEQoAAMCIUAAAAEaEAgAAMCIUAACAEaEAAACMCAUAAGBEKAAAACNCAQAAGBEKAADAiFAAAABGhAIAADDy233AkZER7dixQ7FYTIlEQqFQSIsWLdLy5cvtPhUAAHCY7aHQ3NysPXv26J577lFJSYk++OADPfnkk5o+fbquu+46u08HAAAcZHsoHD16VAsWLFBlZaUkqaCgQLFYTO3t7XafCgAAOMz2exTKy8t1+PBhHT9+XJLU2dmp9957T1dddZXdpwLggVQqU/F4jj74IFfxeI5SqUyvR8IEdvr5cvDgIM+XScr2dxSWLVumvr4+rV27VhkZGbIsSytXrlR1dbXdpwLgslQqU/v25aq+Pk99fT7l5FhqbPxSCxeekt8/7PV4mGB4vqQH20Nh//79isViWrt2rUpKStTZ2ammpiaFw2HV1NTYfToALurpyRr9S1+S+vp8qq/P04EDKUUifR5Ph4mG50t6sD0Unn32Wd10002qqqqSJF188cWKx+PauXOnMRRisZhaW1vHbCsqKlJdXZ2CwaAsy7J7TBgEAgGFw2Gvx5hSJtOaf/jh4Ohf+qf19fnU25up8vLJ8WeQJteaT2bp8nyZjHy+r9a9qalJXV1dYx6rrq5WNBo962PZHgqDg4PKyBh764PP5/vOF/toNGocOplMamhoyNYZYRYOh9Xd3e31GFPKZFrzvLwc5eRYY/7yz8mxNH368KT5M0iTa80ns3R5vkxGgUBAkUhEdXV1530s229mnD9/vl544QW9+eabisfjamtr065du/TDH/7Q7lMBcFkoNKjGxi+Vk/NV+OfmfnXNORQa9HgyTEQ8X9KDz7L5ff3+/n5t375dbW1tSiaTCoVCikajWr58uTIzx3+3azwe5x0FF/F/Wu6bbGueSmWqpydLyaRPwaClUGhw0t2YNtnWfDI7/Xzp7c3U9OnDk/L5MhmdfkfBDraHgt0IBXfxF6j7WHP3sebuY83dZWco8LseAACAEaEAAACMCAUAAGBEKAAAACNCAQAAGBEKAADAiFAAAABGhAIAADAiFAAAgBGhAAAAjAgFAABgRCgAAAAjQgEAABgRCgAAwIhQAAAARoQCAAAwIhQAAIARoQAAAIwIBQAAYEQoAAAAI0IBAAAYEQoAAMCIUAAAAEaEAgAAMCIUAACAEaEAAACMCAUAAGBEKAAAACNCAQAAGBEKAADAiFAAAABGhAIAADAiFAAAgBGhAAAAjAgFAABgRCgAAAAjQgEAABgRCgAAwIhQAAAARoQCAAAwIhQAAICR34mDdnd367nnntPbb7+tgYEBXXjhhbr77rtVVlbmxOkAAIBDbA+F3t5ebdiwQRUVFVq/fr1mzJih48ePKy8vz+5TAQAAh9keCs3NzSooKNDq1atHt0UiEbtPAwAAXGB7KBw8eFCVlZV69NFHdeTIEYXDYf3sZz/TT3/6U7tPBQAAHGZ7KHR1demVV17RDTfcoJtvvlnt7e166qmnFAgEVFNTY/fpAACAg2wPBcuyNHfuXK1cuVKSVFpaqo8//livvvoqoQAAwCRjeyiEQiEVFxeP2VZcXKy2tjbjPrFYTK2trWO2FRUVqa6uTsFgUJZl2T0mDAKBgMLhsNdjTCmsuftYc/ex5u7y+XySpKamJnV1dY15rLq6WtFo9KyPZXsolJeX69ixY2O2HTt2TAUFBcZ9otGocehkMqmhoSFbZ4RZOBxWd3e312NMKay5+1hz97Hm7goEAopEIqqrqzvvY9n+hUtLly7V+++/r507d+rEiROKxWLau3evrrvuOrtPBQAAHGb7Owpz587Vvffeq23btumFF15QYWGh6urqVF1dbfepAACAwxz5Zsarr75aV199tROHBgAALuJ3PQAAACNCAQAAGBEKAADAiFAAAABGhAIAADAiFAAAgBGhAAAAjAgFAABgRCgAAAAjQgEAABgRCgAAwIhQAAAARoQCAAAwIhQAAIARoQAAAIwIBQAAYEQoAAAAI0IBAAAYEQoAAMCIUAAAAEaEAgAAMCIUAACAEaEAAACMCAUAAGBEKAAAACNCAQAAGBEKAADAiFAAAABGhAIAADAiFAAAgBGhAAAAjAgFAABgRCgAAAAjQgEAABgRCgAAwIhQAAAARoQCAAAwIhQAAIARoQAAAIwIBQAAYEQoAAAAI8dDobm5WStWrNA///lPp08FAABs5mgotLe3a8+ePZozZ46TpwEAAA5xLBT6+/v1+OOPa/Xq1Zo+fbpTpwEAAA5yLBS2bNmi+fPn64orrnDqFAAAwGGOhEJra6s++ugj/epXv3Li8AAAwCW2h8Lnn3+upqYm/e53v5Pf77f78AAAwEU+y7IsOw/43//+V3/729+UkfH/DTIyMiJJysjI0LZt2+Tz+cbsE4vF1NraOmZbUVGR6urqNDAwIJtHxHcIBAIaGhryeowphTV3H2vuPtbcXT6fT9nZ2WpqalJXV9eYx6qrqxWNRs/+WHaHQn9/vz777LMx25544gkVFxdr2bJlKikpGdfx4vE4Ty4XhcNhdXd3ez3GlMKau481dx9r7q5AIKBIJGLLsWy/NjBt2rRvxcC0adM0Y8aMcUcCAADwFt/MCAAAjFy52/DBBx904zQAAMBmvKMAAACMCAUAAGBEKAAAACNCAQAAGBEKAADAiFAAAABGhAIAADAiFAAAgBGhAAAAjAgFAABgRCgAAAAjQgEAABgRCgAAwIhQAAAARoQCAAAwIhQAAIARoQAAAIwIBQAAYEQoAAAAI0IBAAAYEQoAAMCIUAAAAEaEAgAAMCIUAACAEaEAAACMCAUAAGBEKAAAACNCAQAAGBEKAADAiFAAAABGhAIAADAiFAAAgBGhAAAAjAgFAABgRCgAAAAjQgEAABgRCgAAwIhQAAAARoQCAAAwIhQAAICR3+4D7ty5U21tbTp27JiysrJ02WWX6dZbb9VFF11k96kAAIDDbA+Fd999V0uWLFFZWZlGRka0bds2NTQ06LHHHlNWVpbdpwMAAA6y/dLD/fffr5qaGpWUlGj27Nlas2aNPvvsM3V0dNh9KgAA4DDb31H4plOnTkmS8vLynD6V51KpTPX0ZCmZ9CkYtBQKDcrvH/Z6LAAAzpmjNzNalqWmpiZdfvnlKikpcfJUnkulMrVvX66qqvJVU5Ovqqp87duXq1Qq0+vRAAA4Z46GwpYtW/TJJ59o7dq1Tp5mQujpyVJ9fZ76+nySpL4+n+rr89TTw30ZAIDJy7FLD1u3btVbb72lhx56SKFQ6Dv/3VgsptbW1jHbioqKVFdXp2AwKMuynBrTNh9+ODgaCaf19fnU25up8vKwR1ONXyAQUDg8eeZNB6y5+1hz97Hm7vL5vno9ampqUldX15jHqqurFY1Gz/5YlgOvwlu3btUbb7yhjRs3qqio6LyOFY/HNTQ0ZNNkzonHc1RVlT8mFnJyLB04kFAk0ufhZOMTDofV3d3t9RhTCmvuPtbcfay5uwKBgCKRiC3Hsv3Sw5YtWxSLxfT73/9e2dnZSiQSSiQSGhwctPtUE0ooNKjGxi+Vk/NVd+XmWmps/FKhkH1/7lQqU/F4jj74IFfxeA73PwAAHGf7pYdXX31VkrRx48Yx29esWaPa2lq7Tzdh+P3DWrjwlA4cSDnyqYfTN0uevg8iJ+erEFm48BSfrAAAOMaRSw92miyXHpzm1qUN3h50H2vuPtbcfay5uyb0pQc4I5n0nfFmyWTSZ9gDAIDzRyhMEsGgNXr/w2k5OZaCwQn9hhAAYJIjFCYJN26WBADgmxz/CmfYw+mbJQEAOBNCYRLx+4cVifTJpvtTAAD4Xlx6AAAARoQCAAAwIhQAAIARoQAAAIwIBQAAYEQoAAAAI0IBAAAYEQoAAMCIUAAAAEaEAgAAMCIUAACAEaEAAACMCAUAAGBEKAAAACNCAQAAGBEKAADAiFAAAABGhAIAADAiFAAAgBGhAAAAjAgFAABgRCgAAAAjQgEAABj5vR4gnaRSmerpyVIy6VMwaCkUGpTfP+z1WAAAnDPeUbBJKpWpfftyVVWVr5qafFVV5WvfvlylUplejwYAwDkjFGzS05Ol+vo89fX5JEl9fT7V1+eppyfL48kAADh3hIJNkknfaCSc1tfnUzLpM+wBAMDERyjYJBi0lJNjjdmWk2MpGLQMewAAMPERCjYJhQbV2PjlaCzk5lpqbPxSodCgx5MBAHDu+NSDTfz+YS1ceEoHDqT41AMAIG0QCjby+4cVifQpEvF6EgAA7MGlBwAAYEQoAAAAI0IBAAAYEQoAAMCIUAAAAEaOfeph9+7deumll5RIJFRaWqrbb79dl1xyiVOnAwAADnDkHYX9+/frmWee0S233KLNmzdrzpw5amhoUDKZdOJ0AADAIY6Ewq5du3TttdeqtrZWxcXFqq+vV3Z2tl5//XUnTgcAABxieyikUil1dHSooqJidJvP51NFRYWOHj1q9+kAAICDbA+FL774QiMjI5o5c+aY7TNnzlQikbD7dAAAwEET/iuc/f4JP2Ja8fl8CgQCXo8xpbDm7mPN3ceau8vO107bX4VnzJihjIwMnTx5csz2kydPKj8//4z7xGIxtba2jtk2b9483XjjjQqFQnaPiO8R4ZdVuI41dx9r7j7W3H0tLS06cuTImG3V1dWKRqNnfQzbQ8Hv96usrEyHDh3SggULJEmWZenw4cNasmTJGfeJRqNnHLqlpUU33nij3SPiOzQ1Namurs7rMaYU1tx9rLn7WHP3nX4NPd/XUUc+9bB06VK99tpr+s9//qP//e9/amxs1MDAgBYtWjSu43yzguC8rq4ur0eYclhz97Hm7mPN3WfXa6gjNwD8+Mc/1hdffKEdO3aMfuHS+vXrFQwGnTgdAABwiGN3Ci5evFiLFy926vAAAMAF/K4HAABglLlx48aNXg/xXWbPnu31CFMOa+4+1tx9rLn7WHP32bHmPsuyLBtmAQAAaYhLDwAAwIhQAAAARoQCAAAwIhQAAIDRhPyNS7t379ZLL700+mVNt99+uy655BKvx0pLO3fuVFtbm44dO6asrCxddtlluvXWW3XRRRd5PdqU0dzcrH/961+6/vrr9Zvf/MbrcdJWd3e3nnvuOb399tsaGBjQhRdeqLvvvltlZWVej5aWRkZGtGPHDsViMSUSCYVCIS1atEjLly/3erS0cuTIEbW0tKijo0OJRELr1q0b/fUJp23fvl179+5Vb2+vysvLVV9fr1mzZp31OSbcOwr79+/XM888o1tuuUWbN2/WnDlz1NDQoGQy6fVoaendd9/VkiVL1NDQoA0bNmh4eFgNDQ0aHBz0erQpob29XXv27NGcOXO8HiWt9fb2asOGDQoEAlq/fr0ee+wx3XbbbcrLy/N6tLTV3NysPXv2aNWqVfr73/+uX//612ppadHu3bu9Hi2tDAwMqLS0VKtWrTrj483Nzdq9e7fuvPNOPfzww8rOzlZDQ4NSqdRZn2PChcKuXbt07bXXqra2VsXFxaqvr1d2drZef/11r0dLS/fff79qampUUlKi2bNna82aNfrss8/U0dHh9Whpr7+/X48//rhWr16t6dOnez1OWmtublZBQYFWr16tsrIyRSIRXXnllSosLPR6tLR19OhRLViwQJWVlSooKNCPfvQjXXnllWpvb/d6tLRSWVmpFStW6Jprrjnj4y+//LKWL1+u+fPna/bs2brnnnvU3d2ttra2sz7HhAqFVCqljo4OVVRUjG7z+XyqqKjQ0aNHPZxs6jh16pQk8X9aLtiyZYvmz5+vK664wutR0t7Bgwc1d+5cPfroo6qvr9d9992n1157zeux0lp5ebkOHz6s48ePS5I6Ozv13nvv6aqrrvJ4sqnj008/VSKRGPOampubq0svvXRcr6kT6h6FL774QiMjI5o5c+aY7TNnztSxY8c8mmrqsCxLTU1Nuvzyy1VSUuL1OGmttbVVH330kf7yl794PcqU0NXVpVdeeUU33HCDbr75ZrW3t+upp55SIBBQTU2N1+OlpWXLlqmvr09r165VRkaGLMvSypUrVV1d7fVoU0YikZCkM76mnn7sbEyoUIC3tmzZok8++UR//vOfvR4lrX3++edqamrShg0b5Pfzn6AbLMvS3LlztXLlSklSaWmpPv74Y7366quEgkP279+vWCymtWvXqqSkRJ2dnWpqalI4HGbNJ5kJ9bfUjBkzlJGRoZMnT47ZfvLkSeXn53s01dSwdetWvfXWW3rooYcUCoW8HietdXR0KJlM6r777hvdNjIyonfeeUe7d+/Wtm3b5PP5PJww/YRCIRUXF4/ZVlxcPK7rtBifZ599VjfddJOqqqokSRdffLHi8bh27txJKLjk9OvmN19DT548qdLS0rM+zoQKBb/fr7KyMh06dGj04x2WZenw4cNasmSJx9Olr61bt+qNN97Qxo0bVVBQ4PU4aa+iokKPPPLImG1PPPGEiouLtWzZMiLBAeXl5d+6fHns2DGe7w4aHBxURsbY2+B8Pp/49ULuKSwsVH5+vg4dOjT6yapTp07p/fff1+LFi8/6OBMqFCRp6dKlevLJJ1VWVqZLLrlEu3bt0sDAgBYtWuT1aGlpy5Ytam1t1R/+8AdlZ2ePXrfKzc1VVlaWx9Olp2nTpn3rHpBp06ZpxowZ3BvikKVLl2rDhg3auXOnqqqq1N7err179+quu+7yerS0NX/+fL3wwgsKh8O6+OKL9eGHH2rXrl36yU9+4vVoaaW/v18nTpwY/bmrq0udnZ3Ky8tTQUGBrr/+er344ouaNWuWCgsL9fzzz+uCCy4wfkriTCbkb4/897//rZaWltEvXLrjjjs0d+5cr8dKSytWrDjj9jVr1qi2ttblaaauTZs2qbS0lC9cctCbb76pbdu26cSJEyosLNQNN9zAi5aD+vv7tX37drW1tSmZTCoUCikajWr58uXKzMz0ery08c4772jTpk3f2l5bW6s1a9ZIknbs2KHXXntNvb29mjdvnn7729+O6wuXJmQoAACAiWFCfY8CAACYWAgFAABgRCgAAAAjQgEAABgRCgAAwIhQAAAARoQCAAAwIhQAAIARoQAAAIwIBQAAYEQoAAAAI0IBAAAY/R9hHeXyP21ZXAAAAABJRU5ErkJggg==", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "matplotlib.pyplot.scatter(x,y)\n", "matplotlib.pyplot.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can create an array of features, that is, we convert the coordinates in the $x$ and $y$ feature arrays above to an array of tuples that represent the datapoints or features of samples:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "X = numpy.array([[1,2],\n", " [5,8],\n", " [1.5,1.8],\n", " [8,8],\n", " [1,0.6],\n", " [9,11]])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Assuming two classes represented by $0$ and $1$, we can encode the assignment of the datapoints in $X$ to classes $0$ or $1$ by using a vector with the class labels in the order of the samples in $X$. The $i^{th}$ datapoint of $X$ is assigned to the $i^{th}$ class label in $y$." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "y = [0,1,0,1,0,1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We define a classifier as a linear Support Vector Classifier using the *svm* module of [Scikit-learn](http://scikit-learn.org/):" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "classifier = svm.SVC(kernel='linear')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We train the classifier using our features in *X* and the labels in *y*:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,\n", " decision_function_shape=None, degree=3, gamma='auto', kernel='linear',\n", " max_iter=-1, probability=False, random_state=None, shrinking=True,\n", " tol=0.001, verbose=False)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "classifier.fit(X,y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now create a new sample and ask the classifier for a guess to which class this sample belongs. Note that in the following code we generate a *numpy array* from the features $[0.58, 0.76]$. This array needs to be reshaped to an array the contains one element, an array with a set of sample features. " ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sample: [[ 0.58 0.76]]\n", " Class: [0]\n" ] } ], "source": [ "sample = numpy.array([0.58,0.76]).reshape(1,-1)\n", "\n", "print(\"Sample:\", sample)\n", "\n", "print(\" Class:\", classifier.predict(sample))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead of using the *reshape()* function, we could have also defined the sample directly as an array with a sample feature array:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": true }, "outputs": [], "source": [ "sample = numpy.array( [ [0.58,0.76] ] )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following code will visualize the data and the identified hyperplane that separates the two classes." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 0.1380943 0.24462418]\n" ] }, { "data": { "image/png": "", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "w = classifier.coef_[0]\n", "print(w)\n", "\n", "a = -w[0] / w[1]\n", "\n", "xx = numpy.linspace(0,12)\n", "yy = a * xx - classifier.intercept_[0] / w[1]\n", "\n", "h0 = matplotlib.pyplot.plot(xx, yy, 'k-', label=\"non weighted div\")\n", "\n", "matplotlib.pyplot.scatter(X[:, 0], X[:, 1], c = y)\n", "matplotlib.pyplot.legend()\n", "matplotlib.pyplot.show()" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "(C) 2017-2024 by [Damir Cavar](http://cavar.me/damir/) - [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/)); portions taken from the referenced sources." ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" }, "latex_metadata": { "affiliation": "Indiana University, Department of Linguistics, Bloomington, IN, USA", "author": "Damir Cavar", "title": "Python examples and notes for Machine Learning for Computational Linguistics" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": false, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": false, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 1 }