{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Python Scikit-Learn for Computational Linguists" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**(C) 2017-2024 by [Damir Cavar](http://cavar.me/damir/)**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Version:** 1.1, January 2024" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**License:** [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Prerequisites:**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install -U scikit-learn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This tutorial was developed as part of my course material for the course *Machine Learning for Computational Linguistics* in the [Computational Linguistics Program](http://cl.indiana.edu/) of the [Department of Linguistics](http://www.indiana.edu/~lingdept/) at [Indiana University](https://www.indiana.edu/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This material is based on various other tutorials, including:\n", "\n", "- [An introduction to machine learning with scikit-learn](http://scikit-learn.org/stable/tutorial/basic/tutorial.html#machine-learning-the-problem-setting)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the problems or issues that Machine Learning aims to solve is to make predictions from previous experience. This can be achieved by extracting features from existing data collections. Scikit-Learn comes with some sample datasets. The datasets are the [Iris flower data](https://en.wikipedia.org/wiki/Iris_flower_data_set) (classification), the [Pen-Based Recognition of Handwritten Digits Data Set](http://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits) (classification), and the [Boston Housing Data Set](http://archive.ics.uci.edu/ml/datasets/Housing) (regression). The datasets are part of the Scikit and do not have to be downloads. We can load these datasets by loading the *datasets* module from *sklearn* and then loading the individual datasets." ] }, { "cell_type": "code", "execution_count": 71, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn import datasets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can load a dataset using the following function:" ] }, { "cell_type": "code", "execution_count": 72, "metadata": { "collapsed": true }, "outputs": [], "source": [ "diabetes = datasets.load_diabetes()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some datasets provide a description in the DESCR field:" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Iris Plants Database\n", "====================\n", "\n", "Notes\n", "-----\n", "Data Set Characteristics:\n", " :Number of Instances: 150 (50 in each of three classes)\n", " :Number of Attributes: 4 numeric, predictive attributes and the class\n", " :Attribute Information:\n", " - sepal length in cm\n", " - sepal width in cm\n", " - petal length in cm\n", " - petal width in cm\n", " - class:\n", " - Iris-Setosa\n", " - Iris-Versicolour\n", " - Iris-Virginica\n", " :Summary Statistics:\n", "\n", " ============== ==== ==== ======= ===== ====================\n", " Min Max Mean SD Class Correlation\n", " ============== ==== ==== ======= ===== ====================\n", " sepal length: 4.3 7.9 5.84 0.83 0.7826\n", " sepal width: 2.0 4.4 3.05 0.43 -0.4194\n", " petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)\n", " petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)\n", " ============== ==== ==== ======= ===== ====================\n", "\n", " :Missing Attribute Values: None\n", " :Class Distribution: 33.3% for each of 3 classes.\n", " :Creator: R.A. Fisher\n", " :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)\n", " :Date: July, 1988\n", "\n", "This is a copy of UCI ML iris datasets.\n", "http://archive.ics.uci.edu/ml/datasets/Iris\n", "\n", "The famous Iris database, first used by Sir R.A Fisher\n", "\n", "This is perhaps the best known database to be found in the\n", "pattern recognition literature. Fisher's paper is a classic in the field and\n", "is referenced frequently to this day. (See Duda & Hart, for example.) The\n", "data set contains 3 classes of 50 instances each, where each class refers to a\n", "type of iris plant. One class is linearly separable from the other 2; the\n", "latter are NOT linearly separable from each other.\n", "\n", "References\n", "----------\n", " - Fisher,R.A. \"The use of multiple measurements in taxonomic problems\"\n", " Annual Eugenics, 7, Part II, 179-188 (1936); also in \"Contributions to\n", " Mathematical Statistics\" (John Wiley, NY, 1950).\n", " - Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.\n", " (Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.\n", " - Dasarathy, B.V. (1980) \"Nosing Around the Neighborhood: A New System\n", " Structure and Classification Rule for Recognition in Partially Exposed\n", " Environments\". IEEE Transactions on Pattern Analysis and Machine\n", " Intelligence, Vol. PAMI-2, No. 1, 67-71.\n", " - Gates, G.W. (1972) \"The Reduced Nearest Neighbor Rule\". IEEE Transactions\n", " on Information Theory, May 1972, 431-433.\n", " - See also: 1988 MLC Proceedings, 54-64. Cheeseman et al\"s AUTOCLASS II\n", " conceptual clustering system finds 3 classes in the data.\n", " - Many, many more ...\n", "\n" ] } ], "source": [ "iris = datasets.load_iris()\n", "print(iris.DESCR)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see the content of the datasets by printing them out:" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'DESCR': \"Optical Recognition of Handwritten Digits Data Set\\n===================================================\\n\\nNotes\\n-----\\nData Set Characteristics:\\n :Number of Instances: 5620\\n :Number of Attributes: 64\\n :Attribute Information: 8x8 image of integer pixels in the range 0..16.\\n :Missing Attribute Values: None\\n :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)\\n :Date: July; 1998\\n\\nThis is a copy of the test set of the UCI ML hand-written digits datasets\\nhttp://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits\\n\\nThe data set contains images of hand-written digits: 10 classes where\\neach class refers to a digit.\\n\\nPreprocessing programs made available by NIST were used to extract\\nnormalized bitmaps of handwritten digits from a preprinted form. From a\\ntotal of 43 people, 30 contributed to the training set and different 13\\nto the test set. 32x32 bitmaps are divided into nonoverlapping blocks of\\n4x4 and the number of on pixels are counted in each block. This generates\\nan input matrix of 8x8 where each element is an integer in the range\\n0..16. This reduces dimensionality and gives invariance to small\\ndistortions.\\n\\nFor info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.\\nT. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.\\nL. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,\\n1994.\\n\\nReferences\\n----------\\n - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their\\n Applications to Handwritten Digit Recognition, MSc Thesis, Institute of\\n Graduate Studies in Science and Engineering, Bogazici University.\\n - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.\\n - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.\\n Linear dimensionalityreduction using relevance weighted LDA. School of\\n Electrical and Electronic Engineering Nanyang Technological University.\\n 2005.\\n - Claudio Gentile. A New Approximate Maximal Margin Classification\\n Algorithm. NIPS. 2000.\\n\", 'images': array([[[ 0., 0., 5., ..., 1., 0., 0.],\n", " [ 0., 0., 13., ..., 15., 5., 0.],\n", " [ 0., 3., 15., ..., 11., 8., 0.],\n", " ..., \n", " [ 0., 4., 11., ..., 12., 7., 0.],\n", " [ 0., 2., 14., ..., 12., 0., 0.],\n", " [ 0., 0., 6., ..., 0., 0., 0.]],\n", "\n", " [[ 0., 0., 0., ..., 5., 0., 0.],\n", " [ 0., 0., 0., ..., 9., 0., 0.],\n", " [ 0., 0., 3., ..., 6., 0., 0.],\n", " ..., \n", " [ 0., 0., 1., ..., 6., 0., 0.],\n", " [ 0., 0., 1., ..., 6., 0., 0.],\n", " [ 0., 0., 0., ..., 10., 0., 0.]],\n", "\n", " [[ 0., 0., 0., ..., 12., 0., 0.],\n", " [ 0., 0., 3., ..., 14., 0., 0.],\n", " [ 0., 0., 8., ..., 16., 0., 0.],\n", " ..., \n", " [ 0., 9., 16., ..., 0., 0., 0.],\n", " [ 0., 3., 13., ..., 11., 5., 0.],\n", " [ 0., 0., 0., ..., 16., 9., 0.]],\n", "\n", " ..., \n", " [[ 0., 0., 1., ..., 1., 0., 0.],\n", " [ 0., 0., 13., ..., 2., 1., 0.],\n", " [ 0., 0., 16., ..., 16., 5., 0.],\n", " ..., \n", " [ 0., 0., 16., ..., 15., 0., 0.],\n", " [ 0., 0., 15., ..., 16., 0., 0.],\n", " [ 0., 0., 2., ..., 6., 0., 0.]],\n", "\n", " [[ 0., 0., 2., ..., 0., 0., 0.],\n", " [ 0., 0., 14., ..., 15., 1., 0.],\n", " [ 0., 4., 16., ..., 16., 7., 0.],\n", " ..., \n", " [ 0., 0., 0., ..., 16., 2., 0.],\n", " [ 0., 0., 4., ..., 16., 2., 0.],\n", " [ 0., 0., 5., ..., 12., 0., 0.]],\n", "\n", " [[ 0., 0., 10., ..., 1., 0., 0.],\n", " [ 0., 2., 16., ..., 1., 0., 0.],\n", " [ 0., 0., 15., ..., 15., 0., 0.],\n", " ..., \n", " [ 0., 4., 16., ..., 16., 6., 0.],\n", " [ 0., 8., 16., ..., 16., 8., 0.],\n", " [ 0., 1., 8., ..., 12., 1., 0.]]]), 'data': array([[ 0., 0., 5., ..., 0., 0., 0.],\n", " [ 0., 0., 0., ..., 10., 0., 0.],\n", " [ 0., 0., 0., ..., 16., 9., 0.],\n", " ..., \n", " [ 0., 0., 1., ..., 6., 0., 0.],\n", " [ 0., 0., 2., ..., 12., 0., 0.],\n", " [ 0., 0., 10., ..., 12., 1., 0.]]), 'target': array([0, 1, 2, ..., 8, 9, 8]), 'target_names': array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])}\n" ] } ], "source": [ "digits = datasets.load_digits()\n", "print(digits)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data of the digits dataset is stored in the *data* member. This data represents the features of the digit image." ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[ 0. 0. 5. ..., 0. 0. 0.]\n", " [ 0. 0. 0. ..., 10. 0. 0.]\n", " [ 0. 0. 0. ..., 16. 9. 0.]\n", " ..., \n", " [ 0. 0. 1. ..., 6. 0. 0.]\n", " [ 0. 0. 2. ..., 12. 0. 0.]\n", " [ 0. 0. 10. ..., 12. 1. 0.]]\n" ] } ], "source": [ "print(digits.data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The *target* member contains the real target labels or values of the feature sets, that is the numbers that the feature sets represent." ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0 1 2 ..., 8 9 8]\n", "Optical Recognition of Handwritten Digits Data Set\n", "===================================================\n", "\n", "Notes\n", "-----\n", "Data Set Characteristics:\n", " :Number of Instances: 5620\n", " :Number of Attributes: 64\n", " :Attribute Information: 8x8 image of integer pixels in the range 0..16.\n", " :Missing Attribute Values: None\n", " :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)\n", " :Date: July; 1998\n", "\n", "This is a copy of the test set of the UCI ML hand-written digits datasets\n", "http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits\n", "\n", "The data set contains images of hand-written digits: 10 classes where\n", "each class refers to a digit.\n", "\n", "Preprocessing programs made available by NIST were used to extract\n", "normalized bitmaps of handwritten digits from a preprinted form. From a\n", "total of 43 people, 30 contributed to the training set and different 13\n", "to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of\n", "4x4 and the number of on pixels are counted in each block. This generates\n", "an input matrix of 8x8 where each element is an integer in the range\n", "0..16. This reduces dimensionality and gives invariance to small\n", "distortions.\n", "\n", "For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.\n", "T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.\n", "L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,\n", "1994.\n", "\n", "References\n", "----------\n", " - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their\n", " Applications to Handwritten Digit Recognition, MSc Thesis, Institute of\n", " Graduate Studies in Science and Engineering, Bogazici University.\n", " - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.\n", " - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.\n", " Linear dimensionalityreduction using relevance weighted LDA. School of\n", " Electrical and Electronic Engineering Nanyang Technological University.\n", " 2005.\n", " - Claudio Gentile. A New Approximate Maximal Margin Classification\n", " Algorithm. NIPS. 2000.\n", "\n" ] } ], "source": [ "print(digits.target)\n", "print(digits.DESCR)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In case of the *digits* dataset the 2D shapes of the images are mapped on a 8x8 matrix. You can print them out using the *images* member:" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 \n", " [[ 0. 0. 5. 13. 9. 1. 0. 0.]\n", " [ 0. 0. 13. 15. 10. 15. 5. 0.]\n", " [ 0. 3. 15. 2. 0. 11. 8. 0.]\n", " [ 0. 4. 12. 0. 0. 8. 8. 0.]\n", " [ 0. 5. 8. 0. 0. 9. 8. 0.]\n", " [ 0. 4. 11. 0. 1. 12. 7. 0.]\n", " [ 0. 2. 14. 5. 10. 12. 0. 0.]\n", " [ 0. 0. 6. 13. 10. 0. 0. 0.]]\n", "\n", "1 \n", " [[ 0. 0. 0. 12. 13. 5. 0. 0.]\n", " [ 0. 0. 0. 11. 16. 9. 0. 0.]\n", " [ 0. 0. 3. 15. 16. 6. 0. 0.]\n", " [ 0. 7. 15. 16. 16. 2. 0. 0.]\n", " [ 0. 0. 1. 16. 16. 3. 0. 0.]\n", " [ 0. 0. 1. 16. 16. 6. 0. 0.]\n", " [ 0. 0. 1. 16. 16. 6. 0. 0.]\n", " [ 0. 0. 0. 11. 16. 10. 0. 0.]]\n" ] } ], "source": [ "print(0, '\\n', digits.images[0])\n", "print()\n", "print(1, '\\n', digits.images[1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The *digits* dataset is a set of images of digits that can be used to train a classifier and test the classification on unseen images. To use a Support Vector Classifier we import the *svm* module:" ] }, { "cell_type": "code", "execution_count": 79, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn import svm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We create a classifier instance with manually set parameters. The parameters can be automatically set using various methods." ] }, { "cell_type": "code", "execution_count": 80, "metadata": { "collapsed": true }, "outputs": [], "source": [ "classifier = svm.SVC(gamma=0.001, C=100.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The *classifier* instance has to be trained on the data. The *fit* method of the instance requires two parameters, the features and the array with the corresponding classes or labels. The features are stored in the *data* member. The labels are stored in the *target* member. We use all but the last data and target element for training or fitting." ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0,\n", " decision_function_shape=None, degree=3, gamma=0.001, kernel='rbf',\n", " max_iter=-1, probability=False, random_state=None, shrinking=True,\n", " tol=0.001, verbose=False)" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "classifier.fit(digits.data[:-1], digits.target[:-1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use the *predict* method to request a guess about the last element in the *data* member:" ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Prediction: [8]\n", "Image:\n", " [[ 0. 0. 10. 14. 8. 1. 0. 0.]\n", " [ 0. 2. 16. 14. 6. 1. 0. 0.]\n", " [ 0. 0. 15. 15. 8. 15. 0. 0.]\n", " [ 0. 0. 5. 16. 16. 10. 0. 0.]\n", " [ 0. 0. 12. 15. 15. 12. 0. 0.]\n", " [ 0. 4. 16. 6. 4. 16. 6. 0.]\n", " [ 0. 8. 16. 10. 8. 16. 8. 0.]\n", " [ 0. 1. 8. 12. 14. 12. 1. 0.]]\n", "Label: 8\n" ] } ], "source": [ "print(\"Prediction:\", classifier.predict(digits.data[-1:]))\n", "print(\"Image:\\n\", digits.images[-1])\n", "print(\"Label:\", digits.target[-1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Storing Models" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can train a new model from the Iris data using the *fit* method:" ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0,\n", " decision_function_shape=None, degree=3, gamma=0.001, kernel='rbf',\n", " max_iter=-1, probability=False, random_state=None, shrinking=True,\n", " tol=0.001, verbose=False)" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "classifier.fit(iris.data, iris.target)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To store the model in a file, we can use the *pickle* module:" ] }, { "cell_type": "code", "execution_count": 85, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pickle" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can serialize the *classifier* to a variable that we can process or save to disk:" ] }, { "cell_type": "code", "execution_count": 86, "metadata": { "collapsed": true }, "outputs": [], "source": [ "s = pickle.dumps(classifier)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will save the model to a file *irisModel.dat*." ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [], "source": [ "ofp = open(\"irisModel.dat\", mode='bw')\n", "ofp.write(s)\n", "ofp.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The model can be read back into memory using the following code:" ] }, { "cell_type": "code", "execution_count": 88, "metadata": { "collapsed": true }, "outputs": [], "source": [ "ifp = open(\"irisModel.dat\", mode='br')\n", "model = ifp.read()\n", "ifp.close()\n", "classifier2 = pickle.loads(model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use this *unpickled* *classifier2* in the same way as shown above:" ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Prediction: [0]\n", "Target: 0\n" ] } ], "source": [ "print(\"Prediction:\", classifier2.predict(iris.data[0:1]))\n", "print(\"Target:\", iris.target[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Nearest Neighbor Classification" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use the *numpy* module for arrays and operations on those." ] }, { "cell_type": "code", "execution_count": 90, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can print out the unique list (or array) of classes (or targets) from the *iris* dataset using the following code:" ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", " 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n", " 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2\n", " 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2\n", " 2 2]\n", "[0 1 2]\n" ] } ], "source": [ "print(iris.target)\n", "print(numpy.unique(iris.target))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can split the *iris* dataset in a training and testing dataset using random permutations." ] }, { "cell_type": "code", "execution_count": 96, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[114 62 33 107 7 100 40 86 76 71 134 51 73 54 63 37 78 90\n", " 45 16 121 66 24 8 126 22 44 97 93 26 137 84 27 127 132 59\n", " 18 83 61 92 112 2 141 43 10 60 116 144 119 108 69 135 56 80\n", " 123 133 106 146 50 147 85 30 101 94 64 89 91 125 48 13 111 95\n", " 20 15 52 3 149 98 6 68 109 96 12 102 120 104 128 46 11 110\n", " 124 41 148 1 113 139 42 4 129 17 38 5 53 143 105 0 34 28\n", " 55 75 35 23 74 31 118 57 131 65 32 138 14 122 19 29 130 49\n", " 136 99 82 79 115 145 72 77 25 81 140 142 39 58 88 70 87 36\n", " 21 9 103 67 117 47]\n", "[ 92 141 130 119 48 143 122 63 26 64 42 108 91 77 22 148 6 65\n", " 47 68 60 15 124 58 142 12 59 105 89 78 52 131 113 98 30 136\n", " 66 133 49 62 74 17 106 8 135 80 107 90 0 36 112 5 57 102\n", " 55 34 128 33 21 73 7 45 129 103 146 120 94 50 134 99 126 114\n", " 9 39 97 101 29 81 20 46 51 53 23 27 2 28 37 111 10 84\n", " 137 127 43 87 69 144 140 35 76 3 82 145 116 88 44 147 1 93\n", " 38 11 115 54 40 18 41 79 24 56 71 13 31 85 70 132 125 123\n", " 100 32 104 83 117 118 138 25 110 16 75 109 121 86 139 4 96 14\n", " 61 67 149 95 19 72]\n" ] } ], "source": [ "numpy.random.seed(0)\n", "indices = numpy.random.permutation(len(iris.data))\n", "print(indices)\n", "indices = numpy.random.permutation(len(iris.data))\n", "print(indices)" ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 : H\n", "1 : e\n", "2 : l\n", "3 : l\n", "4 : o\n" ] } ], "source": [ "text = \"Hello\"\n", "for i in range(len(text)):\n", " print(i, ':', text[i])" ] }, { "cell_type": "code", "execution_count": 97, "metadata": { "collapsed": true }, "outputs": [], "source": [ "irisTrain_data = iris.data[indices[:-10]]\n", "irisTrain_target = iris.target[indices[:-10]]\n", "irisTest_data = iris.data[indices[-10:]]\n", "irisTest_target = iris.target[indices[-10:]]" ] }, { "cell_type": "code", "execution_count": 98, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.neighbors import KNeighborsClassifier" ] }, { "cell_type": "code", "execution_count": 99, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n", " metric_params=None, n_jobs=1, n_neighbors=5, p=2,\n", " weights='uniform')" ] }, "execution_count": 99, "metadata": {}, "output_type": "execute_result" } ], "source": [ "knn = KNeighborsClassifier()\n", "knn.fit(irisTrain_data, irisTrain_target)" ] }, { "cell_type": "code", "execution_count": 100, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([2, 0, 1, 0, 1, 1, 2, 1, 0, 2])" ] }, "execution_count": 100, "metadata": {}, "output_type": "execute_result" } ], "source": [ "knn.predict(irisTest_data)" ] }, { "cell_type": "code", "execution_count": 101, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([2, 0, 1, 0, 1, 1, 2, 1, 0, 1])" ] }, "execution_count": 101, "metadata": {}, "output_type": "execute_result" } ], "source": [ "irisTest_target" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Clustering" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### K-means Clustering" ] }, { "cell_type": "code", "execution_count": 102, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn import cluster" ] }, { "cell_type": "code", "execution_count": 104, "metadata": { "collapsed": true }, "outputs": [], "source": [ "k_means = cluster.KMeans(n_clusters=3)" ] }, { "cell_type": "code", "execution_count": 105, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,\n", " n_clusters=3, n_init=10, n_jobs=1, precompute_distances='auto',\n", " random_state=None, tol=0.0001, verbose=0)" ] }, "execution_count": 105, "metadata": {}, "output_type": "execute_result" } ], "source": [ "k_means.fit(iris.data)" ] }, { "cell_type": "code", "execution_count": 106, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1 1 1 1 1 2 2 2 2 2 0 0 0 0 0]\n" ] } ], "source": [ "print(k_means.labels_[::10])" ] }, { "cell_type": "code", "execution_count": 107, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0 0 0 0 0 1 1 1 1 1 2 2 2 2 2]\n" ] } ], "source": [ "print(iris.target[::10])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Classification" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using Kernels" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Linear kernel ![title](http://scikit-learn.org/stable/_images/sphx_glr_plot_svm_kernels_001.png)" ] }, { "cell_type": "code", "execution_count": 108, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[8]\n", "[8]\n" ] } ], "source": [ "svc = svm.SVC(kernel='linear', gamma=0.001, C=100.)\n", "svc.fit(digits.data[:-1], digits.target[:-1])\n", "print(svc.predict(digits.data[-1:]))\n", "print(digits.target[-1:])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Polynomial kernel: ![title](http://scikit-learn.org/stable/_images/sphx_glr_plot_svm_kernels_002.png)\n", "\n", "The degree is polynomial." ] }, { "cell_type": "code", "execution_count": 109, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[8]\n", "[8]\n" ] } ], "source": [ "svc = svm.SVC(kernel='poly', degree=3, gamma=0.001, C=100.)\n", "svc.fit(digits.data[:-1], digits.target[:-1])\n", "print(svc.predict(digits.data[-1:]))\n", "print(digits.target[-1:])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "RBF kernel (Radial Basis Function):\n", "\n", "![title](http://scikit-learn.org/stable/_images/sphx_glr_plot_svm_kernels_003.png)" ] }, { "cell_type": "code", "execution_count": 110, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[8]\n", "[8]\n" ] } ], "source": [ "svc = svm.SVC(kernel='rbf', gamma=0.001, C=100.)\n", "svc.fit(digits.data[:-1], digits.target[:-1])\n", "print(svc.predict(digits.data[-1:]))\n", "print(digits.target[-1:])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Logistic Regression" ] }, { "cell_type": "code", "execution_count": 111, "metadata": {}, "outputs": [], "source": [ "from sklearn import linear_model\n", "\n", "logistic = linear_model.LogisticRegression(C=1e5)" ] }, { "cell_type": "code", "execution_count": 112, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LogisticRegression(C=100000.0, class_weight=None, dual=False,\n", " fit_intercept=True, intercept_scaling=1, max_iter=100,\n", " multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,\n", " solver='liblinear', tol=0.0001, verbose=0, warm_start=False)" ] }, "execution_count": 112, "metadata": {}, "output_type": "execute_result" } ], "source": [ "logistic.fit(irisTrain_data, irisTrain_target)" ] }, { "cell_type": "code", "execution_count": 113, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([2, 0, 1, 0, 1, 1, 2, 1, 0, 1])" ] }, "execution_count": 113, "metadata": {}, "output_type": "execute_result" } ], "source": [ "logistic.predict(irisTest_data)" ] }, { "cell_type": "code", "execution_count": 114, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([2, 0, 1, 0, 1, 1, 2, 1, 0, 1])" ] }, "execution_count": 114, "metadata": {}, "output_type": "execute_result" } ], "source": [ "irisTest_target" ] }, { "cell_type": "code", "execution_count": 115, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n", " max_depth=None, max_features='auto', max_leaf_nodes=None,\n", " min_impurity_split=1e-07, min_samples_leaf=1,\n", " min_samples_split=2, min_weight_fraction_leaf=0.0,\n", " n_estimators=10, n_jobs=1, oob_score=False, random_state=None,\n", " verbose=0, warm_start=False)" ] }, "execution_count": 115, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn import ensemble\n", "rfc = ensemble.RandomForestClassifier()\n", "rfc.fit(irisTrain_data, irisTrain_target)" ] }, { "cell_type": "code", "execution_count": 116, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([2, 0, 1, 0, 1, 1, 2, 1, 0, 1])" ] }, "execution_count": 116, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rfc.predict(irisTest_data)" ] }, { "cell_type": "code", "execution_count": 117, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([2, 0, 1, 0, 1, 1, 2, 1, 0, 1])" ] }, "execution_count": 117, "metadata": {}, "output_type": "execute_result" } ], "source": [ "irisTest_target" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 129, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "13\n", "0\n" ] } ], "source": [ "text_s1 = \"\"\"\n", "User (computing)\n", "A user is a person who uses a computer or network service. Users generally use a system or a software product[1] without the technical expertise required to fully understand it.[1] Power users use advanced features of programs, though they are not necessarily capable of computer programming and system administration.[2][3]\n", "\n", "A user often has a user account and is identified to the system by a username (or user name). Other terms for username include login name, screenname (or screen name), nickname (or nick) and handle, which is derived from the identical Citizen's Band radio term.\n", "\n", "Some software products provide services to other systems and have no direct end users.\n", "End user\n", "See also: End user\n", "\n", "End users are the ultimate human users (also referred to as operators) of a software product. The term is used to abstract and distinguish those who only use the software from the developers of the system, who enhance the software for end users.[4] In user-centered design, it also distinguishes the software operator from the client who pays for its development and other stakeholders who may not directly use the software, but help establish its requirements.[5][6] This abstraction is primarily useful in designing the user interface, and refers to a relevant subset of characteristics that most expected users would have in common.\n", "\n", "In user-centered design, personas are created to represent the types of users. It is sometimes specified for each persona which types of user interfaces it is comfortable with (due to previous experience or the interface's inherent simplicity), and what technical expertise and degree of knowledge it has in specific fields or disciplines. When few constraints are imposed on the end-user category, especially when designing programs for use by the general public, it is common practice to expect minimal technical expertise or previous training in end users.[7] In this context, graphical user interfaces (GUIs) are usually preferred to command-line interfaces (CLIs) for the sake of usability.[8]\n", "\n", "The end-user development discipline blurs the typical distinction between users and developers. It designates activities or techniques in which people who are not professional developers create automated behavior and complex data objects without significant knowledge of a programming language.\n", "\n", "Systems whose actor is another system or a software agent have no direct end users.\n", "User account\n", "\n", "A user's account allows a user to authenticate to a system and potentially to receive authorization to access resources provided by or connected to that system; however, authentication does not imply authorization. To log in to an account, a user is typically required to authenticate oneself with a password or other credentials for the purposes of accounting, security, logging, and resource management.\n", "\n", "Once the user has logged on, the operating system will often use an identifier such as an integer to refer to them, rather than their username, through a process known as identity correlation. In Unix systems, the username is correlated with a user identifier or user id.\n", "\n", "Computer systems operate in one of two types based on what kind of users they have:\n", "\n", " Single-user systems do not have a concept of several user accounts.\n", " Multi-user systems have such a concept, and require users to identify themselves before using the system.\n", "\n", "Each user account on a multi-user system typically has a home directory, in which to store files pertaining exclusively to that user's activities, which is protected from access by other users (though a system administrator may have access). User accounts often contain a public user profile, which contains basic information provided by the account's owner. The files stored in the home directory (and all other directories in the system) have file system permissions which are inspected by the operating system to determine which users are granted access to read or execute a file, or to store a new file in that directory.\n", "\n", "While systems expect most user accounts to be used by only a single person, many systems have a special account intended to allow anyone to use the system, such as the username \"anonymous\" for anonymous FTP and the username \"guest\" for a guest account.\n", "Usernames\n", "\n", "Various computer operating-systems and applications expect/enforce different rules for the formats of user names.\n", "\n", "In Microsoft Windows environments, for example, note the potential use of:[9]\n", "\n", " User Principal Name (UPN) format - for example: UserName@orgName.com\n", " Down-Level Logon Name format - for example: DOMAIN\\accountName\n", "\n", "Some online communities use usernames as nicknames for the account holders. In some cases, a user may be better known by their username than by their real name, such as CmdrTaco (Rob Malda), founder of the website Slashdot.\n", "Terminology\n", "\n", "Some usability professionals have expressed their dislike of the term \"user\", proposing it to be changed.[10] Don Norman stated that \"One of the horrible words we use is 'users'. I am on a crusade to get rid of the word 'users'. I would prefer to call them 'people'.\"[11]\n", "See also\n", "\n", " Information technology portal iconSoftware portal \n", "\n", " 1% rule (Internet culture)\n", " Anonymous post\n", " Pseudonym\n", " End-user computing, systems in which non-programmers can create working applications.\n", " End-user database, a collection of data developed by individual end-users.\n", " End-user development, a technique that allows people who are not professional developers to perform programming tasks, i.e. to create or modify software.\n", " End-User License Agreement (EULA), a contract between a supplier of software and its purchaser, granting the right to use it.\n", " User error\n", " User agent\n", " User experience\n", " User space\n", "\"\"\"\n", "\n", "text_s2 = \"\"\"\n", "Personal account\n", "\n", "A personal account is an account for use by an individual for that person's own needs. It is a relative term to differentiate them from those accounts for corporate or business use. The term \"personal account\" may be used generically for financial accounts at banks and for service accounts such as accounts with the phone company, or even for e-mail accounts.\n", "\n", "Banking\n", "\n", "In banking \"personal account\" refers to one's account at the bank that is used for non-business purposes. Most likely, the service at the bank consists of one of two kinds of accounts or sometimes both: a savings account and a current account.\n", "\n", "Banks differentiate their services for personal accounts from business accounts by setting lower minimum balance requirements, lower fees, free checks, free ATM usage, free debit card (Check card) usage, etc. The term does not apply to any one service or limit the banks from providing the same services to non-individuals. Personal account can be classified into three categories: 1. Persons of Nature, 2. Persons of Artificial Relationship, 3. Persons of Representation.\n", "\n", "At the turn of the 21st century, many banks started offering free checking, a checking account with no minimum balance, a free check book, and no hidden fees. This encouraged Americans who would otherwise live from check to check to open their \"personal\" account at financial institutions. For businesses that issue corporate checks to employees, this enables reduction in the amount of paperwork.\n", "\n", "Finance\n", "\n", "In the financial industry, 'personal account' (usually \"PA\") refers to trading or investing for yourself, rather than the company one is working for. There are often restrictions on what may be done with a PA, to avoid conflict of interest.\n", "\"\"\"\n", "\n", "test_text = \"\"\"\n", "A user account is a location on a network server used to store a computer username, password, and other information. A user account allows or does not allow a user to connect to a network, another computer, or other share. Any network that has multiple users requires user accounts.\n", "\"\"\"\n", "\n", "from nltk import word_tokenize, sent_tokenize\n", "\n", "sentences_s1 = sent_tokenize(text_s1)\n", "#print(sentences_s1)\n", "\n", "toksentences_s1 = [ word_tokenize(sentence) for sentence in sentences_s1 ]\n", "#print(toksentences_s1)\n", "\n", "tokens_s1 = set(word_tokenize(text_s1))\n", "tokens_s2 = set(word_tokenize(text_s2))\n", "\n", "#print(set.intersection(tokens_s1, tokens_s2))\n", "\n", "unique_s1 = tokens_s1 - tokens_s2\n", "unique_s2 = tokens_s2 - tokens_s1\n", "#print(unique_s1)\n", "#print(unique_s2)\n", "\n", "testTokens = set(word_tokenize(test_text))\n", "print(len(set.intersection(testTokens, unique_s1)))\n", "print(len(set.intersection(testTokens, unique_s2)))\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.1" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": false, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": false, "toc_window_display": false }, "vscode": { "interpreter": { "hash": "1e28a5307a9b5c2fbeb0b263581f1cf3bfba9739188743f6a231f74c7de58892" } } }, "nbformat": 4, "nbformat_minor": 2 }