{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Feature Extraction and Preprocessing" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.feature_extraction import DictVectorizer\n", "from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer\n", "from sklearn.metrics.pairwise import euclidean_distances\n", "from sklearn import preprocessing\n", "\n", "from nltk.stem.wordnet import WordNetLemmatizer\n", "from nltk.stem import PorterStemmer\n", "from nltk import word_tokenize\n", "from nltk import pos_tag\n", "\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** DictVectorizer **" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[ 0. 1. 0.]\n", " [ 0. 0. 1.]\n", " [ 1. 0. 0.]]\n" ] } ], "source": [ "onehot_encoder = DictVectorizer()\n", "instances = [\n", " {'city': 'New York'},\n", " {'city': 'San Francisco'},\n", " {'city': 'Chapel Hill'} ]\n", "\n", "print (onehot_encoder.fit_transform(instances).toarray())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** CountVectorizer **" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[1 1 0 1 0 1 0 1]\n", " [1 1 1 0 1 0 1 0]]\n", "{'played': 5, 'game': 2, 'basketball': 0, 'in': 3, 'duke': 1, 'lost': 4, 'the': 6, 'unc': 7}\n" ] } ], "source": [ "corpus = [\n", " 'UNC played Duke in basketball',\n", " 'Duke lost the basketball game'\n", "]\n", "\n", "\n", "vectorizer = CountVectorizer()\n", "print (vectorizer.fit_transform(corpus).todense())\n", "print (vectorizer.vocabulary_)\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[0 1 1 0 1 0 0 1 0 0 0 1]\n", " [0 1 1 1 0 0 1 0 0 1 0 0]\n", " [1 0 0 0 0 1 0 0 1 0 1 0]]\n", "{'played': 7, 'is': 5, 'game': 3, 'basketball': 1, 'singh': 8, 'in': 4, 'atul': 0, 'duke': 2, 'lost': 6, 'the': 9, 'unc': 11, 'this': 10}\n" ] } ], "source": [ "# adding one more sentence in corpus\n", "\n", "corpus = [\n", " 'UNC played Duke in basketball',\n", " 'Duke lost the basketball game',\n", " 'This is Atul Singh'\n", "]\n", "\n", "vectorizer = CountVectorizer()\n", "print (vectorizer.fit_transform(corpus).todense())\n", "print (vectorizer.vocabulary_)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 & 2 [[ 2.44948974]]\n", "2 & 3 [[ 3.]]\n", "1 & 3 [[ 3.]]\n" ] } ], "source": [ "# checking the euclidean distance \n", "\n", "# converting sentence into CountVectorizer\n", "counts = vectorizer.fit_transform(corpus).todense()\n", "\n", "print(\"1 & 2\", euclidean_distances(counts[0], counts[1]))\n", "print(\"2 & 3\", euclidean_distances(counts[1], counts[2]))\n", "print(\"1 & 3\", euclidean_distances(counts[0], counts[2]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** Stop Word Filtering **" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[0 1 1 0 0 1 0 1]\n", " [0 1 1 1 1 0 0 0]\n", " [1 0 0 0 0 0 1 0]]\n", "{'played': 5, 'game': 3, 'basketball': 1, 'singh': 6, 'atul': 0, 'duke': 2, 'lost': 4, 'unc': 7}\n", "1 & 2 [[ 2.44948974]]\n", "2 & 3 [[ 3.]]\n", "1 & 3 [[ 3.]]\n" ] } ], "source": [ "vectorizer = CountVectorizer(stop_words='english') # added one option which remove the grammer words from corpus\n", "print (vectorizer.fit_transform(corpus).todense())\n", "print (vectorizer.vocabulary_)\n", "\n", "print(\"1 & 2\", euclidean_distances(counts[0], counts[1]))\n", "print(\"2 & 3\", euclidean_distances(counts[1], counts[2]))\n", "print(\"1 & 3\", euclidean_distances(counts[0], counts[2]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** Stemming and Lemmatization ** \n", "\n", "**Lemmatization** is the process of determining the lemma, or the morphological root, of an inflected word based on its context. Lemmas are the base forms of words that are used to key the word in a dictionary.\n", "\n", "**Stemming** has a similar goal to lemmatization, but it does not attempt to produce the morphological roots of words. Instead, stemming removes all patterns of characters that appear to be affixes, resulting in a token that is not necessarily a valid word.\n", "\n", "Lemmatization frequently requires a lexical resource, like WordNet, and the word's part of speech. Stemming \n", "algorithms frequently use rules instead of lexical resources to produce stems and can \n", "operate on any token, even without its context." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[1 0 0 1]\n", " [0 1 1 0]]\n", "{'sandwich': 2, 'ate': 0, 'sandwiches': 3, 'eaten': 1}\n" ] } ], "source": [ "corpus = [\n", " 'He ate the sandwiches',\n", " 'Every sandwich was eaten by him'\n", "]\n", "\n", "vectorizer = CountVectorizer(stop_words='english') # added one option which remove the grammer words from corpus\n", "print (vectorizer.fit_transform(corpus).todense())\n", "print (vectorizer.vocabulary_)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "As we can see both sentences are having same meaning but their feature vectors have no elements in common. Let's use the lexical analysis on the data" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "gather\n", "gathering\n" ] } ], "source": [ "lemmatizer = WordNetLemmatizer()\n", "print (lemmatizer.lemmatize('gathering', 'v'))\n", "print (lemmatizer.lemmatize('gathering', 'n'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Porter stemmer cannot consider the inflected form's part of speech and returns gather for both documents:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "gather\n" ] } ], "source": [ "stemmer = PorterStemmer()\n", "print (stemmer.stem('gathering'))" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Stemmed: [['He', 'ate', 'the', 'sandwich'], ['Everi', 'sandwich', 'wa', 'eaten', 'by', 'him']]\n" ] } ], "source": [ "wordnet_tags = ['n', 'v']\n", "corpus = [\n", "'He ate the sandwiches',\n", "'Every sandwich was eaten by him'\n", "]\n", "stemmer = PorterStemmer()\n", "print ('Stemmed:', [[stemmer.stem(token) for token in word_tokenize(document)] for document in corpus])" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Lemmatized: [['He', 'eat', 'the', 'sandwich'], ['Every', 'sandwich', 'be', 'eat', 'by', 'him']]\n" ] } ], "source": [ "def lemmatize(token, tag):\n", "\tif tag[0].lower() in ['n', 'v']:\n", "\t\treturn lemmatizer.lemmatize(token, tag[0].lower())\n", "\treturn token\n", "lemmatizer = WordNetLemmatizer()\n", "\n", "tagged_corpus = [pos_tag(word_tokenize(document)) for document in corpus]\n", "print ('Lemmatized:', [[lemmatize(token, tag) for token, tag in document] for document in tagged_corpus])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Extending bag-of-words with TF-IDF weights \n", "\n", "It is intuitive that the frequency with which a word appears in a document could indicate the extent to which a document pertains to that word. A long document that contains one occurrence of a word may discuss an entirely different topic than a document that contains many occurrences of the same word. In this section, we will create feature vectors that encode the frequencies of words, and discuss strategies to mitigate two problems caused by encoding term frequencies.\n", " Instead of using a binary value for each element in the feature vector, we will now use an integer that represents the number of times that the words appeared in the document." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[2 1 3 1 1]]\n", "{'transfigured': 3, 'wizard': 4, 'dog': 1, 'ate': 0, 'sandwich': 2}\n" ] } ], "source": [ "corpus = ['The dog ate a sandwich, the wizard transfigured a sandwich, and I ate a sandwich']\n", "vectorizer = CountVectorizer(stop_words='english')\n", "print (vectorizer.fit_transform(corpus).todense())\n", "print(vectorizer.vocabulary_)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[ 0.75458397 0.37729199 0.53689271 0. 0. ]\n", " [ 0. 0. 0.44943642 0.6316672 0.6316672 ]]\n", "{'transfigured': 3, 'wizard': 4, 'dog': 1, 'ate': 0, 'sandwich': 2}\n" ] } ], "source": [ "corpus = ['The dog ate a sandwich and I ate a sandwich',\n", " 'The wizard transfigured a sandwich']\n", "vectorizer = TfidfVectorizer(stop_words='english')\n", "print (vectorizer.fit_transform(corpus).todense())\n", "print(vectorizer.vocabulary_)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[-0.37796447 0. 0.75592895 0.37796447 0. -0.37796447]\n", " [-0.5 0. 0.5 0.5 0. 0.5 ]]\n" ] } ], "source": [ "corpus = ['The dog ate a sandwich and I ate a sandwich',\n", " 'The wizard transfigured a sandwich']\n", "vectorizer = HashingVectorizer(n_features=6)\n", "\n", "print (vectorizer.fit_transform(corpus).todense())\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Data Standardization" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[-1.33630621 -1.37281295 1.22474487]\n", " [ 1.06904497 0.39223227 -1.22474487]\n", " [ 0.26726124 0.98058068 0. ]]\n" ] } ], "source": [ "X = [[1,2,3],\n", " [4,5,1],\n", " [3,6,2]\n", " ]\n", "\n", "print(preprocessing.scale(X))" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "StandardScaler(copy=True, with_mean=True, with_std=True)\n", "[[-1.33630621 -1.37281295 1.22474487]\n", " [ 1.06904497 0.39223227 -1.22474487]\n", " [ 0.26726124 0.98058068 0. ]]\n" ] } ], "source": [ "x1 = preprocessing.StandardScaler()\n", "print(x1)\n", "print(x1.fit_transform(X))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [default]", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }