{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Tokenizing" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\r\n", "Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense. \n" ] } ], "source": [ "# getting text\n", "with open(\"short.txt\",'r')as f:\n", " text = f.read()\n", "print text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### tokenize on spaces" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Mr.',\n", " 'and',\n", " 'Mrs.',\n", " 'Dursley,',\n", " 'of',\n", " 'number',\n", " 'four,',\n", " 'Privet',\n", " 'Drive,',\n", " 'were',\n", " 'proud',\n", " 'to',\n", " 'say',\n", " 'that',\n", " 'they',\n", " 'were',\n", " 'perfectly',\n", " 'normal,',\n", " 'thank',\n", " 'you',\n", " 'very',\n", " 'much.',\n", " 'They',\n", " 'were',\n", " 'the',\n", " 'last',\n", " 'people',\n", " \"you'd\",\n", " 'expect',\n", " 'to',\n", " 'be',\n", " 'involved',\n", " 'in',\n", " 'anything',\n", " 'strange',\n", " 'or',\n", " 'mysterious,',\n", " 'because',\n", " 'they',\n", " 'just',\n", " \"didn't\",\n", " 'hold',\n", " 'with',\n", " 'such',\n", " 'nonsense.']" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text.split()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### sklearn\n", "- note that CountVectorizer discards \"words\" that contain only one character, such as \"s\"\n", "- CountVectorizer also transforms all words into lowercase\n", "- build_tokenizer(): Return a function that splits a string into a sequence of tokens" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Mr',\n", " 'and',\n", " 'Mrs',\n", " 'Dursley',\n", " 'of',\n", " 'number',\n", " 'four',\n", " 'Privet',\n", " 'Drive',\n", " 'were',\n", " 'proud',\n", " 'to',\n", " 'say',\n", " 'that',\n", " 'they',\n", " 'were',\n", " 'perfectly',\n", " 'normal',\n", " 'thank',\n", " 'you',\n", " 'very',\n", " 'much',\n", " 'They',\n", " 'were',\n", " 'the',\n", " 'last',\n", " 'people',\n", " 'you',\n", " 'expect',\n", " 'to',\n", " 'be',\n", " 'involved',\n", " 'in',\n", " 'anything',\n", " 'strange',\n", " 'or',\n", " 'mysterious',\n", " 'because',\n", " 'they',\n", " 'just',\n", " 'didn',\n", " 'hold',\n", " 'with',\n", " 'such',\n", " 'nonsense']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.feature_extraction.text import CountVectorizer\n", "\n", "CountVectorizer().build_tokenizer()(text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### nltk\n", "- do not remove punctuations" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Mr.',\n", " 'and',\n", " 'Mrs.',\n", " 'Dursley',\n", " ',',\n", " 'of',\n", " 'number',\n", " 'four',\n", " ',',\n", " 'Privet',\n", " 'Drive',\n", " ',',\n", " 'were',\n", " 'proud',\n", " 'to',\n", " 'say',\n", " 'that',\n", " 'they',\n", " 'were',\n", " 'perfectly',\n", " 'normal',\n", " ',',\n", " 'thank',\n", " 'you',\n", " 'very',\n", " 'much',\n", " '.',\n", " 'They',\n", " 'were',\n", " 'the',\n", " 'last',\n", " 'people',\n", " 'you',\n", " \"'d\",\n", " 'expect',\n", " 'to',\n", " 'be',\n", " 'involved',\n", " 'in',\n", " 'anything',\n", " 'strange',\n", " 'or',\n", " 'mysterious',\n", " ',',\n", " 'because',\n", " 'they',\n", " 'just',\n", " 'did',\n", " \"n't\",\n", " 'hold',\n", " 'with',\n", " 'such',\n", " 'nonsense',\n", " '.']" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from nltk.tokenize import word_tokenize\n", "word_tokenize(text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Stemming" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "danish dutch english finnish french german hungarian italian norwegian porter portuguese romanian russian spanish swedish\n" ] } ], "source": [ "from nltk.stem.snowball import SnowballStemmer\n", "# See which languages are supported\n", "print(\" \".join(SnowballStemmer.languages))" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "generous\n" ] } ], "source": [ "#Create a new instance of a language specific subclass.\n", "stemmer = SnowballStemmer(\"english\")\n", "\n", "print stemmer.stem('generously')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Chunking\n", "\"Splitting a long text into smaller samples is a common task in text analysis. As most kinds of quantitative text analysis take as inputs an unordered list of words, breaking a text up into smaller chunks allows one to preserve context that would otherwise be discarded; observing two words together in a paragraph-sized chunk of text tells us much more about the relationship between those two words than observing two words occurring together in an 100,000 word book. Or, as we will be using a selection of tragedies as our examples, we might consider the difference between knowing that two character names occur in the same scene versus knowing that the two names occur in the same play.\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.13" } }, "nbformat": 4, "nbformat_minor": 2 }