{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "\n", "### 2017年计算传播学工作坊\n", "***\n", "***\n", "# Text Data Processing Basics: \n", "## How to segment words using Python\n", "\n", "***\n", "***\n", "\n", "张伦\n", "\n", "\n", "计算传播网 http://computational-communication.com\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Step 1. Preparation" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'/Users/zhanglun/Desktop/Workshop/PythonDemo'" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Set the Working Directory \n", "import os\n", "os.getcwd() \n", "os.chdir('/Users/zhanglun/Desktop/Workshop/PythonDemo')\n", "os.getcwd()" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": true, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "###Import packages. \n", "\n", "import nltk\n", "import string\n", "import re\n", "\n", "from nltk.stem.wordnet import WordNetLemmatizer\n", "from nltk.corpus import stopwords\n", "from nltk.corpus import brown\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Step 2. Loading Lemmatizer and Stemmer" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": true, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn']\n" ] } ], "source": [ "brown_tagged_sents=brown.tagged_sents(categories=None)\n", "unigram_tagger=nltk.UnigramTagger(brown_tagged_sents)\n", "wnl = nltk.WordNetLemmatizer()\n", "porter=nltk.PorterStemmer()\n", "stopwords=nltk.corpus.stopwords.words('english')\n", "print (stopwords)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Step 2.5 Import the text to be analysed" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Trump was born and raised in Queens, New York City, and earned an economics degree from the Wharton School. Later, he took charge of The Trump Organization, the real estate and construction firm founded by his paternal grandmother, which he ran for 45 years until 2016.\n" ] } ], "source": [ "\n", "text='Trump was born and raised in Queens, New York City, and earned an economics degree from the Wharton School. Later, he took charge of The Trump Organization, the real estate and construction firm founded by his paternal grandmother, which he ran for 45 years until 2016.'\n", "print (text)\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Step 3. Remove Punctuations" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Trump was born and raised in Queens, New York City, and earned an economics degree from the Wharton School. Later, he took charge of The Trump Organization, the real estate and construction firm founded by his paternal grandmother, which he ran for 45 years until 2016.\n", "trump was born and raised in queens new york city and earned an economics degree from the wharton school later he took charge of the trump organization the real estate and construction firm founded by his paternal grandmother which he ran for 45 years until 2016\n" ] } ], "source": [ "text2=text.translate(str.maketrans(\"\",\"\",string.punctuation)).lower()\n", "print (text)\n", "print (text2)\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Step 4. Tokenization" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['trump', 'was', 'born', 'and', 'raised', 'in', 'queens', 'new', 'york', 'city', 'and', 'earned', 'an', 'economics', 'degree', 'from', 'the', 'wharton', 'school', 'later', 'he', 'took', 'charge', 'of', 'the', 'trump', 'organization', 'the', 'real', 'estate', 'and', 'construction', 'firm', 'founded', 'by', 'his', 'paternal', 'grandmother', 'which', 'he', 'ran', 'for', '45', 'years', 'until', '2016']\n" ] } ], "source": [ "#tokenization\n", "wordlist=text2.split()\n", "print (wordlist)\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Step 5. Droping Common/Stop Words" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "trump 1\n", "was 0\n", "born 1\n", "and 0\n", "raised 1\n", "in 0\n", "queens 1\n", "new 1\n", "york 1\n", "city 1\n", "and 0\n", "earned 1\n", "an 0\n", "economics 1\n", "degree 1\n", "from 0\n", "the 0\n", "wharton 1\n", "school 1\n", "later 1\n", "he 0\n", "took 1\n", "charge 1\n", "of 0\n", "the 0\n", "trump 1\n", "organization 1\n", "the 0\n", "real 1\n", "estate 1\n", "and 0\n", "construction 1\n", "firm 1\n", "founded 1\n", "by 0\n", "his 0\n", "paternal 1\n", "grandmother 1\n", "which 0\n", "he 0\n", "ran 1\n", "for 0\n", "45 1\n", "years 1\n", "until 0\n", "2016 1\n" ] } ], "source": [ "#drop stop words\n", "for item in wordlist:\n", " if item in stopwords:\n", " print(item,'0')\n", " else:\n", " print(item,'1')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Step 6. Stemming and Lemmatization" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'woman'" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#lemmentization\n", "wnl.lemmatize('was','v')\n", "wnl.lemmatize('going','v')\n", "wnl.lemmatize('boys','n')\n", "wnl.lemmatize('women','n')\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "#stemming\n", "porter.stem('going')\n", "porter.stem('amazing')\n", "porter.stem('girls')\n", "porter.stem('ponies')\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Step 7. Word Tagging" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false, "scrolled": true, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[('trump', 'VB'), ('was', 'BEDZ'), ('born', 'VBN'), ('and', 'CC'), ('raised', 'VBN'), ('in', 'IN'), ('queens', 'NNS'), ('new', 'JJ'), ('york', None), ('city', 'NN'), ('and', 'CC'), ('earned', 'VBN'), ('an', 'AT'), ('economics', 'NN'), ('degree', 'NN'), ('from', 'IN'), ('the', 'AT'), ('wharton', None), ('school', 'NN'), ('later', 'RBR'), ('he', 'PPS'), ('took', 'VBD'), ('charge', 'NN'), ('of', 'IN'), ('the', 'AT'), ('trump', 'VB'), ('organization', 'NN'), ('the', 'AT'), ('real', 'JJ'), ('estate', 'NN'), ('and', 'CC'), ('construction', 'NN'), ('firm', 'NN'), ('founded', 'VBN'), ('by', 'IN'), ('his', 'PP$'), ('paternal', None), ('grandmother', 'NN'), ('which', 'WDT'), ('he', 'PPS'), ('ran', 'VBD'), ('for', 'IN'), ('45', 'CD'), ('years', 'NNS'), ('until', 'CS'), ('2016', None)]\n" ] } ], "source": [ "#pos tagging\n", "tag=unigram_tagger.tag(wordlist)\n", "print (tag)\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Final Step (optional)\n", "\n", "Export the Result \n", "(or just leave it here for the follow-up analysis)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": true, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "####write out the segmentation result\n", "import pandas as pd\n", "my_df = pd.DataFrame(tag)\n", "my_df.to_csv('out.csv', index=False, header=False)\n", "###alternatively \n", "pd.DataFrame(tag).to_csv ('out1.csv', index=False, header=False)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "slideshow": { "slide_type": "slide" } }, "source": [ "## Congratualations! \n", "## You are a programmer now!" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.12" }, "latex_envs": { "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 0 }, "toc": { "toc_cell": false, "toc_number_sections": false, "toc_section_display": "none", "toc_threshold": 6, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 2 }