{ "cells": [ { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Import the libraries we will be using" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "import random\n", "import nltk\n", "import gensim\n", "import numpy as np\n", "import pandas as pd\n", "import scipy\n", "\n", "from nltk.stem.porter import *" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First we download the necessary resources for NLTK" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "nltk.download('punkt')\n", "nltk.download('tagsets')\n", "nltk.download('averaged_perceptron_tagger')" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "# Part 1: Tokenization" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Tokenization involves segmenting text into tokens. It is a common preprocessing step in many NLP applications" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "deletable": true, "editable": true }, "outputs": [], "source": [ "text = \"Are you crazy? I don't know.\"" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "A simple method is just to split based on white space. Note that this doesn't work for many other languages (like Chinese)!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "text.split()" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "We will now explore tokenization as provided by two NLP tools. First, look at how NLTK tokenizes words:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "nltk.word_tokenize(text)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Now, let's look at how Gensim is handling tokenization" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "list(gensim.utils.tokenize(text))" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "It often makes sense to lowercase the text, as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "\"HELLO world\".lower()" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "