{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Torchtext Tutorial 01: Getting Started" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 0. Brief overview\n", "\n", "#### In this tutorial we will...\n", "1. create Field objects for preprocessing / tokenizing text\n", "2. create Dataset objects with input text files and Field object(s) which return tokenized and processed text data\n", "3. create Vocab objects that contain the vocabulary and word embeddings of the words from our data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Here's a brief implementation of what we'll be doing in this session\n", "\n", "import torch\n", "import spacy\n", "from torchtext import data, datasets\n", "\n", "# 1. create Field objects for preprocessing / tokenizing text\n", "TEXT = data.Field(tokenize=data.get_tokenizer('spacy'), \n", " init_token='', eos_token='',lower=True)\n", "\n", "# 2. create Dataset objects with input text files and Field object(s) which return tokenized and processed text data\n", "train,val,test = datasets.WikiText2.splits(text_field=TEXT)\n", "\n", "# 3. create Vocab objects that contain the vocabulary and word embeddings of the words from our data\n", "TEXT.build_vocab(train, wv_type='glove.6B')\n", "vocab = TEXT.vocab" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Setting a Field object\n", "\n", "#### Purpose of a Field object\n", "- Intuitively, it sets rules for preprocessing the input text, and creates a vocabulary of the words that are introduced from the data\n", "\n", "#### Things you can do with a Field object\n", " - 1.1. apply self-defined or external (**spacy is recommended**) tokenizers to tokenize strings into word tokens\n", " - 1.2. automatically convert word tokens (strings) to indices (ints)\n", " - 1.3. automatically add SOS(start-of-sentence) or EOS(end-of-sentence) tokens to input strings\n", " - 1.4. convert text to lowercase\n", " - 1.5. determine whether to pad sentences to a fixed length or leave them as variable lengths\n", "\n", "#### Things you CAN'T do in a Field object\n", "- print batches of text data (remember, a Field defines how to tokenize/preprocess/label your data and arrange a vocabulary, **and does not store the data itself**)\n", "\n", "##### MISC.\n", "- Spacy is a NLP package in Python which provides many features such as tokenizing, POS-tagging, preprocessing etc." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import torch\n", "from torchtext import data, datasets" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 1.0. setting up a default0 Field object\n", "TEXT = data.Field()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# 1.1. applying an external tokenizer\n", "\n", "\"\"\"\n", "The default tokenizer implemented in torchtext is merely\n", "a .split() function, so we usually have to use our own versions.\n", "Luckily, the most commonly used 'spacy' tokenizer can be\n", "easily called.\n", "Here we call a spacy tokenizer for English, and add it to our Field.\n", "\"\"\"\n", "\n", "tokenizer = data.get_tokenizer('spacy') # spacy tokenizer function\n", "test_string = 'This is a string to be tokenized...'\n", "print('original string: ',test_string)\n", "print('tokenized string: ',list(tokenizer(test_string)))\n", "\n", "TEXT = data.Field(tokenize=tokenizer) # add our tokenizer to our Field object" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "\"\"\"\n", "Or you can do this manually by calling it from the spacy package.\n", "FYI, spacy provides tokenizers of MANY languages.\n", "\"\"\"\n", "\n", "import spacy\n", "spacy_en = spacy.load('en') # the default English package by Spacy\n", "\n", "def tokenizer2(text): # create a tokenizer function\n", " return [tok.text for tok in spacy_en.tokenizer(text)]\n", "\n", "test_string = 'This is a string to be tokenized...'\n", "print('original string: ',test_string)\n", "print('tokenized string: ',list(tokenizer(test_string)))\n", "\n", "TEXT = data.Field(tokenize=tokenizer2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 1.2. converting word tokens to indices\n", "\n", "\"\"\"\n", "use_vocab is set default to True. By setting it to False, \n", "we instead take number indices as inputs.\n", "In most cases your inputs will in text and not integer indices, \n", "so you won't be using this feature that much.\n", "\"\"\"\n", "\n", "TEXT = data.Field(use_vocab=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 1.3. Adding SOS and EOS tokens to input strings\n", "\n", "\"\"\"\n", "In seq2seq models, it is (almost) necessary to append \n", "SOS(start-of-sentence) and EOS(end-of-sentence) tokens to let the model \n", "determine when to start and stop generating new words. \n", "These are applied to the dataset created using this Field.\n", "\"\"\"\n", "\n", "TEXT = data.Field(init_token='', eos_token='')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 1.4. converting text to lowercase\n", "\n", "\"\"\"\n", "Converts all words to lower case\n", "\"\"\"\n", "\n", "TEXT = data.Field(lower=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 1.5. fixed vs variable lengths\n", "\n", "\"\"\"\n", "In default, RNNs in Pytorch are invariant of sequence length, \n", "so it is possible to train models using variable sequence lengths.\n", "However, you may have to fix input lengths \n", "such as when using CNNs and other models.\n", "\n", "Note: Even when set to variable length, when working in batches\n", "the all sentences are padded to match the longest sentence in a batch.\n", "\"\"\"\n", "\n", "TEXT = data.Field(fix_length=40) # shorter strings will be padded" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Actual Field object to be used for next step\n", "\n", "TEXT = data.Field(tokenize=tokenizer, init_token='', eos_token='',\n", " lower=True)\n", "vars(TEXT)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Creating a Dataset object\n", "\n", "Here we will create a dataset for language modelling. The input text data will\n", "be tokenized and preprocessed according to our Field settings.\n", "\n", "\n", "#### Ingredients!\n", "- a Field object used to store the vocabulary of the text file\n", "- the path to a text file\n", "- an appropriate Dataset class\n", "\n", "#### Things you can do with a Dataset object\n", "- 2.1. print examples from the text\n", " \n", "#### Introducing Datasets of various purposes\n", "- 2.2. Language modelling (WikiText2)\n", "- 2.3. Sentiment analysis (SST)\n", "\n", "##### MISC.\n", "- Creating custom datasets will be covered in subsequent tutorials" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# create a LanguageModelingDataset instance using TEXT as Field\n", "\n", "\"\"\"\n", "dataset: Cornell Movie-Dialogs Corpus\n", "\"\"\"\n", "\n", "lang = datasets.LanguageModelingDataset(path='movie.txt',\n", " text_field=TEXT)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# 2.1. print examples from text\n", "\n", "\"\"\"\n", "You can print examples(=all text) of the given text using Dataset.examples.\n", "When using the basic LanguageModelingDataset, the entire text corpus will\n", "be stored as a long list of word tokens.\n", "\"\"\"\n", "\n", "examples = lang.examples\n", "print(\"Number of tokens: \", len(examples[0].text))\n", "print(\"\\n\")\n", "print(\"Print first 100 tokens: \",examples[0].text[:100])\n", "print(\"\\n\")\n", "print(\"Print last 10 tokens: \",examples[0].text[-10:])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# 2.2. Dataset for language modelling\n", "\n", "\"\"\"\n", "Torchtext provides the Wikitext dataset as downloadable for language modelling.\n", "Through the .splits() function of the WikiText2 class, we can create\n", "training / validation / test datasets.\n", "Note that this class can ONLY read from the WikiText2 data.\n", "\"\"\"\n", "\n", "# get Field\n", "TEXT_wiki = data.Field(tokenize=tokenizer, init_token='', eos_token='',\n", " lower=True)\n", "\n", "# split into train, val, test\n", "train, val, test = datasets.WikiText2.splits(text_field=TEXT_wiki)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# 2.3. Dataset for sentiment analysis\n", "\n", "\"\"\"\n", "We will use the same .split() function provided by the SST class in torchtext.\n", "The only difference is that we also need a label field to keep a vocabulary of\n", "the labels from 'very negative' to 'very positive'\n", "\"\"\"\n", "\n", "# get Fields - sentiment analysis also requires a label field\n", "TEXT_sst = data.Field(tokenize=tokenizer, init_token='', eos_token='',\n", " lower=True)\n", "LABEL_sst = data.Field(sequential=False)\n", "\n", "# split into train, val, test\n", "train, val, test = datasets.SST.splits(text_field=TEXT_sst, label_field=LABEL_sst)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 2.4. Dataset for natural language inference\n", "\n", "# WIP" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Using a Vocab object\n", "\n", "Now you can create a vocabulary of the words from the text file \n", "stored in your predefined Field object, TEXT. You first have to build\n", "a vocabulary in your Field object using .build_vocab() with your dataset as input. Then you can access it using TEXT.vocab, which is a Vocab object also defined by TorchText. Here is a list of the features provided by Vocab.\n", "\n", "#### Things you can do with a Vocab object\n", "- 3.1. View vocabulary information (size, frequency of words)\n", "- 3.2. View the created string2index list (stoi) and index2string dict (itos)\n", "- 3.3. Create purpose-specific vocabularies (requires a Counter object)\n", "- 3.4. Load external word embeddings\n", "- 3.5. Easily handle unknown words" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# 3.0. building a vocabulary\n", "\n", "\"\"\"\n", "We build a vocabulary of words included in the dataset.\n", "Note that we can only build on the Field object that was used to create the dataset.\n", "\"\"\"\n", "\n", "TEXT.build_vocab(lang) # use dataset as input\n", "vocab = TEXT.vocab" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# 3.1. vocabulary information (size, frequency of words, etc.)\n", "\n", "\"\"\"\n", "You can view information of the vocabulary.\n", "Vocab.freqs returns a Counter object that stores the frequency of\n", "all words that appeared in the text.\n", "\"\"\"\n", "\n", "print(\"Vocabulary size: \", len(vocab))\n", "print(\"10 most frequent words: \", vocab.freqs.most_common(10))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# 3.2. string2index (stoi), index2string (itos)\n", "\n", "\"\"\"\n", "The created Vocab object contains an index mapping for each word.\n", "\"\"\"\n", "\n", "print(\"First 10 words: \", vocab.itos[0:10])\n", "print(\"First 10 words of text data: \", lang.examples[0].text[:10])\n", "print(\"Index of the first word: \", vocab.stoi[lang.examples[0].text[0]])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# 3.3. create purpose-specific vocabularies (requires a Counter object)\n", "\n", "\"\"\"\n", "The Vocabulary object created from Field have many parameters set to default.\n", "We can create a new vocabulary using any Counter object \n", "(e.g. the Counter object from our initial vocabulary). \n", "Our new vocabulary may\n", "1) only contains word which appear more than N times,\n", "2) be smaller than a given maximum size\n", "\"\"\"\n", "\n", "counter = vocab.freqs # frequency of the original vocabulary created by Field\n", "vocab2 = data.Vocab(counter=counter,min_freq=10) # discard words appearing less than 10 times\n", "vocab3 = data.Vocab(counter=counter,max_size=100000) # set max number of words for a vocabulary\n", "\n", "print(len(vocab))\n", "print(len(vocab2))\n", "print(len(vocab3))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# 3.4. load external word embeddings\n", "\n", "\"\"\"\n", "We can load external word embeddings such as word2vec or glove with Vocab objects.\n", "Here we build a Vocab object using 'glove.6B'\n", "wv_dir = only if you have a custom word embedding file in .pt, .txt, .zip\n", "wv_dim = 100, 200, 300 (provided by glove) # default=300\n", "wv_type = 'glove.6B', 'glove.840B', 'glove.twitter.27B', 'glove.42B'\n", "\"\"\"\n", "\n", "########## NOTE: this external word embedding requires 800+ MB space ########## \n", "# 3.4.1. downloading embedding and loading into Field object \n", "GLOVE = data.Field()\n", "lang2 = datasets.LanguageModelingDataset(path='movie.txt',\n", " text_field=GLOVE)\n", "GLOVE.build_vocab(lang2, wv_type='glove.6B')\n", "\n", "# 3.4.2. loading embedding into specific Vocab object\n", "vocab2.load_vectors(wv_type='glove.6B', wv_dim=100)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# 3.4.3. word embeddings in Vocab objects\n", "\n", "\"\"\"\n", "Word embedding vectors are accessible via Vocab.vectors.\n", "They are treated as any normal FloatTensor.\n", "\"\"\"\n", "\n", "print(\"Word embedding size: \", vocab2.vectors.size())\n", "print(vocab2.vectors[0:10])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# 3.5. easily handle unknown words\n", "\n", "\"\"\"\n", "While you needed exceptions for dealing with unknown words in normal dictionaries,\n", "when using a Vocab object it automatically assigns to any unknown word.\n", "\"\"\"\n", "\n", "unknown_word = \"humbahumba\"\n", "print(\"Index for unknown word %s: %d\" %(unknown_word, vocab2.stoi[unknown_word]))\n", "print(\"Token for unknown word: \", vocab2.itos[vocab2.stoi[unknown_word]])" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [conda env:pytorch]", "language": "python", "name": "conda-env-pytorch-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 1 }