{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Processing\n", "\n", "**Data processing for labeled datasets involves the following operations:**\n", "1. Read and preprocess train, validation and test CSV files (involves data alterations, e.g., tokenization).\n", "2. Compute global vocabulary (set of all unique words in all 3 datasets).\n", "3. Compute word embeddings for all words in vocabulary.\n", "4. Compute metadata on training set (e.g., word frequencies).\n", "5. Save all 3 processed datasets to cache along with data processing info.\n", "\n", "**Data processing for unlabeled datasets involves the following operations:**\n", "1. Read and preprocess unlabeled CSV file (involves data alterations, e.g., tokenization).\n", "2. Compute vocabulary (set of all unique words in all unlabeled dataset).\n", "3. Compute word embeddings for words in vocabulary that are not present in labeled datasets.\n", "\n", "In `deepmatcher` we aim to simplify and abstract away the complexity of data processing as much as possible. In some cases however, you may need to customize some aspects of it. \n", "\n", "This tutorial is structured into four sections, each describing one kind of customization:\n", "1. CSV format\n", "2. Data alterations\n", "3. Word embeddings\n", "4. Caching\n", "\n", "## 1. CSV format\n", "\n", "As described in the getting started tutorial, each CSV file is assumed to have the following kinds of columns:\n", "\n", "* **\"Left\" attributes (required):** Remember our goal is to match tuple pairs across two tables (e.g., table A and B). \"Left\" attributes are columns that correspond to the \"left\" tuple or the first tuple (in table A) in an tuple pair. These column names are expected to be prefixed with \"left_\" by default. This can be customized by setting the `left_prefix` parameter (e.g., use \"ltable_\" as the prefix).\n", "* **\"Right\" attributes (required):** \"Right\" attributes are columns that correspond to the \"right\" tuple or the second tuple (in table B) in an tuple pair. These column names are expected to be prefixed with \"right_\" by default. This can be customized by setting the `right_prefix` parameter (e.g., use \"rtable_\" as the prefix).\n", "* **Label column (required for train, validation, test):** Column containing the labels (match or non-match) for each tuple pair. Expected to be named \"label\" by default. This can be customized by setting the `label_attr` parameter.\n", "* **ID column (required):** Column containing a unique ID for each tuple pair. This is for evaluation convenience. Expected to be named \"id\" by default. This can be customized by setting the `id_attr` parameter.\n", "\n", "An example of this is shown below:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import deepmatcher as dm\n", "train, validation, test = dm.data.process(\n", " path='sample_data/itunes-amazon',\n", " train='train.csv',\n", " validation='validation.csv',\n", " test='test.csv',\n", " ignore_columns=('left_id', 'right_id'),\n", " left_prefix='left_',\n", " right_prefix='right_',\n", " label_attr='label',\n", " id_attr='id')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Data alterations\n", "\n", "By default, data processing involves performing the following two modifications to all data:\n", "\n", "**Tokenization:** Tokenization involves dividing text into a sequence of tokens, which roughly correspond to \"words\". E.g., \"This ain't funny. It's actually hillarious.\" will be converted to the following sequence after tokenization: \\['This', 'ain', ''t', 'funny', '.', 'It', ''s', 'actually', 'hillarious', '.'\\]. The tokenizer can be set by specifying the `tokenizer` parameter. By default, this is set to `\"nltk\"`, which will use the **[default nltk tokenizer](https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.word_tokenize)**. Alternatively, you may set this to `\"spacy\"` which will use the tokenizer provided by the `spacy` package. You need to first [install and setup](https://spacy.io/usage/) `spacy` to do this." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Rebuilding data cache because: {'Field arguments have changed.'}\n", "Load time: 0.747850532643497\n", "Vocab time: 13.160778391174972\n", "Metadata time: 0.3041890738531947\n", "Cache time: 0.5588693134486675\n" ] } ], "source": [ "import deepmatcher as dm\n", "train, validation, test = dm.data.process(\n", " path='sample_data/itunes-amazon',\n", " train='train.csv',\n", " validation='validation.csv',\n", " test='test.csv',\n", " ignore_columns=('left_id', 'right_id'),\n", " tokenize='spacy')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Lowercasing:** By default all data is lowercased to improve generalization. This can be disabled by setting the `lowercase` parameter to `False`." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Rebuilding data cache because: {'Field arguments have changed.'}\n", "Load time: 0.8287883093580604\n", "Vocab time: 0.07724822405725718\n", "Metadata time: 0.2948675565421581\n", "Cache time: 0.5322669483721256\n" ] } ], "source": [ "train, validation, test = dm.data.process(\n", " path='sample_data/itunes-amazon',\n", " train='train.csv',\n", " validation='validation.csv',\n", " test='test.csv',\n", " ignore_columns=('left_id', 'right_id'),\n", " lowercase=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Word Embeddings\n", "\n", "**[Word embeddings](http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/#word-embeddings)** are obtained using **[fastText](https://fasttext.cc/docs/en/pretrained-vectors.html)** by default. This can be customized by setting the `embeddings` parameter to any of the pre-trained word embedding models described in the **[API docs](http://pages.cs.wisc.edu/~sidharth/deepmatcher/deepmatcher.html#deepmatcher.process)** under the `embeddings` parameter. We list a few common settings for `embeddings` below:\n", "\n", "* `fasttext.en.bin`: Uncased **character level** `fastText` embeddings trained on English Wikipedia. This is the default.\n", "* `fasttext.crawl.vec`: Uncased word level `fastText` embeddings trained on Common Crawl\n", "* `fasttext.wiki.vec`: Uncased word level `fastText` embeddings trained on English Wikipedia and news data\n", "* `glove.42B.300d`: Uncased glove word embeddings trained on Common Crawl \n", "* `glove.6B.300d`: Uncased glove word embeddings trained on English Wikipedia and news data\n", "\n", "An example follows:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Rebuilding data cache because: {'Field arguments have changed.'}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:torchtext.vocab:Downloading vectors from http://nlp.stanford.edu/data/glove.42B.300d.zip\n", "/u/s/i/sidharth/.vector_cache/glove.42B.300d.zip: 0.00B [00:00, ?B/s]" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Load time: 0.6479603135958314\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/u/s/i/sidharth/.vector_cache/glove.42B.300d.zip: 1.88GB [04:29, 6.96MB/s] \n", "INFO:torchtext.vocab:Extracting vectors into /u/s/i/sidharth/.vector_cache\n", "INFO:torchtext.vocab:Loading vectors from /u/s/i/sidharth/.vector_cache/glove.42B.300d.txt\n", "100%|██████████| 1917494/1917494 [02:42<00:00, 11825.27it/s]\n", "INFO:torchtext.vocab:Saving vectors to /u/s/i/sidharth/.vector_cache/glove.42B.300d.txt.pt\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Vocab time: 647.3909602137282\n", "Metadata time: 0.33549712132662535\n", "Cache time: 0.5913551291450858\n" ] } ], "source": [ "train, validation, test = dm.data.process(\n", " path='sample_data/itunes-amazon',\n", " train='train.csv',\n", " validation='validation.csv',\n", " test='test.csv',\n", " ignore_columns=('left_id', 'right_id'),\n", " embeddings='glove.42B.300d')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to avoid redownloading pre-trained embeddings during each run, the downloads are saved to a shared directory which serves as a cache for word embeddings. By default this is set to `~/.vector_cache`, but you may customize this location by setting the `embeddings_cache_path` parameter as follows:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:deepmatcher.data.field:Downloading vectors from https://drive.google.com/uc?export=download&id=1Vih8gAmgBnuYDxfblbT94P6WjB7s1ZSh\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Load time: 0.6334623005241156\n", "downloading from Google Drive; may take a few minutes\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/afs/cs.wisc.edu/u/s/i/sidharth/private/deepmatcher/deepmatcher/data/field.py:62: ResourceWarning: unclosed \n", " download_from_url(url, self.destination)\n", "/afs/cs.wisc.edu/u/s/i/sidharth/private/deepmatcher/deepmatcher/data/field.py:62: ResourceWarning: unclosed \n", " download_from_url(url, self.destination)\n", "INFO:deepmatcher.data.field:Extracting vectors into /u/s/i/sidharth/custom_embeddings_cache_dir\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Vocab time: 435.4614539416507\n", "Metadata time: 0.34553063195198774\n", "Cache time: 0.6381127825006843\n" ] } ], "source": [ "# First, remove data cache file. Otherwise `embeddings_cache_path` won't be used - cache \n", "# already contains embeddings information.\n", "!rm -f sample_data/itunes-amazon/*.pth\n", "\n", "# Also reset in-memory vector cache. Otherwise in-memory fastText embeddings will be used \n", "# instead of loading them from disk.\n", "dm.data.reset_vector_cache()\n", "\n", "# Then, re-process.\n", "train, validation, test = dm.data.process(\n", " path='sample_data/itunes-amazon',\n", " train='train.csv',\n", " validation='validation.csv',\n", " test='test.csv',\n", " ignore_columns=('left_id', 'right_id'),\n", " embeddings_cache_path='~/custom_embeddings_cache_dir')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Caching" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Processing data is time consuming. In order to reduce the time spent in processing, `deepmatcher` automatically caches the result of the data processing step for all labeled datasets. It is designed such that users do not need to be aware of its existence at all - you only need to call `deepmatcher.data.process` as you would normally. In the first such call, `deepmatcher` would do the processing and cache the result. In subsequent calls, unless there are any changes to this call that would necessitate re-processing data, e.g., a modification in any of the data files or a change in the tokenizer used, `deepmatcher` will re-use the cached results. If there changes that makes the cache invalid, it will automatically rebuild the cache. The caching behavior can be customized by setting these parameters in the `deepmatcher.data.process` call:\n", "\n", "* **cache:** The filename of the cache file which will be stored in the same `path` as the data sets. This file will store the processed train, validation and test data, along with all relevant information about data processing (e.g. columns to ignore, tokenizer, lowercasing etc.). This defaults to `cacheddata.pth`. \n", "* **check_cached_data:** Whether to check the contents of the cache file to ensure its compatibility with the specified data processing arguments. This defaults to `True`, and we strongly recommend against disabling the check.\n", "* **auto_rebuild_cache:** Whether to automatically rebuild the cache file if the cache is stale, i.e., if the data processing arguments have changed in a way that makes the previously processed data invalid. If this is `False` and `check_cached_data` is enabled, then `deepmatcher` will throw an exception if the cache is stale. This defaults to `True`.\n", "\n", "An example of using these parameters is shown below:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Load time: 0.6869373824447393\n", "Vocab time: 0.05685392860323191\n", "Metadata time: 0.2930362243205309\n", "Cache time: 0.5228710686787963\n" ] } ], "source": [ "train, validation, test = dm.data.process(\n", " path='sample_data/itunes-amazon',\n", " train='train.csv',\n", " validation='validation.csv',\n", " test='test.csv',\n", " ignore_columns=('left_id', 'right_id'),\n", " cache='my_itunes_cache.pth',\n", " check_cached_data=True,\n", " auto_rebuild_cache=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now when you change a processing parameter with `auto_rebuild_cache` set to False, you will get an error." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }