{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# FeatureHasher and DictVectorizer Comparison\n\nIn this example we illustrate text vectorization, which is the process of\nrepresenting non-numerical input data (such as dictionaries or text documents)\nas vectors of real numbers.\n\nWe first compare :func:`~sklearn.feature_extraction.FeatureHasher` and\n:func:`~sklearn.feature_extraction.DictVectorizer` by using both methods to\nvectorize text documents that are preprocessed (tokenized) with the help of a\ncustom Python function.\n\nLater we introduce and analyze the text-specific vectorizers\n:func:`~sklearn.feature_extraction.text.HashingVectorizer`,\n:func:`~sklearn.feature_extraction.text.CountVectorizer` and\n:func:`~sklearn.feature_extraction.text.TfidfVectorizer` that handle both the\ntokenization and the assembling of the feature matrix within a single class.\n\nThe objective of the example is to demonstrate the usage of text vectorization\nAPI and to compare their processing time. See the example scripts\n`sphx_glr_auto_examples_text_plot_document_classification_20newsgroups.py`\nand `sphx_glr_auto_examples_text_plot_document_clustering.py` for actual\nlearning on text documents.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load Data\n\nWe load data from `20newsgroups_dataset`, which comprises around\n18000 newsgroups posts on 20 topics split in two subsets: one for training and\none for testing. For the sake of simplicity and reducing the computational\ncost, we select a subset of 7 topics and use the training set only.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.datasets import fetch_20newsgroups\n\ncategories = [\n \"alt.atheism\",\n \"comp.graphics\",\n \"comp.sys.ibm.pc.hardware\",\n \"misc.forsale\",\n \"rec.autos\",\n \"sci.space\",\n \"talk.religion.misc\",\n]\n\nprint(\"Loading 20 newsgroups training data\")\nraw_data, _ = fetch_20newsgroups(subset=\"train\", categories=categories, return_X_y=True)\ndata_size_mb = sum(len(s.encode(\"utf-8\")) for s in raw_data) / 1e6\nprint(f\"{len(raw_data)} documents - {data_size_mb:.3f}MB\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define preprocessing functions\n\nA token may be a word, part of a word or anything comprised between spaces or\nsymbols in a string. Here we define a function that extracts the tokens using\na simple regular expression (regex) that matches Unicode word characters. This\nincludes most characters that can be part of a word in any language, as well\nas numbers and the underscore:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import re\n\n\ndef tokenize(doc):\n \"\"\"Extract tokens from doc.\n\n This uses a simple regex that matches word characters to break strings\n into tokens. For a more principled approach, see CountVectorizer or\n TfidfVectorizer.\n \"\"\"\n return (tok.lower() for tok in re.findall(r\"\\w+\", doc))\n\n\nlist(tokenize(\"This is a simple example, isn't it?\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We define an additional function that counts the (frequency of) occurrence of\neach token in a given document. It returns a frequency dictionary to be used\nby the vectorizers.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from collections import defaultdict\n\n\ndef token_freqs(doc):\n \"\"\"Extract a dict mapping tokens from doc to their occurrences.\"\"\"\n\n freq = defaultdict(int)\n for tok in tokenize(doc):\n freq[tok] += 1\n return freq\n\n\ntoken_freqs(\"That is one example, but this is another one\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Observe in particular that the repeated token `\"is\"` is counted twice for\ninstance.\n\nBreaking a text document into word tokens, potentially losing the order\ninformation between the words in a sentence is often called a [Bag of Words\nrepresentation](https://en.wikipedia.org/wiki/Bag-of-words_model).\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## DictVectorizer\n\nFirst we benchmark the :func:`~sklearn.feature_extraction.DictVectorizer`,\nthen we compare it to :func:`~sklearn.feature_extraction.FeatureHasher` as\nboth of them receive dictionaries as input.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from time import time\n\nfrom sklearn.feature_extraction import DictVectorizer\n\ndict_count_vectorizers = defaultdict(list)\n\nt0 = time()\nvectorizer = DictVectorizer()\nvectorizer.fit_transform(token_freqs(d) for d in raw_data)\nduration = time() - t0\ndict_count_vectorizers[\"vectorizer\"].append(\n vectorizer.__class__.__name__ + \"\\non freq dicts\"\n)\ndict_count_vectorizers[\"speed\"].append(data_size_mb / duration)\nprint(f\"done in {duration:.3f} s at {data_size_mb / duration:.1f} MB/s\")\nprint(f\"Found {len(vectorizer.get_feature_names_out())} unique terms\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The actual mapping from text token to column index is explicitly stored in\nthe `.vocabulary_` attribute which is a potentially very large Python\ndictionary:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "type(vectorizer.vocabulary_)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "len(vectorizer.vocabulary_)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "vectorizer.vocabulary_[\"example\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## FeatureHasher\n\nDictionaries take up a large amount of storage space and grow in size as the\ntraining set grows. Instead of growing the vectors along with a dictionary,\nfeature hashing builds a vector of pre-defined length by applying a hash\nfunction `h` to the features (e.g., tokens), then using the hash values\ndirectly as feature indices and updating the resulting vector at those\nindices. When the feature space is not large enough, hashing functions tend to\nmap distinct values to the same hash code (hash collisions). As a result, it\nis impossible to determine what object generated any particular hash code.\n\nBecause of the above it is impossible to recover the original tokens from the\nfeature matrix and the best approach to estimate the number of unique terms in\nthe original dictionary is to count the number of active columns in the\nencoded feature matrix. For such a purpose we define the following function:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\n\n\ndef n_nonzero_columns(X):\n \"\"\"Number of columns with at least one non-zero value in a CSR matrix.\n\n This is useful to count the number of features columns that are effectively\n active when using the FeatureHasher.\n \"\"\"\n return len(np.unique(X.nonzero()[1]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The default number of features for the\n:func:`~sklearn.feature_extraction.FeatureHasher` is 2**20. Here we set\n`n_features = 2**18` to illustrate hash collisions.\n\n**FeatureHasher on frequency dictionaries**\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.feature_extraction import FeatureHasher\n\nt0 = time()\nhasher = FeatureHasher(n_features=2**18)\nX = hasher.transform(token_freqs(d) for d in raw_data)\nduration = time() - t0\ndict_count_vectorizers[\"vectorizer\"].append(\n hasher.__class__.__name__ + \"\\non freq dicts\"\n)\ndict_count_vectorizers[\"speed\"].append(data_size_mb / duration)\nprint(f\"done in {duration:.3f} s at {data_size_mb / duration:.1f} MB/s\")\nprint(f\"Found {n_nonzero_columns(X)} unique tokens\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The number of unique tokens when using the\n:func:`~sklearn.feature_extraction.FeatureHasher` is lower than those obtained\nusing the :func:`~sklearn.feature_extraction.DictVectorizer`. This is due to\nhash collisions.\n\nThe number of collisions can be reduced by increasing the feature space.\nNotice that the speed of the vectorizer does not change significantly when\nsetting a large number of features, though it causes larger coefficient\ndimensions and then requires more memory usage to store them, even if a\nmajority of them is inactive.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "t0 = time()\nhasher = FeatureHasher(n_features=2**22)\nX = hasher.transform(token_freqs(d) for d in raw_data)\nduration = time() - t0\n\nprint(f\"done in {duration:.3f} s at {data_size_mb / duration:.1f} MB/s\")\nprint(f\"Found {n_nonzero_columns(X)} unique tokens\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We confirm that the number of unique tokens gets closer to the number of\nunique terms found by the :func:`~sklearn.feature_extraction.DictVectorizer`.\n\n**FeatureHasher on raw tokens**\n\nAlternatively, one can set `input_type=\"string\"` in the\n:func:`~sklearn.feature_extraction.FeatureHasher` to vectorize the strings\noutput directly from the customized `tokenize` function. This is equivalent to\npassing a dictionary with an implied frequency of 1 for each feature name.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "t0 = time()\nhasher = FeatureHasher(n_features=2**18, input_type=\"string\")\nX = hasher.transform(tokenize(d) for d in raw_data)\nduration = time() - t0\ndict_count_vectorizers[\"vectorizer\"].append(\n hasher.__class__.__name__ + \"\\non raw tokens\"\n)\ndict_count_vectorizers[\"speed\"].append(data_size_mb / duration)\nprint(f\"done in {duration:.3f} s at {data_size_mb / duration:.1f} MB/s\")\nprint(f\"Found {n_nonzero_columns(X)} unique tokens\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now plot the speed of the above methods for vectorizing.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n\nfig, ax = plt.subplots(figsize=(12, 6))\n\ny_pos = np.arange(len(dict_count_vectorizers[\"vectorizer\"]))\nax.barh(y_pos, dict_count_vectorizers[\"speed\"], align=\"center\")\nax.set_yticks(y_pos)\nax.set_yticklabels(dict_count_vectorizers[\"vectorizer\"])\nax.invert_yaxis()\n_ = ax.set_xlabel(\"speed (MB/s)\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In both cases :func:`~sklearn.feature_extraction.FeatureHasher` is\napproximately twice as fast as\n:func:`~sklearn.feature_extraction.DictVectorizer`. This is handy when dealing\nwith large amounts of data, with the downside of losing the invertibility of\nthe transformation, which in turn makes the interpretation of a model a more\ncomplex task.\n\nThe `FeatureHeasher` with `input_type=\"string\"` is slightly faster than the\nvariant that works on frequency dict because it does not count repeated\ntokens: each token is implicitly counted once, even if it was repeated.\nDepending on the downstream machine learning task, it can be a limitation or\nnot.\n\n## Comparison with special purpose text vectorizers\n\n:func:`~sklearn.feature_extraction.text.CountVectorizer` accepts raw data as\nit internally implements tokenization and occurrence counting. It is similar\nto the :func:`~sklearn.feature_extraction.DictVectorizer` when used along with\nthe customized function `token_freqs` as done in the previous section. The\ndifference being that :func:`~sklearn.feature_extraction.text.CountVectorizer`\nis more flexible. In particular it accepts various regex patterns through the\n`token_pattern` parameter.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.feature_extraction.text import CountVectorizer\n\nt0 = time()\nvectorizer = CountVectorizer()\nvectorizer.fit_transform(raw_data)\nduration = time() - t0\ndict_count_vectorizers[\"vectorizer\"].append(vectorizer.__class__.__name__)\ndict_count_vectorizers[\"speed\"].append(data_size_mb / duration)\nprint(f\"done in {duration:.3f} s at {data_size_mb / duration:.1f} MB/s\")\nprint(f\"Found {len(vectorizer.get_feature_names_out())} unique terms\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that using the :func:`~sklearn.feature_extraction.text.CountVectorizer`\nimplementation is approximately twice as fast as using the\n:func:`~sklearn.feature_extraction.DictVectorizer` along with the simple\nfunction we defined for mapping the tokens. The reason is that\n:func:`~sklearn.feature_extraction.text.CountVectorizer` is optimized by\nreusing a compiled regular expression for the full training set instead of\ncreating one per document as done in our naive tokenize function.\n\nNow we make a similar experiment with the\n:func:`~sklearn.feature_extraction.text.HashingVectorizer`, which is\nequivalent to combining the \"hashing trick\" implemented by the\n:func:`~sklearn.feature_extraction.FeatureHasher` class and the text\npreprocessing and tokenization of the\n:func:`~sklearn.feature_extraction.text.CountVectorizer`.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.feature_extraction.text import HashingVectorizer\n\nt0 = time()\nvectorizer = HashingVectorizer(n_features=2**18)\nvectorizer.fit_transform(raw_data)\nduration = time() - t0\ndict_count_vectorizers[\"vectorizer\"].append(vectorizer.__class__.__name__)\ndict_count_vectorizers[\"speed\"].append(data_size_mb / duration)\nprint(f\"done in {duration:.3f} s at {data_size_mb / duration:.1f} MB/s\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can observe that this is the fastest text tokenization strategy so far,\nassuming that the downstream machine learning task can tolerate a few\ncollisions.\n\n## TfidfVectorizer\n\nIn a large text corpus, some words appear with higher frequency (e.g. \"the\",\n\"a\", \"is\" in English) and do not carry meaningful information about the actual\ncontents of a document. If we were to feed the word count data directly to a\nclassifier, those very common terms would shadow the frequencies of rarer yet\nmore informative terms. In order to re-weight the count features into floating\npoint values suitable for usage by a classifier it is very common to use the\ntf-idf transform as implemented by the\n:func:`~sklearn.feature_extraction.text.TfidfTransformer`. TF stands for\n\"term-frequency\" while \"tf-idf\" means term-frequency times inverse\ndocument-frequency.\n\nWe now benchmark the :func:`~sklearn.feature_extraction.text.TfidfVectorizer`,\nwhich is equivalent to combining the tokenization and occurrence counting of\nthe :func:`~sklearn.feature_extraction.text.CountVectorizer` along with the\nnormalizing and weighting from a\n:func:`~sklearn.feature_extraction.text.TfidfTransformer`.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.feature_extraction.text import TfidfVectorizer\n\nt0 = time()\nvectorizer = TfidfVectorizer()\nvectorizer.fit_transform(raw_data)\nduration = time() - t0\ndict_count_vectorizers[\"vectorizer\"].append(vectorizer.__class__.__name__)\ndict_count_vectorizers[\"speed\"].append(data_size_mb / duration)\nprint(f\"done in {duration:.3f} s at {data_size_mb / duration:.1f} MB/s\")\nprint(f\"Found {len(vectorizer.get_feature_names_out())} unique terms\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\nLet's conclude this notebook by summarizing all the recorded processing speeds\nin a single plot:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=(12, 6))\n\ny_pos = np.arange(len(dict_count_vectorizers[\"vectorizer\"]))\nax.barh(y_pos, dict_count_vectorizers[\"speed\"], align=\"center\")\nax.set_yticks(y_pos)\nax.set_yticklabels(dict_count_vectorizers[\"vectorizer\"])\nax.invert_yaxis()\n_ = ax.set_xlabel(\"speed (MB/s)\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice from the plot that\n:func:`~sklearn.feature_extraction.text.TfidfVectorizer` is slightly slower\nthan :func:`~sklearn.feature_extraction.text.CountVectorizer` because of the\nextra operation induced by the\n:func:`~sklearn.feature_extraction.text.TfidfTransformer`.\n\nAlso notice that, by setting the number of features `n_features = 2**18`, the\n:func:`~sklearn.feature_extraction.text.HashingVectorizer` performs better\nthan the :func:`~sklearn.feature_extraction.text.CountVectorizer` at the\nexpense of inversibility of the transformation due to hash collisions.\n\nWe highlight that :func:`~sklearn.feature_extraction.text.CountVectorizer` and\n:func:`~sklearn.feature_extraction.text.HashingVectorizer` perform better than\ntheir equivalent :func:`~sklearn.feature_extraction.DictVectorizer` and\n:func:`~sklearn.feature_extraction.FeatureHasher` on manually tokenized\ndocuments since the internal tokenization step of the former vectorizers\ncompiles a regular expression once and then reuses it for all the documents.\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }