{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Benchmark: Implement Levenshtein term similarity matrix and fast SCM between corpora ([RaRe-Technologies/gensim PR #2016][#2016])\n", "\n", " [#2016]: https://github.com/RaRe-Technologies/gensim/pull/2016 (Implement Levenshtein term similarity matrix and fast SCM between corpora - Pull Request #2016)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "d429fedf094e00c4bb5c27589d5befb53b2e4b13\r\n" ] } ], "source": [ "!git rev-parse HEAD" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from copy import deepcopy\n", "from datetime import timedelta\n", "from itertools import product\n", "import logging\n", "from math import floor, ceil, log10\n", "import pickle\n", "from random import sample, seed, shuffle\n", "from time import time\n", "\n", "import numpy as np\n", "import pandas as pd\n", "from tqdm import tqdm_notebook\n", "\n", "def tqdm(iterable, total=None, desc=None):\n", " if total is None:\n", " total = len(iterable)\n", " for num_done, element in enumerate(tqdm_notebook(iterable, total=total)):\n", " logger.info(\"%s: %d / %d\", desc, num_done, total)\n", " yield element\n", "\n", "from gensim.corpora import Dictionary\n", "import gensim.downloader as api\n", "from gensim.similarities.index import AnnoyIndexer\n", "from gensim.similarities import SparseTermSimilarityMatrix\n", "from gensim.similarities import UniformTermSimilarityIndex\n", "from gensim.similarities import LevenshteinSimilarityIndex\n", "from gensim.similarities import WordEmbeddingSimilarityIndex\n", "from gensim.utils import simple_preprocess\n", "\n", "RANDOM_SEED = 12345\n", "\n", "logger = logging.getLogger()\n", "fhandler = logging.FileHandler(filename='matrix_speed.log', mode='a')\n", "formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')\n", "fhandler.setFormatter(formatter)\n", "logger.addHandler(fhandler)\n", "logger.setLevel(logging.INFO)\n", "\n", "pd.set_option('display.max_rows', None, 'display.max_seq_items', None)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "\"\"\"Repeatedly run a benchmark callable given various configurations and\n", "get a list of results.\n", "\n", "Return a list of results of repeatedly running a benchmark callable.\n", "\n", "Parameters\n", "----------\n", "benchmark : callable tuple -> dict\n", " A benchmark callable that accepts a configuration and returns results.\n", "configurations : iterable of tuple\n", " An iterable of configurations that are used for calling the benchmark function.\n", "results_filename : str\n", " A filename of a file that will be used to persistently store the results using\n", " pickle. If the file exists, then the function will load the stored results\n", " instead of calling the benchmark callable.\n", "\n", "Returns\n", "-------\n", "iterable of tuple\n", " The return values of the individual invocations of the benchmark callable.\n", "\n", "\"\"\"\n", "def benchmark_results(benchmark, configurations, results_filename):\n", " try:\n", " with open(results_filename, \"rb\") as file:\n", " results = pickle.load(file)\n", " except IOError:\n", " configurations = list(configurations)\n", " shuffle(configurations)\n", " results = list(tqdm(\n", " (benchmark(configuration) for configuration in configurations),\n", " total=len(configurations), desc=\"benchmark\"))\n", " with open(results_filename, \"wb\") as file:\n", " pickle.dump(results, file)\n", " return results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Implement Levenshtein term similarity matrix\n", "\n", "In Gensim PR [#1827][], we added a base implementation of the soft cosine measure (SCM). The base implementation would create term similarity matrices using a single complex procedure. In the Gensim PR [#2016][], we split the procedure into:\n", "\n", "- **TermSimilarityIndex** builder classes that produce the $k$ most similar terms for a given term $t$ that are distinct from $t$ along with the term similarities, and\n", "- the **SparseTermSimilarityMatrix** director class that constructs term similarity matrices and consumes term similarities produced by **TermSimilarityIndex** instances.\n", "\n", "One of the benefits of this separation is that we can easily measure the speed at which a **TermSimilarityIndex** builder class produces term similarities and compare this speed with the speed at which the **SparseTermSimilarityMatrix** director class consumes term similarities. This allows us to see which of the classes are a bottleneck that slows down the construction of term similarity matrices.\n", "\n", "In this notebook, we measure all the currently available builder and director classes. For the measurements, we use the [Google News word embeddings][word2vec-google-news-300] distributed with the C implementation of Word2Vec. From the word embeddings, we will derive a dictionary of 2.01M terms.\n", "\n", " [word2vec-google-news-300]: https://github.com/mmihaltz/word2vec-GoogleNews-vectors (word2vec-GoogleNews-vectors)\n", " [#1827]: https://github.com/RaRe-Technologies/gensim/pull/1827 (Implement Soft Cosine Measure - Pull Request #1827)\n", " [#2016]: https://github.com/RaRe-Technologies/gensim/pull/2016 (Implement Levenshtein term similarity matrix and fast SCM between corpora - Pull Request #2016)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "full_model = api.load(\"word2vec-google-news-300\")\n", "\n", "try:\n", " full_dictionary = Dictionary.load(\"matrix_speed.dictionary\")\n", "except IOError:\n", " full_dictionary = Dictionary([[term] for term in full_model.vocab.keys()])\n", " full_dictionary.save(\"matrix_speed.dictionary\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Director class benchmark\n", "#### SparseTermSimilarityMatrix\n", "First, we measure the speed at which the **SparseTermSimilarityMatrix** director class consumes term similarities." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "scrolled": true }, "outputs": [], "source": [ "def benchmark(configuration):\n", " dictionary, nonzero_limit, symmetric, positive_definite, repetition = configuration\n", " index = UniformTermSimilarityIndex(dictionary)\n", " \n", " start_time = time()\n", " matrix = SparseTermSimilarityMatrix(\n", " index, dictionary, nonzero_limit=nonzero_limit, symmetric=symmetric,\n", " positive_definite=positive_definite, dtype=np.float16).matrix\n", " end_time = time()\n", " \n", " duration = end_time - start_time\n", " return {\n", " \"dictionary_size\": len(dictionary),\n", " \"nonzero_limit\": nonzero_limit,\n", " \"matrix_nonzero\": matrix.nnz,\n", " \"repetition\": repetition,\n", " \"symmetric\": symmetric,\n", " \"positive_definite\": positive_definite,\n", " \"duration\": duration, }" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "4aef903a70e24247ad3c889237ed4c48", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(IntProgress(value=0, max=4), HTML(value='')))" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "dictionary_sizes = [10**k for k in range(3, int(ceil(log10(len(full_dictionary)))))]\n", "seed(RANDOM_SEED)\n", "dictionaries = []\n", "for size in tqdm(dictionary_sizes, desc=\"dictionaries\"):\n", " dictionary = Dictionary([sample(list(full_dictionary.values()), size)])\n", " dictionaries.append(dictionary)\n", "dictionaries.append(full_dictionary)\n", "nonzero_limits = [1, 10, 100]\n", "symmetry = (True, False)\n", "positive_definiteness = (True, False)\n", "repetitions = range(10)\n", "\n", "configurations = product(dictionaries, nonzero_limits, symmetry, positive_definiteness, repetitions)\n", "results = benchmark_results(benchmark, configurations, \"matrix_speed.director_results\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following tables show how long it takes to construct a term similarity matrix (the **duration** column), how many nonzero elements there are in the matrix (the **matrix_nonzero** column) and the mean term similarity consumption speed (the **consumption_speed** column) as we vary the dictionary size (the **dictionary_size** column) the maximum number of nonzero elements outside the diagonal in every column of the matrix (the **nonzero_limit** column), the matrix symmetry constraint (the **symmetric** column), and the matrix positive definiteness constraing (the **positive_definite** column). Ten independendent measurements were taken. The top table shows the mean values and the bottom table shows the standard deviations.\n", "\n", "We can see that the symmetry and positive definiteness constraints severely limit the number of nonzero elements in the resulting matrix. This in turn increases the consumption speed, since we end up throwing away most of the elements that we consume. The effects of the dictionary size on the mean term similarity consumption speed are minor to none." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame(results)\n", "df[\"consumption_speed\"] = df.dictionary_size * df.nonzero_limit / df.duration\n", "df = df.groupby([\"dictionary_size\", \"nonzero_limit\", \"symmetric\", \"positive_definite\"])\n", "\n", "def display(df):\n", " df[\"duration\"] = [timedelta(0, duration) for duration in df[\"duration\"]]\n", " df[\"matrix_nonzero\"] = [int(nonzero) for nonzero in df[\"matrix_nonzero\"]]\n", " df[\"consumption_speed\"] = [\"%.02f Kword pairs / s\" % (speed / 1000) for speed in df[\"consumption_speed\"]]\n", " return df" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
durationmatrix_nonzeroconsumption_speed
dictionary_sizenonzero_limitsymmetricpositive_definite
100001FalseFalse00:00:00.4355332000022.96 Kword pairs / s
True00:00:00.4926062000020.30 Kword pairs / s
TrueFalse00:00:00.1855631000253.90 Kword pairs / s
True00:00:00.2404711000241.59 Kword pairs / s
10FalseFalse00:00:02.68783611000037.21 Kword pairs / s
True00:00:00.61549220000162.49 Kword pairs / s
TrueFalse00:00:00.50118810118199.53 Kword pairs / s
True00:00:01.3805861001072.44 Kword pairs / s
100FalseFalse00:00:25.262807101000039.58 Kword pairs / s
True00:00:01.13252420000883.02 Kword pairs / s
TrueFalse00:00:03.59566620198278.13 Kword pairs / s
True00:00:11.8189121010084.61 Kword pairs / s
20100001FalseFalse00:01:31.786585402000021.90 Kword pairs / s
True00:01:40.954580402000019.91 Kword pairs / s
TrueFalse00:00:39.050064201000251.48 Kword pairs / s
True00:00:49.238437201000240.82 Kword pairs / s
10FalseFalse00:09:35.4703732211000034.93 Kword pairs / s
True00:02:02.9203344020000163.52 Kword pairs / s
TrueFalse00:01:39.5766932010118201.88 Kword pairs / s
True00:04:35.646501201001072.92 Kword pairs / s
100FalseFalse01:42:01.74756820301000032.88 Kword pairs / s
True00:03:36.4207784020000928.75 Kword pairs / s
TrueFalse00:10:58.4340602020198305.30 Kword pairs / s
True00:39:40.319479201010084.44 Kword pairs / s
\n", "
" ], "text/plain": [ " duration \\\n", "dictionary_size nonzero_limit symmetric positive_definite \n", "10000 1 False False 00:00:00.435533 \n", " True 00:00:00.492606 \n", " True False 00:00:00.185563 \n", " True 00:00:00.240471 \n", " 10 False False 00:00:02.687836 \n", " True 00:00:00.615492 \n", " True False 00:00:00.501188 \n", " True 00:00:01.380586 \n", " 100 False False 00:00:25.262807 \n", " True 00:00:01.132524 \n", " True False 00:00:03.595666 \n", " True 00:00:11.818912 \n", "2010000 1 False False 00:01:31.786585 \n", " True 00:01:40.954580 \n", " True False 00:00:39.050064 \n", " True 00:00:49.238437 \n", " 10 False False 00:09:35.470373 \n", " True 00:02:02.920334 \n", " True False 00:01:39.576693 \n", " True 00:04:35.646501 \n", " 100 False False 01:42:01.747568 \n", " True 00:03:36.420778 \n", " True False 00:10:58.434060 \n", " True 00:39:40.319479 \n", "\n", " matrix_nonzero \\\n", "dictionary_size nonzero_limit symmetric positive_definite \n", "10000 1 False False 20000 \n", " True 20000 \n", " True False 10002 \n", " True 10002 \n", " 10 False False 110000 \n", " True 20000 \n", " True False 10118 \n", " True 10010 \n", " 100 False False 1010000 \n", " True 20000 \n", " True False 20198 \n", " True 10100 \n", "2010000 1 False False 4020000 \n", " True 4020000 \n", " True False 2010002 \n", " True 2010002 \n", " 10 False False 22110000 \n", " True 4020000 \n", " True False 2010118 \n", " True 2010010 \n", " 100 False False 203010000 \n", " True 4020000 \n", " True False 2020198 \n", " True 2010100 \n", "\n", " consumption_speed \n", "dictionary_size nonzero_limit symmetric positive_definite \n", "10000 1 False False 22.96 Kword pairs / s \n", " True 20.30 Kword pairs / s \n", " True False 53.90 Kword pairs / s \n", " True 41.59 Kword pairs / s \n", " 10 False False 37.21 Kword pairs / s \n", " True 162.49 Kword pairs / s \n", " True False 199.53 Kword pairs / s \n", " True 72.44 Kword pairs / s \n", " 100 False False 39.58 Kword pairs / s \n", " True 883.02 Kword pairs / s \n", " True False 278.13 Kword pairs / s \n", " True 84.61 Kword pairs / s \n", "2010000 1 False False 21.90 Kword pairs / s \n", " True 19.91 Kword pairs / s \n", " True False 51.48 Kword pairs / s \n", " True 40.82 Kword pairs / s \n", " 10 False False 34.93 Kword pairs / s \n", " True 163.52 Kword pairs / s \n", " True False 201.88 Kword pairs / s \n", " True 72.92 Kword pairs / s \n", " 100 False False 32.88 Kword pairs / s \n", " True 928.75 Kword pairs / s \n", " True False 305.30 Kword pairs / s \n", " True 84.44 Kword pairs / s " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "display(df.mean()).loc[\n", " [10000, len(full_dictionary)], :, :].loc[\n", " :, [\"duration\", \"matrix_nonzero\", \"consumption_speed\"]]" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
durationmatrix_nonzeroconsumption_speed
dictionary_sizenonzero_limitsymmetricpositive_definite
100001FalseFalse00:00:00.00533400.28 Kword pairs / s
True00:00:00.00407200.17 Kword pairs / s
TrueFalse00:00:00.00312400.90 Kword pairs / s
True00:00:00.00179700.31 Kword pairs / s
10FalseFalse00:00:00.01198600.17 Kword pairs / s
True00:00:00.00597201.59 Kword pairs / s
TrueFalse00:00:00.00286901.15 Kword pairs / s
True00:00:00.01141100.60 Kword pairs / s
100FalseFalse00:00:00.11111800.17 Kword pairs / s
True00:00:00.00761105.94 Kword pairs / s
TrueFalse00:00:00.03087502.38 Kword pairs / s
True00:00:00.05019800.36 Kword pairs / s
20100001FalseFalse00:00:00.76730500.18 Kword pairs / s
True00:00:00.17243200.03 Kword pairs / s
TrueFalse00:00:00.34623900.46 Kword pairs / s
True00:00:00.17707500.15 Kword pairs / s
10FalseFalse00:00:05.15665500.31 Kword pairs / s
True00:00:00.63167600.83 Kword pairs / s
TrueFalse00:00:01.21606702.41 Kword pairs / s
True00:00:00.54777300.14 Kword pairs / s
100FalseFalse00:04:10.37103501.24 Kword pairs / s
True00:00:00.63441602.73 Kword pairs / s
TrueFalse00:00:06.58676703.05 Kword pairs / s
True00:00:09.03093200.32 Kword pairs / s
\n", "
" ], "text/plain": [ " duration \\\n", "dictionary_size nonzero_limit symmetric positive_definite \n", "10000 1 False False 00:00:00.005334 \n", " True 00:00:00.004072 \n", " True False 00:00:00.003124 \n", " True 00:00:00.001797 \n", " 10 False False 00:00:00.011986 \n", " True 00:00:00.005972 \n", " True False 00:00:00.002869 \n", " True 00:00:00.011411 \n", " 100 False False 00:00:00.111118 \n", " True 00:00:00.007611 \n", " True False 00:00:00.030875 \n", " True 00:00:00.050198 \n", "2010000 1 False False 00:00:00.767305 \n", " True 00:00:00.172432 \n", " True False 00:00:00.346239 \n", " True 00:00:00.177075 \n", " 10 False False 00:00:05.156655 \n", " True 00:00:00.631676 \n", " True False 00:00:01.216067 \n", " True 00:00:00.547773 \n", " 100 False False 00:04:10.371035 \n", " True 00:00:00.634416 \n", " True False 00:00:06.586767 \n", " True 00:00:09.030932 \n", "\n", " matrix_nonzero \\\n", "dictionary_size nonzero_limit symmetric positive_definite \n", "10000 1 False False 0 \n", " True 0 \n", " True False 0 \n", " True 0 \n", " 10 False False 0 \n", " True 0 \n", " True False 0 \n", " True 0 \n", " 100 False False 0 \n", " True 0 \n", " True False 0 \n", " True 0 \n", "2010000 1 False False 0 \n", " True 0 \n", " True False 0 \n", " True 0 \n", " 10 False False 0 \n", " True 0 \n", " True False 0 \n", " True 0 \n", " 100 False False 0 \n", " True 0 \n", " True False 0 \n", " True 0 \n", "\n", " consumption_speed \n", "dictionary_size nonzero_limit symmetric positive_definite \n", "10000 1 False False 0.28 Kword pairs / s \n", " True 0.17 Kword pairs / s \n", " True False 0.90 Kword pairs / s \n", " True 0.31 Kword pairs / s \n", " 10 False False 0.17 Kword pairs / s \n", " True 1.59 Kword pairs / s \n", " True False 1.15 Kword pairs / s \n", " True 0.60 Kword pairs / s \n", " 100 False False 0.17 Kword pairs / s \n", " True 5.94 Kword pairs / s \n", " True False 2.38 Kword pairs / s \n", " True 0.36 Kword pairs / s \n", "2010000 1 False False 0.18 Kword pairs / s \n", " True 0.03 Kword pairs / s \n", " True False 0.46 Kword pairs / s \n", " True 0.15 Kword pairs / s \n", " 10 False False 0.31 Kword pairs / s \n", " True 0.83 Kword pairs / s \n", " True False 2.41 Kword pairs / s \n", " True 0.14 Kword pairs / s \n", " 100 False False 1.24 Kword pairs / s \n", " True 2.73 Kword pairs / s \n", " True False 3.05 Kword pairs / s \n", " True 0.32 Kword pairs / s " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "display(df.apply(lambda x: (x - x.mean()).std())).loc[\n", " [10000, len(full_dictionary)], :, :].loc[\n", " :, [\"duration\", \"matrix_nonzero\", \"consumption_speed\"]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Builder class benchmark\n", "#### UniformTermSimilarityIndex\n", "First, we measure the speed at which the **UniformTermSimilarityIndex** builder class produces term similarities. **UniformTermSimilarityIndex** is a dummy class that just generates a sequence of constants. It produces much more term similarities per second than the **SparseTermSimilarityMatrix** is capable of consuming and its results will serve as an upper limit." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "def benchmark(configuration):\n", " dictionary, nonzero_limit, repetition = configuration\n", " \n", " start_time = time()\n", " index = UniformTermSimilarityIndex(dictionary)\n", " end_time = time()\n", " constructor_duration = end_time - start_time\n", " \n", " start_time = time()\n", " for term in dictionary.values():\n", " for _j, _k in zip(index.most_similar(term, topn=nonzero_limit), range(nonzero_limit)):\n", " pass\n", " end_time = time()\n", " production_duration = end_time - start_time\n", " \n", " return {\n", " \"dictionary_size\": len(dictionary),\n", " \"nonzero_limit\": nonzero_limit,\n", " \"repetition\": repetition,\n", " \"constructor_duration\": constructor_duration,\n", " \"production_duration\": production_duration, }" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "nonzero_limits = [1, 10, 100, 1000]\n", "\n", "configurations = product(dictionaries, nonzero_limits, repetitions)\n", "results = benchmark_results(benchmark, configurations, \"matrix_speed.builder_results.uniform\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following tables show how long it takes to retrieve the most similar terms for all terms in a dictionary (the **production_duration** column) and the mean term similarity production speed (the **production_speed** column) as we vary the dictionary size (the **dictionary_size** column), and the maximum number of most similar terms that will be retrieved (the **nonzero_limit** column). Ten independendent measurements were taken. The top table shows the mean values and the bottom table shows the standard deviations.\n", "\n", "The **production_speed** is proportional to **nonzero_limit**." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame(results)\n", "df[\"processing_speed\"] = df.dictionary_size ** 2 / df.production_duration\n", "df[\"production_speed\"] = df.dictionary_size * df.nonzero_limit / df.production_duration\n", "df = df.groupby([\"dictionary_size\", \"nonzero_limit\"])\n", "\n", "def display(df):\n", " df[\"constructor_duration\"] = [timedelta(0, duration) for duration in df[\"constructor_duration\"]]\n", " df[\"production_duration\"] = [timedelta(0, duration) for duration in df[\"production_duration\"]]\n", " df[\"processing_speed\"] = [\"%.02f Kword pairs / s\" % (speed / 1000) for speed in df[\"processing_speed\"]]\n", " df[\"production_speed\"] = [\"%.02f Kword pairs / s\" % (speed / 1000) for speed in df[\"production_speed\"]]\n", " return df" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
production_durationproduction_speed
dictionary_sizenonzero_limit
1000100:00:00.002973336.41 Kword pairs / s
1000:00:00.0053721861.64 Kword pairs / s
10000:00:00.0267523738.79 Kword pairs / s
100000:00:00.2902653449.16 Kword pairs / s
2010000100:00:06.318446318.12 Kword pairs / s
1000:00:10.7836111863.96 Kword pairs / s
10000:00:53.1086443785.04 Kword pairs / s
100000:09:45.1037413437.36 Kword pairs / s
\n", "
" ], "text/plain": [ " production_duration production_speed\n", "dictionary_size nonzero_limit \n", "1000 1 00:00:00.002973 336.41 Kword pairs / s\n", " 10 00:00:00.005372 1861.64 Kword pairs / s\n", " 100 00:00:00.026752 3738.79 Kword pairs / s\n", " 1000 00:00:00.290265 3449.16 Kword pairs / s\n", "2010000 1 00:00:06.318446 318.12 Kword pairs / s\n", " 10 00:00:10.783611 1863.96 Kword pairs / s\n", " 100 00:00:53.108644 3785.04 Kword pairs / s\n", " 1000 00:09:45.103741 3437.36 Kword pairs / s" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "display(df.mean()).loc[\n", " [1000, len(full_dictionary)], :, :].loc[\n", " :, [\"production_duration\", \"production_speed\"]]" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
production_durationproduction_speed
dictionary_sizenonzero_limit
1000100:00:00.0000171.93 Kword pairs / s
1000:00:00.00006221.50 Kword pairs / s
10000:00:00.00040856.66 Kword pairs / s
100000:00:00.010500123.82 Kword pairs / s
2010000100:00:00.0234951.18 Kword pairs / s
1000:00:00.0355876.16 Kword pairs / s
10000:00:00.53576537.76 Kword pairs / s
100000:00:15.03781689.56 Kword pairs / s
\n", "
" ], "text/plain": [ " production_duration production_speed\n", "dictionary_size nonzero_limit \n", "1000 1 00:00:00.000017 1.93 Kword pairs / s\n", " 10 00:00:00.000062 21.50 Kword pairs / s\n", " 100 00:00:00.000408 56.66 Kword pairs / s\n", " 1000 00:00:00.010500 123.82 Kword pairs / s\n", "2010000 1 00:00:00.023495 1.18 Kword pairs / s\n", " 10 00:00:00.035587 6.16 Kword pairs / s\n", " 100 00:00:00.535765 37.76 Kword pairs / s\n", " 1000 00:00:15.037816 89.56 Kword pairs / s" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "display(df.apply(lambda x: (x - x.mean()).std())).loc[\n", " [1000, len(full_dictionary)], :, :].loc[\n", " :, [\"production_duration\", \"production_speed\"]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### LevenshteinSimilarityIndex\n", "Next, we measure the speed at which the **LevenshteinSimilarityIndex** builder class produces term similarities. **LevenshteinSimilarityIndex** is currently just a naïve implementation that produces much fewer term similarities per second than the **SparseTermSimilarityMatrix** class is capable of consuming." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "def benchmark(configuration):\n", " dictionary, nonzero_limit, query_terms, repetition = configuration\n", " \n", " start_time = time()\n", " index = LevenshteinSimilarityIndex(dictionary)\n", " end_time = time()\n", " constructor_duration = end_time - start_time\n", " \n", " start_time = time()\n", " for term in query_terms:\n", " for _j, _k in zip(index.most_similar(term, topn=nonzero_limit), range(nonzero_limit)):\n", " pass\n", " end_time = time()\n", " production_duration = end_time - start_time\n", " \n", " return {\n", " \"dictionary_size\": len(dictionary),\n", " \"mean_query_term_length\": np.mean([len(term) for term in query_terms]),\n", " \"nonzero_limit\": nonzero_limit,\n", " \"repetition\": repetition,\n", " \"constructor_duration\": constructor_duration,\n", " \"production_duration\": production_duration, }" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "nonzero_limits = [1, 10, 100]\n", "seed(RANDOM_SEED)\n", "min_dictionary = sorted((len(dictionary), dictionary) for dictionary in dictionaries)[0][1]\n", "query_terms = sample(list(min_dictionary.values()), 10)\n", "\n", "configurations = product(dictionaries, nonzero_limits, [query_terms], repetitions)\n", "results = benchmark_results(benchmark, configurations, \"matrix_speed.builder_results.levenshtein\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following tables show how long it takes to retrieve the most similar terms for ten randomly sampled terms from a dictionary (the **production_duration** column), the mean term similarity production speed (the **production_speed** column) and the mean term similarity processing speed (the **processing_speed** column) as we vary the dictionary size (the **dictionary_size** column), and the maximum number of most similar terms that will be retrieved (the **nonzero_limit** column). Ten independendent measurements were taken. The top table shows the mean values and the bottom table shows the standard deviations.\n", "\n", "The **production_speed** is proportional to **nonzero_limit / dictionary_size**. The **processing_speed** is constant." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame(results)\n", "df[\"processing_speed\"] = df.dictionary_size * len(query_terms) / df.production_duration\n", "df[\"production_speed\"] = df.nonzero_limit * len(query_terms) / df.production_duration\n", "df = df.groupby([\"dictionary_size\", \"nonzero_limit\"])\n", "\n", "def display(df):\n", " df[\"constructor_duration\"] = [timedelta(0, duration) for duration in df[\"constructor_duration\"]]\n", " df[\"production_duration\"] = [timedelta(0, duration) for duration in df[\"production_duration\"]]\n", " df[\"processing_speed\"] = [\"%.02f Kword pairs / s\" % (speed / 1000) for speed in df[\"processing_speed\"]]\n", " df[\"production_speed\"] = [\"%.02f word pairs / s\" % speed for speed in df[\"production_speed\"]]\n", " return df" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
production_durationproduction_speedprocessing_speed
dictionary_sizenonzero_limit
1000100:00:00.055994178.61 word pairs / s178.61 Kword pairs / s
1000:00:00.0560971782.70 word pairs / s178.27 Kword pairs / s
10000:00:00.05621217791.65 word pairs / s177.92 Kword pairs / s
1000000100:01:20.6180700.12 word pairs / s124.05 Kword pairs / s
1000:01:20.0482381.25 word pairs / s124.92 Kword pairs / s
10000:01:20.06499912.49 word pairs / s124.90 Kword pairs / s
2010000100:02:44.0693990.06 word pairs / s122.51 Kword pairs / s
1000:02:43.9146010.61 word pairs / s122.63 Kword pairs / s
10000:02:43.8924086.10 word pairs / s122.64 Kword pairs / s
\n", "
" ], "text/plain": [ " production_duration production_speed \\\n", "dictionary_size nonzero_limit \n", "1000 1 00:00:00.055994 178.61 word pairs / s \n", " 10 00:00:00.056097 1782.70 word pairs / s \n", " 100 00:00:00.056212 17791.65 word pairs / s \n", "1000000 1 00:01:20.618070 0.12 word pairs / s \n", " 10 00:01:20.048238 1.25 word pairs / s \n", " 100 00:01:20.064999 12.49 word pairs / s \n", "2010000 1 00:02:44.069399 0.06 word pairs / s \n", " 10 00:02:43.914601 0.61 word pairs / s \n", " 100 00:02:43.892408 6.10 word pairs / s \n", "\n", " processing_speed \n", "dictionary_size nonzero_limit \n", "1000 1 178.61 Kword pairs / s \n", " 10 178.27 Kword pairs / s \n", " 100 177.92 Kword pairs / s \n", "1000000 1 124.05 Kword pairs / s \n", " 10 124.92 Kword pairs / s \n", " 100 124.90 Kword pairs / s \n", "2010000 1 122.51 Kword pairs / s \n", " 10 122.63 Kword pairs / s \n", " 100 122.64 Kword pairs / s " ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "display(df.mean()).loc[\n", " [1000, 1000000, len(full_dictionary)], :].loc[\n", " :, [\"production_duration\", \"production_speed\", \"processing_speed\"]]" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
production_durationproduction_speedprocessing_speed
dictionary_sizenonzero_limit
1000100:00:00.0006732.16 word pairs / s2.16 Kword pairs / s
1000:00:00.00040913.06 word pairs / s1.31 Kword pairs / s
10000:00:00.000621196.80 word pairs / s1.97 Kword pairs / s
1000000100:00:00.8106610.00 word pairs / s1.23 Kword pairs / s
1000:00:00.1100130.00 word pairs / s0.17 Kword pairs / s
10000:00:00.1649590.03 word pairs / s0.26 Kword pairs / s
2010000100:00:01.1592730.00 word pairs / s0.85 Kword pairs / s
1000:00:00.4290110.00 word pairs / s0.32 Kword pairs / s
10000:00:00.4336870.02 word pairs / s0.32 Kword pairs / s
\n", "
" ], "text/plain": [ " production_duration production_speed \\\n", "dictionary_size nonzero_limit \n", "1000 1 00:00:00.000673 2.16 word pairs / s \n", " 10 00:00:00.000409 13.06 word pairs / s \n", " 100 00:00:00.000621 196.80 word pairs / s \n", "1000000 1 00:00:00.810661 0.00 word pairs / s \n", " 10 00:00:00.110013 0.00 word pairs / s \n", " 100 00:00:00.164959 0.03 word pairs / s \n", "2010000 1 00:00:01.159273 0.00 word pairs / s \n", " 10 00:00:00.429011 0.00 word pairs / s \n", " 100 00:00:00.433687 0.02 word pairs / s \n", "\n", " processing_speed \n", "dictionary_size nonzero_limit \n", "1000 1 2.16 Kword pairs / s \n", " 10 1.31 Kword pairs / s \n", " 100 1.97 Kword pairs / s \n", "1000000 1 1.23 Kword pairs / s \n", " 10 0.17 Kword pairs / s \n", " 100 0.26 Kword pairs / s \n", "2010000 1 0.85 Kword pairs / s \n", " 10 0.32 Kword pairs / s \n", " 100 0.32 Kword pairs / s " ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "display(df.apply(lambda x: (x - x.mean()).std())).loc[\n", " [1000, 1000000, len(full_dictionary)], :].loc[\n", " :, [\"production_duration\", \"production_speed\", \"processing_speed\"]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### WordEmbeddingSimilarityIndex\n", "Lastly, we measure the speed at which the **WordEmbeddingSimilarityIndex** builder class constructs an instance and produces term similarities. Gensim currently supports slow and precise nearest neighbor search, and also approximate nearest neighbor search using [ANNOY][]. We evaluate both options.\n", "\n", " [ANNOY]: https://github.com/spotify/annoy (Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "def benchmark(configuration):\n", " (model, dictionary), nonzero_limit, annoy_n_trees, query_terms, repetition = configuration\n", " use_annoy = annoy_n_trees > 0\n", " model.init_sims()\n", " \n", " start_time = time()\n", " if use_annoy:\n", " annoy = AnnoyIndexer(model, annoy_n_trees)\n", " kwargs = {\"indexer\": annoy}\n", " else:\n", " kwargs = {}\n", " index = WordEmbeddingSimilarityIndex(model, kwargs=kwargs)\n", " end_time = time()\n", " constructor_duration = end_time - start_time\n", " \n", " start_time = time()\n", " for term in query_terms:\n", " for _j, _k in zip(index.most_similar(term, topn=nonzero_limit), range(nonzero_limit)):\n", " pass\n", " end_time = time()\n", " production_duration = end_time - start_time\n", " \n", " return {\n", " \"dictionary_size\": len(dictionary),\n", " \"mean_query_term_length\": np.mean([len(term) for term in query_terms]),\n", " \"nonzero_limit\": nonzero_limit,\n", " \"use_annoy\": use_annoy,\n", " \"annoy_n_trees\": annoy_n_trees,\n", " \"repetition\": repetition,\n", " \"constructor_duration\": constructor_duration,\n", " \"production_duration\": production_duration, }" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "842bb1a60f814110a8f20eb44a973397", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(IntProgress(value=0, max=5), HTML(value='')))" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "models = []\n", "for dictionary in tqdm(dictionaries, desc=\"models\"):\n", " if dictionary == full_dictionary:\n", " models.append(full_model)\n", " continue\n", " model = full_model.__class__(full_model.vector_size)\n", " model.vocab = {word: deepcopy(full_model.vocab[word]) for word in dictionary.values()}\n", " model.index2entity = []\n", " vector_indices = []\n", " for index, word in enumerate(full_model.index2entity):\n", " if word in model.vocab.keys():\n", " model.index2entity.append(word)\n", " model.vocab[word].index = len(vector_indices)\n", " vector_indices.append(index)\n", " model.vectors = full_model.vectors[vector_indices]\n", " models.append(model)\n", "annoy_n_trees = [0] + [10**k for k in range(3)]\n", "seed(RANDOM_SEED)\n", "query_terms = sample(list(min_dictionary.values()), 1000)\n", "\n", "configurations = product(zip(models, dictionaries), nonzero_limits, annoy_n_trees, [query_terms], repetitions)\n", "results = benchmark_results(benchmark, configurations, \"matrix_speed.builder_results.wordembeddings\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following tables show how long it takes to construct an ANNOY index and the builder class instance (the **constructor_duration** column), how long it takes to retrieve the most similar terms for 1,000 randomly sampled terms from a dictionary (the **production_duration** column), the mean term similarity production speed (the **production_speed** column) and the mean term similarity processing speed (the **processing_speed** column) as we vary the dictionary size (the **dictionary_size** column), the maximum number of most similar terms that will be retrieved (the **nonzero_limit** column), and the number of constructed ANNOY trees (the **annoy_n_trees** column). Ten independendent measurements were taken. The top table shows the mean values and the bottom table shows the standard deviations.\n", "\n", "If we do not use ANNOY (**annoy_n_trees**${}=0$), then **production_speed** is proportional to **nonzero_limit / dictionary_size**. \n", "If we do use ANNOY (**annoy_n_trees**${}>0$), then **production_speed** is proportional to **nonzero_limit / (annoy_n_trees)**${}^{1/2}$." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame(results)\n", "df[\"processing_speed\"] = df.dictionary_size * len(query_terms) / df.production_duration\n", "df[\"production_speed\"] = df.nonzero_limit * len(query_terms) / df.production_duration\n", "df = df.groupby([\"dictionary_size\", \"nonzero_limit\", \"annoy_n_trees\"])\n", "\n", "def display(df):\n", " df[\"constructor_duration\"] = [timedelta(0, duration) for duration in df[\"constructor_duration\"]]\n", " df[\"production_duration\"] = [timedelta(0, duration) for duration in df[\"production_duration\"]]\n", " df[\"processing_speed\"] = [\"%.02f Kword pairs / s\" % (speed / 1000) for speed in df[\"processing_speed\"]]\n", " df[\"production_speed\"] = [\"%.02f Kword pairs / s\" % (speed / 1000) for speed in df[\"production_speed\"]]\n", " return df" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
constructor_durationproduction_durationproduction_speedprocessing_speed
dictionary_sizenonzero_limitannoy_n_trees
10000001000:00:00.00000700:00:19.9629770.05 Kword pairs / s50094.22 Kword pairs / s
100:00:30.26879700:00:00.09701110.32 Kword pairs / s10320061.76 Kword pairs / s
10000:06:23.41598200:00:00.1608706.24 Kword pairs / s6236688.27 Kword pairs / s
100000:00:00.00000800:00:22.8683724.37 Kword pairs / s43729.34 Kword pairs / s
100:00:31.15487600:00:00.156238641.91 Kword pairs / s6419086.99 Kword pairs / s
10000:06:23.29057200:00:01.29744577.13 Kword pairs / s771277.71 Kword pairs / s
20100001000:00:00.00000700:01:55.3032160.01 Kword pairs / s17432.79 Kword pairs / s
100:01:34.00419600:00:00.1904635.25 Kword pairs / s10561607.14 Kword pairs / s
10000:23:29.79600600:00:00.3395002.96 Kword pairs / s5954865.50 Kword pairs / s
100000:00:00.00000700:02:11.9268610.76 Kword pairs / s15236.46 Kword pairs / s
100:01:35.81341400:00:00.301120332.38 Kword pairs / s6680879.02 Kword pairs / s
10000:23:05.15539900:00:03.03152733.42 Kword pairs / s671683.05 Kword pairs / s
\n", "
" ], "text/plain": [ " constructor_duration \\\n", "dictionary_size nonzero_limit annoy_n_trees \n", "1000000 1 0 00:00:00.000007 \n", " 1 00:00:30.268797 \n", " 100 00:06:23.415982 \n", " 100 0 00:00:00.000008 \n", " 1 00:00:31.154876 \n", " 100 00:06:23.290572 \n", "2010000 1 0 00:00:00.000007 \n", " 1 00:01:34.004196 \n", " 100 00:23:29.796006 \n", " 100 0 00:00:00.000007 \n", " 1 00:01:35.813414 \n", " 100 00:23:05.155399 \n", "\n", " production_duration \\\n", "dictionary_size nonzero_limit annoy_n_trees \n", "1000000 1 0 00:00:19.962977 \n", " 1 00:00:00.097011 \n", " 100 00:00:00.160870 \n", " 100 0 00:00:22.868372 \n", " 1 00:00:00.156238 \n", " 100 00:00:01.297445 \n", "2010000 1 0 00:01:55.303216 \n", " 1 00:00:00.190463 \n", " 100 00:00:00.339500 \n", " 100 0 00:02:11.926861 \n", " 1 00:00:00.301120 \n", " 100 00:00:03.031527 \n", "\n", " production_speed \\\n", "dictionary_size nonzero_limit annoy_n_trees \n", "1000000 1 0 0.05 Kword pairs / s \n", " 1 10.32 Kword pairs / s \n", " 100 6.24 Kword pairs / s \n", " 100 0 4.37 Kword pairs / s \n", " 1 641.91 Kword pairs / s \n", " 100 77.13 Kword pairs / s \n", "2010000 1 0 0.01 Kword pairs / s \n", " 1 5.25 Kword pairs / s \n", " 100 2.96 Kword pairs / s \n", " 100 0 0.76 Kword pairs / s \n", " 1 332.38 Kword pairs / s \n", " 100 33.42 Kword pairs / s \n", "\n", " processing_speed \n", "dictionary_size nonzero_limit annoy_n_trees \n", "1000000 1 0 50094.22 Kword pairs / s \n", " 1 10320061.76 Kword pairs / s \n", " 100 6236688.27 Kword pairs / s \n", " 100 0 43729.34 Kword pairs / s \n", " 1 6419086.99 Kword pairs / s \n", " 100 771277.71 Kword pairs / s \n", "2010000 1 0 17432.79 Kword pairs / s \n", " 1 10561607.14 Kword pairs / s \n", " 100 5954865.50 Kword pairs / s \n", " 100 0 15236.46 Kword pairs / s \n", " 1 6680879.02 Kword pairs / s \n", " 100 671683.05 Kword pairs / s " ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "display(df.mean()).loc[\n", " [1000000, len(full_dictionary)], [1, 100], [0, 1, 100]].loc[\n", " :, [\"constructor_duration\", \"production_duration\", \"production_speed\", \"processing_speed\"]]" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
constructor_durationproduction_durationproduction_speedprocessing_speed
dictionary_sizenonzero_limitannoy_n_trees
10000001000:00:00.00000200:00:00.1156440.00 Kword pairs / s286.27 Kword pairs / s
100:00:01.85409700:00:00.0035170.37 Kword pairs / s367959.55 Kword pairs / s
10000:00:04.70203500:00:00.0104440.35 Kword pairs / s350506.05 Kword pairs / s
100000:00:00.00000200:00:00.1048720.02 Kword pairs / s198.86 Kword pairs / s
100:00:01.16367800:00:00.00893936.14 Kword pairs / s361441.71 Kword pairs / s
10000:00:06.81856800:00:00.0369792.07 Kword pairs / s20741.69 Kword pairs / s
20100001000:00:00.00000100:00:00.6531770.00 Kword pairs / s97.50 Kword pairs / s
100:00:04.67720900:00:00.0056790.16 Kword pairs / s311832.91 Kword pairs / s
10000:01:38.56268400:00:00.0298870.22 Kword pairs / s434681.25 Kword pairs / s
100000:00:00.00000100:00:00.9796130.01 Kword pairs / s111.85 Kword pairs / s
100:00:03.20747400:00:00.00947910.18 Kword pairs / s204614.80 Kword pairs / s
10000:00:55.11959500:00:00.4195313.46 Kword pairs / s69543.35 Kword pairs / s
\n", "
" ], "text/plain": [ " constructor_duration \\\n", "dictionary_size nonzero_limit annoy_n_trees \n", "1000000 1 0 00:00:00.000002 \n", " 1 00:00:01.854097 \n", " 100 00:00:04.702035 \n", " 100 0 00:00:00.000002 \n", " 1 00:00:01.163678 \n", " 100 00:00:06.818568 \n", "2010000 1 0 00:00:00.000001 \n", " 1 00:00:04.677209 \n", " 100 00:01:38.562684 \n", " 100 0 00:00:00.000001 \n", " 1 00:00:03.207474 \n", " 100 00:00:55.119595 \n", "\n", " production_duration \\\n", "dictionary_size nonzero_limit annoy_n_trees \n", "1000000 1 0 00:00:00.115644 \n", " 1 00:00:00.003517 \n", " 100 00:00:00.010444 \n", " 100 0 00:00:00.104872 \n", " 1 00:00:00.008939 \n", " 100 00:00:00.036979 \n", "2010000 1 0 00:00:00.653177 \n", " 1 00:00:00.005679 \n", " 100 00:00:00.029887 \n", " 100 0 00:00:00.979613 \n", " 1 00:00:00.009479 \n", " 100 00:00:00.419531 \n", "\n", " production_speed \\\n", "dictionary_size nonzero_limit annoy_n_trees \n", "1000000 1 0 0.00 Kword pairs / s \n", " 1 0.37 Kword pairs / s \n", " 100 0.35 Kword pairs / s \n", " 100 0 0.02 Kword pairs / s \n", " 1 36.14 Kword pairs / s \n", " 100 2.07 Kword pairs / s \n", "2010000 1 0 0.00 Kword pairs / s \n", " 1 0.16 Kword pairs / s \n", " 100 0.22 Kword pairs / s \n", " 100 0 0.01 Kword pairs / s \n", " 1 10.18 Kword pairs / s \n", " 100 3.46 Kword pairs / s \n", "\n", " processing_speed \n", "dictionary_size nonzero_limit annoy_n_trees \n", "1000000 1 0 286.27 Kword pairs / s \n", " 1 367959.55 Kword pairs / s \n", " 100 350506.05 Kword pairs / s \n", " 100 0 198.86 Kword pairs / s \n", " 1 361441.71 Kword pairs / s \n", " 100 20741.69 Kword pairs / s \n", "2010000 1 0 97.50 Kword pairs / s \n", " 1 311832.91 Kword pairs / s \n", " 100 434681.25 Kword pairs / s \n", " 100 0 111.85 Kword pairs / s \n", " 1 204614.80 Kword pairs / s \n", " 100 69543.35 Kword pairs / s " ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "display(df.apply(lambda x: (x - x.mean()).std())).loc[\n", " [1000000, len(full_dictionary)], [1, 100], [0, 1, 100]].loc[\n", " :, [\"constructor_duration\", \"production_duration\", \"production_speed\", \"processing_speed\"]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Implement fast SCM between corpora\n", "\n", "In Gensim PR [#1827][], we added a base implementation of the soft cosine measure (SCM). The base implementation would compute SCM between single documents using the **softcossim** function. In the Gensim PR [#2016][], we intruduced the **SparseTermSimilarityMatrix.inner_product** method, which computes SCM not only between single documents, but also between a document and a corpus, and between two corpora.\n", "\n", "For the measurements, we use the [Google News word embeddings][word2vec-google-news-300] distributed with the C implementation of Word2Vec. From the word embeddings, we will derive a dictionary of 2.01m terms. As a corpus, we will use a random sample of 100K articles from the 4.92m English [Wikipedia articles][enwiki].\n", "\n", " [word2vec-google-news-300]: https://github.com/mmihaltz/word2vec-GoogleNews-vectors (word2vec-GoogleNews-vectors)\n", " [enwiki]: https://github.com/RaRe-Technologies/gensim-data/releases/tag/wiki-english-20171001 (wiki-english-20171001)\n", " [#1827]: https://github.com/RaRe-Technologies/gensim/pull/1827 (Implement Soft Cosine Measure - Pull Request #1827)\n", " [#2016]: https://github.com/RaRe-Technologies/gensim/pull/2016 (Implement Levenshtein term similarity matrix and fast SCM between corpora - Pull Request #2016)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "full_model = api.load(\"word2vec-google-news-300\")\n", "\n", "try:\n", " with open(\"matrix_speed.corpus\", \"rb\") as file:\n", " full_corpus = pickle.load(file) \n", "except IOError:\n", " original_corpus = list(tqdm(api.load(\"wiki-english-20171001\"), desc=\"original_corpus\", total=4924894))\n", " seed(RANDOM_SEED)\n", " full_corpus = [\n", " simple_preprocess(u'\\n'.join(article[\"section_texts\"]))\n", " for article in tqdm(sample(original_corpus, 10**5), desc=\"full_corpus\", total=10**5)]\n", " del original_corpus\n", " with open(\"matrix_speed.corpus\", \"wb\") as file:\n", " pickle.dump(full_corpus, file)\n", "\n", "try:\n", " full_dictionary = Dictionary.load(\"matrix_speed.dictionary\")\n", "except IOError:\n", " full_dictionary = Dictionary([[term] for term in full_model.vocab.keys()])\n", " full_dictionary.save(\"matrix_speed.dictionary\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### SCM between two documents\n", "First, we measure the speed at which the **inner_product** method produces term similarities between single documents." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "def benchmark(configuration):\n", " (matrix, dictionary, nonzero_limit), corpus, normalized, repetition = configuration\n", " corpus_size = len(corpus)\n", " corpus = [dictionary.doc2bow(doc) for doc in corpus]\n", " corpus = [vec for vec in corpus if len(vec) > 0]\n", " \n", " start_time = time()\n", " for vec1 in corpus:\n", " for vec2 in corpus:\n", " matrix.inner_product(vec1, vec2, normalized=normalized)\n", " end_time = time()\n", " duration = end_time - start_time\n", " \n", " return {\n", " \"dictionary_size\": matrix.matrix.shape[0],\n", " \"matrix_nonzero\": matrix.matrix.nnz,\n", " \"nonzero_limit\": nonzero_limit,\n", " \"normalized\": normalized,\n", " \"corpus_size\": corpus_size,\n", " \"corpus_actual_size\": len(corpus),\n", " \"corpus_nonzero\": sum(len(vec) for vec in corpus),\n", " \"mean_document_length\": np.mean([len(doc) for doc in corpus]),\n", " \"repetition\": repetition,\n", " \"duration\": duration, }" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "110675d5552847819754f0dc5b1c19e1", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(IntProgress(value=0, max=2), HTML(value='')))" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "744e400d597440f79b5923dafb1974fc", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(IntProgress(value=0, max=2), HTML(value='')))" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "0f84efc0c79a4628a9543736fc5f0c9a", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(IntProgress(value=0, max=2), HTML(value='')))" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "8a185a8e530e4481b90056222f5f0a1c", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(IntProgress(value=0, max=6), HTML(value='')))" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "/mnt/storage/home/novotny/.virtualenvs/gensim/lib/python3.4/site-packages/gensim/matutils.py:738: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.\n", " if np.issubdtype(vec.dtype, np.int):\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "seed(RANDOM_SEED)\n", "dictionary_sizes = [1000, 100000]\n", "dictionaries = []\n", "for size in tqdm(dictionary_sizes, desc=\"dictionaries\"):\n", " dictionary = Dictionary([sample(list(full_dictionary.values()), size)])\n", " dictionaries.append(dictionary)\n", "min_dictionary = sorted((len(dictionary), dictionary) for dictionary in dictionaries)[0][1]\n", "\n", "corpus_sizes = [100, 1000]\n", "corpora = []\n", "for size in tqdm(corpus_sizes, desc=\"corpora\"):\n", " corpus = sample(full_corpus, size)\n", " corpora.append(corpus)\n", "\n", "models = []\n", "for dictionary in tqdm(dictionaries, desc=\"models\"):\n", " if dictionary == full_dictionary:\n", " models.append(full_model)\n", " continue\n", " model = full_model.__class__(full_model.vector_size)\n", " model.vocab = {word: deepcopy(full_model.vocab[word]) for word in dictionary.values()}\n", " model.index2entity = []\n", " vector_indices = []\n", " for index, word in enumerate(full_model.index2entity):\n", " if word in model.vocab.keys():\n", " model.index2entity.append(word)\n", " model.vocab[word].index = len(vector_indices)\n", " vector_indices.append(index)\n", " model.vectors = full_model.vectors[vector_indices]\n", " models.append(model)\n", "\n", "nonzero_limits = [1, 10, 100]\n", "matrices = []\n", "for (model, dictionary), nonzero_limit in tqdm(\n", " list(product(zip(models, dictionaries), nonzero_limits)), desc=\"matrices\"):\n", " annoy = AnnoyIndexer(model, 1)\n", " index = WordEmbeddingSimilarityIndex(model, kwargs={\"indexer\": annoy})\n", " matrix = SparseTermSimilarityMatrix(index, dictionary, nonzero_limit=nonzero_limit)\n", " matrices.append((matrix, dictionary, nonzero_limit))\n", " del annoy\n", "\n", "normalization = (True, False)\n", "repetitions = range(10)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "configurations = product(matrices, corpora, normalization, repetitions)\n", "results = benchmark_results(benchmark, configurations, \"matrix_speed.inner-product_results.doc_doc\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following tables show how long it takes to compute the **inner_product** method between all document vectors in a corpus (the **duration** column), how many nonzero elements there are in a corpus matrix (the **corpus_nonzero** column), how many nonzero elements there are in a term similarity matrix (the **matrix_nonzero** column) and the mean document similarity production speed (the **speed** column) as we vary the dictionary size (the **dictionary_size** column), the size of the corpus (the **corpus_size** column), the maximum number of nonzero elements in a single column of the matrix (the **nonzero_limit** column), and the matrix symmetry constraint (the **symmetric** column). Ten independendent measurements were taken. The top table shows the mean values and the bottom table shows the standard deviations.\n", "\n", "The **speed** is proportional to the square of the number of unique terms shared by the two document vectors. In our scenario as well as the standard IR scenario, this means **speed** is constant. Computing a normalized inner product (**normalized**${}={}$True) results in a constant speed decrease." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame(results)\n", "df[\"speed\"] = df.corpus_actual_size**2 / df.duration\n", "del df[\"corpus_actual_size\"]\n", "df = df.groupby([\"dictionary_size\", \"corpus_size\", \"nonzero_limit\", \"normalized\"])\n", "\n", "def display(df):\n", " df[\"duration\"] = [timedelta(0, duration) for duration in df[\"duration\"]]\n", " df[\"speed\"] = [\"%.02f Kdoc pairs / s\" % (speed / 1000) for speed in df[\"speed\"]]\n", " return df" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
durationcorpus_nonzeromatrix_nonzerospeed
dictionary_sizecorpus_sizenonzero_limitnormalized
10001001False00:00:00.0073833.01000.01.23 Kdoc pairs / s
True00:00:00.0090283.01000.01.01 Kdoc pairs / s
100False00:00:00.0076573.084944.01.19 Kdoc pairs / s
True00:00:00.0082383.084944.01.10 Kdoc pairs / s
10001False00:00:00.41436426.01000.01.39 Kdoc pairs / s
True00:00:00.47378926.01000.01.22 Kdoc pairs / s
100False00:00:00.43083326.084944.01.35 Kdoc pairs / s
True00:00:00.45347726.084944.01.27 Kdoc pairs / s
1000001001False00:00:05.236376423.0101868.01.29 Kdoc pairs / s
True00:00:05.623463423.0101868.01.20 Kdoc pairs / s
100False00:00:05.083829423.08202884.01.33 Kdoc pairs / s
True00:00:05.576003423.08202884.01.21 Kdoc pairs / s
10001False00:08:59.2853475162.0101868.01.26 Kdoc pairs / s
True00:09:57.6932195162.0101868.01.14 Kdoc pairs / s
100False00:09:23.2134505162.08202884.01.21 Kdoc pairs / s
True00:10:10.6124585162.08202884.01.12 Kdoc pairs / s
\n", "
" ], "text/plain": [ " duration \\\n", "dictionary_size corpus_size nonzero_limit normalized \n", "1000 100 1 False 00:00:00.007383 \n", " True 00:00:00.009028 \n", " 100 False 00:00:00.007657 \n", " True 00:00:00.008238 \n", " 1000 1 False 00:00:00.414364 \n", " True 00:00:00.473789 \n", " 100 False 00:00:00.430833 \n", " True 00:00:00.453477 \n", "100000 100 1 False 00:00:05.236376 \n", " True 00:00:05.623463 \n", " 100 False 00:00:05.083829 \n", " True 00:00:05.576003 \n", " 1000 1 False 00:08:59.285347 \n", " True 00:09:57.693219 \n", " 100 False 00:09:23.213450 \n", " True 00:10:10.612458 \n", "\n", " corpus_nonzero \\\n", "dictionary_size corpus_size nonzero_limit normalized \n", "1000 100 1 False 3.0 \n", " True 3.0 \n", " 100 False 3.0 \n", " True 3.0 \n", " 1000 1 False 26.0 \n", " True 26.0 \n", " 100 False 26.0 \n", " True 26.0 \n", "100000 100 1 False 423.0 \n", " True 423.0 \n", " 100 False 423.0 \n", " True 423.0 \n", " 1000 1 False 5162.0 \n", " True 5162.0 \n", " 100 False 5162.0 \n", " True 5162.0 \n", "\n", " matrix_nonzero \\\n", "dictionary_size corpus_size nonzero_limit normalized \n", "1000 100 1 False 1000.0 \n", " True 1000.0 \n", " 100 False 84944.0 \n", " True 84944.0 \n", " 1000 1 False 1000.0 \n", " True 1000.0 \n", " 100 False 84944.0 \n", " True 84944.0 \n", "100000 100 1 False 101868.0 \n", " True 101868.0 \n", " 100 False 8202884.0 \n", " True 8202884.0 \n", " 1000 1 False 101868.0 \n", " True 101868.0 \n", " 100 False 8202884.0 \n", " True 8202884.0 \n", "\n", " speed \n", "dictionary_size corpus_size nonzero_limit normalized \n", "1000 100 1 False 1.23 Kdoc pairs / s \n", " True 1.01 Kdoc pairs / s \n", " 100 False 1.19 Kdoc pairs / s \n", " True 1.10 Kdoc pairs / s \n", " 1000 1 False 1.39 Kdoc pairs / s \n", " True 1.22 Kdoc pairs / s \n", " 100 False 1.35 Kdoc pairs / s \n", " True 1.27 Kdoc pairs / s \n", "100000 100 1 False 1.29 Kdoc pairs / s \n", " True 1.20 Kdoc pairs / s \n", " 100 False 1.33 Kdoc pairs / s \n", " True 1.21 Kdoc pairs / s \n", " 1000 1 False 1.26 Kdoc pairs / s \n", " True 1.14 Kdoc pairs / s \n", " 100 False 1.21 Kdoc pairs / s \n", " True 1.12 Kdoc pairs / s " ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "display(df.mean()).loc[\n", " [1000, 100000], :, [1, 100], :].loc[\n", " :, [\"duration\", \"corpus_nonzero\", \"matrix_nonzero\", \"speed\"]]" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
durationcorpus_nonzeromatrix_nonzerospeed
dictionary_sizecorpus_sizenonzero_limitnormalized
10001001False00:00:00.0008710.00.00.13 Kdoc pairs / s
True00:00:00.0013150.00.00.14 Kdoc pairs / s
100False00:00:00.0008930.00.00.12 Kdoc pairs / s
True00:00:00.0006310.00.00.08 Kdoc pairs / s
10001False00:00:00.0144600.00.00.05 Kdoc pairs / s
True00:00:00.0252500.00.00.07 Kdoc pairs / s
100False00:00:00.0390880.00.00.11 Kdoc pairs / s
True00:00:00.0236020.00.00.06 Kdoc pairs / s
1000001001False00:00:00.2763590.00.00.07 Kdoc pairs / s
True00:00:00.2788060.00.00.06 Kdoc pairs / s
100False00:00:00.2867810.00.00.07 Kdoc pairs / s
True00:00:00.3133970.00.00.06 Kdoc pairs / s
10001False00:00:14.3211010.00.00.03 Kdoc pairs / s
True00:00:23.5261040.00.00.05 Kdoc pairs / s
100False00:00:05.8995270.00.00.01 Kdoc pairs / s
True00:00:24.4544220.00.00.05 Kdoc pairs / s
\n", "
" ], "text/plain": [ " duration \\\n", "dictionary_size corpus_size nonzero_limit normalized \n", "1000 100 1 False 00:00:00.000871 \n", " True 00:00:00.001315 \n", " 100 False 00:00:00.000893 \n", " True 00:00:00.000631 \n", " 1000 1 False 00:00:00.014460 \n", " True 00:00:00.025250 \n", " 100 False 00:00:00.039088 \n", " True 00:00:00.023602 \n", "100000 100 1 False 00:00:00.276359 \n", " True 00:00:00.278806 \n", " 100 False 00:00:00.286781 \n", " True 00:00:00.313397 \n", " 1000 1 False 00:00:14.321101 \n", " True 00:00:23.526104 \n", " 100 False 00:00:05.899527 \n", " True 00:00:24.454422 \n", "\n", " corpus_nonzero \\\n", "dictionary_size corpus_size nonzero_limit normalized \n", "1000 100 1 False 0.0 \n", " True 0.0 \n", " 100 False 0.0 \n", " True 0.0 \n", " 1000 1 False 0.0 \n", " True 0.0 \n", " 100 False 0.0 \n", " True 0.0 \n", "100000 100 1 False 0.0 \n", " True 0.0 \n", " 100 False 0.0 \n", " True 0.0 \n", " 1000 1 False 0.0 \n", " True 0.0 \n", " 100 False 0.0 \n", " True 0.0 \n", "\n", " matrix_nonzero \\\n", "dictionary_size corpus_size nonzero_limit normalized \n", "1000 100 1 False 0.0 \n", " True 0.0 \n", " 100 False 0.0 \n", " True 0.0 \n", " 1000 1 False 0.0 \n", " True 0.0 \n", " 100 False 0.0 \n", " True 0.0 \n", "100000 100 1 False 0.0 \n", " True 0.0 \n", " 100 False 0.0 \n", " True 0.0 \n", " 1000 1 False 0.0 \n", " True 0.0 \n", " 100 False 0.0 \n", " True 0.0 \n", "\n", " speed \n", "dictionary_size corpus_size nonzero_limit normalized \n", "1000 100 1 False 0.13 Kdoc pairs / s \n", " True 0.14 Kdoc pairs / s \n", " 100 False 0.12 Kdoc pairs / s \n", " True 0.08 Kdoc pairs / s \n", " 1000 1 False 0.05 Kdoc pairs / s \n", " True 0.07 Kdoc pairs / s \n", " 100 False 0.11 Kdoc pairs / s \n", " True 0.06 Kdoc pairs / s \n", "100000 100 1 False 0.07 Kdoc pairs / s \n", " True 0.06 Kdoc pairs / s \n", " 100 False 0.07 Kdoc pairs / s \n", " True 0.06 Kdoc pairs / s \n", " 1000 1 False 0.03 Kdoc pairs / s \n", " True 0.05 Kdoc pairs / s \n", " 100 False 0.01 Kdoc pairs / s \n", " True 0.05 Kdoc pairs / s " ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "display(df.apply(lambda x: (x - x.mean()).std())).loc[\n", " [1000, 100000], :, [1, 100], :].loc[\n", " :, [\"duration\", \"corpus_nonzero\", \"matrix_nonzero\", \"speed\"]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### SCM between a document and a corpus\n", "Next, we measure the speed at which the **inner_product** method produces term similarities between documents and a corpus." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "def benchmark(configuration):\n", " (matrix, dictionary, nonzero_limit), corpus, normalized, repetition = configuration\n", " corpus_size = len(corpus)\n", " corpus = [dictionary.doc2bow(doc) for doc in corpus if doc]\n", " \n", " start_time = time()\n", " for vec in corpus:\n", " matrix.inner_product(vec, corpus, normalized=normalized)\n", " end_time = time()\n", " duration = end_time - start_time\n", " \n", " return {\n", " \"dictionary_size\": matrix.matrix.shape[0],\n", " \"matrix_nonzero\": matrix.matrix.nnz,\n", " \"nonzero_limit\": nonzero_limit,\n", " \"normalized\": normalized,\n", " \"corpus_size\": corpus_size,\n", " \"corpus_actual_size\": len(corpus),\n", " \"corpus_nonzero\": sum(len(vec) for vec in corpus),\n", " \"mean_document_length\": np.mean([len(doc) for doc in corpus]),\n", " \"repetition\": repetition,\n", " \"duration\": duration, }" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "configurations = product(matrices, corpora, normalization, repetitions)\n", "results = benchmark_results(benchmark, configurations, \"matrix_speed.inner-product_results.doc_corpus\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The **speed** is inversely proportional to **matrix_nonzero**. Computing a normalized inner product (**normalized**${}={}$True) results in a constant speed decrease." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame(results)\n", "df[\"speed\"] = df.corpus_actual_size**2 / df.duration\n", "del df[\"corpus_actual_size\"]\n", "df = df.groupby([\"dictionary_size\", \"corpus_size\", \"nonzero_limit\", \"normalized\"])\n", "\n", "def display(df):\n", " df[\"duration\"] = [timedelta(0, duration) for duration in df[\"duration\"]]\n", " df[\"speed\"] = [\"%.02f Kdoc pairs / s\" % (speed / 1000) for speed in df[\"speed\"]]\n", " return df" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
durationcorpus_nonzeromatrix_nonzerospeed
dictionary_sizecorpus_sizenonzero_limitnormalized
10001001False00:00:00.0093633.01000.01117.12 Kdoc pairs / s
True00:00:00.0109483.01000.0954.13 Kdoc pairs / s
100False00:00:00.0141283.084944.0728.91 Kdoc pairs / s
True00:00:00.0181643.084944.0551.78 Kdoc pairs / s
10001False00:00:00.07209126.01000.013872.12 Kdoc pairs / s
True00:00:00.07928426.01000.012615.36 Kdoc pairs / s
100False00:00:00.16248326.084944.06188.43 Kdoc pairs / s
True00:00:00.20308126.084944.04924.48 Kdoc pairs / s
1000001001False00:00:00.278253423.0101868.036.05 Kdoc pairs / s
True00:00:00.298519423.0101868.033.56 Kdoc pairs / s
100False00:00:36.326167423.08202884.00.28 Kdoc pairs / s
True00:00:36.928802423.08202884.00.27 Kdoc pairs / s
10001False00:00:07.4033015162.0101868.0135.08 Kdoc pairs / s
True00:00:07.7949435162.0101868.0128.29 Kdoc pairs / s
100False00:05:55.6747125162.08202884.02.81 Kdoc pairs / s
True00:06:05.5613985162.08202884.02.74 Kdoc pairs / s
\n", "
" ], "text/plain": [ " duration \\\n", "dictionary_size corpus_size nonzero_limit normalized \n", "1000 100 1 False 00:00:00.009363 \n", " True 00:00:00.010948 \n", " 100 False 00:00:00.014128 \n", " True 00:00:00.018164 \n", " 1000 1 False 00:00:00.072091 \n", " True 00:00:00.079284 \n", " 100 False 00:00:00.162483 \n", " True 00:00:00.203081 \n", "100000 100 1 False 00:00:00.278253 \n", " True 00:00:00.298519 \n", " 100 False 00:00:36.326167 \n", " True 00:00:36.928802 \n", " 1000 1 False 00:00:07.403301 \n", " True 00:00:07.794943 \n", " 100 False 00:05:55.674712 \n", " True 00:06:05.561398 \n", "\n", " corpus_nonzero \\\n", "dictionary_size corpus_size nonzero_limit normalized \n", "1000 100 1 False 3.0 \n", " True 3.0 \n", " 100 False 3.0 \n", " True 3.0 \n", " 1000 1 False 26.0 \n", " True 26.0 \n", " 100 False 26.0 \n", " True 26.0 \n", "100000 100 1 False 423.0 \n", " True 423.0 \n", " 100 False 423.0 \n", " True 423.0 \n", " 1000 1 False 5162.0 \n", " True 5162.0 \n", " 100 False 5162.0 \n", " True 5162.0 \n", "\n", " matrix_nonzero \\\n", "dictionary_size corpus_size nonzero_limit normalized \n", "1000 100 1 False 1000.0 \n", " True 1000.0 \n", " 100 False 84944.0 \n", " True 84944.0 \n", " 1000 1 False 1000.0 \n", " True 1000.0 \n", " 100 False 84944.0 \n", " True 84944.0 \n", "100000 100 1 False 101868.0 \n", " True 101868.0 \n", " 100 False 8202884.0 \n", " True 8202884.0 \n", " 1000 1 False 101868.0 \n", " True 101868.0 \n", " 100 False 8202884.0 \n", " True 8202884.0 \n", "\n", " speed \n", "dictionary_size corpus_size nonzero_limit normalized \n", "1000 100 1 False 1117.12 Kdoc pairs / s \n", " True 954.13 Kdoc pairs / s \n", " 100 False 728.91 Kdoc pairs / s \n", " True 551.78 Kdoc pairs / s \n", " 1000 1 False 13872.12 Kdoc pairs / s \n", " True 12615.36 Kdoc pairs / s \n", " 100 False 6188.43 Kdoc pairs / s \n", " True 4924.48 Kdoc pairs / s \n", "100000 100 1 False 36.05 Kdoc pairs / s \n", " True 33.56 Kdoc pairs / s \n", " 100 False 0.28 Kdoc pairs / s \n", " True 0.27 Kdoc pairs / s \n", " 1000 1 False 135.08 Kdoc pairs / s \n", " True 128.29 Kdoc pairs / s \n", " 100 False 2.81 Kdoc pairs / s \n", " True 2.74 Kdoc pairs / s " ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "display(df.mean()).loc[\n", " [1000, 100000], :, [1, 100], :].loc[\n", " :, [\"duration\", \"corpus_nonzero\", \"matrix_nonzero\", \"speed\"]]" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
durationcorpus_nonzeromatrix_nonzerospeed
dictionary_sizecorpus_sizenonzero_limitnormalized
10001001False00:00:00.0021200.00.0242.09 Kdoc pairs / s
True00:00:00.0023870.00.0207.64 Kdoc pairs / s
100False00:00:00.0025310.00.0130.94 Kdoc pairs / s
True00:00:00.0009110.00.027.68 Kdoc pairs / s
10001False00:00:00.0005870.00.0112.92 Kdoc pairs / s
True00:00:00.0011910.00.0187.31 Kdoc pairs / s
100False00:00:00.0119440.00.0513.79 Kdoc pairs / s
True00:00:00.0017930.00.043.54 Kdoc pairs / s
1000001001False00:00:00.0161560.00.02.06 Kdoc pairs / s
True00:00:00.0134510.00.01.47 Kdoc pairs / s
100False00:00:01.3397870.00.00.01 Kdoc pairs / s
True00:00:01.6173400.00.00.01 Kdoc pairs / s
10001False00:00:00.0389610.00.00.71 Kdoc pairs / s
True00:00:00.0241540.00.00.40 Kdoc pairs / s
100False00:00:07.6048050.00.00.06 Kdoc pairs / s
True00:00:14.7995190.00.00.10 Kdoc pairs / s
\n", "
" ], "text/plain": [ " duration \\\n", "dictionary_size corpus_size nonzero_limit normalized \n", "1000 100 1 False 00:00:00.002120 \n", " True 00:00:00.002387 \n", " 100 False 00:00:00.002531 \n", " True 00:00:00.000911 \n", " 1000 1 False 00:00:00.000587 \n", " True 00:00:00.001191 \n", " 100 False 00:00:00.011944 \n", " True 00:00:00.001793 \n", "100000 100 1 False 00:00:00.016156 \n", " True 00:00:00.013451 \n", " 100 False 00:00:01.339787 \n", " True 00:00:01.617340 \n", " 1000 1 False 00:00:00.038961 \n", " True 00:00:00.024154 \n", " 100 False 00:00:07.604805 \n", " True 00:00:14.799519 \n", "\n", " corpus_nonzero \\\n", "dictionary_size corpus_size nonzero_limit normalized \n", "1000 100 1 False 0.0 \n", " True 0.0 \n", " 100 False 0.0 \n", " True 0.0 \n", " 1000 1 False 0.0 \n", " True 0.0 \n", " 100 False 0.0 \n", " True 0.0 \n", "100000 100 1 False 0.0 \n", " True 0.0 \n", " 100 False 0.0 \n", " True 0.0 \n", " 1000 1 False 0.0 \n", " True 0.0 \n", " 100 False 0.0 \n", " True 0.0 \n", "\n", " matrix_nonzero \\\n", "dictionary_size corpus_size nonzero_limit normalized \n", "1000 100 1 False 0.0 \n", " True 0.0 \n", " 100 False 0.0 \n", " True 0.0 \n", " 1000 1 False 0.0 \n", " True 0.0 \n", " 100 False 0.0 \n", " True 0.0 \n", "100000 100 1 False 0.0 \n", " True 0.0 \n", " 100 False 0.0 \n", " True 0.0 \n", " 1000 1 False 0.0 \n", " True 0.0 \n", " 100 False 0.0 \n", " True 0.0 \n", "\n", " speed \n", "dictionary_size corpus_size nonzero_limit normalized \n", "1000 100 1 False 242.09 Kdoc pairs / s \n", " True 207.64 Kdoc pairs / s \n", " 100 False 130.94 Kdoc pairs / s \n", " True 27.68 Kdoc pairs / s \n", " 1000 1 False 112.92 Kdoc pairs / s \n", " True 187.31 Kdoc pairs / s \n", " 100 False 513.79 Kdoc pairs / s \n", " True 43.54 Kdoc pairs / s \n", "100000 100 1 False 2.06 Kdoc pairs / s \n", " True 1.47 Kdoc pairs / s \n", " 100 False 0.01 Kdoc pairs / s \n", " True 0.01 Kdoc pairs / s \n", " 1000 1 False 0.71 Kdoc pairs / s \n", " True 0.40 Kdoc pairs / s \n", " 100 False 0.06 Kdoc pairs / s \n", " True 0.10 Kdoc pairs / s " ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "display(df.apply(lambda x: (x - x.mean()).std())).loc[\n", " [1000, 100000], :, [1, 100], :].loc[\n", " :, [\"duration\", \"corpus_nonzero\", \"matrix_nonzero\", \"speed\"]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### SCM between two corpora\n", "Lastly, we measure the speed at which the **inner_product** method produces term similarities between entire corpora." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "def benchmark(configuration):\n", " (matrix, dictionary, nonzero_limit), corpus, normalized, repetition = configuration\n", " corpus_size = len(corpus)\n", " corpus = [dictionary.doc2bow(doc) for doc in corpus]\n", " corpus = [vec for vec in corpus if len(vec) > 0]\n", " \n", " start_time = time()\n", " matrix.inner_product(corpus, corpus, normalized=normalized)\n", " end_time = time()\n", " duration = end_time - start_time\n", " \n", " return {\n", " \"dictionary_size\": matrix.matrix.shape[0],\n", " \"matrix_nonzero\": matrix.matrix.nnz,\n", " \"nonzero_limit\": nonzero_limit,\n", " \"normalized\": normalized,\n", " \"corpus_size\": corpus_size,\n", " \"corpus_actual_size\": len(corpus),\n", " \"corpus_nonzero\": sum(len(vec) for vec in corpus),\n", " \"mean_document_length\": np.mean([len(doc) for doc in corpus]),\n", " \"repetition\": repetition,\n", " \"duration\": duration, }" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "84e1344be5d944fa98368e6b3994944a", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(IntProgress(value=0, max=2), HTML(value='')))" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "/mnt/storage/home/novotny/.virtualenvs/gensim/lib/python3.4/site-packages/gensim/matutils.py:738: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.\n", " if np.issubdtype(vec.dtype, np.int):\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "nonzero_limits = [1000]\n", "dense_matrices = []\n", "for (model, dictionary), nonzero_limit in tqdm(\n", " list(product(zip(models, dictionaries), nonzero_limits)), desc=\"matrices\"):\n", " annoy = AnnoyIndexer(model, 1)\n", " index = WordEmbeddingSimilarityIndex(model, kwargs={\"indexer\": annoy})\n", " matrix = SparseTermSimilarityMatrix(index, dictionary, nonzero_limit=nonzero_limit)\n", " matrices.append((matrix, dictionary, nonzero_limit))\n", " del annoy" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "configurations = product(matrices + dense_matrices, corpora + [full_corpus], normalization, repetitions)\n", "results = benchmark_results(benchmark, configurations, \"matrix_speed.inner-product_results.corpus_corpus\")" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame(results)\n", "df[\"speed\"] = df.corpus_actual_size**2 / df.duration\n", "del df[\"corpus_actual_size\"]\n", "df = df.groupby([\"dictionary_size\", \"corpus_size\", \"nonzero_limit\", \"normalized\"])\n", "\n", "def display(df):\n", " df[\"duration\"] = [timedelta(0, duration) for duration in df[\"duration\"]]\n", " df[\"speed\"] = [\"%.02f Kdoc pairs / s\" % (speed / 1000) for speed in df[\"speed\"]]\n", " return df" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
durationcorpus_nonzeromatrix_nonzerospeed
dictionary_sizecorpus_sizenonzero_limitnormalized
10001001False00:00:00.0014033.01000.06.69 Kdoc pairs / s
True00:00:00.0053133.01000.01.70 Kdoc pairs / s
10False00:00:00.0015653.08634.05.80 Kdoc pairs / s
True00:00:00.0053073.08634.01.70 Kdoc pairs / s
100False00:00:00.0031723.084944.03.05 Kdoc pairs / s
True00:00:00.0084613.084944.01.07 Kdoc pairs / s
1000False00:00:00.0213773.0838588.00.42 Kdoc pairs / s
True00:00:00.0552343.0838588.00.16 Kdoc pairs / s
10001False00:00:00.00137626.01000.0418.61 Kdoc pairs / s
True00:00:00.00501926.01000.0114.78 Kdoc pairs / s
10False00:00:00.00151126.08634.0381.50 Kdoc pairs / s
True00:00:00.00520826.08634.0110.60 Kdoc pairs / s
100False00:00:00.00353926.084944.0164.03 Kdoc pairs / s
True00:00:00.00850226.084944.067.81 Kdoc pairs / s
1000False00:00:00.02154826.0838588.026.73 Kdoc pairs / s
True00:00:00.05442526.0838588.010.59 Kdoc pairs / s
1000001False00:00:00.0199152914.01000.0391443.20 Kdoc pairs / s
True00:00:00.0261182914.01000.0298377.75 Kdoc pairs / s
10False00:00:00.0201522914.08634.0386722.55 Kdoc pairs / s
True00:00:00.0269982914.08634.0288567.14 Kdoc pairs / s
100False00:00:00.0283452914.084944.0274905.36 Kdoc pairs / s
True00:00:00.0410692914.084944.0189709.57 Kdoc pairs / s
1000False00:00:00.0899782914.0838588.086598.15 Kdoc pairs / s
True00:00:00.1856112914.0838588.041971.58 Kdoc pairs / s
1000001001False00:00:00.003345423.0101868.02013.92 Kdoc pairs / s
True00:00:00.008857423.0101868.0760.13 Kdoc pairs / s
10False00:00:00.032639423.0814154.0206.66 Kdoc pairs / s
True00:00:00.080591423.0814154.083.46 Kdoc pairs / s
100False00:00:00.488467423.08202884.013.77 Kdoc pairs / s
True00:00:01.454507423.08202884.04.62 Kdoc pairs / s
1000False00:00:04.973667423.089912542.01.35 Kdoc pairs / s
True00:00:15.035711423.089912542.00.45 Kdoc pairs / s
10001False00:00:00.0101415162.0101868.067139.73 Kdoc pairs / s
True00:00:00.0166855162.0101868.040798.02 Kdoc pairs / s
10False00:00:00.0413925162.0814154.016444.18 Kdoc pairs / s
True00:00:00.0916865162.0814154.07425.08 Kdoc pairs / s
100False00:00:00.5089165162.08202884.01338.94 Kdoc pairs / s
True00:00:01.4975565162.08202884.0454.49 Kdoc pairs / s
1000False00:00:05.1014895162.089912542.0133.44 Kdoc pairs / s
True00:00:15.3254155162.089912542.044.42 Kdoc pairs / s
1000001False00:00:37.145526525310.0101868.0192578.80 Kdoc pairs / s
True00:00:45.729004525310.0101868.0156431.36 Kdoc pairs / s
10False00:00:44.981806525310.0814154.0159029.88 Kdoc pairs / s
True00:00:54.245450525310.0814154.0131871.88 Kdoc pairs / s
100False00:01:15.925860525310.08202884.094216.21 Kdoc pairs / s
True00:01:29.232076525310.08202884.080177.08 Kdoc pairs / s
1000False00:03:17.140191525310.089912542.036286.25 Kdoc pairs / s
True00:04:05.865666525310.089912542.029097.14 Kdoc pairs / s
\n", "
" ], "text/plain": [ " duration \\\n", "dictionary_size corpus_size nonzero_limit normalized \n", "1000 100 1 False 00:00:00.001403 \n", " True 00:00:00.005313 \n", " 10 False 00:00:00.001565 \n", " True 00:00:00.005307 \n", " 100 False 00:00:00.003172 \n", " True 00:00:00.008461 \n", " 1000 False 00:00:00.021377 \n", " True 00:00:00.055234 \n", " 1000 1 False 00:00:00.001376 \n", " True 00:00:00.005019 \n", " 10 False 00:00:00.001511 \n", " True 00:00:00.005208 \n", " 100 False 00:00:00.003539 \n", " True 00:00:00.008502 \n", " 1000 False 00:00:00.021548 \n", " True 00:00:00.054425 \n", " 100000 1 False 00:00:00.019915 \n", " True 00:00:00.026118 \n", " 10 False 00:00:00.020152 \n", " True 00:00:00.026998 \n", " 100 False 00:00:00.028345 \n", " True 00:00:00.041069 \n", " 1000 False 00:00:00.089978 \n", " True 00:00:00.185611 \n", "100000 100 1 False 00:00:00.003345 \n", " True 00:00:00.008857 \n", " 10 False 00:00:00.032639 \n", " True 00:00:00.080591 \n", " 100 False 00:00:00.488467 \n", " True 00:00:01.454507 \n", " 1000 False 00:00:04.973667 \n", " True 00:00:15.035711 \n", " 1000 1 False 00:00:00.010141 \n", " True 00:00:00.016685 \n", " 10 False 00:00:00.041392 \n", " True 00:00:00.091686 \n", " 100 False 00:00:00.508916 \n", " True 00:00:01.497556 \n", " 1000 False 00:00:05.101489 \n", " True 00:00:15.325415 \n", " 100000 1 False 00:00:37.145526 \n", " True 00:00:45.729004 \n", " 10 False 00:00:44.981806 \n", " True 00:00:54.245450 \n", " 100 False 00:01:15.925860 \n", " True 00:01:29.232076 \n", " 1000 False 00:03:17.140191 \n", " True 00:04:05.865666 \n", "\n", " corpus_nonzero \\\n", "dictionary_size corpus_size nonzero_limit normalized \n", "1000 100 1 False 3.0 \n", " True 3.0 \n", " 10 False 3.0 \n", " True 3.0 \n", " 100 False 3.0 \n", " True 3.0 \n", " 1000 False 3.0 \n", " True 3.0 \n", " 1000 1 False 26.0 \n", " True 26.0 \n", " 10 False 26.0 \n", " True 26.0 \n", " 100 False 26.0 \n", " True 26.0 \n", " 1000 False 26.0 \n", " True 26.0 \n", " 100000 1 False 2914.0 \n", " True 2914.0 \n", " 10 False 2914.0 \n", " True 2914.0 \n", " 100 False 2914.0 \n", " True 2914.0 \n", " 1000 False 2914.0 \n", " True 2914.0 \n", "100000 100 1 False 423.0 \n", " True 423.0 \n", " 10 False 423.0 \n", " True 423.0 \n", " 100 False 423.0 \n", " True 423.0 \n", " 1000 False 423.0 \n", " True 423.0 \n", " 1000 1 False 5162.0 \n", " True 5162.0 \n", " 10 False 5162.0 \n", " True 5162.0 \n", " 100 False 5162.0 \n", " True 5162.0 \n", " 1000 False 5162.0 \n", " True 5162.0 \n", " 100000 1 False 525310.0 \n", " True 525310.0 \n", " 10 False 525310.0 \n", " True 525310.0 \n", " 100 False 525310.0 \n", " True 525310.0 \n", " 1000 False 525310.0 \n", " True 525310.0 \n", "\n", " matrix_nonzero \\\n", "dictionary_size corpus_size nonzero_limit normalized \n", "1000 100 1 False 1000.0 \n", " True 1000.0 \n", " 10 False 8634.0 \n", " True 8634.0 \n", " 100 False 84944.0 \n", " True 84944.0 \n", " 1000 False 838588.0 \n", " True 838588.0 \n", " 1000 1 False 1000.0 \n", " True 1000.0 \n", " 10 False 8634.0 \n", " True 8634.0 \n", " 100 False 84944.0 \n", " True 84944.0 \n", " 1000 False 838588.0 \n", " True 838588.0 \n", " 100000 1 False 1000.0 \n", " True 1000.0 \n", " 10 False 8634.0 \n", " True 8634.0 \n", " 100 False 84944.0 \n", " True 84944.0 \n", " 1000 False 838588.0 \n", " True 838588.0 \n", "100000 100 1 False 101868.0 \n", " True 101868.0 \n", " 10 False 814154.0 \n", " True 814154.0 \n", " 100 False 8202884.0 \n", " True 8202884.0 \n", " 1000 False 89912542.0 \n", " True 89912542.0 \n", " 1000 1 False 101868.0 \n", " True 101868.0 \n", " 10 False 814154.0 \n", " True 814154.0 \n", " 100 False 8202884.0 \n", " True 8202884.0 \n", " 1000 False 89912542.0 \n", " True 89912542.0 \n", " 100000 1 False 101868.0 \n", " True 101868.0 \n", " 10 False 814154.0 \n", " True 814154.0 \n", " 100 False 8202884.0 \n", " True 8202884.0 \n", " 1000 False 89912542.0 \n", " True 89912542.0 \n", "\n", " speed \n", "dictionary_size corpus_size nonzero_limit normalized \n", "1000 100 1 False 6.69 Kdoc pairs / s \n", " True 1.70 Kdoc pairs / s \n", " 10 False 5.80 Kdoc pairs / s \n", " True 1.70 Kdoc pairs / s \n", " 100 False 3.05 Kdoc pairs / s \n", " True 1.07 Kdoc pairs / s \n", " 1000 False 0.42 Kdoc pairs / s \n", " True 0.16 Kdoc pairs / s \n", " 1000 1 False 418.61 Kdoc pairs / s \n", " True 114.78 Kdoc pairs / s \n", " 10 False 381.50 Kdoc pairs / s \n", " True 110.60 Kdoc pairs / s \n", " 100 False 164.03 Kdoc pairs / s \n", " True 67.81 Kdoc pairs / s \n", " 1000 False 26.73 Kdoc pairs / s \n", " True 10.59 Kdoc pairs / s \n", " 100000 1 False 391443.20 Kdoc pairs / s \n", " True 298377.75 Kdoc pairs / s \n", " 10 False 386722.55 Kdoc pairs / s \n", " True 288567.14 Kdoc pairs / s \n", " 100 False 274905.36 Kdoc pairs / s \n", " True 189709.57 Kdoc pairs / s \n", " 1000 False 86598.15 Kdoc pairs / s \n", " True 41971.58 Kdoc pairs / s \n", "100000 100 1 False 2013.92 Kdoc pairs / s \n", " True 760.13 Kdoc pairs / s \n", " 10 False 206.66 Kdoc pairs / s \n", " True 83.46 Kdoc pairs / s \n", " 100 False 13.77 Kdoc pairs / s \n", " True 4.62 Kdoc pairs / s \n", " 1000 False 1.35 Kdoc pairs / s \n", " True 0.45 Kdoc pairs / s \n", " 1000 1 False 67139.73 Kdoc pairs / s \n", " True 40798.02 Kdoc pairs / s \n", " 10 False 16444.18 Kdoc pairs / s \n", " True 7425.08 Kdoc pairs / s \n", " 100 False 1338.94 Kdoc pairs / s \n", " True 454.49 Kdoc pairs / s \n", " 1000 False 133.44 Kdoc pairs / s \n", " True 44.42 Kdoc pairs / s \n", " 100000 1 False 192578.80 Kdoc pairs / s \n", " True 156431.36 Kdoc pairs / s \n", " 10 False 159029.88 Kdoc pairs / s \n", " True 131871.88 Kdoc pairs / s \n", " 100 False 94216.21 Kdoc pairs / s \n", " True 80177.08 Kdoc pairs / s \n", " 1000 False 36286.25 Kdoc pairs / s \n", " True 29097.14 Kdoc pairs / s " ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "display(df.mean()).loc[\n", " [1000, 100000], :, [1, 10, 100, 1000], :].loc[\n", " :, [\"duration\", \"corpus_nonzero\", \"matrix_nonzero\", \"speed\"]]" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
durationcorpus_nonzeromatrix_nonzerospeed
dictionary_sizecorpus_sizenonzero_limitnormalized
10001001False00:00:00.0002920.00.01.48 Kdoc pairs / s
True00:00:00.0002250.00.00.08 Kdoc pairs / s
100False00:00:00.0007470.00.01.02 Kdoc pairs / s
True00:00:00.0004880.00.00.07 Kdoc pairs / s
10001False00:00:00.0000270.00.08.10 Kdoc pairs / s
True00:00:00.0000690.00.01.56 Kdoc pairs / s
100False00:00:00.0003090.00.016.26 Kdoc pairs / s
True00:00:00.0002680.00.02.24 Kdoc pairs / s
1000001False00:00:00.0005760.00.011256.03 Kdoc pairs / s
True00:00:00.0005740.00.06512.19 Kdoc pairs / s
100False00:00:00.0005620.00.05233.50 Kdoc pairs / s
True00:00:00.0006090.00.02743.63 Kdoc pairs / s
1000001001False00:00:00.0001520.00.098.97 Kdoc pairs / s
True00:00:00.0003220.00.028.10 Kdoc pairs / s
100False00:00:00.0049970.00.00.14 Kdoc pairs / s
True00:00:00.0222060.00.00.07 Kdoc pairs / s
10001False00:00:00.0002100.00.01420.00 Kdoc pairs / s
True00:00:00.0001920.00.0467.23 Kdoc pairs / s
100False00:00:00.0190220.00.045.91 Kdoc pairs / s
True00:00:00.0044310.00.01.35 Kdoc pairs / s
1000001False00:00:00.0244660.00.0126.77 Kdoc pairs / s
True00:00:00.0624470.00.0213.64 Kdoc pairs / s
100False00:00:00.0876920.00.0108.55 Kdoc pairs / s
True00:00:01.0658890.00.0968.80 Kdoc pairs / s
\n", "
" ], "text/plain": [ " duration \\\n", "dictionary_size corpus_size nonzero_limit normalized \n", "1000 100 1 False 00:00:00.000292 \n", " True 00:00:00.000225 \n", " 100 False 00:00:00.000747 \n", " True 00:00:00.000488 \n", " 1000 1 False 00:00:00.000027 \n", " True 00:00:00.000069 \n", " 100 False 00:00:00.000309 \n", " True 00:00:00.000268 \n", " 100000 1 False 00:00:00.000576 \n", " True 00:00:00.000574 \n", " 100 False 00:00:00.000562 \n", " True 00:00:00.000609 \n", "100000 100 1 False 00:00:00.000152 \n", " True 00:00:00.000322 \n", " 100 False 00:00:00.004997 \n", " True 00:00:00.022206 \n", " 1000 1 False 00:00:00.000210 \n", " True 00:00:00.000192 \n", " 100 False 00:00:00.019022 \n", " True 00:00:00.004431 \n", " 100000 1 False 00:00:00.024466 \n", " True 00:00:00.062447 \n", " 100 False 00:00:00.087692 \n", " True 00:00:01.065889 \n", "\n", " corpus_nonzero \\\n", "dictionary_size corpus_size nonzero_limit normalized \n", "1000 100 1 False 0.0 \n", " True 0.0 \n", " 100 False 0.0 \n", " True 0.0 \n", " 1000 1 False 0.0 \n", " True 0.0 \n", " 100 False 0.0 \n", " True 0.0 \n", " 100000 1 False 0.0 \n", " True 0.0 \n", " 100 False 0.0 \n", " True 0.0 \n", "100000 100 1 False 0.0 \n", " True 0.0 \n", " 100 False 0.0 \n", " True 0.0 \n", " 1000 1 False 0.0 \n", " True 0.0 \n", " 100 False 0.0 \n", " True 0.0 \n", " 100000 1 False 0.0 \n", " True 0.0 \n", " 100 False 0.0 \n", " True 0.0 \n", "\n", " matrix_nonzero \\\n", "dictionary_size corpus_size nonzero_limit normalized \n", "1000 100 1 False 0.0 \n", " True 0.0 \n", " 100 False 0.0 \n", " True 0.0 \n", " 1000 1 False 0.0 \n", " True 0.0 \n", " 100 False 0.0 \n", " True 0.0 \n", " 100000 1 False 0.0 \n", " True 0.0 \n", " 100 False 0.0 \n", " True 0.0 \n", "100000 100 1 False 0.0 \n", " True 0.0 \n", " 100 False 0.0 \n", " True 0.0 \n", " 1000 1 False 0.0 \n", " True 0.0 \n", " 100 False 0.0 \n", " True 0.0 \n", " 100000 1 False 0.0 \n", " True 0.0 \n", " 100 False 0.0 \n", " True 0.0 \n", "\n", " speed \n", "dictionary_size corpus_size nonzero_limit normalized \n", "1000 100 1 False 1.48 Kdoc pairs / s \n", " True 0.08 Kdoc pairs / s \n", " 100 False 1.02 Kdoc pairs / s \n", " True 0.07 Kdoc pairs / s \n", " 1000 1 False 8.10 Kdoc pairs / s \n", " True 1.56 Kdoc pairs / s \n", " 100 False 16.26 Kdoc pairs / s \n", " True 2.24 Kdoc pairs / s \n", " 100000 1 False 11256.03 Kdoc pairs / s \n", " True 6512.19 Kdoc pairs / s \n", " 100 False 5233.50 Kdoc pairs / s \n", " True 2743.63 Kdoc pairs / s \n", "100000 100 1 False 98.97 Kdoc pairs / s \n", " True 28.10 Kdoc pairs / s \n", " 100 False 0.14 Kdoc pairs / s \n", " True 0.07 Kdoc pairs / s \n", " 1000 1 False 1420.00 Kdoc pairs / s \n", " True 467.23 Kdoc pairs / s \n", " 100 False 45.91 Kdoc pairs / s \n", " True 1.35 Kdoc pairs / s \n", " 100000 1 False 126.77 Kdoc pairs / s \n", " True 213.64 Kdoc pairs / s \n", " 100 False 108.55 Kdoc pairs / s \n", " True 968.80 Kdoc pairs / s " ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "display(df.apply(lambda x: (x - x.mean()).std())).loc[\n", " [1000, 100000], :, [1, 100], :].loc[\n", " :, [\"duration\", \"corpus_nonzero\", \"matrix_nonzero\", \"speed\"]]" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.4.2" } }, "nbformat": 4, "nbformat_minor": 2 }