{ "cells": [ { "cell_type": "markdown", "metadata": { "toc": true }, "source": [ "

Table of Contents

\n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# code for loading the format for the notebook\n", "import os\n", "\n", "# path : store the current path to convert back to it later\n", "path = os.getcwd()\n", "os.chdir(os.path.join('..', 'notebook_format'))\n", "\n", "from formats import load_style\n", "load_style(plot_style=False)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Ethen 2018-11-17 17:24:06 \n", "\n", "CPython 3.6.4\n", "IPython 6.4.0\n", "\n", "requests 2.20.1\n" ] } ], "source": [ "os.chdir(path)\n", "\n", "import math\n", "import json\n", "import requests\n", "\n", "# 1. magic to print version\n", "# 2. magic so that the notebook will reload external python modules\n", "%load_ext watermark\n", "%load_ext autoreload\n", "%autoreload 2\n", "%watermark -a 'Ethen' -d -t -v -p requests" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Quick Introduction to Okapi BM25" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The problem that **BM25 (Best Match 25)** tries to solve is similar to that of [TFIDF (Term Frequency, Inverse Document Frequency)](http://nbviewer.jupyter.org/github/ethen8181/machine-learning/blob/master/clustering/tfidf/tfidf.ipynb), that is representing our text in a vector space (it can be applied to field outside of text, but text is where it has the biggest presence) so we can search/find similar documents for a given document or query.\n", "\n", "The gist behind TFIDF is that is relies on two main factors to determine whether a document is similar to our query.\n", "\n", "- Term Frequency aka tf: how often does the term occur in the document? 3 times? 10 times?\n", "- Inverse Document Frequency aka idf: measures how many documents the term appeared in. Inverse document frequency (1/df) then measures how special the term is. Is the term a very rare (occurs in just one doc) word? Or a relatively common one (occurs in nearly all the docs)?\n", "\n", "Using these two factors, TFIDF measures the relative concentration of a term in a given piece of document. If the term is common in this article, but relatively rare elsewhere, then the TFIDF score will be high, and documents that have higher TFIDF score would be considered as very relevant to the search term.\n", "\n", "BM25 improves upon TFIDF by casting relevance as a probability problem. A relevance score, according to probabilistic information retrieval, ought to reflect the probability a user will consider the result relevant. Instead of going through how the formula was derived, here we'll take a look a the formula and try to digest it to see why it makes some kind of sense." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Gaining Intuition for Okapi BM25" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**BM25 (Best Match 25)** function scores each document in a corpus according to the document's relevance to a particular text query. For a query $Q$, with terms $q_1, \\ldots, q_n$, the BM25 score for document $D$ is:\n", "\n", "\n", "\\begin{align}\n", "\\mbox{BM25}(D, Q) = \\sum_{i=1}^n IDF(q_i, D) \\frac{f(q_i, D) \\cdot (k_1 + 1)}{f(q_i) + k_1 \\cdot (1-b + b \\cdot |D|/d_{avg}))}\n", "\\end{align}\n", "\n", "where:\n", "\n", "- $f(q_i, D)$ is the number of times term $q_i$ occurs in document $D$.\n", "- $|D|$ is the number of words in document $D$.\n", "- $d_{avg}$ is the average number of words per document.\n", "- $b$ and $k_1$ are hyperparameters for BM25.\n", "\n", "Let's break the formula down into smaller components to see why it makes sense.\n", "\n", "- First of all, there's $f(q_i, D)$ and $k_1$. $f(q_i, D)$ should match our intuition, as this means the more times the query term appears in the document, the higher the document's score will be. The interesting part is the **parameter $k_1$, which determines the term frequency saturation characteristic. The higher the value, the slower the saturation.** And when we say saturation, we are referring to the fact that if terms occurring extra times add extra score. We can observe this diminishing return in term frequency from the graph below.\n", "\n", "\n", "\n", "- Then $|D|/d_{avg}$ part in the denominator means a document that is longer than the average documents will result in a bigger denominator, resulting in a decrease in the score. The intuition is that the more terms in the document that does not match our input query - the lower the document's score should be. In other words, if a 300 page long document mentions the query term once, it's less likely to have as much to do with the query compared to a short tweet which mentions query once.\n", "\n", "\n", "\n", "From the graph above, we can see that shorter documents hit the asymptote much faster. Hopefully, this resembles our intuition as the more matches we have on shorter documents, the more certain we are about the relevance, whereas, for a lengthy book, it might take us longer to get to a point where we feel confident that the book is indeed relevant to the given query.\n", "\n", "- The parameter $b$ (bound 0.0 ~ 1.0) in the denominator is multiplied by the ratio of the document length we just discussed. **If $b$ is bigger, the effects of the document length compared to the average length are more amplified.** We can imagine if we set $b$ to 0, the effect of the length ratio would be completely nullified." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As for the inverse document frequency part, ${IDF}(q_i, D)$. For a corpus with $N$ documents, inverse document frequency for term $q_i$ is computed as follows:\n", "\n", "\\begin{align}\n", "\\mbox{IDF}(q_i, D) = \\log \\frac{N - N(q_i) + 0.5}{N(q_i) + 0.5}\n", "\\end{align}\n", "\n", "where\n", "\n", "- $N(q_i)$ is the number of documents in the corpus that contain term $q_i$.\n", "\n", "The inverse document frequency part is very similar to that of TFIDF, whose role is to make sure that rarer words will have a higher score and contribute more to the final score.\n", "\n", "Please note that the IDF formula listed above has a drawback when using it for terms appearing in more than half of the corpus since the value would come out as negative value, resulting in the overall score to become negative. e.g. if we have 10 documents in the corpus, and the term \"the\" appeared in 6 of them, its IDF would be $log(10 - 6 + 0.5 / 6 + 0.5) = log(4.5 / 6.5)$. Although we can argue that our implementation should have already removed these frequently appearing words as these words are mostly used to form a complete sentence and carry little meaning of note, different softwares/packages still make different adjustments to prevent a negative score from ever occurring. e.g.\n", "\n", "- Add a 1 to the equation.\n", "\n", "\\begin{align}\n", "\\mbox{IDF}(q_i) = \\log \\big( 1 + \\frac{N - N(q_i) + 0.5}{N(q_i) + 0.5} \\big)\n", "\\end{align}\n", "\n", "- For term that resulted in a negative IDF value, swap it with an small positive value, usually denoted as $\\epsilon$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Like all hyperparameters in general, the default ones are usually a good starting point, and we should probably focus on tweaking other stuff before jumping into the rabbit hole of hyperparameter tuning. In the context of search, it might be making sure our ranking scores older documents lower in application such as news ranking. But if we were to start tuning, remember to always measure the performance of various settings and the following questions are general starting points that we can reference to.\n", "\n", "- For $k_1$, we should be asking, \"when do we think a term is likely to be saturated?\" For very long documents like books, it's very likely to have a lot of different terms appear several times in a work, even when the term isn't the primary subject of the document. We may not want terms to be saturated as quickly in this situation, so the suggestion is that $k_1$ should generally trend toward larger numbers when the text is a lot longer and more diverse. On the opposite side of things, it's been suggested to set $k_1$ on the lower side. It's very unlikely that a collection of short tweets would have a term multiple times without being highly related to that term.\n", "- For $b$, we should be asking, \"when do we think a document is likely to be very long, and when should that hinder its relevance to a term?\" Documents which are highly specific like engineering specifications or patents are lengthy in order to be more specific about a subject. Their length is unlikely to be detrimental to the relevance and lower $b$ may be more appropriate. On the other end of the spectrum, documents which touch on several different topics in a broad way — news articles (a political article may touch on economics, international affairs, and certain corporations), user reviews, etc. — often benefit by choosing a larger $b$ so that irrelevant topics to a user's search, including spam and the like, are penalized." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Implementation" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['human', 'interface', 'computer'],\n", " ['survey', 'user', 'computer', 'system', 'response', 'time'],\n", " ['eps', 'user', 'interface', 'system'],\n", " ['system', 'human', 'system', 'eps'],\n", " ['user', 'response', 'time'],\n", " ['trees'],\n", " ['graph', 'trees'],\n", " ['graph', 'minors', 'trees'],\n", " ['graph', 'minors', 'survey']]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# we'll generate some fake texts to experiment with\n", "corpus = [\n", " 'Human machine interface for lab abc computer applications',\n", " 'A survey of user opinion of computer system response time',\n", " 'The EPS user interface management system',\n", " 'System and human system engineering testing of EPS',\n", " 'Relation of user perceived response time to error measurement',\n", " 'The generation of random binary unordered trees',\n", " 'The intersection graph of paths in trees',\n", " 'Graph minors IV Widths of trees and well quasi ordering',\n", " 'Graph minors A survey'\n", "]\n", "\n", "# remove stop words and tokenize them (we probably want to do some more\n", "# preprocessing with our text in a real world setting, but we'll keep\n", "# it simple here)\n", "stopwords = set(['for', 'a', 'of', 'the', 'and', 'to', 'in'])\n", "texts = [\n", " [word for word in document.lower().split() if word not in stopwords]\n", " for document in corpus\n", "]\n", "\n", "# build a word count dictionary so we can remove words that appear only once\n", "word_count_dict = {}\n", "for text in texts:\n", " for token in text:\n", " word_count = word_count_dict.get(token, 0) + 1\n", " word_count_dict[token] = word_count\n", "\n", "texts = [[token for token in text if word_count_dict[token] > 1] for text in texts]\n", "texts" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "class BM25:\n", " \"\"\"\n", " Best Match 25.\n", "\n", " Parameters\n", " ----------\n", " k1 : float, default 1.5\n", "\n", " b : float, default 0.75\n", "\n", " Attributes\n", " ----------\n", " tf_ : list[dict[str, int]]\n", " Term Frequency per document. So [{'hi': 1}] means\n", " the first document contains the term 'hi' 1 time.\n", "\n", " df_ : dict[str, int]\n", " Document Frequency per term. i.e. Number of documents in the\n", " corpus that contains the term.\n", "\n", " idf_ : dict[str, float]\n", " Inverse Document Frequency per term.\n", "\n", " doc_len_ : list[int]\n", " Number of terms per document. So [3] means the first\n", " document contains 3 terms.\n", "\n", " corpus_ : list[list[str]]\n", " The input corpus.\n", "\n", " corpus_size_ : int\n", " Number of documents in the corpus.\n", "\n", " avg_doc_len_ : float\n", " Average number of terms for documents in the corpus.\n", " \"\"\"\n", "\n", " def __init__(self, k1=1.5, b=0.75):\n", " self.b = b\n", " self.k1 = k1\n", "\n", " def fit(self, corpus):\n", " \"\"\"\n", " Fit the various statistics that are required to calculate BM25 ranking\n", " score using the corpus given.\n", "\n", " Parameters\n", " ----------\n", " corpus : list[list[str]]\n", " Each element in the list represents a document, and each document\n", " is a list of the terms.\n", "\n", " Returns\n", " -------\n", " self\n", " \"\"\"\n", " tf = []\n", " df = {}\n", " idf = {}\n", " doc_len = []\n", " corpus_size = 0\n", " for document in corpus:\n", " corpus_size += 1\n", " doc_len.append(len(document))\n", "\n", " # compute tf (term frequency) per document\n", " frequencies = {}\n", " for term in document:\n", " term_count = frequencies.get(term, 0) + 1\n", " frequencies[term] = term_count\n", "\n", " tf.append(frequencies)\n", "\n", " # compute df (document frequency) per term\n", " for term, _ in frequencies.items():\n", " df_count = df.get(term, 0) + 1\n", " df[term] = df_count\n", "\n", " for term, freq in df.items():\n", " idf[term] = math.log(1 + (corpus_size - freq + 0.5) / (freq + 0.5))\n", "\n", " self.tf_ = tf\n", " self.df_ = df\n", " self.idf_ = idf\n", " self.doc_len_ = doc_len\n", " self.corpus_ = corpus\n", " self.corpus_size_ = corpus_size\n", " self.avg_doc_len_ = sum(doc_len) / corpus_size\n", " return self\n", "\n", " def search(self, query):\n", " scores = [self._score(query, index) for index in range(self.corpus_size_)]\n", " return scores\n", "\n", " def _score(self, query, index):\n", " score = 0.0\n", "\n", " doc_len = self.doc_len_[index]\n", " frequencies = self.tf_[index]\n", " for term in query:\n", " if term not in frequencies:\n", " continue\n", "\n", " freq = frequencies[term]\n", " numerator = self.idf_[term] * freq * (self.k1 + 1)\n", " denominator = freq + self.k1 * (1 - self.b + self.b * doc_len / self.avg_doc_len_)\n", " score += (numerator / denominator)\n", "\n", " return score" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.0\tHuman machine interface for lab abc computer applications\n", "1.025\tA survey of user opinion of computer system response time\n", "0.0\tThe EPS user interface management system\n", "0.0\tSystem and human system engineering testing of EPS\n", "0.0\tRelation of user perceived response time to error measurement\n", "1.462\tThe generation of random binary unordered trees\n", "2.485\tThe intersection graph of paths in trees\n", "2.161\tGraph minors IV Widths of trees and well quasi ordering\n", "2.507\tGraph minors A survey\n" ] } ], "source": [ "# query our corpus to see which document is more relevant\n", "query = 'The intersection of graph survey and trees'\n", "query = [word for word in query.lower().split() if word not in stopwords]\n", "\n", "bm25 = BM25()\n", "bm25.fit(texts)\n", "scores = bm25.search(query)\n", "\n", "for score, doc in zip(scores, corpus):\n", " score = round(score, 3)\n", " print(str(score) + '\\t' + doc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the code chunk above, we printed each corpus's BM25 relevance score along with the original text, note that this isn't sorted in decreasing order of the relevance score yet, which is usually what we want to do in a real world setting. That is to find the more relevant document, sort them in decreasing order and present them to the user. Also here, we are computing the scores for every document, this becomes computationally expensive when we start have a large corpus size. Thus search engine uses **inverted index** to speed things up. An inverted index consists of a list of all the unique words that appear in any document, and for each word, a list of the documents in which it appears, this allows us to quickly find the documents that contains the term in our query and only then do we compute the relevance score for this smaller recall set. This [link](https://www.elastic.co/guide/en/elasticsearch/guide/master/inverted-index.html) contains a good high level description of this concept. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ElasticSearch BM25" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see BM25 in action to rank documents using ElasticSearch, this notebook isn't an ElasticSearch tutorial, so hopefully, the reader are some what familiar with the tool, if not, each code chunk contains links to some helpful references.\n", "\n", "We will follow the standard process of creating the index to store our documents, add some sample documents to the index and provide a query against the index to return the relevant documents sorted in decreasing order based on the relevance score, which will be based on BM25." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# installation instructions\n", "# https://www.elastic.co/guide/en/elasticsearch/reference/current/_installation.html\n", "\n", "# creating an index\n", "# https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-create-index.html\n", "settings = {\n", " 'settings': {\n", " 'index': {\n", " 'number_of_shards': 1,\n", " 'number_of_replicas': 1,\n", "\n", " # configure our default similarity algorithm explicitly to use bm25,\n", " # this allows it to use it for all the fields\n", " 'similarity': {\n", " 'default': {\n", " 'type': 'BM25'\n", " }\n", " }\n", " }\n", " },\n", " # we will be indexing our documents in the title field using the English analyzer,\n", " # which removes stop words for us, the default standard analyzer doesn't have\n", " # this preprocessing step\n", " # https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html\n", " 'mappings': {\n", " # this key is the \"type\", which will be explained in the next code chunk\n", " '_doc': {\n", " 'properties': {\n", " 'title': {\n", " 'type': 'text',\n", " 'analyzer': 'english'\n", " }\n", " }\n", " }\n", " }\n", "}\n", "headers = {'Content-Type': 'application/json'}\n", "response = requests.put('http://localhost:9200/experiment', data=json.dumps(settings), headers=headers)\n", "response" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# indexing document\n", "# https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html\n", "# https://www.elastic.co/guide/en/elasticsearch/guide/master/index-doc.html\n", "\n", "# a document is uniquely identified by the index, the type and id\n", "# it's worth noting that there's a note on removing the capabilities of\n", "# having multiple types under one index, and going forward the type will\n", "# just to set to '_doc'\n", "# https://www.elastic.co/guide/en/elasticsearch/reference/current/removal-of-types.html\n", "url = 'http://localhost:9200/experiment/_doc'\n", "\n", "for document in corpus:\n", " # we insert the document into the 'title' field\n", " data = {'title': document}\n", " response = requests.post(url, data=json.dumps(data), headers=headers)\n", " \n", "response" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "def search(query, headers):\n", " url = 'http://localhost:9200/experiment/_doc/_search'\n", " response = requests.get(url, data=json.dumps(query), headers=headers)\n", " \n", " # the response contains other information, such as time it took to\n", " # give the response back, here we are only interested in the matched\n", " # results, which are stored under 'hits'\n", " search_hits = json.loads(response.text)['hits']['hits']\n", "\n", " print('Num\\tRelevance Score\\tTitle')\n", " for idx, hit in enumerate(search_hits):\n", " print('%s\\t%s\\t%s' % (idx + 1, hit['_score'], hit['_source']['title']))" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Num\tRelevance Score\tTitle\n", "1\t4.572298\tThe intersection graph of paths in trees\n", "2\t3.0325541\tGraph minors A survey\n", "3\t1.814194\tGraph minors IV Widths of trees and well quasi ordering\n", "4\t1.2758815\tA survey of user opinion of computer system response time\n", "5\t1.1110051\tThe generation of random binary unordered trees\n" ] } ], "source": [ "# match query\n", "# https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html\n", "query = {\n", " 'query': {\n", " 'match': {\n", " # search against the 'title' field\n", " 'title': 'The intersection of graph survey and trees'\n", " }\n", " }\n", "}\n", "search(query, headers={'Content-Type': 'application/json'})" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# we can delete this experimental index to prevent occupying space\n", "response = requests.delete('http://localhost:9200/experiment')\n", "response" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Reference" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- [Wiki: Okapi BM25](https://en.wikipedia.org/wiki/Okapi_BM25)\n", "- [Blog: BM25 The Next Generation of Lucene Relevance](https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/)\n", "- [Blog: Practical BM25 - Part 1: How Shards Affect Relevance Scoring in Elasticsearch](https://www.elastic.co/blog/practical-bm25-part-1-how-shards-affect-relevance-scoring-in-elasticsearch)\n", "- [Blog: Practical BM25 - Part 2: The BM25 Algorithm and its Variables](https://www.elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables)\n", "- [Blog: Practical BM25 - Part 3: Considerations for Picking b and k1 in Elasticsearch](https://www.elastic.co/blog/practical-bm25-part-3-considerations-for-picking-b-and-k1-in-elasticsearch)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" }, "toc": { "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": true, "toc_position": { "height": "calc(100% - 180px)", "left": "10px", "top": "150px", "width": "264px" }, "toc_section_display": true, "toc_window_display": true }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }