{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Pivoted Document Length Normalization\n", "\n", "## Background\n", "\n", "In many cases, normalizing the tfidf weights for each term favors weight of terms of the documents with shorter length. The _pivoted document length normalization_ scheme counters the effect of this bias for short documents by making tfidf independent of the document length.\n", "\n", "This is achieved by *tilting* the normalization curve along the pivot point defined by user with some slope.\n", "Roughly following the equation:\n", "\n", "`pivoted_norm = (1 - slope) * pivot + slope * old_norm`\n", "\n", "This scheme is proposed in the paper [Pivoted Document Length Normalization](http://singhal.info/pivoted-dln.pdf) by Singhal, Buckley and Mitra.\n", "\n", "Overall this approach can in many cases help increase the accuracy of the model where the document lengths are hugely varying in the entire corpus.\n", "\n", "## Introduction\n", "\n", "This guide demonstrates how to perform pivoted document length normalization.\n", "We will train a logistic regression to distinguish between text from two different newsgroups.\n", "Our results will show that using pivoted document length normalization yields a better model (higher classification accuracy)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "#\n", "# Download our dataset\n", "#\n", "import gensim.downloader as api\n", "nws = api.load(\"20-newsgroups\")\n", "\n", "#\n", "# Pick texts from relevant newsgroups, split into training and test set.\n", "#\n", "cat1, cat2 = ('sci.electronics', 'sci.space')\n", "\n", "#\n", "# X_* contain the actual texts as strings.\n", "# Y_* contain labels, 0 for cat1 (sci.electronics) and 1 for cat2 (sci.space)\n", "#\n", "X_train = []\n", "X_test = []\n", "y_train = []\n", "y_test = []\n", "\n", "for i in nws:\n", " if i[\"set\"] == \"train\" and i[\"topic\"] == cat1:\n", " X_train.append(i[\"data\"])\n", " y_train.append(0)\n", " elif i[\"set\"] == \"train\" and i[\"topic\"] == cat2:\n", " X_train.append(i[\"data\"])\n", " y_train.append(1)\n", " elif i[\"set\"] == \"test\" and i[\"topic\"] == cat1:\n", " X_test.append(i[\"data\"])\n", " y_test.append(0)\n", " elif i[\"set\"] == \"test\" and i[\"topic\"] == cat2:\n", " X_test.append(i[\"data\"])\n", " y_test.append(1)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from gensim.parsing.preprocessing import preprocess_string\n", "from gensim.corpora import Dictionary\n", "\n", "id2word = Dictionary([preprocess_string(doc) for doc in X_train])\n", "train_corpus = [id2word.doc2bow(preprocess_string(doc)) for doc in X_train]\n", "test_corpus = [id2word.doc2bow(preprocess_string(doc)) for doc in X_test]" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1184 787\n" ] } ], "source": [ "print(len(X_train), len(X_test))" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# We perform our analysis on top k documents which is almost top 10% most scored documents\n", "k = len(X_test) // 10" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "from gensim.sklearn_api.tfidf import TfIdfTransformer\n", "from sklearn.linear_model import LogisticRegression\n", "from gensim.matutils import corpus2csc\n", "\n", "# This function returns the model accuracy and indivitual document prob values using\n", "# gensim's TfIdfTransformer and sklearn's LogisticRegression\n", "def get_tfidf_scores(kwargs):\n", " tfidf_transformer = TfIdfTransformer(**kwargs).fit(train_corpus)\n", "\n", " X_train_tfidf = corpus2csc(tfidf_transformer.transform(train_corpus), num_terms=len(id2word)).T\n", " X_test_tfidf = corpus2csc(tfidf_transformer.transform(test_corpus), num_terms=len(id2word)).T\n", "\n", " clf = LogisticRegression().fit(X_train_tfidf, y_train)\n", "\n", " model_accuracy = clf.score(X_test_tfidf, y_test)\n", " doc_scores = clf.decision_function(X_test_tfidf)\n", "\n", " return model_accuracy, doc_scores" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get TFIDF scores for corpus without pivoted document length normalisation" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.9682337992376112\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/misha/envs/gensim/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.\n", " FutureWarning)\n" ] } ], "source": [ "params = {}\n", "model_accuracy, doc_scores = get_tfidf_scores(params)\n", "print(model_accuracy)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Normal cosine normalisation favors short documents as our top 78 docs have a smaller mean doc length of 1668.179 compared to the corpus mean doc length of 1577.799\n" ] } ], "source": [ "import numpy as np\n", "\n", "# Sort the document scores by their scores and return a sorted list\n", "# of document score and corresponding document lengths.\n", "def sort_length_by_score(doc_scores, X_test):\n", " doc_scores = sorted(enumerate(doc_scores), key=lambda x: x[1])\n", " doc_leng = np.empty(len(doc_scores))\n", "\n", " ds = np.empty(len(doc_scores))\n", "\n", " for i, _ in enumerate(doc_scores):\n", " doc_leng[i] = len(X_test[_[0]])\n", " ds[i] = _[1]\n", "\n", " return ds, doc_leng\n", "\n", "\n", "print(\n", " \"Normal cosine normalisation favors short documents as our top {} \"\n", " \"docs have a smaller mean doc length of {:.3f} compared to the corpus mean doc length of {:.3f}\"\n", " .format(\n", " k, sort_length_by_score(doc_scores, X_test)[1][:k].mean(), \n", " sort_length_by_score(doc_scores, X_test)[1].mean()\n", " )\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get TFIDF scores for corpus with pivoted document length normalisation testing on various values of alpha." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Score for slope 0.0 is 0.9720457433290979\n", "Score for slope 0.1 is 0.9758576874205845\n", "Score for slope 0.2 is 0.97712833545108\n", "Score for slope 0.30000000000000004 is 0.9783989834815756\n", "Score for slope 0.4 is 0.97712833545108\n", "Score for slope 0.5 is 0.9758576874205845\n", "Score for slope 0.6000000000000001 is 0.9733163913595934\n", "Score for slope 0.7000000000000001 is 0.9733163913595934\n", "Score for slope 0.8 is 0.9733163913595934\n", "Score for slope 0.9 is 0.9733163913595934\n", "Score for slope 1.0 is 0.9682337992376112\n", "We get best score of 0.9783989834815756 at slope 0.30000000000000004\n" ] } ], "source": [ "best_model_accuracy = 0\n", "optimum_slope = 0\n", "for slope in np.arange(0, 1.1, 0.1):\n", " params = {\"pivot\": 10, \"slope\": slope}\n", "\n", " model_accuracy, doc_scores = get_tfidf_scores(params)\n", "\n", " if model_accuracy > best_model_accuracy:\n", " best_model_accuracy = model_accuracy\n", " optimum_slope = slope\n", "\n", " print(\"Score for slope {} is {}\".format(slope, model_accuracy))\n", "\n", "print(\"We get best score of {} at slope {}\".format(best_model_accuracy, optimum_slope))" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.9783989834815756\n" ] } ], "source": [ "params = {\"pivot\": 10, \"slope\": optimum_slope}\n", "model_accuracy, doc_scores = get_tfidf_scores(params)\n", "print(model_accuracy)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "With pivoted normalisation top 78 docs have mean length of 2077.346 which is much closer to the corpus mean doc length of 1577.799\n" ] } ], "source": [ "print(\n", " \"With pivoted normalisation top {} docs have mean length of {:.3f} \"\n", " \"which is much closer to the corpus mean doc length of {:.3f}\"\n", " .format(\n", " k, sort_length_by_score(doc_scores, X_test)[1][:k].mean(), \n", " sort_length_by_score(doc_scores, X_test)[1].mean()\n", " )\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualizing the pivoted normalization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since cosine normalization favors retrieval of short documents from the plot we can see that when slope was 1 (when pivoted normalisation was not applied) short documents with length of around 500 had very good score hence the bias for short documents can be seen. As we varied the value of slope from 1 to 0 we introdcued a new bias for long documents to counter the bias caused by cosine normalisation. Therefore at a certain point we got an optimum value of slope which is 0.5 where the overall accuracy of the model is increased.\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/misha/envs/gensim/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.\n", " FutureWarning)\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as py\n", "\n", "best_model_accuracy = 0\n", "optimum_slope = 0\n", "\n", "w = 2\n", "h = 2\n", "f, axarr = py.subplots(h, w, figsize=(15, 7))\n", "\n", "it = 0\n", "for slope in [1, 0.2]:\n", " params = {\"pivot\": 10, \"slope\": slope}\n", "\n", " model_accuracy, doc_scores = get_tfidf_scores(params)\n", "\n", " if model_accuracy > best_model_accuracy:\n", " best_model_accuracy = model_accuracy\n", " optimum_slope = slope\n", "\n", " doc_scores, doc_leng = sort_length_by_score(doc_scores, X_test)\n", "\n", " y = abs(doc_scores[:k, np.newaxis])\n", " x = doc_leng[:k, np.newaxis]\n", "\n", " py.subplot(1, 2, it+1).bar(x, y, width=20, linewidth=0)\n", " py.title(\"slope = \" + str(slope) + \" Model accuracy = \" + str(model_accuracy))\n", " py.ylim([0, 4.5])\n", " py.xlim([0, 3200])\n", " py.xlabel(\"document length\")\n", " py.ylabel(\"confidence score\")\n", " \n", " it += 1\n", "\n", "py.tight_layout()\n", "py.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above histogram plot helps us visualize the effect of `slope`. For top k documents we have document length on the x axis and their respective scores of belonging to a specific class on y axis. \n", "As we decrease the slope the density of bins is shifted from low document length (around ~250-500) to over ~500 document length. This suggests that the positive biasness which was seen at `slope=1` (or when regular tfidf was used) for short documents is now reduced. We get the optimum slope or the max model accuracy when slope is 0.2.\n", "\n", "# Conclusion\n", "\n", "Using pivoted document normalization improved the classification accuracy significantly:\n", "\n", "- Before (slope=1, identical to default cosine normalization): 0.9682\n", "- After (slope=0.2): 0.9771" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 1 }