{ "cells": [ { "cell_type": "markdown", "metadata": { "_uuid": "3f6c2bfe6b2e26c92357e896a1511195d836956e" }, "source": [ "
\n", "\n", " \n", "## [mlcourse.ai](https://mlcourse.ai) - Open Machine Learning Course\n", "\n", "Author: [Yury Kashnitsky](https://www.linkedin.com/in/festline/). All content is distributed under the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license." ] }, { "cell_type": "markdown", "metadata": { "_uuid": "cb01ca96934e5c83a36a2308da9645b87a9c52a0" }, "source": [ "##
Assignment 4 (demo). Solution\n", "###
Sarcasm detection with logistic regression\n", " \n", "**Same assignment as a [Kaggle Kernel](https://www.kaggle.com/kashnitsky/a4-demo-sarcasm-detection-with-logit) + [solution](https://www.kaggle.com/kashnitsky/a4-demo-sarcasm-detection-with-logit-solution).**\n", "\n", "\n", "We'll be using the dataset from the [paper](https://arxiv.org/abs/1704.05579) \"A Large Self-Annotated Corpus for Sarcasm\" with >1mln comments from Reddit, labeled as either sarcastic or not. A processed version can be found on Kaggle in a form of a [Kaggle Dataset](https://www.kaggle.com/danofer/sarcasm)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "ed87ab2845921166bb73ca854bfe1ef013c035e9" }, "outputs": [], "source": [ "PATH_TO_DATA = \"../input/sarcasm/train-balanced-sarcasm.csv\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "ffa03aec57ab6150f9bec0fa56cd3a5791a3e6f4" }, "outputs": [], "source": [ "# some necessary imports\n", "import os\n", "\n", "import numpy as np\n", "import pandas as pd\n", "import seaborn as sns\n", "from matplotlib import pyplot as plt\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.metrics import accuracy_score, confusion_matrix\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.pipeline import Pipeline" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "b23e4fc7a1973d60e0c6da8bd60f3d921542a856" }, "outputs": [], "source": [ "train_df = pd.read_csv(PATH_TO_DATA)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "4dc7b3787afa46c7eb0d0e33b0c41ab9821c4a27" }, "outputs": [], "source": [ "train_df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "0a7ed9557943806c6813ad59c3d5ebdb403ffd78" }, "outputs": [], "source": [ "train_df.info()" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "6472f52fb5ecb8bb2a6e3b292678a2042fcfe34c" }, "source": [ "Some comments are missing, so we drop the corresponding rows." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "97b2d85627fcde52a506dbdd55d4d6e4c87d3f08" }, "outputs": [], "source": [ "train_df.dropna(subset=[\"comment\"], inplace=True)" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "9d51637ee70dca7693737ad0da1dbb8c6ce9230b" }, "source": [ "We notice that the dataset is indeed balanced" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "addd77c640423d30fd146c8d3a012d3c14481e11" }, "outputs": [], "source": [ "train_df[\"label\"].value_counts()" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "5b836574e5093c5eb2e9063fefe1c8d198dcba79" }, "source": [ "We split data into training and validation parts." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "c200add4e1dcbaa75164bbcc73b9c12ecb863c96" }, "outputs": [], "source": [ "train_texts, valid_texts, y_train, y_valid = train_test_split(\n", " train_df[\"comment\"], train_df[\"label\"], random_state=17\n", ")" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "8ccf65310c5e61194dd6a8913276e54cc32e7712" }, "source": [ "## Tasks:\n", "1. Analyze the dataset, make some plots. This [Kernel](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-qiqc) might serve as an example\n", "2. Build a Tf-Idf + logistic regression pipeline to predict sarcasm (`label`) based on the text of a comment on Reddit (`comment`).\n", "3. Plot the words/bigrams which a most predictive of sarcasm (you can use [eli5](https://github.com/TeamHG-Memex/eli5) for that)\n", "4. (optionally) add subreddits as new features to improve model performance. Apply here the Bag of Words approach, i.e. treat each subreddit as a new feature." ] }, { "cell_type": "markdown", "metadata": { "_uuid": "ba1a8f65032c5954476a68e01b607655145b746d" }, "source": [ "### Part 1. Exploratory data analysis" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "6a045e347fabd462a643639a1334b51f8780627d" }, "source": [ "Distribution of lengths for sarcastic and normal comments is almost the same." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "dadaf341602993a7854867a1df3004d0aa5d9b8c" }, "outputs": [], "source": [ "train_df.loc[train_df[\"label\"] == 1, \"comment\"].str.len().apply(np.log1p).hist(\n", " label=\"sarcastic\", alpha=0.5\n", ")\n", "train_df.loc[train_df[\"label\"] == 0, \"comment\"].str.len().apply(np.log1p).hist(\n", " label=\"normal\", alpha=0.5\n", ")\n", "plt.legend();" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "c2c613ee2052a2c0379682adf5c23d1f751f4c3b" }, "outputs": [], "source": [ "from wordcloud import STOPWORDS, WordCloud" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "ae7333d67f6a17673d2aa16aed3017e2fbef9b58" }, "outputs": [], "source": [ "wordcloud = WordCloud(\n", " background_color=\"black\",\n", " stopwords=STOPWORDS,\n", " max_words=200,\n", " max_font_size=100,\n", " random_state=17,\n", " width=800,\n", " height=400,\n", ")" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "7a59f35dc359cb1b13a363b0f515b38a03c7b940" }, "source": [ "Word cloud are nice, but not very useful" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "7bc095d9b0c549d4a21478ed52845210c0ffbb57" }, "outputs": [], "source": [ "plt.figure(figsize=(16, 12))\n", "wordcloud.generate(str(train_df.loc[train_df[\"label\"] == 1, \"comment\"]))\n", "plt.imshow(wordcloud);" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "f2423d8c8adf818c0f7c709665f741e1f885e47f" }, "outputs": [], "source": [ "plt.figure(figsize=(16, 12))\n", "wordcloud.generate(str(train_df.loc[train_df[\"label\"] == 0, \"comment\"]))\n", "plt.imshow(wordcloud);" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "15c4140c01495b4d62a5b18d1906ba191e01505c" }, "source": [ "Let's analyze whether some subreddits are more \"sarcastic\" on average than others" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "6ba84720d54144e054b6963d78b48bf648e5c652" }, "outputs": [], "source": [ "sub_df = train_df.groupby(\"subreddit\")[\"label\"].agg([np.size, np.mean, np.sum])\n", "sub_df.sort_values(by=\"sum\", ascending=False).head(10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "575136be3a080cdd9a93ee7ee0c5fc9d9bf754e9" }, "outputs": [], "source": [ "sub_df[sub_df[\"size\"] > 1000].sort_values(by=\"mean\", ascending=False).head(10)" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "4e9cf503955b00599953ef8f87d0662155876519" }, "source": [ "The same for authors doesn't yield much insight. Except for the fact that somebody's comments were sampled - we can see the same amounts of sarcastic and non-sarcastic comments." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "1cde1056ef0d3c62b98d70a299d83df29fcf6071" }, "outputs": [], "source": [ "sub_df = train_df.groupby(\"author\")[\"label\"].agg([np.size, np.mean, np.sum])\n", "sub_df[sub_df[\"size\"] > 300].sort_values(by=\"mean\", ascending=False).head(10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "46f4aa1c9524d525acec04ca0754f85a5190514b" }, "outputs": [], "source": [ "sub_df = (\n", " train_df[train_df[\"score\"] >= 0]\n", " .groupby(\"score\")[\"label\"]\n", " .agg([np.size, np.mean, np.sum])\n", ")\n", "sub_df[sub_df[\"size\"] > 300].sort_values(by=\"mean\", ascending=False).head(10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "3e2f474c6219798c2fef2e4004ad4a6c1022f8ee" }, "outputs": [], "source": [ "sub_df = (\n", " train_df[train_df[\"score\"] < 0]\n", " .groupby(\"score\")[\"label\"]\n", " .agg([np.size, np.mean, np.sum])\n", ")\n", "sub_df[sub_df[\"size\"] > 300].sort_values(by=\"mean\", ascending=False).head(10)" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "416321f19f5a27290bc5622e8b3384b7bbbd28c6" }, "source": [ "### Part 2. Training the model" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "3048a070a56b08eb4e5fe2c54b6d14905031e74a" }, "outputs": [], "source": [ "# build bigrams, put a limit on maximal number of features\n", "# and minimal word frequency\n", "tf_idf = TfidfVectorizer(ngram_range=(1, 2), max_features=50000, min_df=2)\n", "# multinomial logistic regression a.k.a softmax classifier\n", "logit = LogisticRegression(C=1, n_jobs=4, solver=\"lbfgs\", random_state=17, verbose=1)\n", "# sklearn's pipeline\n", "tfidf_logit_pipeline = Pipeline([(\"tf_idf\", tf_idf), (\"logit\", logit)])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "8756bac7457218e4daf08ec276211f03971c17fb" }, "outputs": [], "source": [ "%%time\n", "tfidf_logit_pipeline.fit(train_texts, y_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "d2e47f77f999c2fb5aee9ef1de1542bc93de4c98" }, "outputs": [], "source": [ "%%time\n", "valid_pred = tfidf_logit_pipeline.predict(valid_texts)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "a8f93efc3db12910eaa6d7944feebb2418714203" }, "outputs": [], "source": [ "accuracy_score(y_valid, valid_pred)" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "fefd0178f43ce832031653be70f0a0e47f62cf4c" }, "source": [ "### Part 3. Explaining the model" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "247a13fd3ae4d5c015c0ca0489a9a95d72ad7e9f" }, "outputs": [], "source": [ "def plot_confusion_matrix(\n", " actual,\n", " predicted,\n", " classes,\n", " normalize=False,\n", " title=\"Confusion matrix\",\n", " figsize=(7, 7),\n", " cmap=plt.cm.Blues,\n", " path_to_save_fig=None,\n", "):\n", " \"\"\"\n", " This function prints and plots the confusion matrix.\n", " Normalization can be applied by setting `normalize=True`.\n", " \"\"\"\n", " import itertools\n", "\n", " from sklearn.metrics import confusion_matrix\n", "\n", " cm = confusion_matrix(actual, predicted).T\n", " if normalize:\n", " cm = cm.astype(\"float\") / cm.sum(axis=1)[:, np.newaxis]\n", "\n", " plt.figure(figsize=figsize)\n", " plt.imshow(cm, interpolation=\"nearest\", cmap=cmap)\n", " plt.title(title)\n", " plt.colorbar()\n", " tick_marks = np.arange(len(classes))\n", " plt.xticks(tick_marks, classes, rotation=90)\n", " plt.yticks(tick_marks, classes)\n", "\n", " fmt = \".2f\" if normalize else \"d\"\n", " thresh = cm.max() / 2.0\n", " for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):\n", " plt.text(\n", " j,\n", " i,\n", " format(cm[i, j], fmt),\n", " horizontalalignment=\"center\",\n", " color=\"white\" if cm[i, j] > thresh else \"black\",\n", " )\n", "\n", " plt.tight_layout()\n", " plt.ylabel(\"Predicted label\")\n", " plt.xlabel(\"True label\")\n", "\n", " if path_to_save_fig:\n", " plt.savefig(path_to_save_fig, dpi=300, bbox_inches=\"tight\")" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "ff4a79b0368176a518fb0b84b45a508499e6183f" }, "source": [ "Confusion matrix is quite balanced." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "6df0c058a45b48b756e57e01a23bbc0974407195" }, "outputs": [], "source": [ "plot_confusion_matrix(\n", " y_valid,\n", " valid_pred,\n", " tfidf_logit_pipeline.named_steps[\"logit\"].classes_,\n", " figsize=(8, 8),\n", ")" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "6af3a7c93afef23ce9d215bf1daa2c91feb57d5d" }, "source": [ "Indeed, we can recognize some phrases indicative of sarcasm. Like \"yes sure\". " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "f62f3043b6e94fb6bbd5683a0e9662c572847fa6" }, "outputs": [], "source": [ "import eli5\n", "\n", "eli5.show_weights(\n", " estimator=tfidf_logit_pipeline.named_steps[\"logit\"],\n", " vec=tfidf_logit_pipeline.named_steps[\"tf_idf\"],\n", ")" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "be94f2065f86c65d5ee5590c9b2e5a541135732c" }, "source": [ "So sarcasm detection is easy.\n", "" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "5648f6ad7a14ef3a582909f7c0c72c4fc80204aa" }, "source": [ "### Part 4. Improving the model" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "aaefd2eb6f829f9eb9e9cd12c7903d3086182acc" }, "outputs": [], "source": [ "subreddits = train_df[\"subreddit\"]\n", "train_subreddits, valid_subreddits = train_test_split(subreddits, random_state=17)" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "4a0dac5edc6b5c4078622637454baa820bc18a5f" }, "source": [ "We'll have separate Tf-Idf vectorizers for comments and for subreddits. It's possible to stick to a pipeline as well, but in that case it becomes a bit less straightforward. [Example](https://stackoverflow.com/questions/36731813/computing-separate-tfidf-scores-for-two-different-columns-using-sklearn)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "88be690ae260d824fbd8df4a4d02e5abcce0d5a7" }, "outputs": [], "source": [ "tf_idf_texts = TfidfVectorizer(ngram_range=(1, 2), max_features=50000, min_df=2)\n", "tf_idf_subreddits = TfidfVectorizer(ngram_range=(1, 1))" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "453f8bc16c7726cbff936f3170442341eca3b45e" }, "source": [ "Do transformations separately for comments and subreddits. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "bfd35a513a3a090485df667e8ec773c57682aae7" }, "outputs": [], "source": [ "%%time\n", "X_train_texts = tf_idf_texts.fit_transform(train_texts)\n", "X_valid_texts = tf_idf_texts.transform(valid_texts)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "bea0b7a09c57afefb77e5eabd2b8ed18bf019855" }, "outputs": [], "source": [ "X_train_texts.shape, X_valid_texts.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "b81a4ad703e2fb0f97f21a449af7e0d8b39a860c" }, "outputs": [], "source": [ "%%time\n", "X_train_subreddits = tf_idf_subreddits.fit_transform(train_subreddits)\n", "X_valid_subreddits = tf_idf_subreddits.transform(valid_subreddits)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "d55aec9753a506f41310b4b717cdbb8694af8a8e" }, "outputs": [], "source": [ "X_train_subreddits.shape, X_valid_subreddits.shape" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "74b044c4b73653181ddde0f32e65a14d4867319c" }, "source": [ "Then, stack all features together." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "bc458646e3c36792cf845f86cb9b6d6cc384b32c" }, "outputs": [], "source": [ "from scipy.sparse import hstack\n", "\n", "X_train = hstack([X_train_texts, X_train_subreddits])\n", "X_valid = hstack([X_valid_texts, X_valid_subreddits])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "d5d2505567c9303b2ce9f81c9bdc11fc799af91e" }, "outputs": [], "source": [ "X_train.shape, X_valid.shape" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "a7f2040b76acbf2ff58e5dc68d7437ee6c3e9989" }, "source": [ "Train the same logistic regression." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "77b08444a449def113803365001d1f844620d5aa" }, "outputs": [], "source": [ "logit.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "1a48a2e637db7e695766af5be6f5aa60596f6ad9" }, "outputs": [], "source": [ "%%time\n", "valid_pred = logit.predict(X_valid)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "b50af0bc6ea1297538daaff183b7eecf8dfa7c2c" }, "outputs": [], "source": [ "accuracy_score(y_valid, valid_pred)" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "5fe6b371e91831198dda10d8b198c9e1ebfe07aa" }, "source": [ "As we can see, accuracy slightly increased." ] }, { "cell_type": "markdown", "metadata": { "_uuid": "7f0f47b98e49a185cd5cffe19fcbe28409bf00c0" }, "source": [ "## Links:\n", " - Machine learning library [Scikit-learn](https://scikit-learn.org/stable/index.html) (a.k.a. sklearn)\n", " - Kernels on [logistic regression](https://www.kaggle.com/kashnitsky/topic-4-linear-models-part-2-classification) and its applications to [text classification](https://www.kaggle.com/kashnitsky/topic-4-linear-models-part-4-more-of-logit), also a [Kernel](https://www.kaggle.com/kashnitsky/topic-6-feature-engineering-and-feature-selection) on feature engineering and feature selection\n", " - [Kaggle Kernel](https://www.kaggle.com/abhishek/approaching-almost-any-nlp-problem-on-kaggle) \"Approaching (Almost) Any NLP Problem on Kaggle\"\n", " - [ELI5](https://github.com/TeamHG-Memex/eli5) to explain model predictions" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 1 }