{ "cells": [ { "cell_type": "markdown", "metadata": { "_uuid": "3f6c2bfe6b2e26c92357e896a1511195d836956e" }, "source": [ "
\n", "\n", " \n", "## [mlcourse.ai](https://mlcourse.ai) - Open Machine Learning Course\n", "\n", "Author: [Yury Kashnitsky](https://www.linkedin.com/in/festline/). All content is distributed under the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license." ] }, { "cell_type": "markdown", "metadata": { "_uuid": "cb01ca96934e5c83a36a2308da9645b87a9c52a0" }, "source": [ "##
Assignment 4 (demo)\n", "###
Sarcasm detection with logistic regression\n", " \n", "**Same assignment as a [Kaggle Kernel](https://www.kaggle.com/kashnitsky/a4-demo-sarcasm-detection-with-logit) + [solution](https://www.kaggle.com/kashnitsky/a4-demo-sarcasm-detection-with-logit-solution).**\n", "\n", "\n", "We'll be using the dataset from the [paper](https://arxiv.org/abs/1704.05579) \"A Large Self-Annotated Corpus for Sarcasm\" with >1mln comments from Reddit, labeled as either sarcastic or not. A processed version can be found on Kaggle in a form of a [Kaggle Dataset](https://www.kaggle.com/danofer/sarcasm).\n", "\n", "Sarcasm detection is easy. \n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "23a833b42b3c214b5191dfdc2482f2f901118247" }, "outputs": [], "source": [ "!ls ../input/sarcasm/" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "ffa03aec57ab6150f9bec0fa56cd3a5791a3e6f4" }, "outputs": [], "source": [ "# some necessary imports\n", "import os\n", "\n", "import numpy as np\n", "import pandas as pd\n", "import seaborn as sns\n", "from matplotlib import pyplot as plt\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.metrics import accuracy_score, confusion_matrix\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.pipeline import Pipeline" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "b23e4fc7a1973d60e0c6da8bd60f3d921542a856" }, "outputs": [], "source": [ "train_df = pd.read_csv(\"../input/sarcasm/train-balanced-sarcasm.csv\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "4dc7b3787afa46c7eb0d0e33b0c41ab9821c4a27" }, "outputs": [], "source": [ "train_df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "0a7ed9557943806c6813ad59c3d5ebdb403ffd78" }, "outputs": [], "source": [ "train_df.info()" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "6472f52fb5ecb8bb2a6e3b292678a2042fcfe34c" }, "source": [ "Some comments are missing, so we drop the corresponding rows." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "97b2d85627fcde52a506dbdd55d4d6e4c87d3f08" }, "outputs": [], "source": [ "train_df.dropna(subset=[\"comment\"], inplace=True)" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "9d51637ee70dca7693737ad0da1dbb8c6ce9230b" }, "source": [ "We notice that the dataset is indeed balanced" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "addd77c640423d30fd146c8d3a012d3c14481e11" }, "outputs": [], "source": [ "train_df[\"label\"].value_counts()" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "5b836574e5093c5eb2e9063fefe1c8d198dcba79" }, "source": [ "We split data into training and validation parts." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "c200add4e1dcbaa75164bbcc73b9c12ecb863c96" }, "outputs": [], "source": [ "train_texts, valid_texts, y_train, y_valid = train_test_split(\n", " train_df[\"comment\"], train_df[\"label\"], random_state=17\n", ")" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "7f0f47b98e49a185cd5cffe19fcbe28409bf00c0" }, "source": [ "## Tasks:\n", "1. Analyze the dataset, make some plots. This [Kernel](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-qiqc) might serve as an example\n", "2. Build a Tf-Idf + logistic regression pipeline to predict sarcasm (`label`) based on the text of a comment on Reddit (`comment`).\n", "3. Plot the words/bigrams which a most predictive of sarcasm (you can use [eli5](https://github.com/TeamHG-Memex/eli5) for that)\n", "4. (optionally) add subreddits as new features to improve model performance. Apply here the Bag of Words approach, i.e. treat each subreddit as a new feature.\n", "\n", "## Links:\n", " - Machine learning library [Scikit-learn](https://scikit-learn.org/stable/index.html) (a.k.a. sklearn)\n", " - Kernels on [logistic regression](https://www.kaggle.com/kashnitsky/topic-4-linear-models-part-2-classification) and its applications to [text classification](https://www.kaggle.com/kashnitsky/topic-4-linear-models-part-4-more-of-logit), also a [Kernel](https://www.kaggle.com/kashnitsky/topic-6-feature-engineering-and-feature-selection) on feature engineering and feature selection\n", " - [Kaggle Kernel](https://www.kaggle.com/abhishek/approaching-almost-any-nlp-problem-on-kaggle) \"Approaching (Almost) Any NLP Problem on Kaggle\"\n", " - [ELI5](https://github.com/TeamHG-Memex/eli5) to explain model predictions" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 1 }