{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "3f6c2bfe6b2e26c92357e896a1511195d836956e"
   },
   "source": [
    "<center>\n",
    "<img src=\"../../img/ods_stickers.jpg\">\n",
    "    \n",
    "## [mlcourse.ai](https://mlcourse.ai) - Open Machine Learning Course\n",
    "\n",
    "Author: [Yury Kashnitsky](https://www.linkedin.com/in/festline/). All content is distributed under the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "cb01ca96934e5c83a36a2308da9645b87a9c52a0"
   },
   "source": [
    "## <center> Assignment 4 (demo)\n",
    "### <center>  Sarcasm detection with logistic regression\n",
    "    \n",
    "**Same assignment as a [Kaggle Kernel](https://www.kaggle.com/kashnitsky/a4-demo-sarcasm-detection-with-logit) + [solution](https://www.kaggle.com/kashnitsky/a4-demo-sarcasm-detection-with-logit-solution).**\n",
    "\n",
    "\n",
    "We'll be using the dataset from the [paper](https://arxiv.org/abs/1704.05579) \"A Large Self-Annotated Corpus for Sarcasm\" with >1mln comments from Reddit, labeled as either sarcastic or not. A processed version can be found on Kaggle in a form of a [Kaggle Dataset](https://www.kaggle.com/danofer/sarcasm).\n",
    "\n",
    "Sarcasm detection is easy. \n",
    "<img src=\"https://habrastorage.org/webt/1f/0d/ta/1f0dtavsd14ncf17gbsy1cvoga4.jpeg\" />"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "23a833b42b3c214b5191dfdc2482f2f901118247"
   },
   "outputs": [],
   "source": [
    "!ls ../input/sarcasm/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "ffa03aec57ab6150f9bec0fa56cd3a5791a3e6f4"
   },
   "outputs": [],
   "source": [
    "# some necessary imports\n",
    "import os\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "from matplotlib import pyplot as plt\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.metrics import accuracy_score, confusion_matrix\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.pipeline import Pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "b23e4fc7a1973d60e0c6da8bd60f3d921542a856"
   },
   "outputs": [],
   "source": [
    "train_df = pd.read_csv(\"../input/sarcasm/train-balanced-sarcasm.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "4dc7b3787afa46c7eb0d0e33b0c41ab9821c4a27"
   },
   "outputs": [],
   "source": [
    "train_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "0a7ed9557943806c6813ad59c3d5ebdb403ffd78"
   },
   "outputs": [],
   "source": [
    "train_df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "6472f52fb5ecb8bb2a6e3b292678a2042fcfe34c"
   },
   "source": [
    "Some comments are missing, so we drop the corresponding rows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "97b2d85627fcde52a506dbdd55d4d6e4c87d3f08"
   },
   "outputs": [],
   "source": [
    "train_df.dropna(subset=[\"comment\"], inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "9d51637ee70dca7693737ad0da1dbb8c6ce9230b"
   },
   "source": [
    "We notice that the dataset is indeed balanced"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "addd77c640423d30fd146c8d3a012d3c14481e11"
   },
   "outputs": [],
   "source": [
    "train_df[\"label\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "5b836574e5093c5eb2e9063fefe1c8d198dcba79"
   },
   "source": [
    "We split data into training and validation parts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "c200add4e1dcbaa75164bbcc73b9c12ecb863c96"
   },
   "outputs": [],
   "source": [
    "train_texts, valid_texts, y_train, y_valid = train_test_split(\n",
    "    train_df[\"comment\"], train_df[\"label\"], random_state=17\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "7f0f47b98e49a185cd5cffe19fcbe28409bf00c0"
   },
   "source": [
    "## Tasks:\n",
    "1. Analyze the dataset, make some plots. This [Kernel](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-qiqc) might serve as an example\n",
    "2. Build a Tf-Idf + logistic regression pipeline to predict sarcasm (`label`) based on the text of a comment on Reddit (`comment`).\n",
    "3. Plot the words/bigrams which a most predictive of sarcasm (you can use [eli5](https://github.com/TeamHG-Memex/eli5) for that)\n",
    "4. (optionally) add subreddits as new features to improve model performance. Apply here the Bag of Words approach, i.e. treat each subreddit as a new feature.\n",
    "\n",
    "## Links:\n",
    "  - Machine learning library [Scikit-learn](https://scikit-learn.org/stable/index.html) (a.k.a. sklearn)\n",
    "  - Kernels on [logistic regression](https://www.kaggle.com/kashnitsky/topic-4-linear-models-part-2-classification) and its applications to [text classification](https://www.kaggle.com/kashnitsky/topic-4-linear-models-part-4-more-of-logit), also a [Kernel](https://www.kaggle.com/kashnitsky/topic-6-feature-engineering-and-feature-selection) on feature engineering and feature selection\n",
    "  - [Kaggle Kernel](https://www.kaggle.com/abhishek/approaching-almost-any-nlp-problem-on-kaggle) \"Approaching (Almost) Any NLP Problem on Kaggle\"\n",
    "  - [ELI5](https://github.com/TeamHG-Memex/eli5) to explain model predictions"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}