{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19", "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5" }, "outputs": [], "source": [ "# This Python 3 environment comes with many helpful analytics libraries installed\n", "# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python\n", "# For example, here's several helpful packages to load in \n", "\n", "import numpy as np # linear algebra\n", "import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "color = sns.color_palette()\n", "\n", "%matplotlib inline\n", "\n", "from plotly import tools\n", "import plotly.offline as py\n", "py.init_notebook_mode(connected=True)\n", "import plotly.graph_objs as go\n", "\n", "\n", "\n", "from sklearn.model_selection import train_test_split\n", "from sklearn import metrics\n", "\n", "from keras.preprocessing.text import Tokenizer\n", "from keras.preprocessing.sequence import pad_sequences\n", "\n", "from sklearn.model_selection import cross_val_score\n", "\n", "# Input data files are available in the \"../input/\" directory.\n", "# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory\n", "import os\n", "import time\n", "import numpy as np # linear algebra\n", "import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n", "from tqdm import tqdm\n", "import math\n", "from sklearn.model_selection import train_test_split\n", "from sklearn import metrics\n", "\n", "from keras.preprocessing.text import Tokenizer\n", "from keras.preprocessing.sequence import pad_sequences\n", "\n", "\n", "from sklearn.model_selection import KFold\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "from keras.preprocessing import text, sequence\n", "import os\n", "print(os.listdir(\"../input\"))\n", "\n", "# Any results you write to the current directory are saved as output." ] }, { "cell_type": "markdown", "metadata": { "_uuid": "5f8edfb2e3260acf27bbcda13da73c5c0c899ee4" }, "source": [ "**1. Feature and data explanation**" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "eb41beb67cc33f5e094499b36cf2233e9b5fa1e3" }, "source": [ "\n", "The basis of this project is taken kaggle competition \"Quora Insincere Questions Classification\". Quora is a platform that empowers people to learn from each other. On Quora, people can ask questions and connect with others who contribute unique insights and quality answers. A key challenge is to weed out insincere questions -- those founded upon false premises, or that intend to make a statement rather than look for helpful answers.\n", "In this competition, we will develop models that identify and flag insincere questions.\n", "An existential problem for any major website today is how to handle toxic and divisive content. Quora wants to tackle this problem head-on to keep their platform a place where users can feel safe sharing their knowledge with the world.\n", "\n", "In this project I will try to help them with this. Let's start!" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "1e2224b616cb5080ce7c82ed7c1b4812328c971e" }, "source": [ "In this competition we will be predicting whether a question asked on Quora is sincere or not.\n", "An insincere question is defined as a question intended to make a statement rather than look for helpful answers. Some characteristics that can signify that a question is insincere:\n", "* Has a non-neutral tone:\n", " * Has an exaggerated tone to underscore a point about a group of people\n", " * Is rhetorical and meant to imply a statement about a group of people\n", "* Is disparaging or inflammatory:\n", " * Suggests a discriminatory idea against a protected class of people, or seeks confirmation of a stereotype\n", " * Makes disparaging attacks/insults against a specific person or group of people\n", " * Based on an outlandish premise about a group of people\n", " * Disparages against a characteristic that is not fixable and not measurable\n", "* Isn't grounded in reality:\n", " * Based on false information, or contains absurd assumptions\n", "* Uses sexual content (incest, bestiality, pedophilia) for shock value, and not to seek genuine answers*\n", "\n", "The training data includes the question that was asked, and whether it was identified as insincere (target = 1). The ground-truth labels contain some amount of noise: they are not guaranteed to be perfect.\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "6b14f8622946cfddb8e033e3b07d0613a1afc8a1" }, "source": [ " * *File descriptions* (data is here https://www.kaggle.com/c/quora-insincere-questions-classification/data)\n", " * train.csv - the training set\n", " * test.csv - the test set\n", " * sample_submission.csv - A sample submission in the correct format\n", " * embeddings/ - (see below)\n", " \n", "* *Data fields*\n", " * qid - unique question identifier\n", " * question_text - Quora question text\n", " * target - a question labeled \"insincere\" has a value of 1, otherwise 0" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "2875a65e64ed10a75767cd19ea3f4c8e73c9bc52" }, "source": [ "External data sources are not allowed for this competition. But Quora provides us with a number of pre-trained word vectors\n", "* GoogleNews-vectors-negative300 - https://code.google.com/archive/p/word2vec/\n", "* glove.840B.300d - https://nlp.stanford.edu/projects/glove/\n", "* paragram_300_sl999 - https://cogcomp.org/page/resource_view/106\n", "* wiki-news-300d-1M - https://fasttext.cc/docs/en/english-vectors.html\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "a48c52c50c238827960dbcf1a179bf879f1bc80d" }, "source": [ "**2. Primary data analysis**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "29e78c5bac555ce495db6e6b92e1857a924b9b0c" }, "outputs": [], "source": [ "# load train and test datasets\n", "train_df = pd.read_csv(\"../input/train.csv\")\n", "test_df = pd.read_csv(\"../input/test.csv\")\n", "print(\"Train datasets shape:\", train_df.shape)\n", "print(\"Test datasets shape:\", test_df.shape)\n", "train_df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "58dcf2071146cc185de6319ab4fc561fbf71cf64" }, "outputs": [], "source": [ "train_df.info()\n", "test_df.info()" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "6bce8d9bf74b42e0863c8463230958800673640e" }, "source": [ "We can see there are no missing values" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "1738e870e2c8722193b58ddd21c0c06103cd7809" }, "source": [ "First let us look at the distribution of the target variable to understand more about the imbalance and so on." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "891f6cc3eeded1ece24dfac06445fb67f6e4b6a7" }, "outputs": [], "source": [ "sns.countplot(train_df['target'])" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "b11ae038811ffd6e0c0041793e7376c319676918" }, "source": [ "So we see that data unbalanced.\n", "\n", "Since we only have text data, let's take a look at some of the mean values ​​of words / sentences." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "c66c24eee79396a9cc3c78179a714d0b14597d66" }, "outputs": [], "source": [ "print('Average word length of questions in train is {0:.0f}.'.format(np.mean(train_df['question_text'].apply(lambda x: len(x.split())))))\n", "print('Average word length of questions in test is {0:.0f}.'.format(np.mean(test_df['question_text'].apply(lambda x: len(x.split())))))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "62745d43f382f4c60729c22ff0c560f67314976c" }, "outputs": [], "source": [ "print('Max word length of questions in train is {0:.0f}.'.format(np.max(train_df['question_text'].apply(lambda x: len(x.split())))))\n", "print('Max word length of questions in test is {0:.0f}.'.format(np.max(test_df['question_text'].apply(lambda x: len(x.split())))))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "0e0e44246fd5d9c81549c5813ba5a6bf21662756" }, "outputs": [], "source": [ "print('Average character length of questions in train is {0:.0f}.'.format(np.mean(train_df['question_text'].apply(lambda x: len(x)))))\n", "print('Average character length of questions in test is {0:.0f}.'.format(np.mean(test_df['question_text'].apply(lambda x: len(x)))))" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "8b933f1d75be9e55385af4fce3731449365195e7" }, "source": [ "As we can see on average questions in train and test datasets are similar, but there are quite long questions in train dataset." ] }, { "cell_type": "markdown", "metadata": { "_uuid": "a41a10a402b75536852806d7e0ca72d06b9e39d0" }, "source": [ "Let's look at the most frequent words in each of the classes separately and immediately visualize it" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "a34f07fcbc950cbfe4ee880df2a5e8fe7fdc1dc8" }, "outputs": [], "source": [ "from collections import defaultdict\n", "from wordcloud import WordCloud, STOPWORDS\n", "train1_df = train_df[train_df[\"target\"]==1]\n", "train0_df = train_df[train_df[\"target\"]==0]\n", "\n", "## custom function for ngram generation ##\n", "def generate_ngrams(text, n_gram=1):\n", " token = [token for token in text.lower().split(\" \") if token != \"\" if token not in STOPWORDS]\n", " ngrams = zip(*[token[i:] for i in range(n_gram)])\n", " return [\" \".join(ngram) for ngram in ngrams]\n", "\n", "## custom function for horizontal bar chart ##\n", "def horizontal_bar_chart(df, color):\n", " trace = go.Bar(\n", " y=df[\"word\"].values[::-1],\n", " x=df[\"wordcount\"].values[::-1],\n", " showlegend=False,\n", " orientation = 'h',\n", " marker=dict(\n", " color=color,\n", " ),\n", " )\n", " return trace\n", "\n", "## Get the bar chart from sincere questions ##\n", "freq_dict = defaultdict(int)\n", "for sent in train0_df[\"question_text\"]:\n", " for word in generate_ngrams(sent):\n", " freq_dict[word] += 1\n", "fd_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])\n", "fd_sorted.columns = [\"word\", \"wordcount\"]\n", "trace0 = horizontal_bar_chart(fd_sorted.head(50), 'blue')\n", "\n", "## Get the bar chart from insincere questions ##\n", "freq_dict = defaultdict(int)\n", "for sent in train1_df[\"question_text\"]:\n", " for word in generate_ngrams(sent):\n", " freq_dict[word] += 1\n", "fd_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])\n", "fd_sorted.columns = [\"word\", \"wordcount\"]\n", "trace1 = horizontal_bar_chart(fd_sorted.head(50), 'blue')\n", "\n", "# Creating two subplots\n", "fig = tools.make_subplots(rows=1, cols=2, vertical_spacing=0.04,\n", " subplot_titles=[\"Frequent words of sincere questions\", \n", " \"Frequent words of insincere questions\"])\n", "fig.append_trace(trace0, 1, 1)\n", "fig.append_trace(trace1, 1, 2)\n", "fig['layout'].update(height=1200, width=900, paper_bgcolor='rgb(233,233,233)', title=\"Word Count Plots\")\n", "py.iplot(fig, filename='word-plots')\n" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "69c3db60580f529864bacadbd9b75048ddf912dc" }, "source": [ "It can be seen that in insincere questions words as 'black', 'white', 'muslims', 'trump', 'woman' is predominate, which hints at us on racial discrimination, sexual content. But for example, the word 'people' is often found in both classes. Let's look at bi-gramm and tri-gramm to understand the context in which they most often used." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "589b2573d4dc803e134a0f132c76f91558658c7f" }, "outputs": [], "source": [ "freq_dict = defaultdict(int)\n", "for sent in train0_df[\"question_text\"]:\n", " for word in generate_ngrams(sent,2):\n", " freq_dict[word] += 1\n", "fd_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])\n", "fd_sorted.columns = [\"word\", \"wordcount\"]\n", "trace0 = horizontal_bar_chart(fd_sorted.head(50), 'orange')\n", "\n", "\n", "freq_dict = defaultdict(int)\n", "for sent in train1_df[\"question_text\"]:\n", " for word in generate_ngrams(sent,2):\n", " freq_dict[word] += 1\n", "fd_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])\n", "fd_sorted.columns = [\"word\", \"wordcount\"]\n", "trace1 = horizontal_bar_chart(fd_sorted.head(50), 'orange')\n", "\n", "# Creating two subplots\n", "fig = tools.make_subplots(rows=1, cols=2, vertical_spacing=0.04,horizontal_spacing=0.15,\n", " subplot_titles=[\"Frequent bigrams of sincere questions\", \n", " \"Frequent bigrams of insincere questions\"])\n", "fig.append_trace(trace0, 1, 1)\n", "fig.append_trace(trace1, 1, 2)\n", "fig['layout'].update(height=1200, width=900, paper_bgcolor='rgb(233,233,233)', title=\"Bigram Count Plots\")\n", "py.iplot(fig, filename='word-plots')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "2e2be63858b13e618172868939f0a12dac7798a2" }, "outputs": [], "source": [ "freq_dict = defaultdict(int)\n", "for sent in train0_df[\"question_text\"]:\n", " for word in generate_ngrams(sent,3):\n", " freq_dict[word] += 1\n", "fd_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])\n", "fd_sorted.columns = [\"word\", \"wordcount\"]\n", "trace0 = horizontal_bar_chart(fd_sorted.head(50), 'green')\n", "\n", "\n", "freq_dict = defaultdict(int)\n", "for sent in train1_df[\"question_text\"]:\n", " for word in generate_ngrams(sent,3):\n", " freq_dict[word] += 1\n", "fd_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])\n", "fd_sorted.columns = [\"word\", \"wordcount\"]\n", "trace1 = horizontal_bar_chart(fd_sorted.head(50), 'green')\n", "\n", "# Creating two subplots\n", "fig = tools.make_subplots(rows=1, cols=2, vertical_spacing=0.04, horizontal_spacing=0.15,\n", " subplot_titles=[\"Frequent trigrams of sincere questions\", \n", " \"Frequent trigrams of insincere questions\"])\n", "fig.append_trace(trace0, 1, 1)\n", "fig.append_trace(trace1, 1, 2)\n", "fig['layout'].update(height=1200, width=1000, paper_bgcolor='rgb(233,233,233)', title=\"Trigram Count Plots\")\n", "py.iplot(fig, filename='word-plots')" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "ba6965131abeac9256bfff78f319e78c356e4d61" }, "source": [ "On these two graphs, you can see phrases that determine the insincere question or not. They clearly indicate unfriendly content.\n", "No wonder that the word 'donald trump' is so common. These words apparently refer to the political context, and there, as is well known, there are always heated discussions between his supporters and those who do not like him." ] }, { "cell_type": "markdown", "metadata": { "_uuid": "a41dc24f0c36f413dd6b126e3eb4a9eb90dc0c17" }, "source": [ "**3. Primary visual data analysis**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "dd5f9d58ff216d3ab35190a090020b2f1c719c96" }, "outputs": [], "source": [ "train_df['question_text'].apply(lambda x: len(x.split())).plot(kind='hist');\n", "plt.yscale('log');\n", "plt.title('Distribution of question text length in characters in train')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "a545a8758ed5b208db1a6e317861e93ec2e97117" }, "outputs": [], "source": [ "test_df['question_text'].apply(lambda x: len(x.split())).plot(kind='hist');\n", "plt.yscale('log');\n", "plt.title('Distribution of question text length in characters in test')" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "a3f288109da7c0aba15fba0b3de1b432c0ed77ce" }, "source": [ "We can see that most of the questions in train and test are 40 words long or shorter. \n" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "4f6451ae7d7a34db48e19640dbe161e8ea97e6c3" }, "source": [ "Now let us see how some features are distributed between both sincere and insincere questions." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "a07172c64c074c3c0d13cccac5211e240e8c0a0c" }, "outputs": [], "source": [ "## Number of words in the text ##\n", "train_df[\"num_words\"] = train_df[\"question_text\"].apply(lambda x: len(str(x).split()))\n", "test_df[\"num_words\"] = test_df[\"question_text\"].apply(lambda x: len(str(x).split()))\n", "\n", "## Number of characters in the text ##\n", "train_df[\"num_chars\"] = train_df[\"question_text\"].apply(lambda x: len(str(x)))\n", "test_df[\"num_chars\"] = test_df[\"question_text\"].apply(lambda x: len(str(x)))\n", "\n", "f, axes = plt.subplots(2, 1, figsize=(10,20))\n", "sns.boxplot(x='target', y='num_words', data=train_df, ax=axes[0])\n", "axes[0].set_xlabel('Target', fontsize=12)\n", "axes[0].set_title(\"Number of words in each class\", fontsize=15)\n", "\n", "sns.boxplot(x='target', y='num_chars', data=train_df, ax=axes[1])\n", "axes[1].set_xlabel('Target', fontsize=12)\n", "axes[1].set_title(\"Number of characters in each class\", fontsize=15)" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "f225a32b4b0735fb47df455913af1c538e30328e" }, "source": [ "We can see that the insincere questions have more number of words as well as characters compared to sincere questions" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "05cda4a655e1ed8350211c69769440c6fc9ec38a" }, "source": [ "**4. Insights and found dependencies **\n", "The problem is closely related to sentiment analysis as well. Usually, neural networks, for example RNN / LSTM, which can reveal hidden patterns are most often used in such tasks. \n", "But we noticed that some word combinations, which determine the class of the question, stand out very well, let's make text preprocessing plus TFIDFTransform and try to make simple LogisticRegression baseline." ] }, { "cell_type": "markdown", "metadata": { "_uuid": "c0a97473f8d8938648e567f62104f4b1b1e66e2b" }, "source": [ "**5. Metrics selection**" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "df423154190d9aef6ffe9539d2e0a6cc45eef349" }, "source": [ "Metrics -** F1 score ** is given in competition. F1 is a function of Precision and Recall. \n", "F1 formula:\n", "**F1 = 2*(Precision * Recall)/(Precision + Recall)**" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "d69668dc11968e50a8f3916a9863f731511a176d" }, "source": [ "F1 Score might be a better measure to use if we need to seek a balance between Precision and Recall. \n", "Since dataset is very unbalanced, it’s no wonder why this metric is chosen." ] }, { "cell_type": "markdown", "metadata": { "_uuid": "1f7be43a989bd51adaac7b9f0582e1dd76f387f4" }, "source": [ "**6. Model selection**\n", "We will predict the class of the question using a simple logistic regression." ] }, { "cell_type": "markdown", "metadata": { "_uuid": "3bc196b19b682c58163e146614a5f2b27c74fb67" }, "source": [ "**7. Data preprocessing**" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "a881b5ede5e820946f043b8d3fc7104c68ec6432" }, "source": [ "Lets make some text preprocessing:\n", "* Removing numbers and special characters\n", "* Lowering letters\n", "* Removing Stopwords\n", "* Lemmatization " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "004356e48b0db9669e46ddac3f276d288d3ab378" }, "outputs": [], "source": [ "from nltk import WordNetLemmatizer\n", "import re\n", "from nltk.corpus import stopwords\n", "stop_words = set(stopwords.words('english'))\n", "wnl = WordNetLemmatizer()\n", "def preprocess_text(text): \n", " # Keeping letters + lowerring\n", " text = re.sub(\"[^a-zA-Z]\",\" \", text)\n", " text = re.findall(r\"[a-zA-Z]+\", text.lower())\n", " \n", " # Removing stopwords\n", " text = [word for word in text if (word not in stop_words and len(word)>2)]\n", " \n", " # Lemming\n", " text = [wnl.lemmatize(word) for word in text]\n", " \n", " # Removing repetitions\n", " text = re.sub(r'(.)\\1+', r'\\1\\1', ' '.join(text))\n", " \n", " return text\n", "\n", "print('Cleaning data ... ')\n", "train_df['clean_text'] = train_df['question_text'].transform(preprocess_text)\n", "test_df['clean_text'] = train_df['question_text'].transform(preprocess_text)\n", "print(train_df['clean_text'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "11d10b4a5b5ad91d9b775b7b3724a773f2303b02" }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "9ae984ddc760bdb8e3f10782e767b2fbbe656439" }, "outputs": [], "source": [ "from sklearn.feature_extraction.text import TfidfVectorizer\n", "\n", "tfidf = TfidfVectorizer(ngram_range=(1, 3))\n", "x_train = tfidf.fit_transform(train_df['clean_text'])\n", "y_train = train_df['target']\n", "\n", "print(x_train.shape)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "7edeb7ff4ca093eb48cac0a8159a79f81b0d14a2" }, "outputs": [], "source": [ "from sklearn.linear_model import LogisticRegression\n", "from sklearn.metrics import f1_score\n", "from sklearn.metrics import classification_report\n", "from sklearn.model_selection import StratifiedKFold, KFold\n", "from sklearn.model_selection import cross_val_score\n", "skf = StratifiedKFold(n_splits=5, random_state=7, shuffle=True)\n", "#C = 12.5 best param using gridsearch\n", "logit = LogisticRegression(C=12.5, random_state = 7)\n", "\n", "# cv_scores = cross_val_score(logit, x_train, train_df['target'], cv=skf,\n", "# scoring='f1', n_jobs=-1)\n", "# print(cv_scores.mean(), cv_scores)\n" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "d131cc48a1261d3ca2fcad6c395e5d5967e0fb41" }, "source": [ "Now try to add a little bit meta feature." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "5d2f70193c48cf1f70aabd98160c2cc4d6561c51" }, "outputs": [], "source": [ "\n", "import string\n", "train_df[\"num_words\"] = train_df['question_text'].apply(lambda x: len(str(x).split()))\n", "test_df[\"num_words\"] = test_df['question_text'].apply(lambda x: len(str(x).split()))\n", "\n", "train_df[\"num_unique_words\"] = train_df['question_text'].apply(lambda x: len(set(str(x).split())))\n", "test_df[\"num_unique_words\"] = test_df['question_text'].apply(lambda x: len(set(str(x).split())))\n", "\n", "train_df[\"num_chars\"] = train_df['question_text'].apply(lambda x: len(str(x)))\n", "test_df[\"num_chars\"] = test_df['question_text'].apply(lambda x: len(str(x)))\n", "\n", "train_df[\"mean_word_len\"] = train_df['question_text'].apply(lambda x: np.mean([len(w) for w in str(x).split()]))\n", "test_df[\"mean_word_len\"] = test_df['question_text'].apply(lambda x: np.mean([len(w) for w in str(x).split()]))\n", "\n", "train_df[\"num_stopwords\"] = train_df[\"question_text\"].apply(lambda x: len([w for w in str(x).lower().split() if w in stop_words]))\n", "test_df[\"num_stopwords\"] = test_df[\"question_text\"].apply(lambda x: len([w for w in str(x).lower().split() if w in stop_words]))\n", "\n", "## Number of punctuations in the text ##\n", "train_df[\"num_punctuations\"] =train_df['question_text'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]) )\n", "test_df[\"num_punctuations\"] =test_df['question_text'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]) )\n", "\n", "## Number of title case words in the text ##\n", "train_df[\"num_words_upper\"] = train_df[\"question_text\"].apply(lambda x: len([w for w in str(x).split() if w.isupper()]))\n", "test_df[\"num_words_upper\"] = test_df[\"question_text\"].apply(lambda x: len([w for w in str(x).split() if w.isupper()]))\n", "\n", "## Number of title case words in the text ##\n", "train_df[\"num_words_title\"] = train_df[\"question_text\"].apply(lambda x: len([w for w in str(x).split() if w.istitle()]))\n", "test_df[\"num_words_title\"] = test_df[\"question_text\"].apply(lambda x: len([w for w in str(x).split() if w.istitle()]))\n", "\n", "## Average length of the words in the text ##\n", "train_df[\"mean_word_len\"] = train_df[\"question_text\"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))\n", "test_df[\"mean_word_len\"] =test_df[\"question_text\"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))\n", "print(train_df.head(5))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "d239e282965b951e13902910cb34d2215588b33a" }, "outputs": [], "source": [ "from scipy.sparse import hstack\n", "feats = [\"num_words\", \"num_unique_words\", \"num_chars\", \"mean_word_len\", \"num_stopwords\", \"num_punctuations\", \"num_words_upper\", \"num_words_title\"]\n", "X_train_full = hstack([x_train, train_df[feats].values])\n", "\n", "# cv_scores = cross_val_score(logit, X_train_full, train_df['target'], cv=skf,\n", "# scoring='f1', n_jobs=-1)\n", "# print(cv_scores.mean(), cv_scores)\n" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "1df60109820fe6aaf6799e11c64cf690d6938cab" }, "source": [ "Now, let's find optimal parametr C for logregression using GridSearch" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "e322d9738fd9d6d7cb93a46156af30017a354b4f" }, "outputs": [], "source": [ "from sklearn.model_selection import GridSearchCV\n", " #GridSearch for C\n", "c_values = np.linspace(10, 15, 5)\n", "\n", "logit_grid_searcher = GridSearchCV(estimator=logit, param_grid={'C': c_values},\n", " scoring='f1', n_jobs=-1, cv=skf, verbose=1)\n", "\n", "logit_grid_searcher.fit(X_train_full, train_df['target'])\n", "print(logit_grid_searcher.best_score_, logit_grid_searcher.best_params_)" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "4e549b6e1ada35682b98e836ea4b56d8450e7e6c" }, "source": [ "**10. Plotting training and validation curves **" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "f789a36c63104b22972275b29d17e038d3ff1cff" }, "outputs": [], "source": [ "from sklearn.model_selection import learning_curve\n", "\n", "def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,\n", " n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5)):\n", " plt.figure()\n", " plt.title(title)\n", " if ylim is not None:\n", " plt.ylim(*ylim)\n", " plt.xlabel(\"Training examples\")\n", " plt.ylabel(\"Score\")\n", " train_sizes, train_scores, test_scores = learning_curve(\n", " estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)\n", " train_scores_mean = np.mean(train_scores, axis=1)\n", " train_scores_std = np.std(train_scores, axis=1)\n", " test_scores_mean = np.mean(test_scores, axis=1)\n", " test_scores_std = np.std(test_scores, axis=1)\n", " plt.grid()\n", "\n", " plt.fill_between(train_sizes, train_scores_mean - train_scores_std,\n", " train_scores_mean + train_scores_std, alpha=0.1,\n", " color=\"r\")\n", " plt.fill_between(train_sizes, test_scores_mean - test_scores_std,\n", " test_scores_mean + test_scores_std, alpha=0.1, color=\"g\")\n", " plt.plot(train_sizes, train_scores_mean, 'o-', color=\"r\",\n", " label=\"Training score\")\n", " plt.plot(train_sizes, test_scores_mean, 'o-', color=\"g\",\n", " label=\"Cross-validation score\")\n", "\n", " plt.legend(loc=\"best\")\n", " plt.show\n", "\n", "title = \"Learning Curves (Logistic Regression)\"\n", "plot_learning_curve(logit, title, X_train_full, train_df['target'], ylim=(0.3, 1.01), cv=skf, n_jobs=-1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "a04ef05135a322c4dbecf7a1844e76eae864dce6" }, "outputs": [], "source": [ "scores = logit_grid_searcher.cv_results_['mean_test_score']\n", "scores_std = logit_grid_searcher.cv_results_['std_test_score']\n", "plt.figure().set_size_inches(8, 6)\n", "plt.semilogx(c_values, scores)\n", "\n", "# plot error lines showing +/- std. errors of the scores\n", "std_error = scores_std / np.sqrt(5)\n", "\n", "plt.semilogx(c_values, scores + std_error, 'b--')\n", "plt.semilogx(c_values, scores - std_error, 'b--')\n", "\n", "\n", "plt.fill_between(c_values, scores + std_error, scores - std_error, alpha=0.2)\n", "\n", "plt.ylabel('CV score +/- std error')\n", "plt.xlabel('c_values')\n", "plt.axhline(np.max(scores), linestyle='--', color='.5')\n", "plt.xlim([c_values[0], c_values[-1]])\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "2901ce48923c1803a8196dff59d4c9de87d69f84" }, "source": [ "It's time to make predict for test set!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "2f7388ae2975c83489ee00af6a69ce4708a9e6fd" }, "outputs": [], "source": [ "x_test = tfidf.transform(test_df['clean_text'])\n", "X_test_full = hstack([x_test, test_df[feats].values])\n", "logit.fit(X_train_full, train_df['target'])\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "e767d76ae46f211abceabba05b296043ac74c788" }, "outputs": [], "source": [ "pred = logit.predict(X_test_full)\n", "example = pd.read_csv('../input/sample_submission.csv')\n", "example['prediction'] = pred\n", "example.to_csv('submission.csv', index=False)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 1 }