{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center>\n",
    "<img src=\"../../img/ods_stickers.jpg\" />\n",
    "    \n",
    "## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course \n",
    "\n",
    "\n",
    "### Individual Project\n",
    "# Predicting Wine Expert Rating\n",
    "\n",
    "<div style=\"text-align: right;\">Author: Maxim Klyuchnikov</div>\n",
    "<hr>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h1>Table of Contents<span class=\"tocSkip\"></span></h1>\n",
    "<div class=\"toc\">\n",
    "    <ul class=\"toc-item\">\n",
    "        <li><span><a href=\"#Project-Description\"><span\n",
    "                class=\"toc-item-num\"></span>Project Description</a></span>\n",
    "            <ul class=\"toc-item\">\n",
    "                <li><span><a href=\"#Dataset\" data-toc-modified-id=\"Dataset-2.1\"><span class=\"toc-item-num\"></span>Dataset</a></span>\n",
    "                </li>\n",
    "                <li><span><a href=\"#Features\" data-toc-modified-id=\"Features-2.2\"><span class=\"toc-item-num\"></span>Features</a></span>\n",
    "                </li>\n",
    "                <li><span><a href=\"#Target\" data-toc-modified-id=\"Target-2.3\"><span\n",
    "                        class=\"toc-item-num\"></span>Target</a></span></li>\n",
    "                <li><span><a href=\"#Our-goal-and-possible-applications\"\n",
    "                             data-toc-modified-id=\"Our-goal-and-possible-applications-2.4\"><span class=\"toc-item-num\"></span>Our goal and possible applications</a></span>\n",
    "                </li>\n",
    "            </ul>\n",
    "        </li>\n",
    "        <li><span><a href=\"#Data-Analysis-and-Cleaning\"><span\n",
    "                class=\"toc-item-num\"></span>Data Analysis and Cleaning</a></span>\n",
    "            <ul class=\"toc-item\">\n",
    "                <li><span><a href=\"#Country\" data-toc-modified-id=\"Country-3.1\"><span class=\"toc-item-num\"></span>Country</a></span>\n",
    "                </li>\n",
    "                <li><span><a href=\"#Province,-Region-1-and-Region-2\"\n",
    "                             data-toc-modified-id=\"Province,-Region-1-and-Region-2-3.2\"><span class=\"toc-item-num\"></span>Province, Region 1 and Region 2</a></span>\n",
    "                </li>\n",
    "                <li><span><a href=\"#Price\" data-toc-modified-id=\"Price-3.3\"><span\n",
    "                        class=\"toc-item-num\"></span>Price</a></span>\n",
    "                </li>\n",
    "                <li><span><a href=\"#Variety\" data-toc-modified-id=\"Variety-3.4\"><span class=\"toc-item-num\"></span>Variety</a></span>\n",
    "                </li>\n",
    "                <li><span><a href=\"#Title\" data-toc-modified-id=\"Title-3.5\"><span\n",
    "                        class=\"toc-item-num\"></span>Title</a></span></li>\n",
    "                <li><span><a href=\"#Description\" data-toc-modified-id=\"Description-3.6\"><span class=\"toc-item-num\"></span>Description</a></span>\n",
    "                </li>\n",
    "                <li><span><a href=\"#Taster-and-their-Twitter-Handle\"\n",
    "                             data-toc-modified-id=\"Taster-and-their-Twitter-Handle-3.7\"><span class=\"toc-item-num\"></span>Taster and their Twitter Handle</a></span>\n",
    "                </li>\n",
    "                <li><span><a href=\"#Winery-and-Designation\" data-toc-modified-id=\"Winery-and-Designation-3.8\"><span\n",
    "                        class=\"toc-item-num\"></span>Winery and Designation</a></span></li>\n",
    "                <li><span><a href=\"#Target-(Points)\" data-toc-modified-id=\"Target-(Points)-3.9\"><span\n",
    "                        class=\"toc-item-num\"></span>Target (Points)</a></span></li>\n",
    "            </ul>\n",
    "        </li>\n",
    "        <li><span><a href=\"#Metrics-Selection\"><span class=\"toc-item-num\"></span>Metrics Selection</a></span>\n",
    "        </li>\n",
    "        <li><span><a href=\"#Model-Selection\"><span class=\"toc-item-num\"></span>Model Selection</a></span>\n",
    "        </li>\n",
    "        <li><span><a href=\"#Cross-Validation-Selection\"><span\n",
    "                class=\"toc-item-num\"></span>Cross-Validation Selection</a></span></li>\n",
    "        <li><span><a href=\"#Data-Preprocessing\"><span class=\"toc-item-num\"></span>Data Preprocessing</a></span>\n",
    "            <ul class=\"toc-item\">\n",
    "                <li><span><a href=\"#Dealing-with-nulls\" data-toc-modified-id=\"Dealing-with-nulls-7.1\"><span\n",
    "                        class=\"toc-item-num\"></span>Dealing with nulls</a></span></li>\n",
    "                <li><span><a href=\"#Train-test-split\" data-toc-modified-id=\"Train-test-split-7.2\"><span\n",
    "                        class=\"toc-item-num\"></span>Train-test split</a></span></li>\n",
    "                <li><span><a href=\"#Categorical-features-encoding\"\n",
    "                             data-toc-modified-id=\"Categorical-features-encoding-7.3\"><span class=\"toc-item-num\"></span>Categorical features encoding</a></span>\n",
    "                </li>\n",
    "                <li><span><a href=\"#Text-vectorization-with-TF-IDF\"\n",
    "                             data-toc-modified-id=\"Text-vectorization-with-TF-IDF-7.4\"><span class=\"toc-item-num\"></span>Text vectorization with TF-IDF</a></span>\n",
    "                </li>\n",
    "                <li><span><a href=\"#Scaling-numerical-features\"\n",
    "                             data-toc-modified-id=\"Scaling-numerical-features-7.5\"><span class=\"toc-item-num\"></span>Scaling numerical features</a></span>\n",
    "                </li>\n",
    "                <li><span><a href=\"#Getting-features-together\"\n",
    "                             data-toc-modified-id=\"Getting-features-together-7.6\"><span class=\"toc-item-num\"></span>Getting features together</a></span>\n",
    "                </li>\n",
    "                <li><span><a href=\"#Getting-preprocessing-steps-together\"\n",
    "                             data-toc-modified-id=\"Getting-preprocessing-steps-together-7.7\"><span class=\"toc-item-num\"></span>Getting preprocessing steps together</a></span>\n",
    "                </li>\n",
    "            </ul>\n",
    "        </li>\n",
    "        <li><span><a href=\"#Training-a-Model\"><span class=\"toc-item-num\"></span>Training a Model</a></span>\n",
    "        </li>\n",
    "        <li><span><a href=\"#Hyperparameter-Tuning\"><span\n",
    "                class=\"toc-item-num\"></span>Hyperparameter Tuning</a></span></li>\n",
    "        <li><span><a href=\"#Feature-Engineering\"><span\n",
    "                class=\"toc-item-num\"></span>Feature Engineering</a></span>\n",
    "            <ul class=\"toc-item\">\n",
    "                <li><span><a href=\"#Winery-+-Designation\" data-toc-modified-id=\"Winery-+-Designation-10.1\"><span\n",
    "                        class=\"toc-item-num\"></span>Winery + Designation</a></span></li>\n",
    "                <li><span><a href=\"#Year-(Vintage)\" data-toc-modified-id=\"Year-(Vintage)-10.2\"><span\n",
    "                        class=\"toc-item-num\"></span>Year (Vintage)</a></span></li>\n",
    "                <li><span><a href=\"#Winery-+-Year\" data-toc-modified-id=\"Winery-+-Year-10.3\"><span class=\"toc-item-num\"></span>Winery + Year</a></span>\n",
    "                </li>\n",
    "            </ul>\n",
    "        </li>\n",
    "        <li><span><a href=\"#Retrain-the-Best-Model\"><span\n",
    "                class=\"toc-item-num\"></span>Retrain the Best Model</a></span></li>\n",
    "        <li><span><a href=\"#Conclusions\"><span\n",
    "                class=\"toc-item-num\"></span>Conclusions</a></span></li>\n",
    "    </ul>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "from matplotlib import pyplot as plt\n",
    "import seaborn as sns\n",
    "from scipy import stats\n",
    "import plotly.offline as py\n",
    "import warnings\n",
    "import pycountry\n",
    "from statsmodels.graphics.gofplots import qqplot\n",
    "from wordcloud import WordCloud, STOPWORDS\n",
    "\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "import re\n",
    "from sklearn.linear_model import Ridge, RidgeCV\n",
    "from sklearn.model_selection import train_test_split, cross_val_score, KFold, GridSearchCV\n",
    "from sklearn.metrics import mean_squared_error, mean_absolute_error\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from scipy.sparse import csr_matrix, hstack\n",
    "from yellowbrick.model_selection import ValidationCurve, LearningCurve\n",
    "\n",
    "py.init_notebook_mode(connected=True)\n",
    "import plotly.graph_objs as go\n",
    "\n",
    "RANDOM_SEED=17"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Project Description\n",
    "\n",
    "### Dataset\n",
    "The data is taken from the Kaggle dataset https://www.kaggle.com/zynicide/wine-reviews/home, which in turn was scraped by the dataset's author from https://www.winemag.com/<br>\n",
    "There are lot of reviews from differents experts for the wines from the whole world. Also, some wine-specific information is also provided as a part of the dataset.<br>\n",
    "Dataset consists of the following fields (per info from https://github.com/zackthoutt/wine-deep-learning):\n",
    "\n",
    "### Features\n",
    "* **Points**: the number of points WineEnthusiast rated the wine on a scale of 1-100 (though they say they only post reviews for wines that score >=80)\n",
    "* **Title**: the title of the wine review, which often contains the vintage if you're interested in extracting that feature\n",
    "* **Variety**: the type of grapes used to make the wine (ie Pinot Noir)\n",
    "* **Description**: a few sentences from a sommelier describing the wine's taste, smell, look, feel, etc.\n",
    "* **Country**: the country that the wine is from\n",
    "* **Province**: the province or state that the wine is from\n",
    "* **Region 1**: the wine growing area in a province or state (ie Napa)\n",
    "* **Region 2**: sometimes there are more specific regions specified within a wine growing area (ie Rutherford inside the Napa Valley), but this value can sometimes be blank\n",
    "* **Winery**: the winery that made the wine\n",
    "* **Designation**: the vineyard within the winery where the grapes that made the wine are from\n",
    "* **Price**: the cost for a bottle of the wine, in US$\n",
    "* **Taster Name**: name of the person who tasted and reviewed the wine\n",
    "* **Taster Twitter Handle**: Twitter handle for the person who tasted ane reviewed the wine\n",
    "\n",
    "### Target\n",
    "We have wine rating (**Points**) as a target. Reviewers from the original site provide rating for the wines varying from 80 to 100, here is the details of different ranges:\n",
    "\n",
    "| Range  | Mark       | Description                                            |\n",
    "|--------|------------|--------------------------------------------------------|\n",
    "| 98–100 | Classic    | The pinnacle of quality                                |\n",
    "| 94–97  | Superb     | A great achievement                                    |\n",
    "| 90–93  | Excellent  | Highly recommended                                     |\n",
    "| 87–89  | Very       | Often good value; well recommended                     |\n",
    "| 83–86  | Good       | Suitable for everyday consumption; often good value    |\n",
    "| 80–82  | Acceptable | Can be employed in casual, less-critical circumstances |\n",
    "\n",
    "### Our goal and possible applications\n",
    "Originally, dataset author collected the data to ```create a predictive model to identify wines through blind tasting like a master sommelier would```.\n",
    "Here we will try to solve simpler, yet useful in real life, task: predict the wine rating based on the wine features and words used in its review. This can have the following practical applications:\n",
    "#### Understanding the unrated wine quality\n",
    "Unlike other beverages, wines comes in overwhelming variety: it's about 10k grapes exists (and their number is growing), they can be blended in different proportions, the grape collection year and growing conditions comes into play, the wine may be seasoned for different amount of time in different types of barrels, etc, etc.\n",
    "\n",
    "So review of the specific wine or lists like \"top 10 wines of the season\" doesn't make any sense - if you go to 2 different local stores there is a good chance you won't find the same wine in both of them. Finding the specific wine may require journey to another city or even country :) In such conditions it's worth to have a model which may predict the wine quality without having an exact rating given by the expert, but based on the wine features which you can get from the bottle.\n",
    "#### Blind testing the expert predictions\n",
    "While this is an area of purely personal taste, professionals always try to become free from the biases and provide objective observations. Blind testing may allow to find the biases of the specific reviewer.<br>\n",
    "Actually, the model could be used for _cross-validation_ of the expert ratings :)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Analysis and Cleaning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's download the data from Kaggle, extract them into ```data``` folder and check the main properties of the resulting DataFrame:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv('data/winemag-data-130k-v2.csv', index_col=0)\n",
    "df.info(memory_usage='deep')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see, there are many null values in the data, we need to deal with them later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's check the data for possible categorical features:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.nunique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looks like the following features can be represented as categorical:\n",
    "* designation\n",
    "* province\n",
    "* region_1\n",
    "* region_2\n",
    "* taster_name\n",
    "* taster_twitter_handle\n",
    "* variety\n",
    "* winery\n",
    "\n",
    "Let's explore the data now to get acquainted to the dataset more closely:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Country"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(13, 10))\n",
    "ax = sns.countplot(y=df.country, order=df.country.value_counts().index, palette='tab10')\n",
    "for p, label in zip(ax.patches, df.country.value_counts()):\n",
    "    ax.annotate(\"{0:,d}\".format(label), (p.get_width() + 50, p.get_y() + 0.7))\n",
    "ax.set_title('Number of wine reviews per country', fontsize=18);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that we have a lot of the reviews for the wines from US, which can be explained by the fact that the reviewers are mostly located in the US.<br>\n",
    "Also it should be noted that we have countries with less number of reviews which may cause problems.\n",
    "\n",
    "Let's see how the countries are distributed on the map, along with the number if review in them.<br>\n",
    "For the Choropleth to display the coloring in a more understantable way, let's ```log1p```-transform the number of reviews per country:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "countries = df.groupby('country').size().reset_index()\n",
    "countries.columns = ['name', 'size']\n",
    "countries.name = countries.name.replace({ # making the country names compatible with pycountry\n",
    "    'England': 'United Kingdom',\n",
    "    'Czech Republic': 'United Kingdom',\n",
    "    'Macedonia': 'Macedonia, Republic of',\n",
    "    'Moldova': 'Moldova, Republic of',\n",
    "    'US': 'United States'\n",
    "})\n",
    "\n",
    "data = pd.DataFrame(index=countries.index)\n",
    "data['name'] = countries.name\n",
    "data['size'] = countries['size']\n",
    "data['code'] = countries.apply(lambda x: pycountry.countries.get(name=x['name']), axis=1)\n",
    "data['code'] = data.code.apply(lambda x: x.alpha_3 if x else None)\n",
    "data = data.dropna()\n",
    "\n",
    "choropleth_data = [dict(\n",
    "    type='choropleth',\n",
    "    locations=data['code'],\n",
    "    z=np.log1p(data['size']),\n",
    "    #showscale=False,\n",
    "    text=data['name'],\n",
    "    marker=dict(\n",
    "        line=dict(\n",
    "            color='rgb(180,180,180)',\n",
    "            width=0.5\n",
    "        )),\n",
    ")]\n",
    "\n",
    "layout = dict(\n",
    "    title='Number of wine reviews per country, log-transformed',\n",
    "    geo=dict(\n",
    "        showframe=False,\n",
    "        showcoastlines=True,\n",
    "        projection=dict(\n",
    "            type='natural earth'\n",
    "        )\n",
    "    ))\n",
    "\n",
    "fig = dict(data=choropleth_data, layout=layout)\n",
    "py.iplot(fig, validate=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "top_rated_countries = df[['country', 'points']].groupby('country').mean().reset_index().sort_values('points', ascending=False).country[:10]\n",
    "\n",
    "data = df[df.country.isin(top_rated_countries)]\n",
    "\n",
    "plt.figure(figsize=(15, 7))\n",
    "ax = sns.violinplot(x='country', y='points', data=data, order=top_rated_countries, palette='tab10')\n",
    "ax.set_title('Top 10 countries with highest average rating', fontsize=18);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we can see that the some of the countries with low number of reviews has pretty high average rating.<br>\n",
    "Probably, it's because wines with the highest potential rating are the first to be reviewed by the experts.<br>\n",
    "The dependency between the **Country** and **Points** is clear."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Cleaning and transforming"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Countries with less number of reviews does not have too much predictive power and introduce unnecessary noise, so let's replace them with the name 'Other' instead:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "vc = df.country.value_counts()\n",
    "df['trans_country'] = df.country.replace(vc[vc < 100].index, 'Other')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "top_rated_countries = df[['trans_country', 'points']].groupby('trans_country').mean().reset_index().sort_values('points', ascending=False).trans_country[:10]\n",
    "\n",
    "top_rated_countries_data = df[df.trans_country.isin(top_rated_countries)]\n",
    "\n",
    "plt.figure(figsize=(15, 7))\n",
    "ax = sns.violinplot(x='trans_country', y='points', data=top_rated_countries_data, order=top_rated_countries, palette='tab10')\n",
    "ax.set_title('Top 10 countries with highest average rating', fontsize=18);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now see better distribution of the rating among countries in the top 10 list."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Province, Region 1 and Region 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These features are actually parts of the wine location hierarchy, so they better be joined into one field with the **Country**.\n",
    "Let's take a look at them:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[['trans_country', 'province', 'region_1', 'region_2']].head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[['trans_country', 'province', 'region_1', 'region_2']].nunique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Countries with Region 2:', df[~df.region_2.isna()].trans_country.unique())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looks like **Region 2** is a US-specific feature, but it won't hurt if we include it as well, so we get better categorization for US wines."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['location'] = df.apply(lambda x: ' / '.join([y for y in [str(x['trans_country']), str(x['province']), str(x['region_1']), str(x['region_2'])] if y != 'nan']), axis=1)\n",
    "df.location.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's try to see if there is a dependency between the **Points** and **Location**:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_top_locations = df[df.location.isin(df.location.value_counts().index[:10])]\n",
    "\n",
    "plt.figure(figsize=(12, 10))\n",
    "ax = sns.violinplot(y='location', x='points', data=df_top_locations, palette='tab10');\n",
    "ax.set_title('Wine rating distribution over top 10 locations with highest average rating', fontsize=18);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Cleaning and transforming"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see if we can get something from the title:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[['region_1', 'title']].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see, some regions are repeated in title and even if region is NaN, it is possible to fill it with the value from the title, so let's do it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def extract_region_1(row):\n",
    "    if row.region_1 == 'nan':\n",
    "        return row.region_1\n",
    "    if not row.title.endswith(')'):\n",
    "        return None\n",
    "    return row.title[row.title.rindex('(')+1:-1]\n",
    "\n",
    "df.region_1 = df.apply(extract_region_1, axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[['region_1', 'title']].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Great, now let's recreate the **Location**:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['location'] = df.apply(lambda x: ' / '.join([y for y in [str(x['trans_country']), str(x['province']), str(x['region_1']), str(x['region_2'])] if y != 'nan']), axis=1)\n",
    "df.location.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's replace the locations with lower amount of reviews with the name 'Other'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "vc = df.location.value_counts()\n",
    "df.location = df.location.replace(vc[vc < 2].index, 'Other')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Price"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Price is is given in the US$, let's see how it's distributed:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(15, 5))\n",
    "data = df[~df.price.isna()]\n",
    "plt.scatter(range(data.shape[0]), np.sort(data.price.values)[::-1])\n",
    "plt.title(\"Distribution of wine prices\", fontsize=18)\n",
    "plt.ylabel('Price');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Wow, there are wines with more than $3000 price. That's not a usual weekend wine :)\n",
    "\n",
    "As we see, the price distribution is very skewed, let's try to log-transform it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(15, 3))\n",
    "series_price = df[~df.price.isna()].price.apply(np.log1p)\n",
    "ax = sns.distplot(series_price);\n",
    "ax.set_title(\"Distribution of wine prices\", fontsize=18)\n",
    "ax.set_ylabel('Price (log1p)')\n",
    "ax.set_xlabel('');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Still, it's not normal:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Shapiro-Wilk test:', stats.shapiro(series_price))\n",
    "print('Kolmogorov-Smirnov test:', stats.kstest(series_price, cdf='norm'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But not very skewed anymore:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Skeweness:', series_price.skew())\n",
    "print('Kurtosis:', series_price.kurt())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's see a connection between the **Price** (not log-transformed) and **Points**:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(15, 5))\n",
    "ax = sns.regplot(x='points', y='price', data=df, fit_reg=False, x_jitter=True)\n",
    "ax.set_title('Correlation between the wine price and points given', fontsize=18);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And now let's see which countries has the most expensive wines (per average):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(13, 7))\n",
    "data = df[['country', 'price']].groupby('country').mean().reset_index().sort_values('price', ascending=False)\n",
    "ax = sns.barplot(y='country', x='price', data=data, palette='tab10')\n",
    "\n",
    "for p, label in zip(ax.patches, data.price):\n",
    "    if np.isnan(label):\n",
    "        continue\n",
    "    ax.annotate('{0:.2f}'.format(label), (p.get_width() + 0.2, p.get_y() + 0.5))\n",
    "\n",
    "ax.set_title('Top countries with the most expensive average wine prices');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Insterestingly, we see, for example, Germany, Hungary and France in leaders here, which are also in leaders for average wine rating above."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's take the countries with the top rated wines and see the prices distribution in them:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(15, 5))\n",
    "sns.violinplot(x='country', y='price', data=top_rated_countries_data, order=top_rated_countries, palette='tab10');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Weel, not good, the **Price** need to be transformed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Cleaning and transforming"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['trans_price'] = df.price.apply(np.log1p)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "top_rated_countries = df[['trans_country', 'points']].groupby('trans_country').mean().reset_index().sort_values('points', ascending=False).trans_country[:10]\n",
    "\n",
    "top_rated_countries_data = df[df.trans_country.isin(top_rated_countries)]\n",
    "\n",
    "plt.figure(figsize=(15, 5))\n",
    "sns.violinplot(x='trans_country', y='trans_price', data=top_rated_countries_data, order=top_rated_countries, palette='tab10');"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(15, 5))\n",
    "ax = sns.regplot(x='points', y='trans_price', data=df, fit_reg=False, x_jitter=True)\n",
    "ax.set_title('Correlation between the wine price (log) and points given', fontsize=18);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Variety"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see the top 10 varietes with their wine counts:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_top_varieties = df[df.variety.isin(df.variety.value_counts().index[:10])]\n",
    "plt.figure(figsize=(13, 5))\n",
    "ax = sns.countplot(y=df_top_varieties.variety, order=df_top_varieties.variety.value_counts().index, palette='tab10')\n",
    "for p, label in zip(ax.patches, df_top_varieties.variety.value_counts()):\n",
    "    ax.annotate(\"{0:,d}\".format(label), (p.get_width() + 50, p.get_y() + 0.5))\n",
    "ax.set_title('Number of wines per variety', fontsize=18);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's see the dependency between the **Variety** and **Points**:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(14, 10))\n",
    "ax = sns.violinplot(y='variety', x='points', data=df_top_varieties, palette='tab10', order=df_top_varieties.variety.value_counts().index)\n",
    "ax.set_title('Wine rating distribution over top 10 varietes by wine count', fontsize=18);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we see, somevarietes get higher points than the other and points distribution is also may vary."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Variety has the same problem as other categorical features: there are some varietes where almost no samples, but they affect the points heavily:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "top_rated_varietes = df[['variety', 'points']].groupby('variety').mean().reset_index().sort_values('points', ascending=False).variety[:10]\n",
    "\n",
    "top_rated_varietes_data = df[df.variety.isin(top_rated_varietes)]\n",
    "\n",
    "plt.figure(figsize=(15, 5))\n",
    "ax = sns.violinplot(x='variety', y='points', data=top_rated_varietes_data, order=top_rated_varietes, palette='tab10');\n",
    "ax.set_xticklabels(ax.get_xticklabels(), rotation=90);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Cleaning and transforming"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "vc = df.variety.value_counts()\n",
    "df['trans_variety'] = df.variety.replace(vc[vc < 2].index, 'Other')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "top_rated_varietes = df[['trans_variety', 'points']].groupby('trans_variety').mean().reset_index().sort_values('points', ascending=False).trans_variety[:10]\n",
    "\n",
    "top_rated_varietes_data = df[df.trans_variety.isin(top_rated_varietes)]\n",
    "\n",
    "plt.figure(figsize=(15, 5))\n",
    "ax = sns.violinplot(x='trans_variety', y='points', data=top_rated_varietes_data, order=top_rated_varietes, palette='tab10');\n",
    "ax.set_xticklabels(ax.get_xticklabels(), rotation=90);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Title"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.title.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The title itself seems to be not containing valuable information except that we already used it for filling the nulls in **Region 1** and we can extract a **Year** (vintage) from it, we will do it later."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Description"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That a typical textual varible, which we can try to analyze with word clouds.\n",
    "\n",
    "Let's see what experts tell about wines that has low rating:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "stopwords = set(STOPWORDS)\n",
    "stopwords.update(['wine', 'a', 'about', 'above', 'across', 'after', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'among', 'an', 'and', 'another', 'any', 'anybody', 'anyone', 'anything', 'anywhere', 'are', 'area', 'areas', 'around', 'as', 'ask', 'asked', 'asking', 'asks', 'at', 'away', 'b', 'back', 'backed', 'backing', 'backs', 'be', 'became', 'because', 'become', 'becomes', 'been', 'before', 'began', 'behind', 'being', 'beings', 'best', 'better', 'between', 'big', 'both', 'but', 'by', 'c', 'came', 'can', 'cannot', 'case', 'cases', 'certain', 'certainly', 'clear', 'clearly', 'come', 'could', 'd', 'did', 'differ', 'different', 'differently', 'do', 'does', 'done', 'down', 'down', 'downed', 'downing', 'downs', 'during', 'e', 'each', 'early', 'either', 'end', 'ended', 'ending', 'ends', 'enough', 'even', 'evenly', 'ever', 'every', 'everybody', 'everyone', 'everything', 'everywhere', 'f', 'face', 'faces', 'fact', 'facts', 'far', 'felt', 'few', 'find', 'finds', 'first', 'for', 'four', 'from', 'full', 'fully', 'further', 'furthered', 'furthering', 'furthers', 'g', 'gave', 'general', 'generally', 'get', 'gets', 'give', 'given', 'gives', 'go', 'going', 'good', 'goods', 'got', 'great', 'greater', 'greatest', 'group', 'grouped', 'grouping', 'groups', 'h', 'had', 'has', 'have', 'having', 'he', 'her', 'here', 'herself', 'high', 'high', 'high', 'higher', 'highest', 'him', 'himself', 'his', 'how', 'however', 'i', 'if', 'important', 'in', 'interest', 'interested', 'interesting', 'interests', 'into', 'is', 'it', 'its', 'itself', 'j', 'just', 'k', 'keep', 'keeps', 'kind', 'knew', 'know', 'known', 'knows', 'l', 'large', 'largely', 'last', 'later', 'latest', 'least', 'less', 'let', 'lets', 'like', 'likely', 'long', 'longer', 'longest', 'm', 'made', 'make', 'making', 'man', 'many', 'may', 'me', 'member', 'members', 'men', 'might', 'more', 'most', 'mostly', 'mr', 'mrs', 'much', 'must', 'my', 'myself', 'n', 'necessary', 'need', 'needed', 'needing', 'needs', 'never', 'new', 'new', 'newer', 'newest', 'next', 'no', 'nobody', 'non', 'noone', 'not', 'nothing', 'now', 'nowhere', 'number', 'numbers', 'o', 'of', 'off', 'often', 'old', 'older', 'oldest', 'on', 'once', 'one', 'only', 'open', 'opened', 'opening', 'opens', 'or', 'order', 'ordered', 'ordering', 'orders', 'other', 'others', 'our', 'out', 'over', 'p', 'part', 'parted', 'parting', 'parts', 'per', 'perhaps', 'place', 'places', 'point', 'pointed', 'pointing', 'points', 'possible', 'present', 'presented', 'presenting', 'presents', 'problem', 'problems', 'put', 'puts', 'q', 'quite', 'r', 'rather', 'really', 'right', 'right', 'room', 'rooms', 's', 'said', 'same', 'saw', 'say', 'says', 'second', 'seconds', 'see', 'seem', 'seemed', 'seeming', 'seems', 'sees', 'several', 'shall', 'she', 'should', 'show', 'showed', 'showing', 'shows', 'side', 'sides', 'since', 'small', 'smaller', 'smallest', 'so', 'some', 'somebody', 'someone', 'something', 'somewhere', 'state', 'states', 'still', 'still', 'such', 'sure', 't', 'take', 'taken', 'than', 'that', 'the', 'their', 'them', 'then', 'there', 'therefore', 'these', 'they', 'thing', 'things', 'think', 'thinks', 'this', 'those', 'though', 'thought', 'thoughts', 'three', 'through', 'thus', 'to', 'today', 'together', 'too', 'took', 'toward', 'turn', 'turned', 'turning', 'turns', 'two', 'u', 'under', 'until', 'up', 'upon', 'us', 'use', 'used', 'uses', 'v', 'very', 'w', 'want', 'wanted', 'wanting', 'wants', 'was', 'way', 'ways', 'we', 'well', 'wells', 'went', 'were', 'what', 'when', 'where', 'whether', 'which', 'while', 'who', 'whole', 'whose', 'why', 'will', 'with', 'within', 'without', 'work', 'worked', 'working', 'works', 'would', 'x', 'y', 'year', 'years', 'yet', 'you', 'young', 'younger', 'youngest', 'your', 'yours', 'z'])\n",
    "\n",
    "wordcloud = WordCloud(background_color='white', stopwords=stopwords,\n",
    "    max_words=500, max_font_size=200, width=2000, height=800,\n",
    "    random_state=RANDOM_SEED).generate(' '.join(df[df.points < 83].description.str.lower()))\n",
    "\n",
    "plt.figure(figsize=(15, 7))\n",
    "plt.imshow(wordcloud)\n",
    "plt.title(\"Low Rated Wines Description Word Cloud\", fontsize=20)\n",
    "plt.axis('off');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```bitter```, ```sour```, ```simple```, ```sharp```, ```tart``` - there must be definitely something wrong with these wines!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "wordcloud = WordCloud(background_color='white', stopwords=stopwords,\n",
    "    max_words=500, max_font_size=200, width=2000, height=800,\n",
    "    random_state=RANDOM_SEED).generate(' '.join(df[df.points > 97].description.str.lower()))\n",
    "\n",
    "plt.figure(figsize=(15, 7))\n",
    "plt.imshow(wordcloud)\n",
    "plt.title(\"High Rated Wines Description Word Cloud\", fontsize=20)\n",
    "plt.axis('off');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Oh yeah, much better: ```structured```, ```complex```, ```classic```, ```rich```, ```ripe```, ```powerful```, ```intense``` and other good words which you would expect for the pricey and high rated wines :)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Taster and their Twitter Handle\n",
    "We don't need these fields per our goals, since we will not have them to perform predictions for the model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Winery and Designation\n",
    "Let's skip these features for now to see if we can use them later."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Target (Points)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see how our target is distributed:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(15, 5))\n",
    "sns.distplot(df.points, kde=False);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Well, looks like we have binomial distribution here and while it may look like normally-distributed, the tests don't confirm it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Shapiro-Wilk test:', stats.shapiro(df.points))\n",
    "print('Kolmogorov-Smirnov test:', stats.kstest(df.points, cdf='norm'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Skeweness is pretty low however:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Skeweness:', df.points.skew())\n",
    "print('Kurtosis:', df.points.kurt())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is the QQ-plot, in addition:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.rcParams['figure.figsize'] = (7, 7)\n",
    "qqplot(df.points, line='r');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The problem here is that our **Points** has discrete values instead of continuous. Which might tell us that we need to treat this problem as a classification or _Ordered Regression_.<br>\n",
    "But still, simple regression should also work well in our case, even though the data is discrete."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Metrics Selection"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are two most popular metrics which we can choose from: MAE (mean absolute error) and MSE (mean squared error)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "MSE would be the better choice for this problem because:\n",
    "* our train target does not contain outliers and its variance is relatively low. So we want our model to penalize large errors in predictions, which is an immanent feature of MSE\n",
    "* MSE is smoothly differentiable which makes it easier for calculations\n",
    "\n",
    "In scikit-learn MSE is represented in negative form and has the following name, let's save it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "SCORING = 'neg_mean_squared_error'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Model Selection"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have the following meaningful properties of our task:\n",
    "* it's a regression problem\n",
    "* we have relatively much data, >100k samples\n",
    "* since we have some categorical feature candidates withe a lot of unique values + textual data, we can expect that we will have a lot of features, 10k+, and much of them will be important for our predictions\n",
    "\n",
    "We can use both SGD and Ridge giving these properties.<br>\n",
    "While SGD will be much faster than Ridge in this task, it will not give us the same level of accuracy and must be tuned a lot more than Ridge, which essentially has one hyperparameter.\n",
    "\n",
    "So let's use Ridge and see how it will perform:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "MODEL = Ridge(random_state=RANDOM_SEED)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cross-Validation Selection\n",
    "\n",
    "Since our data does not have any heavy specifics, so we can choose simple KFold cross validation for 10 folds, with shuffle:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "CV = KFold(n_splits=10, shuffle=True, random_state=RANDOM_SEED)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Preprocessing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_full = df.copy(deep=True)\n",
    "df_full = df_full.drop(['country', 'price', 'taster_name', 'taster_twitter_handle', 'variety', 'province', 'region_1', 'region_2'], axis=1)\n",
    "df_full.columns = [x.replace('trans_', '') for x in df_full.columns]\n",
    "df_full.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Dealing with nulls"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Fill missing countries with \"Other\"\n",
    "df_full.country = df_full.country.fillna('Other')\n",
    "\n",
    "# Fill missing locations with \"Other\"\n",
    "df_full.location = df_full.location.fillna('Other')\n",
    "\n",
    "# Remove samples with missing prices since there are not so much of them and it's and important feature\n",
    "df_full = df_full[~df_full.price.isna()]\n",
    "\n",
    "df_full.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Train-test split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_train, df_test, y_train, y_test = train_test_split(df_full.drop(['points'], axis=1), df_full.points, test_size=0.25, random_state=RANDOM_SEED)\n",
    "df_train.shape, df_test.shape, y_train.shape, y_test.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that we will be processing the train and test sets separately, to not introduce \"looking into the future\" problem."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Categorical features encoding"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "categorical_features = ['country', 'variety', 'location']\n",
    "\n",
    "for feature in categorical_features:\n",
    "    categorical = pd.Categorical(df_train[feature].unique())\n",
    "    \n",
    "    df_train[feature] = df_train[feature].astype(categorical)\n",
    "    df_test[feature] = df_test[feature].astype(categorical)\n",
    "    \n",
    "X_train_cat = pd.get_dummies(df_train[categorical_features], sparse=True)\n",
    "X_test_cat = pd.get_dummies(df_test[categorical_features], sparse=True)\n",
    "\n",
    "X_train_cat.shape, X_test_cat.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Text vectorization with TF-IDF"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tv = TfidfVectorizer(stop_words=stopwords, max_features=10000)\n",
    "X_train_desc = tv.fit_transform(df_train.description)\n",
    "X_test_desc = tv.transform(df_test.description)\n",
    "X_train_desc.shape, X_test_desc.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Scaling numerical features\n",
    "Our model is sensitive to non-centered numeric features, so we need to scale them:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ss = StandardScaler()\n",
    "df_train.price = ss.fit_transform(df_train[['price']])\n",
    "df_test.price = ss.transform(df_test[['price']])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Getting features together"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train = csr_matrix(hstack([\n",
    "    df_train[['price']],\n",
    "    X_train_cat,\n",
    "    X_train_desc,\n",
    "]))\n",
    "X_test = csr_matrix(hstack([\n",
    "    df_test[['price']],\n",
    "    X_test_cat,\n",
    "    X_test_desc,\n",
    "]))\n",
    "X_train.shape, X_test.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Getting preprocessing steps together"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def prepare_data(df_full, categorical_features):\n",
    "    df_train, df_test, y_train, y_test = train_test_split(df_full.drop(['points'], axis=1), df_full.points, test_size=0.25, random_state=RANDOM_SEED)\n",
    "    df_train.shape, df_test.shape, y_train.shape, y_test.shape\n",
    "\n",
    "    print('processing categorical features')\n",
    "    for feature in categorical_features:\n",
    "        categorical = pd.Categorical(df_train[feature].unique())\n",
    "\n",
    "        df_train[feature] = df_train[feature].astype(categorical)\n",
    "        df_test[feature] = df_test[feature].astype(categorical)\n",
    "\n",
    "    print('preparing dummies')\n",
    "    X_train_cat = pd.get_dummies(df_train[categorical_features], sparse=True)\n",
    "    X_test_cat = pd.get_dummies(df_test[categorical_features], sparse=True)\n",
    "    \n",
    "    print('extracting word vectors')\n",
    "    tv = TfidfVectorizer(stop_words=stopwords, max_features=10000)\n",
    "    X_train_desc = tv.fit_transform(df_train.description)\n",
    "    X_test_desc = tv.transform(df_test.description)\n",
    "    X_train_desc.shape, X_test_desc.shape\n",
    "    \n",
    "    print('scaling')\n",
    "    ss = StandardScaler()\n",
    "    df_train.price = ss.fit_transform(df_train[['price']])\n",
    "    df_test.price = ss.transform(df_test[['price']])\n",
    "    df_train.describe()\n",
    "\n",
    "    print('combining features')\n",
    "    X_train = csr_matrix(hstack([\n",
    "        df_train[['price']],\n",
    "        X_train_cat,\n",
    "        X_train_desc,\n",
    "    ]))\n",
    "    X_test = csr_matrix(hstack([\n",
    "        df_test[['price']],\n",
    "        X_test_cat,\n",
    "        X_test_desc,\n",
    "    ]))\n",
    "\n",
    "    return X_train, X_test, y_train, y_test\n",
    "\n",
    "X_train, X_test, y_train, y_test = prepare_data(df_full, ['country', 'variety', 'location'])\n",
    "X_train.shape, X_test.shape, y_train.shape, y_test.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training a Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's fit our model for the first time and see how it will perform:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def train_and_cv(model, X_train, y_train, X_test, y_test):\n",
    "    cvs = cross_val_score(model, X_train, y_train, cv=CV, scoring=SCORING, n_jobs=-1)\n",
    "    print('MSE and STD on CV:\\t', -cvs.mean(), cvs.std())\n",
    "    \n",
    "    model.fit(X_train, y_train)\n",
    "    print('MSE on holdout:\\t\\t', mean_squared_error(MODEL.predict(X_test), y_test))\n",
    "    \n",
    "    return model\n",
    "\n",
    "train_and_cv(MODEL, X_train, y_train, X_test, y_test);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And the results are pretty good, we already have good relatively low error. But we definitely can improve it even further."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's see the learning curve:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_learning_curve(model, X_train, y_train):\n",
    "    plt.figure(figsize=(10, 5))\n",
    "    viz = LearningCurve(model, cv=CV, train_sizes=np.linspace(.1, 1.0, 10), scoring=SCORING, n_jobs=-1)\n",
    "    viz.fit(X_train, y_train)\n",
    "    viz.poof()\n",
    "    \n",
    "plot_learning_curve(MODEL, X_train, y_train);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see a typical picture when the training score is decreasing with the number of samples given to model and cross-val scorr is increasing.<br>\n",
    "But still, there is gap between them, so our model will be improved with larger number of provided samples.<br>\n",
    "Also it is worth noticing that the variance of the cross-val scores is pretty low, which tells us that our model gives pretty stable predictions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Hyperparameter Tuning\n",
    "One of the key characteristics of our model is its simplicity, so we have only on parameter to adjust: [alpha](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html), which is regularization strength for ridge regularization.<br>\n",
    "Let's try to see how what is the best value for it, visually:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(10, 5))\n",
    "viz = ValidationCurve(MODEL, param_name='alpha', cv=CV, param_range=np.logspace(-1, 1, 10), logx=True, scoring=SCORING, n_jobs=-1)\n",
    "viz.fit(X_train, y_train)\n",
    "viz.poof()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we see, we have a best value for our model located at the maximum of the cross-val curve.<br>\n",
    "The variance is still low, as on learning curve, which is a good indicator.<br>\n",
    "\n",
    "Now let's calculate the best ```alpha``` using simple grid search:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "params = { 'alpha': np.logspace(-1, 1, 10) }\n",
    "\n",
    "gs = GridSearchCV(MODEL, param_grid=params, verbose=10, n_jobs=-1, cv=CV, scoring=SCORING)\n",
    "gs.fit(X_train, y_train)\n",
    "\n",
    "print('Best alpha:', gs.best_params_['alpha'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "MODEL = Ridge(alpha=gs.best_params_['alpha'], random_state=RANDOM_SEED)\n",
    "train_and_cv(MODEL, X_train, y_train, X_test, y_test);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature Engineering"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Winery + Designation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see how adding the combination of the **Winery** and **Designation** (designation is a part of a winery, so they are expected to processed together) affect our model.<br>\n",
    "Specific winery and its designation may define a specific winery \"factory\" where the wine is produced and may affect the wine quality."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_full['winery_designation'] = df_full.winery + ' / ' + df_full.designation\n",
    "vc = df_full.winery_designation.value_counts()\n",
    "df_full.winery_designation = df_full.winery_designation.replace(vc[vc < 2].index, 'Other')\n",
    "df_full.winery_designation = df_full.winery_designation.fillna('Other')\n",
    "\n",
    "print('Number of unique winery + designation:', len(df_full.winery_designation.unique()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = prepare_data(df_full, ['country', 'variety', 'location', 'winery_designation'])\n",
    "X_train.shape, X_test.shape, y_train.shape, y_test.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_and_cv(MODEL, X_train, y_train, X_test, y_test);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our assumption was correct and performance of the model is improved, so let's keep this feature."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Year (Vintage)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Turns out, we have a year inside the title:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_full.title.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Wine year is the when the grape was collected and this is a categorical feature in our case.<br>\n",
    "Year can tell about the weather and other conditions related to the specific harvest and often affect the quality of the wine."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def extract_year(title):\n",
    "    matches = re.findall(r'\\d{4}', title)\n",
    "    return next(filter(lambda x: 1000 < x <= 2018, map(int, matches)), 0)\n",
    "\n",
    "df_full['year'] = df.title.apply(extract_year)\n",
    "df_full.year = df_full.year.fillna(0)\n",
    "\n",
    "print('Number of unique years:', len(df_full.year.unique()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = prepare_data(df_full, ['country', 'variety', 'location', 'winery_designation', 'year'])\n",
    "X_train.shape, X_test.shape, y_train.shape, y_test.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_and_cv(MODEL, X_train, y_train, X_test, y_test);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And again, the performance of the model is improved, we will keep this feature."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Winery + Year\n",
    "We can pretend that the winery and year together may define a quality of the wine - for example, if for some winery the weather was good and it had a good financial status in a specific year, we can expect better grape quality from it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_full['winery_year'] = df_full.winery + ' / ' + df_full.year.astype(str)\n",
    "vc = df_full.winery_year.value_counts()\n",
    "df_full.winery_year = df_full.winery_year.replace(vc[vc < 2].index, 'Other')\n",
    "df_full.winery_year = df_full.winery_year.fillna('Other')\n",
    "\n",
    "print('Number of unique winery + year:', len(df_full.winery_year.unique()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = prepare_data(df_full, ['country', 'variety', 'location', 'winery_designation', 'year', 'winery_year'])\n",
    "X_train.shape, X_test.shape, y_train.shape, y_test.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_and_cv(MODEL, X_train, y_train, X_test, y_test);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We've got minor improvement in holdout score, but worse CV score.<br>\n",
    "Since this feature also inroduce a lot of dummy values, let's not add it to the model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Retrain the Best Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = prepare_data(df_full, ['country', 'variety', 'location', 'winery_designation', 'year'])\n",
    "X_train.shape, X_test.shape, y_train.shape, y_test.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_and_cv(MODEL, X_train, y_train, X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now our model is ready to be used or improved further.<br>\n",
    "We can train it on the whole dataset (train+test) to get better results in real use, as our learning curve suggested."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We've reached our goal on building a model which may predict the wine rating based on the wine features and textual description.<br>\n",
    "Models of such type can be used in the wine industry to predict wine ratings and augment predictions of an experts to find and resolve their biases.\n",
    "\n",
    "Also it is possible to deploy this model in a form of a web or mobile application to allow wine buyers to get predictions for the random wines in the local stores (they can use wine description on the bottle in this case).\n",
    "\n",
    "However, the following things can be tried to improve the model:\n",
    "\n",
    "* we can approach a problem from the classification perspective and build a classifier instead of regressor, or implement an ordered regression;\n",
    "* the words in the Description can be stemmed, we can use word2vec, GloVe, LDA to improve features extracted from this field;\n",
    "* if we need to retrain the model often, better to switch to SGDRegressor, which will be way faster for this task;\n",
    "* we can apply feature selection to remove noisy and unnecessary features which may improve the accuracy and speed of the model;\n",
    "* it is possible to get the data apart from the dataset provided, to add some other features - for example, the wheater conditions during specific year in a specific location may greatly affect the wine quality.\n",
    "\n",
    "Choose the best wines and **drink responsibly**! :)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": true,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": true,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}