{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import seaborn as sns\n",
    "plt.style.use('ggplot')\n",
    "plt.rcParams['figure.figsize'] = (15, 10)\n",
    "df = pd.read_csv('data/Consumo_cerveja.csv', \n",
    "                 decimal=',', \n",
    "                 thousands='.', \n",
    "                 header=0, \n",
    "                 names=['date','median_temp','min_temp','max_temp','precip','weekend','consumption'], \n",
    "                 parse_dates=['date'], \n",
    "                 nrows=365)\n",
    "holidays = pd.read_html('https://en.wikipedia.org/wiki/Public_holidays_in_Brazil', header=0)[0]\n",
    "df = df.set_index('date')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Let's start asking questions\n",
    "\n",
    "Now we've gone through a number of techniques for working with data in Pandas - let's try applying some of these to answer some questions about our data - remember, we are interested in learning about factors that affect beer consumption in Sao Paulo!\n",
    "\n",
    "So what kind of questions can we ask of our data?\n",
    "\n",
    "## Does temperature matter in beer consumption?\n",
    "We would hypothesize that when it's warmer outside, people go out for a drink more often. However, as we know from here in Denmark, Danes often grab a beer inside a nice, warm bar when it's cold and dark!\n",
    "\n",
    "## Do people drink more beer in the weekend?\n",
    "From collective personal experience, we would assume that consumption is higher during the weekend.\n",
    "\n",
    "## Is there seasonality in beer drinking?\n",
    "Are people drinking more beer in autumn vs winter?\n",
    "\n",
    "## Does day of the week matter?\n",
    "Does the weekend start on Friday? Or maybe people need a beer on Monday to help them start the week?\n",
    "\n",
    "## Does precipitation affect beer drinking?\n",
    "Does more rain lead to less beer drinking? Or do people stay inside the bars longer?\n",
    "\n",
    "## Are weekends different when it comes to weather?\n",
    "It feels like the weather is worse when we have time off, no?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Correlations\n",
    "We can always start with a simple correlation matrix - this is easy to get in Pandas!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.corr()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It's a bit hard to parse just by looking - we could use seaborn to draw a heatmap, but we are sticking as much as we can to Pandas. Let's use the new `.style` api!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Note! Calculated row-or-columnwise\n",
    "df.corr().style.background_gradient()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's pick out the consumption correlations to better see what we are interested in"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.corr()[['consumption']].style.bar(align='mid', color=['red', 'green'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see some evidence to back up our hypotheses - temperature and weekend appears correllated with consumption, while precipiation is slightly negatively correlated.\n",
    "\n",
    "![Fun Fact](images/fun_fact.resized.jpeg)Before we dive in a bit more, just want to mention one technique that's very useful - I've been using it throughout, but now we'll give it a name:\n",
    "\n",
    "# Method Chaining\n",
    "\n",
    "One neat feature of the pandas api is that many operations return a modified DataFrame/Series. This means that we can continue to apply operations in a chain, making it simple and readable to transform data. My favorite example of the difference this makes is stolen from [Tom Augspurg](https://tomaugspurger.github.io/method-chaining) (who stole it from [Jeff Allen](https://trestletech.com/wp-content/uploads/2015/07/dplyr.pdf)). Imagine we are telling the story of Jack and Jill - traditionally it would look like this:\n",
    "\n",
    "```python\n",
    "tumble_after(\n",
    "    broke(\n",
    "        fell_down(\n",
    "            fetch(went_up(jack_jill, \"hill\"), \"water\"),\n",
    "            jack),\n",
    "        \"crown\"),\n",
    "    \"jill\"\n",
    ")\n",
    "```\n",
    "\n",
    "What's the story happening here?\n",
    "You might be a clever software engineer and rewrite it to something like this:\n",
    "```python\n",
    "on_hill = went_up(jack_jill, 'hill')\n",
    "with_water = fetch(on_hill, 'water')\n",
    "fallen = fell_down(with_water, 'jack')\n",
    "broken = broke(fallen, 'jack')\n",
    "after = tumble_after(broken, 'jill')\n",
    "```\n",
    "\n",
    "That looks better, but now I had to come up with 5 sensible variable names that are essentially thrown out - and it's not even that clear what the story is!\n",
    "\n",
    "Pandas let's us do something like this:\n",
    "```python\n",
    "(jack_jill.went_up('hill')\n",
    "    .fetch('water')\n",
    "    .fell_down('jack')\n",
    "    .broke('crown')\n",
    "    .tumble_after('jill')\n",
    ")\n",
    "```\n",
    "\n",
    "That's the best of both worlds! I can clearly read the story, while at the same time, avoiding the need for \"throwaway\" variable names\n",
    "\n",
    "Now back to the data!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Plot](images/plot_all_the_things.jpg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the EDA phase, we just want to plot everything, but we are not looking at production grade plots - we want quick and dirty with features - pandas has you covered!\n",
    "Let's start by plotting each column over time"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.plot(subplots=True, title='Daily trend');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's a bit hard to parse - There are definitely some seasonal trends in the precipitation, but it's hard to get a clear picture of what's happening. Weekend is also not very useful in this context\n",
    "\n",
    "Let's use some of the techniques we looked at to capture a higher-level trend"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "numeric = df.drop('weekend',axis='columns')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rolling = numeric.rolling(7)\n",
    "axes = rolling.mean().plot(subplots=True, title='Rolling 7 Days Trend');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The correlation between temperature and consumption is now more obvious - we also see that the big increase in precipitation around March does not significantly impact consumption, but in mid September, consumption is very low. Let's try to smooth it out even more by looking at monthly data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "resampler = numeric.resample('M')\n",
    "resampler.mean().plot(subplots=True, title='Mean Monthly Trend');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is now a clear trend between temperature and consumption but we lost some knowledge. For example, the inverse relationship with preciptiation in mid-september is not as clear as in the rolling 7 graph. Always use different aggregation levels!\n",
    "\n",
    "Let's start trying to answer our questions about the data:\n",
    "\n",
    "## Does temperature matter in beer consumption?\n",
    "Looking at our trend over time graphs above, we would conclude that yes, the temperature clearly affects consumption (It would be difficult to make the argument that consumption affects temperature!)\n",
    "\n",
    "Let's examine the relationship a bit more using a scatterplot. Scatterplots are a great way to visualize numerical relationships - we lose the time dimension, but are able to see the correlation better."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.plot.scatter(x='median_temp', y='consumption', title='Median Temp vs Consumption');"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.plot.scatter(x='min_temp', y='consumption', title='Min Temp vs Consumption');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These two graphs are a bit hard to compare as they have different x axes. Just for illustration purposes, let's plot them on the same graph"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ax = df.plot(kind='scatter', x='min_temp', y='consumption', marker='^', label='min_temp');\n",
    "df.plot(kind='scatter', x='max_temp', y='consumption', c='red', marker='+', ax=ax, label='max_temp');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Min_temp and max_temp are clearly different - Surprising! :-) It would be more clear if we could plot them each in their own graph, but keep the y axis scale. We could adjust them manually, but it's better to have that done for us! That does require some matplotlib code though - knowing some matplotlib is very helpful to get plots the way you want them!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = plt.subplots(nrows=3, sharey=True)\n",
    "for label, ax in zip(['min_temp', 'max_temp', 'median_temp'], axes):\n",
    "    df.plot.scatter(x=label, y='consumption', ax=ax, label=label, title=f\"Consumption vs {label}\")\n",
    "fig.tight_layout()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looks a lot like our line graphs before! Just looking at the graphs - max_temp seems a bit more clustered around an imaginary trend line while min_temp is fairly spaced out. Let's note that max_temp does seem be a more linear predictor than the other two.\n",
    "\n",
    "We can confirm this by looking at the mean consumption per 5 degrees"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby(pd.cut(df.max_temp, np.arange(10, 45, 5))).consumption.mean().plot.bar(title='Mean Consumption per 5 degrees Max Temp');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I think we can conclusively say temperature matters!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Do people drink more beer in the weekend?\n",
    "This is a pretty easy question to answer - we simply take the mean of weekends vs mean of non-weekends!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby('weekend').consumption.mean().plot(kind='bar');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That seems pretty conclusive! We can also examine the distribution of values using a boxplot "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.boxplot(column='consumption', by='weekend')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We must be careful with averages! While there is a clear difference, there are a fair number of weekdays where consumption is at weekend levels."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "weekend_mean = df.query('weekend == 1').consumption.mean()\n",
    "\n",
    "num_above_mean_weekdays = len(df[(df.weekend == 0) & (df.consumption > weekend_mean)])\n",
    "print(f\"There are {num_above_mean_weekdays} weekdays above the weekend mean\")\n",
    "print(f\"That's {num_above_mean_weekdays / len(df.weekend == 0):.1%}!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Almost 8% of our weekdays would look like a weekend if we trusted our average! We can conclude that being the weekend is definitely related to consumption - but always examine the distributions!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Does seasonality affect consumption?\n",
    "Seasons can have an impact in many different ways - holidays, number of tourists and of course the weather! Unfortunately we only have one year of data, so it will be hard to estimate these effects in a robust manner.\n",
    "\n",
    "First let's use our mapper from previously to add in seasons. The easy mistake here is to forget that Brazil is in the Southern Hemisphere, so seasons are inverted!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "season = {\n",
    "    \"summer\": [12, 1, 2],\n",
    "    \"autumn\": [3, 4, 5],\n",
    "    \"winter\": [6, 7, 8],\n",
    "    \"spring\": [9, 10,11]\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "season_map = {i: k\n",
    "              for k, v in season.items()\n",
    "              for i in v\n",
    "             }\n",
    "\n",
    "df['season'] = df.index.month.map(season_map)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby('season').consumption.mean().sort_values().plot.bar(title='')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Spring and summer does seem to have a higher beer consumption. Let's check the boxplot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.boxplot(column=['consumption'], by='season')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Spring is fairly spread out here while summer seems the best differentiator - Note the overlaps here!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We get a similar picture by looking at month, but notice how season \"wraps around\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['month'] = df.index.month\n",
    "df.boxplot(column='consumption', by='month')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There does seem to be a significant jump going into August - could be worth noting!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['holidays'] = np.where(df.index.isin(holidays.Date), 1, 0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Do holidays matter?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby('holidays').consumption.mean().plot.bar()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby(['season', 'holidays']).consumption.mean().unstack().plot.bar()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's how easy it is to incorporate some extra data in!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Does day of the week matter?\n",
    "Does the party start on Friday? Are Thursday bars popular?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = df.groupby(df.index.weekday).consumption.mean()\n",
    "data.plot(kind='bar');"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.sort_values().to_frame().style.format('{:,.0f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The weekend is clearly the most popular day, with no particular standouts during the week"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['weekday'] = df.index.weekday\n",
    "df.boxplot(column='consumption', by='weekday')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Maybe there is a difference"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pivoted = df.pivot_table(values='consumption', index='weekday', columns='month')\n",
    "pivoted"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's a lot of numbers! Let's use our style trick again to see a bit better"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pivoted.style.background_gradient(cmap='Blues')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are some days that stick out, such as Thursdays in September, or Tuesdays in January - so the combination can matter!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Does Precipitation matter?\n",
    "Let's have a look at precipitation to round out our analysis - our hypothesis would be that rainy weather and beer consumption do not match!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.plot(kind='scatter', x='precip', y='consumption', title='Precipitation vs Consumption');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We don't see a nice linear trend, but this is also affected by many days with close to zero rain - lucky Brazilians!\n",
    "\n",
    "We can try to get a bit fancy - let's logtransform `precip` and see if that makes a difference. Here we are using method chaining to get a nice story out of our transforms\n",
    "\n",
    "![Fun Fact](images/fun_fact.resized.jpeg) If we wrap our method chain in `()` we can align each operation nicely on a new line!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "(df.query('precip > 0.1')\n",
    " .assign(precip=lambda x: np.log1p(x.precip))\n",
    " .plot(kind='scatter', x='precip', y='consumption', title='Precipitation vs Consumption - Rainy Days'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There does seem to be a slight downward trend if we squint a bit, but definitely not as clear cut as with temperature"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Are weekends different when it comes to weather?\n",
    "Is it true that the weather is always bad when we have time off? It can certainly feel that way!\n",
    "\n",
    "Let's see how true that is (at least for Brazil!) in a oneliner!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "(df.groupby('weekend')\n",
    " [['median_temp', 'min_temp', 'max_temp', 'precip']]\n",
    " .mean()\n",
    " .pct_change()\n",
    " .loc[1, :]\n",
    " .plot\n",
    " .barh());"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Building a model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After all that exploration - let's build a regression model to see how good it is!\n",
    "\n",
    "Let's start from a clean read and incorporate our various transformations we have looked at"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv('data/Consumo_cerveja.csv', \n",
    "                 decimal=',', \n",
    "                 thousands='.', \n",
    "                 header=0, \n",
    "                 names=['date','median_temp','min_temp','max_temp','precip','weekend','consumption'], \n",
    "                 parse_dates=['date'], \n",
    "                 nrows=365)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def add_lags(df, col, lags=5):\n",
    "    lags = pd.concat([df[col].shift(i).rename(f\"{col} t_{-i}\") for i in range(1, lags)], axis=1)\n",
    "    return df.join(lags, how='inner')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = (df.set_index('date')\n",
    "        .assign(mean_temp=lambda x: (x.max_temp + x.min_temp) / 2)\n",
    "        .assign(month=lambda x: x.index.month)\n",
    "        .assign(day=lambda x: x.index.day)\n",
    "        .assign(weekday=lambda x: x.index.dayofweek)\n",
    "        .assign(precip=lambda x: np.log1p(x.precip))\n",
    "        .assign(seasons=lambda x: x.index.month.map(season_map))\n",
    "        .assign(holidays=lambda x: np.where(x.index.isin(holidays.Date), 1, 0))\n",
    "        .pipe(add_lags, 'consumption', lags=10)\n",
    "        .pipe(add_lags, 'max_temp', lags=10)\n",
    "        .dropna()\n",
    "        .reset_index(drop=True)\n",
    "    )\n",
    "\n",
    "y = data['consumption']\n",
    "X = pd.get_dummies(data.drop('consumption', axis=1))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's start with a standard Linear Regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import statsmodels.api as sm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "regressor_X = sm.add_constant(X)\n",
    "model = sm.OLS(y, regressor_X, hasconst=True)\n",
    "result = model.fit()\n",
    "\n",
    "result.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And of course we have to have a Random Forest - this is PyData after all!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.ensemble import RandomForestRegressor\n",
    "from sklearn.model_selection import train_test_split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_x, test_x, train_y, test_y = train_test_split(X, y, random_state=42)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clf = RandomForestRegressor(n_estimators=1000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clf.fit(train_x, train_y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clf.score(test_x, test_y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.Series(index=test_x.columns,  data=clf.feature_importances_).sort_values().plot.barh()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}