{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "\n", "plt.style.use('ggplot')\n", "plt.rcParams['figure.figsize'] = (15, 3)\n", "plt.rcParams['font.family'] = 'sans-serif'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We saw earlier that pandas is really good at dealing with dates. It is also amazing with strings! We're going to go back to our weather data from Chapter 5, here." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Temp (C)Dew Point Temp (C)Rel Hum (%)Wind Spd (km/h)Visibility (km)Stn Press (kPa)Weather
Date/Time
2012-01-01 00:00:00-1.8-3.98648.0101.24Fog
2012-01-01 01:00:00-1.8-3.78748.0101.24Fog
2012-01-01 02:00:00-1.8-3.48974.0101.26Freezing Drizzle,Fog
2012-01-01 03:00:00-1.5-3.28864.0101.27Freezing Drizzle,Fog
2012-01-01 04:00:00-1.5-3.38874.8101.23Fog
\n", "
" ], "text/plain": [ " Temp (C) Dew Point Temp (C) Rel Hum (%) \\\n", "Date/Time \n", "2012-01-01 00:00:00 -1.8 -3.9 86 \n", "2012-01-01 01:00:00 -1.8 -3.7 87 \n", "2012-01-01 02:00:00 -1.8 -3.4 89 \n", "2012-01-01 03:00:00 -1.5 -3.2 88 \n", "2012-01-01 04:00:00 -1.5 -3.3 88 \n", "\n", " Wind Spd (km/h) Visibility (km) Stn Press (kPa) \\\n", "Date/Time \n", "2012-01-01 00:00:00 4 8.0 101.24 \n", "2012-01-01 01:00:00 4 8.0 101.24 \n", "2012-01-01 02:00:00 7 4.0 101.26 \n", "2012-01-01 03:00:00 6 4.0 101.27 \n", "2012-01-01 04:00:00 7 4.8 101.23 \n", "\n", " Weather \n", "Date/Time \n", "2012-01-01 00:00:00 Fog \n", "2012-01-01 01:00:00 Fog \n", "2012-01-01 02:00:00 Freezing Drizzle,Fog \n", "2012-01-01 03:00:00 Freezing Drizzle,Fog \n", "2012-01-01 04:00:00 Fog " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "weather_2012 = pd.read_csv('../data/weather_2012.csv', parse_dates=True, index_col='Date/Time')\n", "weather_2012[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 6.1 String operations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You'll see that the 'Weather' column has a text description of the weather that was going on each hour. We'll assume it's snowing if the text description contains \"Snow\".\n", "\n", "pandas provides vectorized string functions, to make it easy to operate on columns containing text. There are some great [examples](http://pandas.pydata.org/pandas-docs/stable/basics.html#vectorized-string-methods) in the documentation." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "weather_description = weather_2012['Weather']\n", "is_snowing = weather_description.str.contains('Snow')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This gives us a binary vector, which is a bit hard to look at, so we'll plot it." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Date/Time\n", "2012-01-01 00:00:00 False\n", "2012-01-01 01:00:00 False\n", "2012-01-01 02:00:00 False\n", "2012-01-01 03:00:00 False\n", "2012-01-01 04:00:00 False\n", "Name: Weather, dtype: bool" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Not super useful\n", "is_snowing[:5]" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# More useful!\n", "is_snowing=is_snowing.astype(float)\n", "is_snowing.plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 6.2 Use resampling to find the snowiest month" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we wanted the median temperature each month, we could use the `resample()` method like this:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "weather_2012['Temp (C)'].resample('M').apply(np.median).plot(kind='bar')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unsurprisingly, July and August are the warmest." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So we can think of snowiness as being a bunch of 1s and 0s instead of `True`s and `False`s:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Date/Time\n", "2012-01-01 00:00:00 0.0\n", "2012-01-01 01:00:00 0.0\n", "2012-01-01 02:00:00 0.0\n", "2012-01-01 03:00:00 0.0\n", "2012-01-01 04:00:00 0.0\n", "2012-01-01 05:00:00 0.0\n", "2012-01-01 06:00:00 0.0\n", "2012-01-01 07:00:00 0.0\n", "2012-01-01 08:00:00 0.0\n", "2012-01-01 09:00:00 0.0\n", "Name: Weather, dtype: float64" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "is_snowing.astype(float)[:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and then use `resample` to find the percentage of time it was snowing each month" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Date/Time\n", "2012-01-31 0.240591\n", "2012-02-29 0.162356\n", "2012-03-31 0.087366\n", "2012-04-30 0.015278\n", "2012-05-31 0.000000\n", "2012-06-30 0.000000\n", "2012-07-31 0.000000\n", "2012-08-31 0.000000\n", "2012-09-30 0.000000\n", "2012-10-31 0.000000\n", "2012-11-30 0.038889\n", "2012-12-31 0.251344\n", "Freq: M, Name: Weather, dtype: float64" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "is_snowing.astype(float).resample('M').apply(np.mean)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "is_snowing.astype(float).resample('M').apply(np.mean).plot(kind='bar')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So now we know! In 2012, December was the snowiest month. Also, this graph suggests something that I feel -- it starts snowing pretty abruptly in November, and then tapers off slowly and takes a long time to stop, with the last snow usually being in April or May." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 6.3 Plotting temperature and snowiness stats together" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also combine these two statistics (temperature, and snowiness) into one dataframe and plot them together:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "temperature = weather_2012['Temp (C)'].resample('M').apply(np.median)\n", "is_snowing = weather_2012['Weather'].str.contains('Snow')\n", "snowiness = is_snowing.astype(float).resample('M').apply(np.mean)\n", "\n", "# Name the columns\n", "temperature.name = \"Temperature\"\n", "snowiness.name = \"Snowiness\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll use `concat` again to combine the two statistics into a single dataframe." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TemperatureSnowiness
Date/Time
2012-01-31-7.050.240591
2012-02-29-4.100.162356
2012-03-312.600.087366
2012-04-306.300.015278
2012-05-3116.050.000000
2012-06-3019.600.000000
2012-07-3122.900.000000
2012-08-3122.200.000000
2012-09-3016.100.000000
2012-10-3111.300.000000
2012-11-301.050.038889
2012-12-31-2.850.251344
\n", "
" ], "text/plain": [ " Temperature Snowiness\n", "Date/Time \n", "2012-01-31 -7.05 0.240591\n", "2012-02-29 -4.10 0.162356\n", "2012-03-31 2.60 0.087366\n", "2012-04-30 6.30 0.015278\n", "2012-05-31 16.05 0.000000\n", "2012-06-30 19.60 0.000000\n", "2012-07-31 22.90 0.000000\n", "2012-08-31 22.20 0.000000\n", "2012-09-30 16.10 0.000000\n", "2012-10-31 11.30 0.000000\n", "2012-11-30 1.05 0.038889\n", "2012-12-31 -2.85 0.251344" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stats = pd.concat([temperature, snowiness], axis=1)\n", "stats" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "stats.plot(kind='bar')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Uh, that didn't work so well because the scale was wrong. We can do better by plotting them on two separate graphs:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([,\n", " ],\n", " dtype=object)" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "stats.plot(kind='bar', subplots=True, figsize=(15, 10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "