{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Learning More About Pandas by Building and Using a Weighted Average Function" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook is based on the article on [Pbpython.com](http://pbpython.com/weighted-average.html). Please reference it for the background and additional details" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Read in our sample sales data that includes projected price for our new product launch" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AccountNameStateRepManagerCurrent_PriceQuantityNew_Product_Price
0714466Trantow-BarrowsMNCraig BookerDebra Henley500100550
1737550Fritsch, Russel and AndersonMNCraig BookerDebra Henley60090725
2146832Kiehn-SpinkaTXDaniel HiltonDebra Henley225475255
3218895Kulas IncTXDaniel HiltonDebra Henley290375300
4412290Jerde-HilpertWIJohn SmithDebra Henley375400400
5740150Barton LLCWIJohn SmithDebra Henley550100600
6141962Herman LLCCACedric MossFred Anderson400200425
7163416Purdy-KundeCACedric MossFred Anderson450150475
8239344Stokes LLCWACedric MossFred Anderson55075610
9307599Kassulke, Ondricka and MetzNVWendy YuleFred Anderson275450300
10688981Keeling LLCNVWendy YuleFred Anderson300250350
11729833Koepp LtdNVWendy YuleFred Anderson350100375
\n", "
" ], "text/plain": [ " Account Name State Rep Manager \\\n", "0 714466 Trantow-Barrows MN Craig Booker Debra Henley \n", "1 737550 Fritsch, Russel and Anderson MN Craig Booker Debra Henley \n", "2 146832 Kiehn-Spinka TX Daniel Hilton Debra Henley \n", "3 218895 Kulas Inc TX Daniel Hilton Debra Henley \n", "4 412290 Jerde-Hilpert WI John Smith Debra Henley \n", "5 740150 Barton LLC WI John Smith Debra Henley \n", "6 141962 Herman LLC CA Cedric Moss Fred Anderson \n", "7 163416 Purdy-Kunde CA Cedric Moss Fred Anderson \n", "8 239344 Stokes LLC WA Cedric Moss Fred Anderson \n", "9 307599 Kassulke, Ondricka and Metz NV Wendy Yule Fred Anderson \n", "10 688981 Keeling LLC NV Wendy Yule Fred Anderson \n", "11 729833 Koepp Ltd NV Wendy Yule Fred Anderson \n", "\n", " Current_Price Quantity New_Product_Price \n", "0 500 100 550 \n", "1 600 90 725 \n", "2 225 475 255 \n", "3 290 375 300 \n", "4 375 400 400 \n", "5 550 100 600 \n", "6 400 200 425 \n", "7 450 150 475 \n", "8 550 75 610 \n", "9 275 450 300 \n", "10 300 250 350 \n", "11 350 100 375 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sales = pd.read_excel(\"https://github.com/chris1610/pbpython/blob/master/data/sales-estimate.xlsx?raw=True\", sheetname=\"projections\")\n", "sales" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Show the mean for our current and new product price" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "405.416666667\n", "447.083333333\n" ] } ], "source": [ "print(sales[\"Current_Price\"].mean())\n", "print(sales[\"New_Product_Price\"].mean())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calculate the weighted average using the long form" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "342.540687161\n", "374.638336347\n" ] } ], "source": [ "print((sales[\"Current_Price\"] * sales[\"Quantity\"]).sum() / sales[\"Quantity\"].sum())\n", "print((sales[\"New_Product_Price\"] * sales[\"Quantity\"]).sum() / sales[\"Quantity\"].sum())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use np.average to simplify the formula" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "342.540687161\n", "374.638336347\n" ] } ], "source": [ "print(np.average(sales[\"Current_Price\"], weights=sales[\"Quantity\"]))\n", "print(np.average(sales[\"New_Product_Price\"], weights=sales[\"Quantity\"]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For maximum flexibility, build our own weighted average function" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def wavg(group, avg_name, weight_name):\n", " \"\"\" http://stackoverflow.com/questions/10951341/pandas-dataframe-aggregate-function-using-multiple-columns\n", " In rare instance, we may not have weights, so just return the mean\n", " \"\"\"\n", " d = group[avg_name]\n", " w = group[weight_name]\n", " try:\n", " return (d * w).sum() / w.sum()\n", " except ZeroDivisionError:\n", " return d.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Call the weighted average on all of the data" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "342.540687161\n", "374.638336347\n" ] } ], "source": [ "print(wavg(sales, \"Current_Price\", \"Quantity\"))\n", "print(wavg(sales, \"New_Product_Price\", \"Quantity\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Group the data by manager" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Manager\n", "Debra Henley 340.665584\n", "Fred Anderson 344.897959\n", "dtype: float64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sales.groupby(\"Manager\").apply(wavg, \"Current_Price\", \"Quantity\")" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Manager\n", "Debra Henley 372.646104\n", "Fred Anderson 377.142857\n", "dtype: float64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sales.groupby(\"Manager\").apply(wavg, \"New_Product_Price\", \"Quantity\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also group by state" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "State\n", "CA 446.428571\n", "MN 632.894737\n", "NV 325.000000\n", "TX 274.852941\n", "WA 610.000000\n", "WI 440.000000\n", "dtype: float64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sales.groupby(\"State\").apply(wavg, \"New_Product_Price\", \"Quantity\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also group by multiple criteria and the function will work correctly." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Manager State\n", "Debra Henley MN 632.894737\n", " TX 274.852941\n", " WI 440.000000\n", "Fred Anderson CA 446.428571\n", " NV 325.000000\n", " WA 610.000000\n", "dtype: float64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sales.groupby([\"Manager\", \"State\"]).apply(wavg, \"New_Product_Price\", \"Quantity\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Example of applying multiple aggregation functions" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
QuantityNew_Product_PriceCurrent_Price
summeanmeanmedian
Manager
Debra Henley1540256.666667471.666667437.5
Fred Anderson1225204.166667422.500000375.0
\n", "
" ], "text/plain": [ " Quantity New_Product_Price Current_Price\n", " sum mean mean median\n", "Manager \n", "Debra Henley 1540 256.666667 471.666667 437.5\n", "Fred Anderson 1225 204.166667 422.500000 375.0" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f = {'New_Product_Price': ['mean'],'Current_Price': ['median'], 'Quantity': ['sum', 'mean']}\n", "sales.groupby(\"Manager\").agg(f)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similar method to group multiple custom functions together into a single DataFrame" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true }, "outputs": [], "source": [ "data_1 = sales.groupby(\"Manager\").apply(wavg, \"New_Product_Price\", \"Quantity\")\n", "data_2 = sales.groupby(\"Manager\").apply(wavg, \"Current_Price\", \"Quantity\")" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
New Product PriceCurrent Product Price
Manager
Debra Henley372.646104340.665584
Fred Anderson377.142857344.897959
\n", "
" ], "text/plain": [ " New Product Price Current Product Price\n", "Manager \n", "Debra Henley 372.646104 340.665584\n", "Fred Anderson 377.142857 344.897959" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "summary = pd.DataFrame(data=dict(s1=data_1, s2=data_2))\n", "summary.columns = [\"New Product Price\",\"Current Product Price\"]\n", "summary.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, numpy has an average function that can be used:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "342.54068716094031" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.average(sales[\"Current_Price\"], weights=sales[\"Quantity\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use a lambda function for it to work with grouped data" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Manager\n", "Debra Henley 372.646104\n", "Fred Anderson 377.142857\n", "dtype: float64" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sales.groupby(\"Manager\").apply(lambda x: np.average(x['New_Product_Price'], weights=x['Quantity']))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.4.4" } }, "nbformat": 4, "nbformat_minor": 0 }