{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from datascience import *\n", "import numpy as np\n", "\n", "%matplotlib inline\n", "import matplotlib.pyplot as plots\n", "plots.style.use('fivethirtyeight')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Lecture 9 ##" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Numerical Distribution ##\n", "\n", "Let's examine visualizations of numerical data by looking at how old the top grossing movies are. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "top = Table.read_table('top_movies_2017.csv')\n", "top" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# add the movie age to the top Table\n", "ages = 2022 - top.column('Year')\n", "top = top.with_column('Age', ages)\n", "top" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Binning ##\n", "\n", "We can bin numerical data by creating a set of bins end points, and then calculating how many data points fall within each bin. \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "[min(ages), max(ages)]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# create the bin end points\n", "my_bins = np.arange(0, 121, 20)\n", "my_bins" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Bin the ages of movies into bins of [ ). The last row just gives the end of the last bin and is always 0. \n", "top.bin('Age', bins = my_bins)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# It is possible to bin with intervals of different sizes\n", "uneven_bins = make_array(0, 5, 10, 15, 25, 40, 65, 101)\n", "uneven_bins" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Bin the ages of movies into bins of [ ). The last row just gives the end of the last bin and is always 0. \n", "top.bin('Age', bins = uneven_bins)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sum(top.bin('Age', bins = uneven_bins).column(1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Histograms ##\n", "\n", "Histograms are a useful way to visual numerical data. To create a histogram we binned the data, and then treated the bins as categories and create a bar plot of the resulting data. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# histogram with even bin sizes\n", "top.hist('Age', bins = np.arange(0, 110, 10), unit = 'Years')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We can also specify the number of evenly sized bins we want.\n", "top.hist('Age', bins = 20, unit = 'Years')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We can create histograms of uneven bin sizes. \n", "# The *area* of the bar should be proportional to the number of items in a bin range. \n", "top.hist('Age', bins = uneven_bins, unit = 'Years')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Writing functions ##" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def double(x):\n", " return x * 2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "double(7)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "double(15/3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_number = 12" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "double(my_number)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "double(my_number / 8)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "double(make_array(3, 4, 5))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "double('data')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#\"local scope\"\n", "x" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x = 17" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "double(2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "double(x)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Discussion Question" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#What does this function do?\n", "def percents(values):\n", " return np.round(100 * values / sum(values), 2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "percents(make_array(1, 2, 3, 4))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "percents(make_array(1, 4, 30))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Can have multiple inputs\n", "def percents(values, places):\n", " return np.round(values / sum(values) * 100, places)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "percents(make_array(1, 4, 30), 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Apply ##" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "ages = Table().with_columns(\n", " 'Person', make_array('A', 'B', 'C', 'D'),\n", " 'Age', make_array(63, 110, 99, 102)\n", ")\n", "ages" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def cut_off_at_100(z):\n", " return min(z, 100)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cut_off_at_100(3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cut_off_at_100(107)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cut_age_array = ages.apply(cut_off_at_100, 'Age')\n", "cut_age_array" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ages.with_column('Cut off ages', cut_age_array)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "type(cut_off_at_100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prediction ##" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "galton = Table.read_table('galton.csv')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Each row corresponds to one adult child\n", "#family = family indicator\n", "#father height (inches) \n", "#mother height (inches) \n", "#\"midparent height\"= weighted average of parents' heights\n", "#children= # of children in the family\n", "#childNum = child's birth rank (1 = oldest)\n", "#gender\n", "#height (inches)\n", "galton" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "heights = galton.select(3, 7).relabeled(0, 'MidParent').relabeled(1, 'Child')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "heights" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Side note: overlapping histogram \n", "heights.hist(bins=my_bins, unit='inches')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "heights.scatter('MidParent', 'Child')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "heights.scatter('MidParent', 'Child')\n", "plots.plot([67.5, 67.5], [50, 85], color='red', lw=2)\n", "plots.plot([68.5, 68.5], [50, 85], color='red', lw=2);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nearby = heights.where('MidParent', are.between(67.5, 68.5))\n", "nearby.column('Child').mean()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "heights.scatter('MidParent', 'Child')\n", "plots.plot([67.5, 67.5], [50, 85], color='red', lw=2)\n", "plots.plot([68.5, 68.5], [50, 85], color='red', lw=2)\n", "plots.scatter(68, 66.24, color='gold', s=75);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def predict_child(h):\n", " nearby = heights.where('MidParent', are.between(h-0.5, h+0.5))\n", " return nearby.column('Child').mean()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predict_child(68)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predict_child(65)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predictions = heights.apply(predict_child, 'MidParent')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "heights = heights.with_column('Child Prediction', predictions)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "heights" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "heights.scatter('MidParent')" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 1 }