{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "[Oregon Curriculum Network](http://4dsolutions.net/ocn/)\n", "\n", "[Home](School_of_Tomorrow.ipynb)\n", "\n", "# Data Visualization (Part One)\n", "\n", "\"Globe\n", "
\n", "Cleveland High School, Portland, Oregon\n", "
\n", "\n", "\n", "## Introduction to Data Science\n", "\n", "In entering the realm of Data Science, we come upon a world concerned with predicting the future, anticipating what's next, based on extrapolation and sometimes interpolation. Many data science practices inherit from the insurance industry, which is about assessing and socializing (spreading the costs of) risk.\n", "\n", "We predict about the past as well. We're often keen to know of events that may have already taken place.\n", "\n", "### Andragogy / Pedagogy \n", "\n", "Statisticians talk a lot about sampling a population, where the latter is what we wish to accurately characterize, but we haven't the means to survey all the data. The algorithms make a distinction depending on whether the entire population and/or samples thereof are being spoken about." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "def pascal(r):\n", " row = [1]\n", " for i in range(r):\n", " row = list([i+j for i,j in zip(row + [0], [0] + row)])\n", " yield row\n", " \n", "for r in pascal(20):\n", " pass" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYoAAAD4CAYAAADy46FuAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAUwElEQVR4nO3df4xd5Z3f8fenZonS3aVAGBDCUJPUGy1BrRMsgpQmYkNDDFRrUsHW/BHclNZJCuquun/ESaUSZYNEWmVpIyWsYLEwqwRDISzW4iyxKF1aCRKGQPkRQj1xvGFiyzY/wiKxJYJ8+8d9Jnsxd54Zz4xnDH6/pKt7zvc8z7nPPbnMJ+c5516nqpAkaTp/b6kHIEk6vBkUkqQug0KS1GVQSJK6DApJUtdRSz2AhXbCCSfUihUrlnoYkvSW8sgjjzxXVWOjtr3tgmLFihWMj48v9TAk6S0lyV9Pt82pJ0lSl0EhSeoyKCRJXQaFJKnLoJAkdRkUkqQug0KS1GVQSJK6DApJUtfb7pvZ0uFgxcZ7Zt1217UXHcKRSPPnGYUkqcugkCR1GRSSpC6DQpLUZVBIkroMCklS14xBkWRTkn1Jnhyq3ZbksfbYleSxVl+R5G+Htv3JUJ+zkjyRZCLJ15Kk1Y9Psj3JjvZ8XKuntZtI8niSDyz825ckzWQ2ZxQ3A2uGC1X1L6tqVVWtAu4Evj20+cdT26rqM0P164ENwMr2mNrnRuC+qloJ3NfWAS4Yaruh9ZckLbIZg6KqHgBeGLWtnRX8HnBrbx9JTgaOqaoHq6qAW4CL2+a1wOa2vPmA+i018BBwbNuPJGkRzfcaxYeBvVW1Y6h2epJHk/xVkg+32inA5FCbyVYDOKmq9gC05xOH+jw7TR9J0iKZ7094XMYbzyb2AKdV1fNJzgL+PMn7gIzoWzPse9Z9kmxgMD3FaaedNuOgJUmzN+cziiRHAf8CuG2qVlWvVtXzbfkR4MfAbzE4G1g+1H05sLst752aUmrP+1p9Ejh1mj5vUFU3VNXqqlo9NjY217ckSRphPlNP/wz4UVX9akopyViSZW353QwuRO9sU0ovJzmnXde4HLi7ddsKrG/L6w+oX97ufjoHeGlqikqStHhmc3vsrcCDwHuTTCa5om1ax5svYn8EeDzJ/wHuAD5TVVMXwj8L/CkwweBM4zutfi3wsSQ7gI+1dYBtwM7W/kbg3x3825MkzdeM1yiq6rJp6v9qRO1OBrfLjmo/Dpw5ov48cN6IegFXzjQ+SdKh5TezJUldBoUkqcugkCR1+U+hStM4mH/OFBbmnzRditeUZuIZhSSpy6CQJHUZFJKkLoNCktRlUEiSugwKSVKXQSFJ6jIoJEldBoUkqcugkCR1GRSSpC6DQpLUZVBIkroMCklSl0EhSeoyKCRJXTMGRZJNSfYleXKo9sUkP0vyWHtcOLTt80kmkjyT5OND9TWtNpFk41D99CTfS7IjyW1Jjm71d7T1ibZ9xUK9aUnS7M3mjOJmYM2I+nVVtao9tgEkOQNYB7yv9flGkmVJlgFfBy4AzgAua20BvtL2tRJ4Ebii1a8AXqyqfwRc19pJkhbZjEFRVQ8AL8xyf2uBLVX1alX9BJgAzm6PiaraWVW/ALYAa5ME+ChwR+u/Gbh4aF+b2/IdwHmtvSRpEc3nGsVVSR5vU1PHtdopwLNDbSZbbbr6u4CfV9VrB9TfsK+2/aXW/k2SbEgynmR8//7983hLkqQDzTUorgfeA6wC9gBfbfVR/4+/5lDv7evNxaobqmp1Va0eGxvrjVuSdJDmFBRVtbeqXq+qXwI3MphagsEZwalDTZcDuzv154Bjkxx1QP0N+2rb/wGznwKTJC2QOQVFkpOHVj8BTN0RtRVY1+5YOh1YCXwfeBhY2e5wOprBBe+tVVXA/cAlrf964O6hfa1vy5cA/6O1lyQtoqNmapDkVuBc4IQkk8DVwLlJVjGYCtoFfBqgqp5KcjvwQ+A14Mqqer3t5yrgXmAZsKmqnmov8TlgS5IvA48CN7X6TcCfJZlgcCaxbt7vVpJ00GYMiqq6bET5phG1qfbXANeMqG8Dto2o7+Tvpq6G6/8PuHSm8UmSDi2/mS1J6jIoJEldBoUkqcugkCR1GRSSpC6DQpLUZVBIkroMCklSl0EhSeoyKCRJXQaFJKnLoJAkdRkUkqQug0KS1GVQSJK6DApJUpdBIUnqMigkSV0GhSSpa8agSLIpyb4kTw7V/kuSHyV5PMldSY5t9RVJ/jbJY+3xJ0N9zkryRJKJJF9LklY/Psn2JDva83GtntZuor3OBxb+7UuSZjKbM4qbgTUH1LYDZ1bVPwb+L/D5oW0/rqpV7fGZofr1wAZgZXtM7XMjcF9VrQTua+sAFwy13dD6S5IW2VEzNaiqB5KsOKD23aHVh4BLevtIcjJwTFU92NZvAS4GvgOsBc5tTTcD/xP4XKvfUlUFPJTk2CQnV9WeGd+V1KzYeM9Btd917UWHaCSH3pH0XrW4FuIaxb9m8Ad/yulJHk3yV0k+3GqnAJNDbSZbDeCkqT/+7fnEoT7PTtPnDZJsSDKeZHz//v3zezeSpDeYV1Ak+Y/Aa8A3W2kPcFpVvR/4D8C3khwDZET3mmn3s+1TVTdU1eqqWj02Nja7wUuSZmXGqafpJFkP/HPgvDY9RFW9Crzalh9J8mPgtxicDSwf6r4c2N2W905NKbUpqn2tPgmcOk0fSdIimdMZRZI1DK4j/G5VvTJUH0uyrC2/m8GF6J1tSunlJOe0u50uB+5u3bYC69vy+gPql7e7n84BXvL6hCQtvhnPKJLcyuBi8wlJJoGrGdzl9A5ge7vL9aF2h9NHgC8leQ14HfhMVb3QdvVZBndQvZPBNY2p6xrXArcnuQL4KXBpq28DLgQmgFeAT83njUqS5mY2dz1dNqJ80zRt7wTunGbbOHDmiPrzwHkj6gVcOdP4JEmHlt/MliR1GRSSpC6DQpLUZVBIkroMCklSl0EhSeoyKCRJXQaFJKnLoJAkdRkUkqQug0KS1GVQSJK6DApJUpdBIUnqMigkSV0GhSSpy6CQJHUZFJKkLoNCktQ1q6BIsinJviRPDtWOT7I9yY72fFyrJ8nXkkwkeTzJB4b6rG/tdyRZP1Q/K8kTrc/XkqT3GpKkxTPbM4qbgTUH1DYC91XVSuC+tg5wAbCyPTYA18Pgjz5wNfBB4Gzg6qE//Ne3tlP91szwGpKkRTKroKiqB4AXDiivBTa35c3AxUP1W2rgIeDYJCcDHwe2V9ULVfUisB1Y07YdU1UPVlUBtxywr1GvIUlaJPO5RnFSVe0BaM8ntvopwLND7SZbrVefHFHvvcYbJNmQZDzJ+P79++fxliRJBzoUF7MzolZzqM9aVd1QVauravXY2NjBdJUkzWA+QbG3TRvRnve1+iRw6lC75cDuGerLR9R7ryFJWiTzCYqtwNSdS+uBu4fql7e7n84BXmrTRvcC5yc5rl3EPh+4t217Ock57W6nyw/Y16jXkCQtkqNm0yjJrcC5wAlJJhncvXQtcHuSK4CfApe25tuAC4EJ4BXgUwBV9UKSPwIebu2+VFVTF8g/y+DOqncC32kPOq8hSVokswqKqrpsmk3njWhbwJXT7GcTsGlEfRw4c0T9+VGvIUlaPH4zW5LUZVBIkroMCklSl0EhSeoyKCRJXQaFJKnLoJAkdRkUkqQug0KS1GVQSJK6DApJUpdBIUnqMigkSV0GhSSpy6CQJHUZFJKkLoNCktRlUEiSugwKSVLXrP7N7FGSvBe4baj0buA/AccC/xbY3+pfqKptrc/ngSuA14F/X1X3tvoa4L8By4A/raprW/10YAtwPPAD4JNV9Yu5jllvXSs23jPrtruuvegQjuTtyeOrnjmfUVTVM1W1qqpWAWcBrwB3tc3XTW0bCokzgHXA+4A1wDeSLEuyDPg6cAFwBnBZawvwlbavlcCLDEJGkrSIFmrq6Tzgx1X11502a4EtVfVqVf0EmADObo+JqtrZzha2AGuTBPgocEfrvxm4eIHGK0mapYUKinXArUPrVyV5PMmmJMe12inAs0NtJlttuvq7gJ9X1WsH1N8kyYYk40nG9+/fP6qJJGmO5h0USY4Gfhf47610PfAeYBWwB/jqVNMR3WsO9TcXq26oqtVVtXpsbOwgRi9JmsmcL2YPuQD4QVXtBZh6BkhyI/AXbXUSOHWo33Jgd1seVX8OODbJUe2sYri9JGmRLMTU02UMTTslOXlo2yeAJ9vyVmBdkne0u5lWAt8HHgZWJjm9nZ2sA7ZWVQH3A5e0/uuBuxdgvJKkgzCvM4okfx/4GPDpofJ/TrKKwTTRrqltVfVUktuBHwKvAVdW1ettP1cB9zK4PXZTVT3V9vU5YEuSLwOPAjfNZ7ySpIM3r6CoqlcYXHQern2y0/4a4JoR9W3AthH1nQzuipIkLRG/mS1J6jIoJEldBoUkqcugkCR1GRSSpC6DQpLUZVBIkroMCklSl0EhSeoyKCRJXQaFJKnLoJAkdRkUkqQug0KS1GVQSJK6DApJUpdBIUnqMigkSV0GhSSpa95BkWRXkieSPJZkvNWOT7I9yY72fFyrJ8nXkkwkeTzJB4b2s76135Fk/VD9rLb/idY38x2zJGn2FuqM4neqalVVrW7rG4H7qmolcF9bB7gAWNkeG4DrYRAswNXAB4GzgaunwqW12TDUb80CjVmSNAuHauppLbC5LW8GLh6q31IDDwHHJjkZ+DiwvapeqKoXge3AmrbtmKp6sKoKuGVoX5KkRbAQQVHAd5M8kmRDq51UVXsA2vOJrX4K8OxQ38lW69UnR9TfIMmGJONJxvfv378Ab0mSNOWoBdjHh6pqd5ITge1JftRpO+r6Qs2h/sZC1Q3ADQCrV69+03ZJ0tzN+4yiqna3533AXQyuMext00a0532t+SRw6lD35cDuGerLR9QlSYtkXkGR5NeT/ObUMnA+8CSwFZi6c2k9cHdb3gpc3u5+Ogd4qU1N3Qucn+S4dhH7fODetu3lJOe0u50uH9qXJGkRzHfq6STgrnbH6lHAt6rqL5M8DNye5Argp8Clrf024EJgAngF+BRAVb2Q5I+Ah1u7L1XVC235s8DNwDuB77SHJGmRzCsoqmon8E9G1J8HzhtRL+DKafa1Cdg0oj4OnDmfcUqS5s5vZkuSugwKSVKXQSFJ6jIoJEldBoUkqcugkCR1LcRPeEizsmLjPQfVfte1Fx2ikWih+L/pkcEzCklSl0EhSeoyKCRJXQaFJKnLoJAkdRkUkqQug0KS1GVQSJK6DApJUpdBIUnqMigkSV0GhSSpa85BkeTUJPcneTrJU0l+v9W/mORnSR5rjwuH+nw+yUSSZ5J8fKi+ptUmkmwcqp+e5HtJdiS5LcnRcx2vJGlu5nNG8Rrwh1X128A5wJVJzmjbrquqVe2xDaBtWwe8D1gDfCPJsiTLgK8DFwBnAJcN7ecrbV8rgReBK+YxXknSHMw5KKpqT1X9oC2/DDwNnNLpshbYUlWvVtVPgAng7PaYqKqdVfULYAuwNkmAjwJ3tP6bgYvnOl5J0twsyDWKJCuA9wPfa6WrkjyeZFOS41rtFODZoW6TrTZd/V3Az6vqtQPqo15/Q5LxJOP79+9fgHckSZoy76BI8hvAncAfVNXfANcD7wFWAXuAr041HdG95lB/c7HqhqpaXVWrx8bGDvIdSJJ65vUv3CX5NQYh8c2q+jZAVe0d2n4j8BdtdRI4daj7cmB3Wx5Vfw44NslR7axiuL0kaZHM566nADcBT1fVHw/VTx5q9gngyba8FViX5B1JTgdWAt8HHgZWtjucjmZwwXtrVRVwP3BJ678euHuu45Ukzc18zig+BHwSeCLJY632BQZ3La1iME20C/g0QFU9leR24IcM7pi6sqpeB0hyFXAvsAzYVFVPtf19DtiS5MvAowyCSZK0iOYcFFX1vxl9HWFbp881wDUj6ttG9auqnQzuipIkLRG/mS1J6jIoJEldBoUkqcugkCR1zet7FDryrNh4z0G133XtRYdoJHqrO5jPkp+jpeUZhSSpy6CQJHUZFJKkLoNCktRlUEiSugwKSVKXQSFJ6jIoJEldBoUkqcugkCR1GRSSpC5/6+kI5W826a3Kz+7i84xCktRlUEiSug77oEiyJskzSSaSbFzq8UjSkeawDooky4CvAxcAZwCXJTljaUclSUeWw/1i9tnARFXtBEiyBVgL/HBJR3WY8KKedHD8b2ZuUlVLPYZpJbkEWFNV/6atfxL4YFVddUC7DcCGtvpe4JkFHsoJwHMLvM+3G4/RzDxGs+NxmtmhOEb/sKrGRm043M8oMqL2pmSrqhuAGw7ZIJLxqlp9qPb/duAxmpnHaHY8TjNb7GN0WF+jACaBU4fWlwO7l2gsknREOtyD4mFgZZLTkxwNrAO2LvGYJOmIclhPPVXVa0muAu4FlgGbquqpJRjKIZvWehvxGM3MYzQ7HqeZLeoxOqwvZkuSlt7hPvUkSVpiBoUkqcugmIE/ITKzJLuSPJHksSTjSz2ew0GSTUn2JXlyqHZ8ku1JdrTn45ZyjEttmmP0xSQ/a5+lx5JcuJRjXGpJTk1yf5KnkzyV5PdbfVE/SwZFhz8hclB+p6pWef/7r9wMrDmgthG4r6pWAve19SPZzbz5GAFc1z5Lq6pq2yKP6XDzGvCHVfXbwDnAle1v0KJ+lgyKvl/9hEhV/QKY+gkRqauqHgBeOKC8FtjcljcDFy/qoA4z0xwjDamqPVX1g7b8MvA0cAqL/FkyKPpOAZ4dWp9sNb1RAd9N8kj7ORWNdlJV7YHBHwDgxCUez+HqqiSPt6mpI3p6bliSFcD7ge+xyJ8lg6JvVj8hIj5UVR9gMEV3ZZKPLPWA9JZ1PfAeYBWwB/jq0g7n8JDkN4A7gT+oqr9Z7Nc3KPr8CZFZqKrd7XkfcBeDKTu92d4kJwO0531LPJ7DTlXtrarXq+qXwI34WSLJrzEIiW9W1bdbeVE/SwZFnz8hMoMkv57kN6eWgfOBJ/u9jlhbgfVteT1w9xKO5bA09cev+QRH+GcpSYCbgKer6o+HNi3qZ8lvZs+g3Z73X/m7nxC5ZomHdFhJ8m4GZxEw+EmYb3mMIMmtwLkMfg56L3A18OfA7cBpwE+BS6vqiL2YO80xOpfBtFMBu4BPT83FH4mS/FPgfwFPAL9s5S8wuE6xaJ8lg0KS1OXUkySpy6CQJHUZFJKkLoNCktRlUEiSugwKSVKXQSFJ6vr/ALOXFJxkfTEAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "plt.bar(range(len(r)), r);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At the School of Tomorrow, we recommend immersion, as when learning a language, to pick up on correlation, regression, normal distribution, confidence intervals and so on. Absorb the semantics and make connections to some glue language like Python for specific workouts.\n", "\n", "The concept of a vector is especially important, given its embodiment as an almost literal tip-to-tail arrow pointing from the origin to anywhere in an n-D space. Such pointing, with corresponding labeling, is the bread and butter input of supervised machine learning algorithms. \n", "\n", "Up to 3-D we have the visualizable space of polyhedrons.\n", "\n", "### Historical Sidebar\n", "\n", "In coordinated Martian Math segments, on polyhedrons, the School of Tomorrow may introduce quadrays, as a questioning and investigational tool ala Ludwig Wittgenstein. How many basis vectors do we need again? The famous three need their three opposites, rotated 180 degrees. \"What minimum basis might get by without needing opposites?\"\n", "\n", "We guess about this and that, whether this or that happened in the past, or has yet to happen. When making these guesses, we use existing data as evidence. A model that's scoring well is able to correctly predict what we already know to be the case.\n", "\n", "### The Science of Predicting\n", "\n", "Under the heading of \"prediction\" therefore, comes \"the ability to guess correctly\" whether or not we're looking into the future or into the past. Keep in mind that Physics, including Quantum Physics, is just as interested in prediction, in \"guessing with some confidance\" as any discipline.\n", "\n", "A goal, in engineering, is to have some influence over outcomes, and that means looking for trimtabs. \n", "\n", "How might we optimize various distribution networks, such as the internet itself, so that it's less likely to bog down in traffic jams?\n", "\n", "### Historical Sidebar\n", "\n", "\"Data Science\" is a relatively recent invention, for what used to be called Statistics. We still have Statistics, but ever since statistics joined forces with Machine Learning, the term \"data science\" has been in the foreground. The evolution of Machine Learning has been against the backdrop of some professional debates the statisticians have been having. One of these debates has been between so-called \"Frequentists\" and another camp known as \"Bayesians\".\n", "\n", "### Research Project: Recent History of Data Science\n", "\n", "Looking for a research topic? Here's [a place to start](https://www.amazon.com/dp/B0050QB3EQ): *The Theory That Would Not Die: How Bayes' Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries of Controversy* by Sharon Bertsch McGrayne." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'hypertext transfer protocol'" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import json\n", "with open(\"glossary.json\", 'r') as infile: # context manager syntax\n", " glossary = json.load(infile)\n", "glossary['HTTP']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Programming Interlude: Context Managers\n", "\n", "Since we've chosen Python for a kernel language (many choices exist), we might as well dig into it from time to time. In the code cell above, you'll notice the keyword ```with``` with the optional ```as``` piece, with indented code underneath (as many lines as we like). \n", "\n", "The indented code is the body of our \"context\" which is entered at the top and exited at the bottom. The occassions of entering and exiting a context automatically trigger the ```__enter__``` and ```__exit__``` methods of the object we're using with ```with```." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Content of knight_bag: Holy Grail from Goth Castle\n" ] } ], "source": [ "class Castle:\n", " \"\"\"\n", " Example of a class designed to perform as a \n", " context manager, as triggered by keyword 'with'\n", " \"\"\"\n", " \n", " def __init__(self, name):\n", " self.name = name\n", " \n", " def __enter__(self):\n", " return self # pass forward through as\n", " \n", " def inner_sanctum(self):\n", " # Monty Python allusion\n", " return \"Holy Grail from %s\" % self.name\n", " \n", " def __exit__(self, *oops):\n", " if oops[0]:\n", " # do cleanup\n", " pass\n", " return True\n", " \n", "with Castle(\"Goth Castle\") as castle:\n", " knight_bag = castle.inner_sanctum()\n", " \n", "print(\"Content of knight_bag:\", knight_bag)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "glossary[\"Bayesian\"] = \"inferential methods useable even in the absense of any prospect for controlled studies\"\n", "glossary[\"Pharo\"] = \"a Smalltalk-like language and ecosystem that competes with Python's\"\n", "glossary[\"Sphinx\"] = \"a documentation generator, targeting the web in particular, for use with Python\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Digital Mathematics curriculum\n", "\n", "In [Digital Mathematics: Heuristics for Teachers](http://wikieducator.org/Digital_Math) you will find a way of carving up our mathematical domain into four sections:\n", "\n", "* Martian Math (looking towards the future)\n", "* Neolithic Math (looking towards the past)\n", "* Casino Math (looking at risks)\n", "* Supermarket Math (looking at ecological systems)\n", "\n", "Ready for [Part Two](dataviz2.ipynb)?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## More Tools...\n", "\n", "In addition to Python the language, which is our Kernel, we need to use some add-on 3rd party packages, which usually get installed in a subfolder called ```site-packages``` and associated with the specific Python you're using. \n", "\n", "Three of these packages are: \n", "\n", "* ```numpy``` (a workhorse that works with tensors, or n-dimensional arrays)\n", "* ```pandas``` for encapsulating [tensors](https://www.tensorflow.org/guide/tensors) and adding dictionary-like labeling\n", "* ```matplotlib``` for doing the actual visualizations.\n", "* ```seaborn``` for making matplotlib even prettier.\n", "\n", "### Will I Ever Know Enough?\n", "What you might be asking yourself, perhaps having glanced at some documentation, is:\n", "\n", "1. where to begin? and\n", "2. will I really need to memorize hundreds of commands to control each one of these products?\n", "\n", "Our assumption here is you're involved in \"world game\" meaning thinking globally, acting locally. \n", "\n", "You're on the faculty of a think tank. People look to you for guidance.\n", "\n", "To get a stronger grasp on what's going on, you read a lot, but you also look at data that's sometimes too new to have yet led many, if any, to draw conclusions. You are one of those privileged data analysts with a special vantage point, who will share your sense of what it all means with your peers.\n", "\n", "That's partly why you read, and also write a lot: to keep your communication skills polished. We're learning new language our entire lives. New vocabularies. New \"games\" (language games), some of which are literally games. Learning from data also involves applying the techniques of data science, which may include using machine learning algorithms.\n", "\n", "The data you're studying is not necessarily \"big data\" although it may be. \"Small data\" may still be quite a lot, by 20th Century standards.\n", "\n", "#### Research Project: Apache Foundation\n", "\n", "[The Apache Foundation](https://www.apache.org/) helps fund a number of valuable free and open source products built to work with big data. In order to gain some fluency with the concepts, do some research on these projects.\n", "\n", "\n", "### What's an API?\n", "As for memorization, you're best bet is to stay in the habit of consulting documentation, and deciphering it. What you're often looking for is advice on how to use an \"API\" or Application Programming Interface. You might call it a control panel or dashboard, but unless you're operating a GUI, the API is likely encountered in the thick of some programming language, such as Python, Ruby, or JavaScript.\n", "\n", "#### Reading the Docs\n", "\n", "Looking ahead to the next Notebook: \n", "\n", "* How do I sort a DataFrame by index? [Check here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_index.html)\n", "* How do I sort a DataFrame by any column? [Check here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html)\n", "\n", "If you work with these tools on an everyday basis, then you'll become more adept through practice. However what's rewarding about programming is that your code will run extremely fast even if you take a relatively long time to write it, compared to someone else who writes code faster. \n", "\n", "Better to take your time and understand what you're doing, than just cut and paste a lot of code you find on the internet. It's fine to cut and paste code, but plan to spend time getting to understand it in some detail. That way, you'll continue along your learning curve.\n", "\n", "A common misapprehension about \"learning to code\" is that \"real programming\" always involves starting with a blank canvas and writing everything from scratch. Certainly piano players don't do that, when it comes to piano playing. Sometimes that's a good approach. Other times, you best bet is to begin with some existing code, and modifying it to suit your own purposes.\n", "\n", "Without further delay, lets get to know some of our data science tools, each with its own API." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt # done above already\n", "import matplotlib as mpl\n", "import seaborn as sns\n", "\n", "from math import sin, cos, radians # lets plot some trig functions!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that you don't need to import Python itself. That's because Python is the Kernel behind the scenes running all these code cells. One specifies the Kernel upon starting a new Jupyter Notebook." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Numpy version : 1.17.2\n", "Pandas version : 0.25.1\n", "Matplotlib version: 3.1.1\n", "Seaborn version: : 0.9.0\n" ] } ], "source": [ "# Kernel is Python 3.6 or above\n", "print(f\"\"\"\\\n", "Numpy version : {np.__version__}\n", "Pandas version : {pd.__version__}\n", "Matplotlib version: {mpl.__version__}\n", "Seaborn version: : {sns.__version__}\"\"\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You probably won't want or need to upgrade each time there's a version change. In fact sometimes you may find yourself in the opposite situation, of needing to lock in an old version of something. Programmers use containers and virtual environments to preserve old ecosystems and keep them from contaminating each other.\n", "\n", "When you do upgrade a package, you may find rerunning the same code results in warnings or outright errors. Packages with stable APIs are less likely to surprise you in this way. It's a good idea to consult documentation to find out what's new, if you actually have a choice about whether to upgrade or not." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Numpy version : 1.17.2\n", "Pandas version : 0.25.1\n", "Matplotlib version: 3.1.1\n", "Seaborn version : 0.9.0\n" ] } ], "source": [ "# if you have an earlier kernel\n", "print(\"\"\"\\\n", "Numpy version : {}\n", "Pandas version : {}\n", "Matplotlib version: {}\n", "Seaborn version : {}\"\"\".format(\n", "np.__version__, pd.__version__, mpl.__version__, sns.__version__)\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The code cell below is quite typical of how we might use ```plt``` (matplotlib.pyplot) together with ```np``` (numpy). Note that ```pd``` (pandas) is not yet involved. We'll be seeing it soon.\n", "\n", "The ```np.linspace``` command is one of the most used, as we so often need a particular number of evenly spaced numbers between a minimum and maximum extreme. ```np.arange``` is the other workhorse. It takes a minimum and maximum extreme, just like ```linspace```, however the third argument is the increment you wish to use. ```arange``` will figure out how many elements you need, up to but not including the limiting value.\n", "\n", "Note that both of these functions return ```np.ndarray``` objects, where the ```ndarray``` type is the star of ```numpy```. An ```ndarray``` is a multi-dimensional array, meaning it has one or more axes. These axes define the coordinate system structure used to address the contained elements. You'll learn more about the ins and outs of ```ndarrays``` from other notebooks." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "domain = np.linspace(-5, 5, 100) # give me 100 points from -5 to 5\n", "y_sin = np.sin(domain) # do all 1000\n", "y_cos = np.cos(domain) # do all 1000\n", "\n", "def plot_functions():\n", " plt.figure(figsize=(10, 5))\n", " plt.xlabel(\"X\")\n", " plt.ylabel(\"Y\")\n", " plt.title(\"Trig Functions\")\n", " lines = plt.plot(domain, y_sin, 'go', domain, y_cos, 'y^')\n", " # https://matplotlib.org/api/_as_gen/matplotlib.pyplot.legend.html\n", " leg = plt.legend(lines, (\"sine\", \"cosine\"), \n", " title=\"Key\", frameon=True,\n", " shadow=True, facecolor=\"gray\",\n", " borderaxespad=2) \n", " plt.axis([-6, 6, -1.5, 1.5])\n", " plt.show()\n", " \n", "plot_functions()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Historical Sidebar\n", "\n", "Do you live in a dome home? Trigonometric functions prove useful when it comes to computing the vertexes of a geodesic sphere. \n", "\n", "One of the best primers on the topic is [Divided Spheres](http://www.dividedspheres.com/) by [Ed Popko](http://www.dividedspheres.com/?page_id=19). Dome homes became popular in the 1960s onward, as an alternative to the more conventional house.\n", "\n", "\"Divided\n", "\n", "The two videos below, talk about how we might (or might not) want to envision dome homes going forward." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "image/jpeg": "\n", "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from IPython.display import YouTubeVideo\n", "YouTubeVideo(\"QV4m76Om7bk\") # https://youtu.be/QV4m76Om7bk" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "image/jpeg": "\n", "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "YouTubeVideo(\"rnkjVd1h8oE\") # https://youtu.be/rnkjVd1h8oE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The [nbviewer view](https://nbviewer.jupyter.org/github/4dsolutions/School_of_Tomorrow/blob/master/dataviz.ipynb) of this notebook will render the Youtubes in place. Github does not." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Literary Sidebar\n", "\n", "Another author who ventured into the realm of geodesic dome design was [Hugh Kenner](https://en.wikipedia.org/wiki/Hugh_Kenner), better known for [The Pound Era](https://en.wikipedia.org/wiki/The_Pound_Era). \n", "\n", "He also wrote [Bucky](https://www.amazon.com/Bucky-Guided-Tour-Buckminster-Fuller/dp/0688001416) and [Geodesic Math and How to Use It](https://www.amazon.com/dp/0520239318).\n", "\n", "\"By" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "def make_table():\n", " keys = pd.Series( list(glossary.keys()), dtype=np.object)\n", " values = pd.Series( list(glossary.values()), dtype=np.object)\n", " df = pd.DataFrame({\"term\":keys, \"definition\":values}).set_index(\"term\")\n", " # create and delete a sorting column, wherein the terms are all uppercase\n", " df[\"sort_column\"] = df.index.str.upper()\n", " df.sort_values(['sort_column'], axis=0, ascending=True, inplace=True)\n", " del df[\"sort_column\"] # now that the df is sorted, delete the sorting column\n", " return df" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# glossary is an ordinary Python dict, stored as JSON in a text file\n", "glossary[\"matplotlib\"] = \"data visualization package for Python, originally written by John D. Hunter\"\n", "glossary[\"numpy\"] = \"number crunchy goodness, vectorizes computations on n-dimensional arrays\"\n", "glossary[\"pandas\"] = \"wraps numpy arrays in handsome frames with row and column indexes\"\n", "glossary[\"seaborn\"] = \"adds new powers to matplotlib, makes pretty plots\"\n", "glossary[\"API\"] = \"a set of functions that take variable arguments, providing programmed control of something\"\n", "glossary[\"Ruby\"] = \"a programming language somewhat like Python and Perl, invented by Yukihiro Matsumoto\"\n", "glossary[\"ndarray\"] = \"n-dimensional array, the star of the numpy package, a multi-axis data structure\"\n", "glossary[\"DataFrame\"] = \"the star of the pandas package, providing ndarrays with a framing infrastructure\"" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "glossary_df = make_table()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Python function above has the job of taking our ```glossary``` object, a Python dictionary, and turning it into a pandas DataFrame object. The dict's keys should comprise our index of terms and be sorted in a case-insensitive manner." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
definition
term
APIa set of functions that take variable arguments, providing programmed control of something
Bayesianinferential methods useable even in the absense of any prospect for controlled studies
cella Jupyter Notebook consists of mostly Code and Markdown cells
code cellwhere runnable code, interpreted by the Kernel, is displayed and color coded
CSVcomma-separated values, one of the simplest data sharing formats
DataFramethe star of the pandas package, providing ndarrays with a framing infrastructure
DOMthe Document Object Model is a tree of graph of a document in a web browser
HTMLhypertext markup language, almost an XML, defines the DOM in tandem with CSS
HTTPhypertext transfer protocol
JavaScripta computer language, not confined to running inside browsers but happy there
jsonJavaScript Object Notation is a way to save data (compare with XML)
Jupyter Notebook (JN)like a web page, but interactive, stored as json
Kernelan interpreter, e.g. Python, ready to process JN code cells and return results
localhostthe IP address of the host computer: 127.0.0.1
markdown celluses a markup called markdown to format the text cells in a Jupyter Notebook
matplotlibdata visualization package for Python, originally written by John D. Hunter
ndarrayn-dimensional array, the star of the numpy package, a multi-axis data structure
numpynumber crunchy goodness, vectorizes computations on n-dimensional arrays
pandaswraps numpy arrays in handsome frames with row and column indexes
Pascalan early computer language, later commercially available as Delphi from Borland
PGPPretty Good Privacy, RSA before the US patent expired, by Phil Zimmerman
Pharoa Smalltalk-like language and ecosystem that competes with Python's
portinternet services connect through IP:port addresses, JN usually on port 8888
Pythona computer language from Holland (the Netherlands) that went viral
RSApublic key crypto algorithm, named for collaborators Rivest, Shamir, Adleman
Rubya programming language somewhat like Python and Perl, invented by Yukihiro Matsumoto
seabornadds new powers to matplotlib, makes pretty plots
SGMLa parent specification behind what eventually became XML
Sphinxa documentation generator, targeting the web in particular, for use with Python
TLSTransport Layer Security, used to turn HTTP into HTTPS
web browserHTTP client, sends requests, gets responses
web serveraccepts and processes (or rejects) HTTP requests, sends responses
XMLa markup language using pointy brackets, reminiscent of HTML, for structured data
\n", "
" ], "text/plain": [ " definition\n", "term \n", "API a set of functions that take variable arguments, providing programmed control of something\n", "Bayesian inferential methods useable even in the absense of any prospect for controlled studies \n", "cell a Jupyter Notebook consists of mostly Code and Markdown cells \n", "code cell where runnable code, interpreted by the Kernel, is displayed and color coded \n", "CSV comma-separated values, one of the simplest data sharing formats \n", "DataFrame the star of the pandas package, providing ndarrays with a framing infrastructure \n", "DOM the Document Object Model is a tree of graph of a document in a web browser \n", "HTML hypertext markup language, almost an XML, defines the DOM in tandem with CSS \n", "HTTP hypertext transfer protocol \n", "JavaScript a computer language, not confined to running inside browsers but happy there \n", "json JavaScript Object Notation is a way to save data (compare with XML) \n", "Jupyter Notebook (JN) like a web page, but interactive, stored as json \n", "Kernel an interpreter, e.g. Python, ready to process JN code cells and return results \n", "localhost the IP address of the host computer: 127.0.0.1 \n", "markdown cell uses a markup called markdown to format the text cells in a Jupyter Notebook \n", "matplotlib data visualization package for Python, originally written by John D. Hunter \n", "ndarray n-dimensional array, the star of the numpy package, a multi-axis data structure \n", "numpy number crunchy goodness, vectorizes computations on n-dimensional arrays \n", "pandas wraps numpy arrays in handsome frames with row and column indexes \n", "Pascal an early computer language, later commercially available as Delphi from Borland \n", "PGP Pretty Good Privacy, RSA before the US patent expired, by Phil Zimmerman \n", "Pharo a Smalltalk-like language and ecosystem that competes with Python's \n", "port internet services connect through IP:port addresses, JN usually on port 8888 \n", "Python a computer language from Holland (the Netherlands) that went viral \n", "RSA public key crypto algorithm, named for collaborators Rivest, Shamir, Adleman \n", "Ruby a programming language somewhat like Python and Perl, invented by Yukihiro Matsumoto \n", "seaborn adds new powers to matplotlib, makes pretty plots \n", "SGML a parent specification behind what eventually became XML \n", "Sphinx a documentation generator, targeting the web in particular, for use with Python \n", "TLS Transport Layer Security, used to turn HTTP into HTTPS \n", "web browser HTTP client, sends requests, gets responses \n", "web server accepts and processes (or rejects) HTTP requests, sends responses \n", "XML a markup language using pointy brackets, reminiscent of HTML, for structured data " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.set_option('display.max_colwidth', -1) # max width on columns please\n", "glossary_df" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "glossary_df.to_json('glossary2.json')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Slicing a Pandas DataFrame\n", "\n", "We're free to pick out a range of rows based on starting and ending values. using the .loc method with square brackets. The .iloc method assumes a purely numeric index of consecutive integers, whether one is defined or not." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
definition
term
code cellwhere runnable code, interpreted by the Kernel, is displayed and color coded
CSVcomma-separated values, one of the simplest data sharing formats
DataFramethe star of the pandas package, providing ndarrays with a framing infrastructure
DOMthe Document Object Model is a tree of graph of a document in a web browser
HTMLhypertext markup language, almost an XML, defines the DOM in tandem with CSS
HTTPhypertext transfer protocol
JavaScripta computer language, not confined to running inside browsers but happy there
\n", "
" ], "text/plain": [ " definition\n", "term \n", "code cell where runnable code, interpreted by the Kernel, is displayed and color coded \n", "CSV comma-separated values, one of the simplest data sharing formats \n", "DataFrame the star of the pandas package, providing ndarrays with a framing infrastructure\n", "DOM the Document Object Model is a tree of graph of a document in a web browser \n", "HTML hypertext markup language, almost an XML, defines the DOM in tandem with CSS \n", "HTTP hypertext transfer protocol \n", "JavaScript a computer language, not confined to running inside browsers but happy there " ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "glossary_df.iloc[3:10] # numeric indexing is from 0 and non-inclusive of the outer bound" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
definition
term
HTMLhypertext markup language, almost an XML, defines the DOM in tandem with CSS
HTTPhypertext transfer protocol
JavaScripta computer language, not confined to running inside browsers but happy there
jsonJavaScript Object Notation is a way to save data (compare with XML)
Jupyter Notebook (JN)like a web page, but interactive, stored as json
Kernelan interpreter, e.g. Python, ready to process JN code cells and return results
\n", "
" ], "text/plain": [ " definition\n", "term \n", "HTML hypertext markup language, almost an XML, defines the DOM in tandem with CSS \n", "HTTP hypertext transfer protocol \n", "JavaScript a computer language, not confined to running inside browsers but happy there \n", "json JavaScript Object Notation is a way to save data (compare with XML) \n", "Jupyter Notebook (JN) like a web page, but interactive, stored as json \n", "Kernel an interpreter, e.g. Python, ready to process JN code cells and return results" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "glossary_df.loc[\"HTML\":\"Kernel\"]" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
definition
term
Pythona computer language from Holland (the Netherlands) that went viral
RSApublic key crypto algorithm, named for collaborators Rivest, Shamir, Adleman
Rubya programming language somewhat like Python and Perl, invented by Yukihiro Matsumoto
seabornadds new powers to matplotlib, makes pretty plots
SGMLa parent specification behind what eventually became XML
Sphinxa documentation generator, targeting the web in particular, for use with Python
TLSTransport Layer Security, used to turn HTTP into HTTPS
web browserHTTP client, sends requests, gets responses
web serveraccepts and processes (or rejects) HTTP requests, sends responses
XMLa markup language using pointy brackets, reminiscent of HTML, for structured data
\n", "
" ], "text/plain": [ " definition\n", "term \n", "Python a computer language from Holland (the Netherlands) that went viral \n", "RSA public key crypto algorithm, named for collaborators Rivest, Shamir, Adleman \n", "Ruby a programming language somewhat like Python and Perl, invented by Yukihiro Matsumoto\n", "seaborn adds new powers to matplotlib, makes pretty plots \n", "SGML a parent specification behind what eventually became XML \n", "Sphinx a documentation generator, targeting the web in particular, for use with Python \n", "TLS Transport Layer Security, used to turn HTTP into HTTPS \n", "web browser HTTP client, sends requests, gets responses \n", "web server accepts and processes (or rejects) HTTP requests, sends responses \n", "XML a markup language using pointy brackets, reminiscent of HTML, for structured data " ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "glossary_df.loc[\"Python\":] # slice from Python to the end" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### A First Look at Seaborn\n", "\n", "[Seaborn](https://seaborn.pydata.org/introduction.html)\n", "\n", "The only change is ```sns.set()``` is run, prior to invoking the very same ```plot_functions```. \n", "\n", "Notice the cosmetic differences, procured for free." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plot_functions()" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "sns.set()" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plot_functions()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What does more advanced seaborn look like? [Click here](https://towardsdatascience.com/3-awesome-visualization-techniques-for-every-dataset-9737eecacbe8) for an example on *Medium*." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Anatomy of a pandas Series\n", "\n", "I am a Series, what are my parts? Am more than just a numpy array, but you could say I have a numpy array as payload.\n", "\n", "#### What does it eat?\n", "\n", "How might I be [initialized](https://pandas.pydata.org/pandas-docs/stable/reference/series.html#constructor)? Let's try me." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "from pandas import Series" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "data = {'a':1, 'b':2, 'z':22}\n", "test1 = Series(data)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "a 1 \n", "b 2 \n", "z 22\n", "dtype: int64" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK, so a dictionary works. You could decompose (deconstruct) a dict into its values and keys, using the corresponding methods, and feed those in separately, with keys the index, but why bother? Still, it's nice to know that we can." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "a 1 \n", "b 2 \n", "z 22\n", "dtype: int64" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test1a = Series(data=list(data.values()), index=data.keys())\n", "test1a" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Why was it necessary to feed ```data.values()``` to the list type, instead of just using it directly? \n", "\n", "Modify the code and see. \n", "\n", "The object returned by ```data.values()``` is interpreted as a single tuple to be repeated over and over, for each index row. Atom smash it with list( ) into component particles and you're set." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "from string import ascii_lowercase as letters\n", "test2 = Series(np.arange(10), \n", " index=list(letters)[:10], # just as many as needed\n", " name = \"Labeled\", dtype=np.int8)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "a 0\n", "b 1\n", "c 2\n", "d 3\n", "e 4\n", "f 5\n", "g 6\n", "h 7\n", "i 8\n", "j 9\n", "Name: Labeled, dtype: int8" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test2" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "payload = test2.values # extract the numpy array nutty goodness" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "numpy.ndarray" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(payload) # or tolist() if you wish a Python list" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int8)" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "payload" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "def digitrange(minlen, maxlen, base=2):\n", " \"\"\"Generator producing all lists of digits to a given base.\"\"\"\n", " digits = [0] * maxlen\n", " loop = True\n", " if minlen > 0: \n", " digits[minlen] = 1\n", " while loop:\n", " yield tuple(reversed(digits))\n", " digits[0] += 1\n", " i = 0\n", " while digits[i] >= base:\n", " if ((i+1) >= maxlen):\n", " loop = False\n", " break\n", " digits[i] = 0\n", " digits[i+1] += 1\n", " i += 1" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "dict_values([1, 5, 10, 10, 5, 1])\n" ] } ], "source": [ "gen = digitrange(0, 5, base=2)\n", "from collections import defaultdict\n", "tally = defaultdict(int)\n", "for p in gen:\n", " tally[p.count(1)] += 1\n", "print(tally.values())" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "image/jpeg": "\n", "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "YouTubeVideo(\"WWv0RUxDfbs\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 4 }