{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "init_cell": true, "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "skip" }, "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "hide-input" ] }, "outputs": [], "source": [ "# Install the necessary dependencies\n", "\n", "import os\n", "import sys\n", "!{sys.executable} -m pip install --quiet pandas scikit-learn numpy matplotlib jupyterlab_myst ipython" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "remove-cell" ] }, "source": [ "---\n", "license:\n", " code: MIT\n", " content: CC-BY-4.0\n", "github: https://github.com/ocademy-ai/machine-learning\n", "venue: By Ocademy\n", "open_access: true\n", "bibliography:\n", " - https://raw.githubusercontent.com/ocademy-ai/machine-learning/main/open-machine-learning-jupyter-book/references.bib\n", "---" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "notebookRunGroups": { "groupValue": "" }, "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "# NumPy and Pandas" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "cell_style": "center", "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "## 1. NumPy\n", "\n", "Numpy is a library for working with tensors, i.e. multi-dimensional arrays. Array has values of the same underlying type, and it is simpler than dataframe, but it offers more mathematical operations, and creates less overhead.\n", "\n", "
\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "### The basics of NumPy arrays\n", "\n", "Data manipulation in Python is nearly synonymous with NumPy array manipulation: even newer tools like Pandas are built around the NumPy array.\n", "\n", "We’ll cover a few categories of basic array manipulations here:\n", "\n", "- Attributes of arrays: Determining the size, shape, memory consumption, and data types of arrays.\n", "- Indexing of arrays: Getting and setting the value of individual array elements.\n", "- Slicing of arrays: Getting and setting smaller subarrays within a larger array.\n", "- Reshaping of arrays: Changing the shape of a given array Joining and splitting of arrays: Combining multiple arrays into one, and splitting one array into many." ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "#### NumPy array attributes\n", "\n", "First let’s discuss some useful array attributes. We’ll start by defining three random arrays, a one-dimensional, two-dimensional, and three-dimensional array." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "import numpy as np\n", "np.random.seed(0) # seed for reproducibility\n", "\n", "x1 = np.random.randint(10, size=6) # One-dimensional array\n", "x2 = np.random.randint(10, size=(3, 4)) # Two-dimensional array\n", "x3 = np.random.randint(10, size=(3, 4, 5)) # Three-dimensional array" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "pycharm": { "name": "#%%\n" }, "scrolled": true, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "x3 ndim: 3\n", "x3 shape: (3, 4, 5)\n", "x3 size: 60\n" ] } ], "source": [ "print(\"x3 ndim: \", x3.ndim)\n", "print(\"x3 shape:\", x3.shape)\n", "print(\"x3 size: \", x3.size)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "dtype: int64\n" ] } ], "source": [ "print(\"dtype:\", x3.dtype)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "itemsize: 8 bytes\n", "nbytes: 480 bytes\n" ] } ], "source": [ "print(\"itemsize:\", x3.itemsize, \"bytes\")\n", "print(\"nbytes:\", x3.nbytes, \"bytes\")" ] }, { "cell_type": "markdown", "metadata": { "cell_style": "split", "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "### Computation on NumPy arrays: universal functions\n", "\n", "Computation on NumPy arrays can be very fast, or it can be very slow. The key to making it fast is to use vectorized operations, generally implemented through NumPy’s universal functions (ufuncs)." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "array([0.16666667, 1. , 0.25 , 0.25 , 0.125 ])" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "np.random.seed(0)\n", "\n", "\n", "def compute_reciprocals(values):\n", " output = np.empty(len(values))\n", " for i in range(len(values)):\n", " output[i] = 1.0 / values[i]\n", " return output\n", "\n", "\n", "values = np.random.randint(1, 10, size=5)\n", "compute_reciprocals(values)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "cell_style": "split", "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1.87 s ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" ] } ], "source": [ "big_array = np.random.randint(1, 100, size=1000000)\n", "%timeit compute_reciprocals(big_array)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0.16666667 1. 0.25 0.25 0.125 ]\n", "[0.16666667 1. 0.25 0.25 0.125 ]\n" ] } ], "source": [ "print(compute_reciprocals(values))\n", "print(1.0 / values)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "606 µs ± 16.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)\n" ] } ], "source": [ "%timeit (1.0 / big_array)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "### Exploring NumPy’s ufuncs" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "x = [0 1 2 3]\n", "x + 5 = [5 6 7 8]\n", "x - 5 = [-5 -4 -3 -2]\n", "x * 2 = [0 2 4 6]\n", "x / 2 = [0. 0.5 1. 1.5]\n", "x // 2 = [0 0 1 1]\n" ] } ], "source": [ "x = np.arange(4)\n", "print(\"x =\", x)\n", "print(\"x + 5 =\", x + 5)\n", "print(\"x - 5 =\", x - 5)\n", "print(\"x * 2 =\", x * 2)\n", "print(\"x / 2 =\", x / 2)\n", "print(\"x // 2 =\", x // 2) # floor division" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "### Aggregations: min, max, and everything in between\n", "\n", "Often when faced with a large amount of data, a first step is to compute summary statistics for the data in question." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "50.461758453195614" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "L = np.random.random(100)\n", "sum(L)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "50.46175845319564" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.sum(L)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "74.9 ms ± 979 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n", "193 µs ± 1.54 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)\n" ] } ], "source": [ "big_array = np.random.rand(1000000)\n", "%timeit sum(big_array)\n", "%timeit np.sum(big_array)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Minimum and maximum" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "(7.071203171893359e-07, 0.9999997207656334)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "min(big_array), max(big_array)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "(7.071203171893359e-07, 0.9999997207656334)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.min(big_array), np.max(big_array)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "54.5 ms ± 1.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n", "175 µs ± 2.59 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)\n" ] } ], "source": [ "%timeit min(big_array)\n", "%timeit np.min(big_array)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "7.071203171893359e-07 0.9999997207656334 500216.8034810001\n" ] } ], "source": [ "print(big_array.min(), big_array.max(), big_array.sum())" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "### Computation on arrays: broadcasting\n", "\n", "Another means of vectorizing operations is to use NumPy’s broadcasting functionality. Broadcasting is simply a set of rules for applying binary ufuncs (e.g., addition, subtraction, multiplication, etc.) on arrays of different sizes." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "array([5, 6, 7])" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = np.array([0, 1, 2])\n", "b = np.array([5, 5, 5])\n", "a + b" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" }, "tags": [] }, "outputs": [ { "data": { "text/plain": [ "array([5, 6, 7])" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a + 5" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "array([[1., 1., 1.],\n", " [1., 1., 1.],\n", " [1., 1., 1.]])" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "M = np.ones((3, 3))\n", "M" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "array([[1., 2., 3.],\n", " [1., 2., 3.],\n", " [1., 2., 3.]])" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "M + a" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "### Structured data: NumPy’s structured arrays\n", "\n", "While often our data can be well represented by a homogeneous array of values, sometimes this is not the case. This section demonstrates the use of NumPy’s structured arrays and record arrays" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "name = ['Alice', 'Bob', 'Cathy', 'Doug']\n", "age = [25, 45, 37, 19]\n", "weight = [55.0, 85.5, 68.0, 61.5]" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[('name', '\n", " \n", "" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "### Introducing Pandas objects\n", "\n", "At the very basic level, Pandas objects can be thought of as enhanced versions of NumPy structured arrays in which the rows and columns are identified with labels rather than simple integer indices." ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "#### The Pandas Series object" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "0 0.25\n", "1 0.50\n", "2 0.75\n", "3 1.00\n", "dtype: float64" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "data = pd.Series([0.25, 0.5, 0.75, 1.0])\n", "data" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "array([0.25, 0.5 , 0.75, 1. ])" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.values" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "RangeIndex(start=0, stop=4, step=1)" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.index" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "1 0.50\n", "2 0.75\n", "dtype: float64" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[1]\n", "data[1:3]" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "### Data indexing and selection\n", "\n", "In Pandas, we looked in detail at methods and tools to access, set, and modify values in NumPy arrays. These included indexing (e.g., `arr[2, 1]`), slicing (e.g., `arr[:, 1:5]`), masking (e.g., `arr[arr > 0]`), fancy indexing (e.g., `arr[0, [1, 5]]`), and combinations thereof (e.g., `arr[:, [1, 5]]`)." ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Data selection in Series" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "a 0.25\n", "b 0.50\n", "c 0.75\n", "d 1.00\n", "dtype: float64" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.Series(\n", " [0.25, 0.5, 0.75, 1.0],\n", " index=['a', 'b', 'c', 'd']\n", ")\n", "data" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "0.5" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['b']" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'a' in data" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "Index(['a', 'b', 'c', 'd'], dtype='object')" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.keys()" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(data.items())" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "1 a\n", "3 b\n", "5 c\n", "dtype: object" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])\n", "data" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "3 b\n", "5 c\n", "dtype: object" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.iloc[1:3]" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'a'" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.loc[1]" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Data selection in DataFrame\n", "\n", "Recall that a `DataFrame` acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary of `Series` structures sharing the same index. " ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
areapop
California42396738332521
Texas69566226448193
New York14129719651127
Florida17031219552860
Illinois14999512882135
\n", "
" ], "text/plain": [ " area pop\n", "California 423967 38332521\n", "Texas 695662 26448193\n", "New York 141297 19651127\n", "Florida 170312 19552860\n", "Illinois 149995 12882135" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "area = pd.Series({\n", " 'California': 423967, 'Texas': 695662,\n", " 'New York': 141297, 'Florida': 170312,\n", " 'Illinois': 149995\n", "})\n", "pop = pd.Series({\n", " 'California': 38332521, 'Texas': 26448193,\n", " 'New York': 19651127, 'Florida': 19552860,\n", " 'Illinois': 12882135\n", "})\n", "data = pd.DataFrame({'area': area, 'pop': pop})\n", "data" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "California 423967\n", "Texas 695662\n", "New York 141297\n", "Florida 170312\n", "Illinois 149995\n", "Name: area, dtype: int64" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['area']" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "California 423967\n", "Texas 695662\n", "New York 141297\n", "Florida 170312\n", "Illinois 149995\n", "Name: area, dtype: int64" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.area" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.area is data['area']" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "#### DataFrame as two-dimensional array\n", "\n", "As mentioned previously, we can also view the `DataFrame` as an enhanced two-dimensional array. We can examine the raw underlying data array using the `values` attribute:" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "array([[ 423967, 38332521],\n", " [ 695662, 26448193],\n", " [ 141297, 19651127],\n", " [ 170312, 19552860],\n", " [ 149995, 12882135]])" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.values" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CaliforniaTexasNew YorkFloridaIllinois
area423967695662141297170312149995
pop3833252126448193196511271955286012882135
\n", "
" ], "text/plain": [ " California Texas New York Florida Illinois\n", "area 423967 695662 141297 170312 149995\n", "pop 38332521 26448193 19651127 19552860 12882135" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.T" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
areapop
California42396738332521
Texas69566226448193
New York14129719651127
\n", "
" ], "text/plain": [ " area pop\n", "California 423967 38332521\n", "Texas 695662 26448193\n", "New York 141297 19651127" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.iloc[:3, :2]" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
areapop
California42396738332521
Texas69566226448193
New York14129719651127
Florida17031219552860
Illinois14999512882135
\n", "
" ], "text/plain": [ " area pop\n", "California 423967 38332521\n", "Texas 695662 26448193\n", "New York 141297 19651127\n", "Florida 170312 19552860\n", "Illinois 149995 12882135" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.loc[:'Illinois', :'pop']" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "### Operating on data in Pandas\n", "\n", "Pandas inherits much of this functionality from NumPy, and the ufuncs are key to this." ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Ufuncs: index preservation" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "0 6\n", "1 3\n", "2 7\n", "3 4\n", "dtype: int64" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rng = np.random.RandomState(42)\n", "ser = pd.Series(rng.randint(0, 10, 4))\n", "ser" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
06926
17437
27254
\n", "
" ], "text/plain": [ " A B C D\n", "0 6 9 2 6\n", "1 7 4 3 7\n", "2 7 2 5 4" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(\n", " rng.randint(0, 10, (3, 4)),\n", " columns=['A', 'B', 'C', 'D']\n", ")\n", "df" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "0 403.428793\n", "1 20.085537\n", "2 1096.633158\n", "3 54.598150\n", "dtype: float64" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.exp(ser)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "### Handling missing data\n", "\n", "Real-world data is rarely clean and homogeneous.\n", "\n", "Pandas chose to use sentinels for missing data, and further chose to use two already-existing Python null values: the special floating-point `NaN` value, and the Python `None` object." ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "#### `None`: pythonic missing data" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "array([1, None, 3, 4], dtype=object)" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vals1 = np.array([1, None, 3, 4])\n", "vals1" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "fragment" } }, "source": [ "This `dtype=object` means that the best common type representation NumPy could infer for the contents of the array is that they are Python objects." ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "dtype = object\n", "50.6 ms ± 449 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n", "\n", "dtype = int\n", "589 µs ± 5.26 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)\n", "\n" ] } ], "source": [ "for dtype in ['object', 'int']:\n", " print(\"dtype =\", dtype)\n", " %timeit np.arange(1E6, dtype=dtype).sum()\n", " print()" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "fragment" }, "tags": [ "raises-exception" ] }, "outputs": [ { "ename": "TypeError", "evalue": "unsupported operand type(s) for +: 'int' and 'NoneType'", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m/Users/zhangqi/ws/machine-learning/open-machine-learning-jupyter-book/slides/python-programming/python-in-data-science.ipynb Cell 79\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0m vals1\u001b[39m.\u001b[39;49msum()\n", "File \u001b[0;32m/usr/local/lib/python3.9/site-packages/numpy/core/_methods.py:48\u001b[0m, in \u001b[0;36m_sum\u001b[0;34m(a, axis, dtype, out, keepdims, initial, where)\u001b[0m\n\u001b[1;32m 46\u001b[0m \u001b[39mdef\u001b[39;00m \u001b[39m_sum\u001b[39m(a, axis\u001b[39m=\u001b[39m\u001b[39mNone\u001b[39;00m, dtype\u001b[39m=\u001b[39m\u001b[39mNone\u001b[39;00m, out\u001b[39m=\u001b[39m\u001b[39mNone\u001b[39;00m, keepdims\u001b[39m=\u001b[39m\u001b[39mFalse\u001b[39;00m,\n\u001b[1;32m 47\u001b[0m initial\u001b[39m=\u001b[39m_NoValue, where\u001b[39m=\u001b[39m\u001b[39mTrue\u001b[39;00m):\n\u001b[0;32m---> 48\u001b[0m \u001b[39mreturn\u001b[39;00m umr_sum(a, axis, dtype, out, keepdims, initial, where)\n", "\u001b[0;31mTypeError\u001b[0m: unsupported operand type(s) for +: 'int' and 'NoneType'" ] } ], "source": [ "vals1.sum()" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "#### NaN: missing numerical data" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "dtype('float64')" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vals2 = np.array([1, np.nan, 3, 4]) \n", "vals2.dtype" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "fragment" } }, "source": [ "Notice that NumPy chose a native floating-point type for this array: this means that unlike the object array from before, this array supports fast operations pushed into compiled code. " ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "590 µs ± 7.91 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)\n", "\n" ] } ], "source": [ "%timeit np.arange(1E6, dtype=dtype).sum()\n", "print()" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "nan" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vals2.sum()" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "#### NaN and None in Pandas" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "0 1.0\n", "1 NaN\n", "2 2.0\n", "3 NaN\n", "dtype: float64" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series([1, np.nan, 2, None])" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "0 0\n", "1 1\n", "dtype: int64" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x = pd.Series(range(2), dtype=int)\n", "x" ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "0 NaN\n", "1 1.0\n", "dtype: float64" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x[0] = None\n", "x" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Operating on null values" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "data = pd.Series([1, np.nan, 'hello', None])" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 True\n", "2 False\n", "3 True\n", "dtype: bool" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.isnull()" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "0 1\n", "2 hello\n", "dtype: object" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[data.notnull()]" ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "0 1\n", "2 hello\n", "dtype: object" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.dropna()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "a 1.0\n", "b NaN\n", "c 2.0\n", "d NaN\n", "e 3.0\n", "dtype: float64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))\n", "data\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "a 1.0\n", "b 0.0\n", "c 2.0\n", "d 0.0\n", "e 3.0\n", "dtype: float64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = data.fillna(0)\n", "data" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "### Combining datasets: concat and append\n", "\n", "Some of the most interesting studies of data come from combining different data sources." ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABC
0A0B0C0
1A1B1C1
2A2B2C2
\n", "
" ], "text/plain": [ " A B C\n", "0 A0 B0 C0\n", "1 A1 B1 C1\n", "2 A2 B2 C2" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def make_df(cols, ind):\n", " \"\"\"Quickly make a DataFrame\"\"\"\n", " data = {c: [str(c) + str(i) for i in ind]\n", " for c in cols}\n", " return pd.DataFrame(data, ind)\n", "\n", "\n", "# example DataFrame\n", "make_df('ABC', range(3))" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Simple concatenation with pd.concat\n", "\n", "Pandas has a function, `pd.concat()`, which has a similar syntax to `np.concatenate` but contains a number of options that we’ll discuss momentarily:\n", "\n", "```\n", "pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,\n", " keys=None, levels=None, names=None, verify_integrity=False,\n", " copy=True)\n", "```" ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "1 A\n", "2 B\n", "3 C\n", "dtype: object" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])\n", "ser1" ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "4 D\n", "5 E\n", "6 F\n", "dtype: object" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])\n", "ser2" ] }, { "cell_type": "code", "execution_count": 71, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "1 A\n", "2 B\n", "3 C\n", "4 D\n", "5 E\n", "6 F\n", "dtype: object" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([ser1, ser2])" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "### Combining datasets: merge and join\n", "\n", "One essential feature offered by Pandas is its high-performance, in-memory join and merge operations.\n", "\n", "The `pd.merge()` function implements a number of types of joins: the *one-to-one*, *many-to-one*, and *many-to-many* joins." ] }, { "cell_type": "code", "execution_count": 72, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
employeegroup
0BobAccounting
1JakeEngineering
2LisaEngineering
3SueHR
\n", "
" ], "text/plain": [ " employee group\n", "0 Bob Accounting\n", "1 Jake Engineering\n", "2 Lisa Engineering\n", "3 Sue HR" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 = pd.DataFrame({\n", " 'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],\n", " 'group': ['Accounting', 'Engineering', 'Engineering', 'HR']\n", "})\n", "df1" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
employeehire_date
0Lisa2004
1Bob2008
2Jake2012
3Sue2014
\n", "
" ], "text/plain": [ " employee hire_date\n", "0 Lisa 2004\n", "1 Bob 2008\n", "2 Jake 2012\n", "3 Sue 2014" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2 = pd.DataFrame({\n", " 'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],\n", " 'hire_date': [2004, 2008, 2012, 2014]\n", "})\n", "df2" ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
employeegrouphire_date
0BobAccounting2008
1JakeEngineering2012
2LisaEngineering2004
3SueHR2014
\n", "
" ], "text/plain": [ " employee group hire_date\n", "0 Bob Accounting 2008\n", "1 Jake Engineering 2012\n", "2 Lisa Engineering 2004\n", "3 Sue HR 2014" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df3 = pd.merge(df1, df2)\n", "df3" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "## 3. Your turn! 🚀\n", "\n", "[Data processing in Python](../../data-science/working-with-data/numpy.html#your-turn)\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "## 4. References\n", "\n", "1. [Working with data - Numpy](https://ocademy-ai.github.io/machine-learning/data-science/working-with-data/numpy.html)\n", "2. [Working with data - Pandas](https://ocademy-ai.github.io/machine-learning/data-science/working-with-data/pandas.html)" ] } ], "metadata": { "celltoolbar": "Slideshow", "init_cell": "run_on_kernel_ready", "jupytext": { "formats": "ipynb" }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" }, "rise": { "autolaunch": true, "chalkboard": { "color": [ "rgb(250, 250, 250)", "rgb(250, 250, 250)" ] }, "enable_chalkboard": true, "header": "", "scroll": true }, "vscode": { "interpreter": { "hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49" } } }, "nbformat": 4, "nbformat_minor": 2 }