{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "A6iFUUQLNDlE" }, "source": [ "\n", " \n", " \n", " \n", "
\n", " Run in Google Colab\n", " \n", " View on Github\n", " \n", " View raw on Github\n", "
" ] }, { "cell_type": "markdown", "metadata": { "id": "Cps1y3AA5TqD" }, "source": [ "# Module 6: Data types and tidy data" ] }, { "cell_type": "markdown", "metadata": { "id": "WU236-s35TqM" }, "source": [ "## Tidy data\n", "\n", "Let's take a look at this small dataset: https://raw.githubusercontent.com/tidyverse/tidyr/main/data-raw/relig_income.csv" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "executionInfo": { "elapsed": 3, "status": "ok", "timestamp": 1686938700592, "user": { "displayName": "Vincent Wong", "userId": "06927694896148305320" }, "user_tz": 240 }, "id": "MG17akhx5TqN" }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 613 }, "executionInfo": { "elapsed": 168, "status": "ok", "timestamp": 1686938700927, "user": { "displayName": "Vincent Wong", "userId": "06927694896148305320" }, "user_tz": 240 }, "id": "Ah7fiB905TqQ", "outputId": "2435ddff-18ae-4d03-af72-ece1ab02df84" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
religion<$10k$10-20k$20-30k$30-40k$40-50k$50-75k$75-100k$100-150k>150kDon't know/refused
0Agnostic27346081761371221098496
1Atheist12273752357073597476
2Buddhist27213034335862395354
3Catholic41861773267063811169497926331489
4Don’t know/refused151415111035211718116
5Evangelical Prot575869106498288114869497234141529
6Hindu1979113447485437
7Historically Black Prot2282442362381972231318178339
8Jehovah's Witness2027242421301511637
9Jewish1919252530956987151162
10Mainline Prot28949561965565111079397536341328
11Mormon294048515611285494269
12Muslim67910923168622
13Orthodox13172332324738424673
14Other Christian971113131418141218
15Other Faiths20334046496346404171
16Other World Religions5234273448
17Unaffiliated217299374365341528407321258597
\n", "
" ], "text/plain": [ " religion <$10k $10-20k $20-30k $30-40k $40-50k \n", "0 Agnostic 27 34 60 81 76 \\\n", "1 Atheist 12 27 37 52 35 \n", "2 Buddhist 27 21 30 34 33 \n", "3 Catholic 418 617 732 670 638 \n", "4 Don’t know/refused 15 14 15 11 10 \n", "5 Evangelical Prot 575 869 1064 982 881 \n", "6 Hindu 1 9 7 9 11 \n", "7 Historically Black Prot 228 244 236 238 197 \n", "8 Jehovah's Witness 20 27 24 24 21 \n", "9 Jewish 19 19 25 25 30 \n", "10 Mainline Prot 289 495 619 655 651 \n", "11 Mormon 29 40 48 51 56 \n", "12 Muslim 6 7 9 10 9 \n", "13 Orthodox 13 17 23 32 32 \n", "14 Other Christian 9 7 11 13 13 \n", "15 Other Faiths 20 33 40 46 49 \n", "16 Other World Religions 5 2 3 4 2 \n", "17 Unaffiliated 217 299 374 365 341 \n", "\n", " $50-75k $75-100k $100-150k >150k Don't know/refused \n", "0 137 122 109 84 96 \n", "1 70 73 59 74 76 \n", "2 58 62 39 53 54 \n", "3 1116 949 792 633 1489 \n", "4 35 21 17 18 116 \n", "5 1486 949 723 414 1529 \n", "6 34 47 48 54 37 \n", "7 223 131 81 78 339 \n", "8 30 15 11 6 37 \n", "9 95 69 87 151 162 \n", "10 1107 939 753 634 1328 \n", "11 112 85 49 42 69 \n", "12 23 16 8 6 22 \n", "13 47 38 42 46 73 \n", "14 14 18 14 12 18 \n", "15 63 46 40 41 71 \n", "16 7 3 4 4 8 \n", "17 528 407 321 258 597 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pew_df = pd.read_csv('https://raw.githubusercontent.com/tidyverse/tidyr/main/data-raw/relig_income.csv')\n", "pew_df" ] }, { "cell_type": "markdown", "metadata": { "id": "syUy_FXc5TqT" }, "source": [ "This dataset is about the relationships between income and religion, assembled from a research by the Pew Research Center. Is this dataset tidy or not? Why?\n", "\n", "Yes, many of the columns are values, not variable names. How should we fix it?\n", "\n", "Pandas provides a convenient function called [melt](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html). You specify the `id_vars` that are variable columns, and `value_vars` that are value columns, and provide the name for the variable as well as the name for the values.\n", "\n", "**Q: so please go ahead and tidy it up! I'd suggest to use the variable name \"income\" and value name \"frequency\"**" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "executionInfo": { "elapsed": 131, "status": "ok", "timestamp": 1686938983642, "user": { "displayName": "Vincent Wong", "userId": "06927694896148305320" }, "user_tz": 240 }, "id": "1MYnYfhe5TqU" }, "outputs": [], "source": [ "# YOUR SOLUTION HERE" ] }, { "cell_type": "markdown", "metadata": { "id": "k7Xz2GAn5TqV" }, "source": [ "If you were successful, you'll have something like this:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
religionincomefrequency
0Agnostic<$10k27
1Atheist<$10k12
2Buddhist<$10k27
3Catholic<$10k418
4Don’t know/refused<$10k15
............
175OrthodoxDon't know/refused73
176Other ChristianDon't know/refused18
177Other FaithsDon't know/refused71
178Other World ReligionsDon't know/refused8
179UnaffiliatedDon't know/refused597
\n", "

180 rows × 3 columns

\n", "
" ], "text/plain": [ " religion income frequency\n", "0 Agnostic <$10k 27\n", "1 Atheist <$10k 12\n", "2 Buddhist <$10k 27\n", "3 Catholic <$10k 418\n", "4 Don’t know/refused <$10k 15\n", ".. ... ... ...\n", "175 Orthodox Don't know/refused 73\n", "176 Other Christian Don't know/refused 18\n", "177 Other Faiths Don't know/refused 71\n", "178 Other World Religions Don't know/refused 8\n", "179 Unaffiliated Don't know/refused 597\n", "\n", "[180 rows x 3 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pew_tidy_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This \"tidy\" form is the default form that you want to keep your datasets. It is easy to filter, manipulate, and plot. Many of the visualization libraries we'll use in this course expect the data to be in this form." ] }, { "cell_type": "markdown", "metadata": { "id": "dDyOarQy5TqY" }, "source": [ "## Data types\n", "\n", "Let's talk about data types briefly. Understanding data types is not only important for choosing the right visualizations, but also important for efficient computing and storage of data. You may not have thought about how pandas represent data in memory. A Pandas `Dataframe` is essentially a bunch of `Series`, and those `Series` are essentially `numpy` arrays. An array may contain a fixed-length items such as integers or variable length items such as strings. Putting some efforts to think about the correct data type can potentially save a lot of memory as well as time.\n", "\n", "A nice example would be the categorical data type. If you have a variable that only has several possible values, it's essentially a categorical data. Take a look at the `income` variable." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 25, "status": "ok", "timestamp": 1686938043328, "user": { "displayName": "Vincent Wong", "userId": "06927694896148305320" }, "user_tz": 240 }, "id": "0KoiuKUo5TqZ", "outputId": "f2742f7d-2fb9-47f2-8190-a167b33c230b" }, "outputs": [ { "data": { "text/plain": [ "income\n", "<$10k 18\n", "$10-20k 18\n", "$20-30k 18\n", "$30-40k 18\n", "$40-50k 18\n", "$50-75k 18\n", "$75-100k 18\n", "$100-150k 18\n", ">150k 18\n", "Don't know/refused 18\n", "Name: count, dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pew_tidy_df.income.value_counts()" ] }, { "cell_type": "markdown", "metadata": { "id": "wJW_t5_K5Tqb" }, "source": [ "These were the column names in the original non-tidy data. The value can take only one of these income ranges and thus it is a categorical data. What is the data type that pandas use to store this column?" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 23, "status": "ok", "timestamp": 1686938043329, "user": { "displayName": "Vincent Wong", "userId": "06927694896148305320" }, "user_tz": 240 }, "id": "6oIXzznW5Tqb", "outputId": "b077eecd-5dd0-4af5-9c85-896196c9e0ea" }, "outputs": [ { "data": { "text/plain": [ "dtype('O')" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pew_tidy_df.income.dtype" ] }, { "cell_type": "markdown", "metadata": { "id": "gq7wZ8RM5Tqc" }, "source": [ "The `O` means that it is an **O**bject data type, which does not have a fixed size like integer or float. The series contains a sort of pointer to the actual text objects. You can actually inspect the amount of memory used by the dataset." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 226, "status": "ok", "timestamp": 1686938043536, "user": { "displayName": "Vincent Wong", "userId": "06927694896148305320" }, "user_tz": 240 }, "id": "bZKUkW2o5Tqd", "outputId": "0528d96b-2750-4980-a014-16f615c5484e" }, "outputs": [ { "data": { "text/plain": [ "Index 132\n", "religion 1440\n", "income 1440\n", "frequency 1440\n", "dtype: int64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pew_tidy_df.memory_usage()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 38, "status": "ok", "timestamp": 1686938043538, "user": { "displayName": "Vincent Wong", "userId": "06927694896148305320" }, "user_tz": 240 }, "id": "NOGRStlY5Tqe", "outputId": "06cb3b8f-30d6-416e-9fd2-2281dc53987b" }, "outputs": [ { "data": { "text/plain": [ "Index 132\n", "religion 12780\n", "income 11700\n", "frequency 1440\n", "dtype: int64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pew_tidy_df.memory_usage(deep=True)" ] }, { "cell_type": "markdown", "metadata": { "id": "LuX_IM-W5Tqe" }, "source": [ "What's going on with the `deep=True` option? When you don't specify `deep=True`, the memory usage method just tells you the amount of memory used by the numpy arrays in the pandas dataframe. When you pass `deep=True`, it tells you the total amount of memory by including the memory used by all the text objects. So, the `religion` and `income` columns occupies almost ten times of memory than the `frequency` column, which is simply an array of integers." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 33, "status": "ok", "timestamp": 1686938043539, "user": { "displayName": "Vincent Wong", "userId": "06927694896148305320" }, "user_tz": 240 }, "id": "5_THR_Ux5Tqf", "outputId": "e9a12ef7-df32-4cfc-9ce0-8ac9e2316bd1" }, "outputs": [ { "data": { "text/plain": [ "dtype('int64')" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pew_tidy_df.frequency.dtype" ] }, { "cell_type": "markdown", "metadata": { "id": "31_v9CQE5Tqf" }, "source": [ "Is there any way to save up the memory? Note that there are only 10 categories in the income variable. That means we just need 10 numbers to represent the categories! Of course we need to store the names of each category, but that's just one-time cost. The simplest way to convert a column is using `astype` method." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "id": "cH2j6lrU5Tqg" }, "outputs": [], "source": [ "income_categorical_series = pew_tidy_df.income.astype('category')\n", "# you can do pew_tidy_df.income = pew_tidy_df.income.astype('category')" ] }, { "cell_type": "markdown", "metadata": { "id": "LO3liL7A5Tqh" }, "source": [ "Now, this series has the `CategoricalDtype` dtype." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 26, "status": "ok", "timestamp": 1686938043542, "user": { "displayName": "Vincent Wong", "userId": "06927694896148305320" }, "user_tz": 240 }, "id": "VdDV-3FQ5Tqh", "outputId": "5e804917-ba2c-4eff-9416-738385f3affd" }, "outputs": [ { "data": { "text/plain": [ "CategoricalDtype(categories=['$10-20k', '$100-150k', '$20-30k', '$30-40k', '$40-50k',\n", " '$50-75k', '$75-100k', '<$10k', '>150k',\n", " 'Don't know/refused'],\n", ", ordered=False)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "income_categorical_series.dtype" ] }, { "cell_type": "markdown", "metadata": { "id": "1Ns4mncK5Tqi" }, "source": [ "How much memory do we use?" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 23, "status": "ok", "timestamp": 1686938043543, "user": { "displayName": "Vincent Wong", "userId": "06927694896148305320" }, "user_tz": 240 }, "id": "ZqDBtWw75Tqi", "outputId": "94155744-dab0-4cf3-9d53-979f949abc88" }, "outputs": [ { "data": { "text/plain": [ "1262" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "income_categorical_series.memory_usage(deep=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compared with the original column?" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 20, "status": "ok", "timestamp": 1686938043544, "user": { "displayName": "Vincent Wong", "userId": "06927694896148305320" }, "user_tz": 240 }, "id": "eEyB-K3Y5Tqj", "outputId": "36f232c2-2bc2-48ea-fecf-ddf458e7e612" }, "outputs": [ { "data": { "text/plain": [ "11832" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pew_tidy_df.income.memory_usage(deep=True)" ] }, { "cell_type": "markdown", "metadata": { "id": "mRkVknuH5Tqk" }, "source": [ "We have reduced the memory usage by almost 10 fold! Not only that, because now the values are just numbers, it will be much faster to match, filter, manipulate. If your dataset is huge, this can save up a lot of space and time.\n", "\n", "If the categories have ordering, you can specify the ordering too." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 17, "status": "ok", "timestamp": 1686938043545, "user": { "displayName": "Vincent Wong", "userId": "06927694896148305320" }, "user_tz": 240 }, "id": "dbXVdCUw5Tqk", "outputId": "a6ae3122-6ecb-4ba1-fdc7-0e973cf1ec7a" }, "outputs": [ { "data": { "text/plain": [ "CategoricalDtype(categories=['Don't know/refused', '<$10k', '$10-20k', '$20-30k',\n", " '$30-40k', '$40-50k', '$50-75k', '$75-100k', '$100-150k',\n", " '>150k'],\n", ", ordered=True)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pandas.api.types import CategoricalDtype\n", "income_type = CategoricalDtype(categories=[\"Don't know/refused\", '<$10k', '$10-20k', '$20-30k', '$30-40k',\n", " '$40-50k', '$50-75k', '$75-100k', '$100-150k', '>150k'], ordered=True)\n", "income_type" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 28, "status": "ok", "timestamp": 1686938043721, "user": { "displayName": "Vincent Wong", "userId": "06927694896148305320" }, "user_tz": 240 }, "id": "g5NwkZjB5Tqk", "outputId": "53f7fa6f-87f0-4fca-8546-d5ea01c4c328" }, "outputs": [ { "data": { "text/plain": [ "CategoricalDtype(categories=['Don't know/refused', '<$10k', '$10-20k', '$20-30k',\n", " '$30-40k', '$40-50k', '$50-75k', '$75-100k', '$100-150k',\n", " '>150k'],\n", ", ordered=True)" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pew_tidy_df.income.astype(income_type).dtype" ] }, { "cell_type": "markdown", "metadata": { "id": "IFQtbxyF5Tql" }, "source": [ "This data type now allows you to compare and sort based on the ordering.\n", "\n", "**Q: ok, now convert both religion and income columns of `pew_tidy_df` as categorical dtype (in place) and show that `pew_tidy_df` now uses much less memory**" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 170, "status": "ok", "timestamp": 1686938128628, "user": { "displayName": "Vincent Wong", "userId": "06927694896148305320" }, "user_tz": 240 }, "id": "kaRvFJwV5Tql", "outputId": "09f3cd6d-b95f-46a2-ff87-e778269ba296" }, "outputs": [ { "data": { "text/plain": [ "Index 132\n", "religion 2035\n", "income 1130\n", "frequency 1440\n", "dtype: int64" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# YOUR SOLUTION HERE\n", "\n", "pew_tidy_df.memory_usage(deep=True)" ] }, { "cell_type": "markdown", "metadata": { "id": "opJC8rlG5Tqm" }, "source": [ "## If you want to know more\n", "\n", "- [Jean-Nicholas Hould: Tidy Data in Python](http://www.jeannicholashould.com/tidy-data-in-python.html)\n", "- [Stephen Simmons | Pandas from the Inside](https://www.youtube.com/watch?v=CowlcrtSyME)\n", "- [Data school: How do I make my pandas DataFrame smaller and faster?](https://www.youtube.com/watch?v=wDYDYGyN_cw)" ] } ], "metadata": { "anaconda-cloud": {}, "colab": { "provenance": [] }, "kernel_info": { "name": "dviz" }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" }, "nteract": { "version": "0.15.0" } }, "nbformat": 4, "nbformat_minor": 0 }