{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "
\n", "\n", "
\n", "\"Unidata\n", "
\n", "\n", "

Pandas and Numpy for CSV files

\n", "

Unidata AMS 2021 Student Conference

\n", "\n", "
\n", "
\n", "\n", "---\n", "
\"Population
\n", "\n", "\n", "### Focuses\n", "* Using this notebook we will read in and manipulate a csv file full of data! [What is a csv file?](https://en.wikipedia.org/wiki/Comma-separated_values)\n", "* We will use pandas to organize, process, and plot miscellaneous data\n", "* Plot data from a csv file with [Matplotlib](https://matplotlib.org/)\n", "\n", "\n", "\n", "\n", "### Objectives\n", "1. [Read in a csv file](#1.-Read-in-a-csv-file)\n", "1. [Replace Missing Data With NaNs](#2.-Replace-Missing-Data-With-NaNs)\n", "1. [Plot data from our csv file](#3.-Plot-data-from-csv-file)\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Imports" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Read in a csv file" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "CSV stands for Comma Separated Values. These files are easy to work with and ubiquitous in atmospheric science. \n", "In this first goal, we will read in a csv file that has data on global populations.\n", "\n", "The data we will work with is stored at '../../instructors/practice_files/populationbycountry19802010millions.csv'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "path_to_data = '../../instructors/practice_files/populationbycountry19802010millions.csv'\n", "\n", "# To generate data from the csv file we can use pandas's read_csv() function\n", "population_data = pd.read_csv(path_to_data, delimiter=',', index_col=0) \n", "# delimiter tells python what distinguishes one entry from the next\n", "# index_col tells the read_csv() function which column will be used as the index\n", "print(type(population_data))\n", "print(population_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## pd.read_csv()\n", "pd.read_csv() returns a pandas DataFrame object for us to work with. The 'index_col=0' was very helpful here. Because the data we wanted to read from our csv had the first column filled with the names of countries for the population data, selecting this to be our index was a natural choice. At the bottom of the printed DataFrame, we can see the shape of the DataFrame, [64 rows x 31 columns] for 64 countries and 31 years of data.\n", "\n", "Missing data is entered as '--', this may make our plotting more difficult so lets replace those values with NaN values. NaN stands for Not a Number and using these python designated data types with numpy will simplify analysis when working with incomplete data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Top\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Replace Missing Data With NaNs\n", "Here we will find and replace all missing data from our csv with NaN values. To do this we will use the replace() function" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# To use replace function, input the first entry with what you want to replace, and the second by its replacement\n", "population_data = population_data.replace('--', np.NaN)\n", "\n", "population_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Perfect! We can now see that the population data has NaN values for missing entries." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Top\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Plot data from csv file\n", "\n", "We can plot temperature and pressure data by using subsetting the columns as we did above and using matplotlib. Instead, we will try to use the pandas built in plot functions!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Use the pandas .plot() function on our DataFrame\n", "population_data=population_data.astype(float)\n", "population_data.plot()\n", "plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Currently our pd.DataFrame() is plotting with countries on the x-axis. We would like to have each line represent its own country." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Top\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To plot each countries population growth, we have to swap the rows and columns of the DataFrame. This can be done uing the transpose function, or including a .T after the DataFrame you want to transpose." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "population_data_transpose = population_data.T # Flip rows and columns of the population DataFrame\n", "population_data_transpose.plot(legend=True) # Plot the data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Woohoo!\n", "We found a work around for the problem of plotting the data by country, but it looks like we may be trying to show too much data at once. Lets just focus on 5 countries from the DataFrame: Mali, Bhutan, Uzbekistan, Guinea, and Cambodia." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#lets just plot 5 different countries\n", "population_data_transpose.Mali.plot(legend=True)\n", "population_data_transpose.Bhutan.plot(legend=True)\n", "population_data_transpose.Uzbekistan.plot(legend=True)\n", "population_data_transpose.Guinea.plot(legend=True)\n", "population_data_transpose.Cambodia.plot(legend=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Almost done!\n", "\n", "Finally, lets add some labels and a title before finishing with our plot!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#lets just plot 5 different countries\n", "population_data_transpose.Mali.plot(legend=True)\n", "population_data_transpose.Bhutan.plot(legend=True)\n", "population_data_transpose.Uzbekistan.plot(legend=True)\n", "population_data_transpose.Guinea.plot(legend=True)\n", "population_data_transpose.Cambodia.plot(legend=True)\n", "\n", "plt.title('Population of 5 countries fromm 1980 to 2010')\n", "plt.xlabel('Years')\n", "plt.ylabel('Population in Millions')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Top\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Nice job! You now know the basics of working with csv data!" ] } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:pyaos-ams-2021]", "language": "python", "name": "conda-env-pyaos-ams-2021-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }