{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Importing Data" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "> There is no data science without data.\n", ">\n", "> \\- A wise person" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Applied Review" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Fundamentals and Data in Python\n", "\n", "* Python stores its data in **variables** - words that you choose to represent values you've stored\n", "* This is done using **assignment** - you assign data to a variable" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Packages/Modules and Data in Python\n", "\n", "* Data is frequently represented inside a **DataFrame** - a class from the pandas library\n", "* The pandas library has **methods** for importing different types of files into DataFrames - operations that import data" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## General Model for Importing Data" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Memory and Size" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "* Python stores its data in **memory** - this makes it relatively quickly accessible but can cause size limitations in certain fields." ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "* With that being said, you are likely not going to run into space limitations anytime soon." ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "* Python memory is session-specific, so quitting Python (i.e. shutting down JupyterLab) removes the data from memory." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### General Framework" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "A general way to conceptualize data import into and use within Python:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "1. Data sits in on the computer/server - this is frequently called \"disk\"\n", "2. Python code can be used to copy a data file from disk to the Python session's memory\n", "3. Python data then sits within Python's memory ready to be used by other Python code" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Here is a visualization of this process:\n", "\n", "\n", "
\n", "\"import-framework.png\"\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Importing Tabular Data\n", "\n", "For much of data science, tabular data -- again, think 2-dimensional datasets -- is the most common format of data." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Importing pandas\n", "\n", "This data format can be imported into Python using the pandas library. We can load pandas with the below code:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "
\n", "

Note

\n", "

Recall that the pandas library is the primary library for representing and working with tabular data in Python.

\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Importing Tabular Data with Pandas\n", "\n", "pandas is preferred because it imports the data directly into a DataFrame -- the data structure of choice for tabular data in Python." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The `read_csv` function is used to import a tabular data file, a CSV, into a DataFrame:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "planes = pd.read_csv('../data/planes.csv')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "And recall we can visualize the first few rows of our DataFrame using the `head()` method:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tailnumyeartypemanufacturermodelenginesseatsspeedengine
0N101562004.0Fixed wing multi engineEMBRAEREMB-145XR255NaNTurbo-fan
1N102UW1998.0Fixed wing multi engineAIRBUS INDUSTRIEA320-2142182NaNTurbo-fan
2N103US1999.0Fixed wing multi engineAIRBUS INDUSTRIEA320-2142182NaNTurbo-fan
3N104UW1999.0Fixed wing multi engineAIRBUS INDUSTRIEA320-2142182NaNTurbo-fan
4N105752002.0Fixed wing multi engineEMBRAEREMB-145LR255NaNTurbo-fan
\n", "
" ], "text/plain": [ " tailnum year type manufacturer model \\\n", "0 N10156 2004.0 Fixed wing multi engine EMBRAER EMB-145XR \n", "1 N102UW 1998.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n", "2 N103US 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n", "3 N104UW 1999.0 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 \n", "4 N10575 2002.0 Fixed wing multi engine EMBRAER EMB-145LR \n", "\n", " engines seats speed engine \n", "0 2 55 NaN Turbo-fan \n", "1 2 182 NaN Turbo-fan \n", "2 2 182 NaN Turbo-fan \n", "3 2 182 NaN Turbo-fan \n", "4 2 55 NaN Turbo-fan " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "planes.head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "The `read_csv()` function has many parameters for importing data. A few examples:\n", "\n", "* `sep` - the data's delimiter\n", "* `header` - the row number containing the column names (0 indicates there is no header)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Full documentation can be pulled up by running the method name followed by a question mark:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[0;31mSignature:\u001b[0m\n", "\u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread_csv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mfilepath_or_buffer\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'FilePath | ReadCsvBuffer[bytes] | ReadCsvBuffer[str]'\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0msep\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'str | None | lib.NoDefault'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m<\u001b[0m\u001b[0mno_default\u001b[0m\u001b[0;34m>\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mdelimiter\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'str | None | lib.NoDefault'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mheader\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m\"int | Sequence[int] | None | Literal['infer']\"\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'infer'\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mnames\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'Sequence[Hashable] | None | lib.NoDefault'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m<\u001b[0m\u001b[0mno_default\u001b[0m\u001b[0;34m>\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mindex_col\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'IndexLabel | Literal[False] | None'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0musecols\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'DtypeArg | None'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mengine\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'CSVEngine | None'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mconverters\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mtrue_values\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mfalse_values\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mskipinitialspace\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'bool'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mskiprows\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mskipfooter\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'int'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mnrows\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'int | None'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mna_values\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mkeep_default_na\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'bool'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mna_filter\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'bool'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mverbose\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'bool'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mskip_blank_lines\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'bool'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mparse_dates\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'bool | Sequence[Hashable] | None'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0minfer_datetime_format\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'bool | lib.NoDefault'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m<\u001b[0m\u001b[0mno_default\u001b[0m\u001b[0;34m>\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mkeep_date_col\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'bool'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mdate_parser\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m<\u001b[0m\u001b[0mno_default\u001b[0m\u001b[0;34m>\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mdate_format\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'str | None'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mdayfirst\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'bool'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mcache_dates\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'bool'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0miterator\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'bool'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mchunksize\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'int | None'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mcompression\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'CompressionOptions'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'infer'\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mthousands\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'str | None'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mdecimal\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'str'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'.'\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mlineterminator\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'str | None'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mquotechar\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'str'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'\"'\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mquoting\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'int'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mdoublequote\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'bool'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mescapechar\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'str | None'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mcomment\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'str | None'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mencoding\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'str | None'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mencoding_errors\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'str | None'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'strict'\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mdialect\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'str | csv.Dialect | None'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mon_bad_lines\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'str'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'error'\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mdelim_whitespace\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'bool'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mlow_memory\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mmemory_map\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'bool'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mfloat_precision\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m\"Literal['high', 'legacy'] | None\"\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mstorage_options\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'StorageOptions'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mdtype_backend\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'DtypeBackend | lib.NoDefault'\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m<\u001b[0m\u001b[0mno_default\u001b[0m\u001b[0;34m>\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m->\u001b[0m \u001b[0;34m'DataFrame | TextFileReader'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mDocstring:\u001b[0m\n", "Read a comma-separated values (csv) file into DataFrame.\n", "\n", "Also supports optionally iterating or breaking of the file\n", "into chunks.\n", "\n", "Additional help can be found in the online docs for\n", "`IO Tools `_.\n", "\n", "Parameters\n", "----------\n", "filepath_or_buffer : str, path object or file-like object\n", " Any valid string path is acceptable. The string could be a URL. Valid\n", " URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is\n", " expected. A local file could be: file://localhost/path/to/table.csv.\n", "\n", " If you want to pass in a path object, pandas accepts any ``os.PathLike``.\n", "\n", " By file-like object, we refer to objects with a ``read()`` method, such as\n", " a file handle (e.g. via builtin ``open`` function) or ``StringIO``.\n", "sep : str, default ','\n", " Delimiter to use. If sep is None, the C engine cannot automatically detect\n", " the separator, but the Python parsing engine can, meaning the latter will\n", " be used and automatically detect the separator by Python's builtin sniffer\n", " tool, ``csv.Sniffer``. In addition, separators longer than 1 character and\n", " different from ``'\\s+'`` will be interpreted as regular expressions and\n", " will also force the use of the Python parsing engine. Note that regex\n", " delimiters are prone to ignoring quoted data. Regex example: ``'\\r\\t'``.\n", "delimiter : str, default ``None``\n", " Alias for sep.\n", "header : int, list of int, None, default 'infer'\n", " Row number(s) to use as the column names, and the start of the\n", " data. Default behavior is to infer the column names: if no names\n", " are passed the behavior is identical to ``header=0`` and column\n", " names are inferred from the first line of the file, if column\n", " names are passed explicitly then the behavior is identical to\n", " ``header=None``. Explicitly pass ``header=0`` to be able to\n", " replace existing names. The header can be a list of integers that\n", " specify row locations for a multi-index on the columns\n", " e.g. [0,1,3]. Intervening rows that are not specified will be\n", " skipped (e.g. 2 in this example is skipped). Note that this\n", " parameter ignores commented lines and empty lines if\n", " ``skip_blank_lines=True``, so ``header=0`` denotes the first line of\n", " data rather than the first line of the file.\n", "names : array-like, optional\n", " List of column names to use. If the file contains a header row,\n", " then you should explicitly pass ``header=0`` to override the column names.\n", " Duplicates in this list are not allowed.\n", "index_col : int, str, sequence of int / str, or False, optional, default ``None``\n", " Column(s) to use as the row labels of the ``DataFrame``, either given as\n", " string name or column index. If a sequence of int / str is given, a\n", " MultiIndex is used.\n", "\n", " Note: ``index_col=False`` can be used to force pandas to *not* use the first\n", " column as the index, e.g. when you have a malformed file with delimiters at\n", " the end of each line.\n", "usecols : list-like or callable, optional\n", " Return a subset of the columns. If list-like, all elements must either\n", " be positional (i.e. integer indices into the document columns) or strings\n", " that correspond to column names provided either by the user in `names` or\n", " inferred from the document header row(s). If ``names`` are given, the document\n", " header row(s) are not taken into account. For example, a valid list-like\n", " `usecols` parameter would be ``[0, 1, 2]`` or ``['foo', 'bar', 'baz']``.\n", " Element order is ignored, so ``usecols=[0, 1]`` is the same as ``[1, 0]``.\n", " To instantiate a DataFrame from ``data`` with element order preserved use\n", " ``pd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']]`` for columns\n", " in ``['foo', 'bar']`` order or\n", " ``pd.read_csv(data, usecols=['foo', 'bar'])[['bar', 'foo']]``\n", " for ``['bar', 'foo']`` order.\n", "\n", " If callable, the callable function will be evaluated against the column\n", " names, returning names where the callable function evaluates to True. An\n", " example of a valid callable argument would be ``lambda x: x.upper() in\n", " ['AAA', 'BBB', 'DDD']``. Using this parameter results in much faster\n", " parsing time and lower memory usage.\n", "dtype : Type name or dict of column -> type, optional\n", " Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32,\n", " 'c': 'Int64'}\n", " Use `str` or `object` together with suitable `na_values` settings\n", " to preserve and not interpret dtype.\n", " If converters are specified, they will be applied INSTEAD\n", " of dtype conversion.\n", "\n", " .. versionadded:: 1.5.0\n", "\n", " Support for defaultdict was added. Specify a defaultdict as input where\n", " the default determines the dtype of the columns which are not explicitly\n", " listed.\n", "engine : {'c', 'python', 'pyarrow'}, optional\n", " Parser engine to use. The C and pyarrow engines are faster, while the python engine\n", " is currently more feature-complete. Multithreading is currently only supported by\n", " the pyarrow engine.\n", "\n", " .. versionadded:: 1.4.0\n", "\n", " The \"pyarrow\" engine was added as an *experimental* engine, and some features\n", " are unsupported, or may not work correctly, with this engine.\n", "converters : dict, optional\n", " Dict of functions for converting values in certain columns. Keys can either\n", " be integers or column labels.\n", "true_values : list, optional\n", " Values to consider as True in addition to case-insensitive variants of \"True\".\n", "false_values : list, optional\n", " Values to consider as False in addition to case-insensitive variants of \"False\".\n", "skipinitialspace : bool, default False\n", " Skip spaces after delimiter.\n", "skiprows : list-like, int or callable, optional\n", " Line numbers to skip (0-indexed) or number of lines to skip (int)\n", " at the start of the file.\n", "\n", " If callable, the callable function will be evaluated against the row\n", " indices, returning True if the row should be skipped and False otherwise.\n", " An example of a valid callable argument would be ``lambda x: x in [0, 2]``.\n", "skipfooter : int, default 0\n", " Number of lines at bottom of file to skip (Unsupported with engine='c').\n", "nrows : int, optional\n", " Number of rows of file to read. Useful for reading pieces of large files.\n", "na_values : scalar, str, list-like, or dict, optional\n", " Additional strings to recognize as NA/NaN. If dict passed, specific\n", " per-column NA values. By default the following values are interpreted as\n", " NaN: '', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan',\n", " '1.#IND', '1.#QNAN', '', 'N/A', 'NA', 'NULL', 'NaN', 'None',\n", " 'n/a', 'nan', 'null'.\n", "keep_default_na : bool, default True\n", " Whether or not to include the default NaN values when parsing the data.\n", " Depending on whether `na_values` is passed in, the behavior is as follows:\n", "\n", " * If `keep_default_na` is True, and `na_values` are specified, `na_values`\n", " is appended to the default NaN values used for parsing.\n", " * If `keep_default_na` is True, and `na_values` are not specified, only\n", " the default NaN values are used for parsing.\n", " * If `keep_default_na` is False, and `na_values` are specified, only\n", " the NaN values specified `na_values` are used for parsing.\n", " * If `keep_default_na` is False, and `na_values` are not specified, no\n", " strings will be parsed as NaN.\n", "\n", " Note that if `na_filter` is passed in as False, the `keep_default_na` and\n", " `na_values` parameters will be ignored.\n", "na_filter : bool, default True\n", " Detect missing value markers (empty strings and the value of na_values). In\n", " data without any NAs, passing na_filter=False can improve the performance\n", " of reading a large file.\n", "verbose : bool, default False\n", " Indicate number of NA values placed in non-numeric columns.\n", "skip_blank_lines : bool, default True\n", " If True, skip over blank lines rather than interpreting as NaN values.\n", "parse_dates : bool or list of int or names or list of lists or dict, default False\n", " The behavior is as follows:\n", "\n", " * boolean. If True -> try parsing the index.\n", " * list of int or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3\n", " each as a separate date column.\n", " * list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as\n", " a single date column.\n", " * dict, e.g. {'foo' : [1, 3]} -> parse columns 1, 3 as date and call\n", " result 'foo'\n", "\n", " If a column or index cannot be represented as an array of datetimes,\n", " say because of an unparsable value or a mixture of timezones, the column\n", " or index will be returned unaltered as an object data type. For\n", " non-standard datetime parsing, use ``pd.to_datetime`` after\n", " ``pd.read_csv``.\n", "\n", " Note: A fast-path exists for iso8601-formatted dates.\n", "infer_datetime_format : bool, default False\n", " If True and `parse_dates` is enabled, pandas will attempt to infer the\n", " format of the datetime strings in the columns, and if it can be inferred,\n", " switch to a faster method of parsing them. In some cases this can increase\n", " the parsing speed by 5-10x.\n", "\n", " .. deprecated:: 2.0.0\n", " A strict version of this argument is now the default, passing it has no effect.\n", "\n", "keep_date_col : bool, default False\n", " If True and `parse_dates` specifies combining multiple columns then\n", " keep the original columns.\n", "date_parser : function, optional\n", " Function to use for converting a sequence of string columns to an array of\n", " datetime instances. The default uses ``dateutil.parser.parser`` to do the\n", " conversion. Pandas will try to call `date_parser` in three different ways,\n", " advancing to the next if an exception occurs: 1) Pass one or more arrays\n", " (as defined by `parse_dates`) as arguments; 2) concatenate (row-wise) the\n", " string values from the columns defined by `parse_dates` into a single array\n", " and pass that; and 3) call `date_parser` once for each row using one or\n", " more strings (corresponding to the columns defined by `parse_dates`) as\n", " arguments.\n", "\n", " .. deprecated:: 2.0.0\n", " Use ``date_format`` instead, or read in as ``object`` and then apply\n", " :func:`to_datetime` as-needed.\n", "date_format : str or dict of column -> format, default ``None``\n", " If used in conjunction with ``parse_dates``, will parse dates according to this\n", " format. For anything more complex,\n", " please read in as ``object`` and then apply :func:`to_datetime` as-needed.\n", "\n", " .. versionadded:: 2.0.0\n", "dayfirst : bool, default False\n", " DD/MM format dates, international and European format.\n", "cache_dates : bool, default True\n", " If True, use a cache of unique, converted dates to apply the datetime\n", " conversion. May produce significant speed-up when parsing duplicate\n", " date strings, especially ones with timezone offsets.\n", "\n", "iterator : bool, default False\n", " Return TextFileReader object for iteration or getting chunks with\n", " ``get_chunk()``.\n", "\n", " .. versionchanged:: 1.2\n", "\n", " ``TextFileReader`` is a context manager.\n", "chunksize : int, optional\n", " Return TextFileReader object for iteration.\n", " See the `IO Tools docs\n", " `_\n", " for more information on ``iterator`` and ``chunksize``.\n", "\n", " .. versionchanged:: 1.2\n", "\n", " ``TextFileReader`` is a context manager.\n", "compression : str or dict, default 'infer'\n", " For on-the-fly decompression of on-disk data. If 'infer' and 'filepath_or_buffer' is\n", " path-like, then detect compression from the following extensions: '.gz',\n", " '.bz2', '.zip', '.xz', '.zst', '.tar', '.tar.gz', '.tar.xz' or '.tar.bz2'\n", " (otherwise no compression).\n", " If using 'zip' or 'tar', the ZIP file must contain only one data file to be read in.\n", " Set to ``None`` for no decompression.\n", " Can also be a dict with key ``'method'`` set\n", " to one of {``'zip'``, ``'gzip'``, ``'bz2'``, ``'zstd'``, ``'tar'``} and other\n", " key-value pairs are forwarded to\n", " ``zipfile.ZipFile``, ``gzip.GzipFile``,\n", " ``bz2.BZ2File``, ``zstandard.ZstdDecompressor`` or\n", " ``tarfile.TarFile``, respectively.\n", " As an example, the following could be passed for Zstandard decompression using a\n", " custom compression dictionary:\n", " ``compression={'method': 'zstd', 'dict_data': my_compression_dict}``.\n", "\n", " .. versionadded:: 1.5.0\n", " Added support for `.tar` files.\n", "\n", " .. versionchanged:: 1.4.0 Zstandard support.\n", "\n", "thousands : str, optional\n", " Thousands separator.\n", "decimal : str, default '.'\n", " Character to recognize as decimal point (e.g. use ',' for European data).\n", "lineterminator : str (length 1), optional\n", " Character to break file into lines. Only valid with C parser.\n", "quotechar : str (length 1), optional\n", " The character used to denote the start and end of a quoted item. Quoted\n", " items can include the delimiter and it will be ignored.\n", "quoting : int or csv.QUOTE_* instance, default 0\n", " Control field quoting behavior per ``csv.QUOTE_*`` constants. Use one of\n", " QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).\n", "doublequote : bool, default ``True``\n", " When quotechar is specified and quoting is not ``QUOTE_NONE``, indicate\n", " whether or not to interpret two consecutive quotechar elements INSIDE a\n", " field as a single ``quotechar`` element.\n", "escapechar : str (length 1), optional\n", " One-character string used to escape other characters.\n", "comment : str, optional\n", " Indicates remainder of line should not be parsed. If found at the beginning\n", " of a line, the line will be ignored altogether. This parameter must be a\n", " single character. Like empty lines (as long as ``skip_blank_lines=True``),\n", " fully commented lines are ignored by the parameter `header` but not by\n", " `skiprows`. For example, if ``comment='#'``, parsing\n", " ``#empty\\na,b,c\\n1,2,3`` with ``header=0`` will result in 'a,b,c' being\n", " treated as the header.\n", "encoding : str, optional, default \"utf-8\"\n", " Encoding to use for UTF when reading/writing (ex. 'utf-8'). `List of Python\n", " standard encodings\n", " `_ .\n", "\n", " .. versionchanged:: 1.2\n", "\n", " When ``encoding`` is ``None``, ``errors=\"replace\"`` is passed to\n", " ``open()``. Otherwise, ``errors=\"strict\"`` is passed to ``open()``.\n", " This behavior was previously only the case for ``engine=\"python\"``.\n", "\n", " .. versionchanged:: 1.3.0\n", "\n", " ``encoding_errors`` is a new argument. ``encoding`` has no longer an\n", " influence on how encoding errors are handled.\n", "\n", "encoding_errors : str, optional, default \"strict\"\n", " How encoding errors are treated. `List of possible values\n", " `_ .\n", "\n", " .. versionadded:: 1.3.0\n", "\n", "dialect : str or csv.Dialect, optional\n", " If provided, this parameter will override values (default or not) for the\n", " following parameters: `delimiter`, `doublequote`, `escapechar`,\n", " `skipinitialspace`, `quotechar`, and `quoting`. If it is necessary to\n", " override values, a ParserWarning will be issued. See csv.Dialect\n", " documentation for more details.\n", "on_bad_lines : {'error', 'warn', 'skip'} or callable, default 'error'\n", " Specifies what to do upon encountering a bad line (a line with too many fields).\n", " Allowed values are :\n", "\n", " - 'error', raise an Exception when a bad line is encountered.\n", " - 'warn', raise a warning when a bad line is encountered and skip that line.\n", " - 'skip', skip bad lines without raising or warning when they are encountered.\n", "\n", " .. versionadded:: 1.3.0\n", "\n", " .. versionadded:: 1.4.0\n", "\n", " - callable, function with signature\n", " ``(bad_line: list[str]) -> list[str] | None`` that will process a single\n", " bad line. ``bad_line`` is a list of strings split by the ``sep``.\n", " If the function returns ``None``, the bad line will be ignored.\n", " If the function returns a new list of strings with more elements than\n", " expected, a ``ParserWarning`` will be emitted while dropping extra elements.\n", " Only supported when ``engine=\"python\"``\n", "\n", "delim_whitespace : bool, default False\n", " Specifies whether or not whitespace (e.g. ``' '`` or ``' '``) will be\n", " used as the sep. Equivalent to setting ``sep='\\s+'``. If this option\n", " is set to True, nothing should be passed in for the ``delimiter``\n", " parameter.\n", "low_memory : bool, default True\n", " Internally process the file in chunks, resulting in lower memory use\n", " while parsing, but possibly mixed type inference. To ensure no mixed\n", " types either set False, or specify the type with the `dtype` parameter.\n", " Note that the entire file is read into a single DataFrame regardless,\n", " use the `chunksize` or `iterator` parameter to return the data in chunks.\n", " (Only valid with C parser).\n", "memory_map : bool, default False\n", " If a filepath is provided for `filepath_or_buffer`, map the file object\n", " directly onto memory and access the data directly from there. Using this\n", " option can improve performance because there is no longer any I/O overhead.\n", "float_precision : str, optional\n", " Specifies which converter the C engine should use for floating-point\n", " values. The options are ``None`` or 'high' for the ordinary converter,\n", " 'legacy' for the original lower precision pandas converter, and\n", " 'round_trip' for the round-trip converter.\n", "\n", " .. versionchanged:: 1.2\n", "\n", "storage_options : dict, optional\n", " Extra options that make sense for a particular storage connection, e.g.\n", " host, port, username, password, etc. For HTTP(S) URLs the key-value pairs\n", " are forwarded to ``urllib.request.Request`` as header options. For other\n", " URLs (e.g. starting with \"s3://\", and \"gcs://\") the key-value pairs are\n", " forwarded to ``fsspec.open``. Please see ``fsspec`` and ``urllib`` for more\n", " details, and for more examples on storage options refer `here\n", " `_.\n", "\n", " .. versionadded:: 1.2\n", "\n", "dtype_backend : {\"numpy_nullable\", \"pyarrow\"}, defaults to NumPy backed DataFrames\n", " Which dtype_backend to use, e.g. whether a DataFrame should have NumPy\n", " arrays, nullable dtypes are used for all dtypes that have a nullable\n", " implementation when \"numpy_nullable\" is set, pyarrow is used for all\n", " dtypes if \"pyarrow\" is set.\n", "\n", " The dtype_backends are still experimential.\n", "\n", " .. versionadded:: 2.0\n", "\n", "Returns\n", "-------\n", "DataFrame or TextFileReader\n", " A comma-separated values (csv) file is returned as two-dimensional\n", " data structure with labeled axes.\n", "\n", "See Also\n", "--------\n", "DataFrame.to_csv : Write DataFrame to a comma-separated values (csv) file.\n", "read_csv : Read a comma-separated values (csv) file into DataFrame.\n", "read_fwf : Read a table of fixed-width formatted lines into DataFrame.\n", "\n", "Examples\n", "--------\n", ">>> pd.read_csv('data.csv') # doctest: +SKIP\n", "\u001b[0;31mFile:\u001b[0m /usr/local/anaconda3/envs/uc-python/lib/python3.11/site-packages/pandas/io/parsers/readers.py\n", "\u001b[0;31mType:\u001b[0m function" ] } ], "source": [ "pd.read_csv?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Your Turn\n", "\n", "1. Python stores its data in ____________. \n", "2. What happens to Python's data when the Python session is terminated?\n", "3. Load the `../data/flights.csv` data file into Python using the `pandas` library." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Importing Other Files" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* While tabular data is the most popular in data science, other types of data will are used as well." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* These are *not* as important as the pandas DataFrame, but it *is* good to be exposed to them." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* These additional data formats are going to be more common in a fully functional programming language like Python." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### JSON Files" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "A common example is a [JSON](https://en.wikipedia.org/wiki/JSON) file -- these are non-tabular data files that are popular in data engineering due to their space efficiency and flexibility." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Here is an example JSON file:\n", "\n", "```json\n", "{\n", " \"planeId\": \"1xc2345g\",\n", " \"manufacturerDetails\": {\n", " \"manufacturer\": \"Airbus\",\n", " \"model\": \"A330\",\n", " \"year\": 1999\n", " },\n", " \"airlineDetails\": {\n", " \"currentAirline\": \"Southwest\",\n", " \"previousAirlines\": {\n", " \"1st\": \"Delta\"\n", " },\n", " \"lastPurchased\": 2013\n", " },\n", " \"numberOfFlights\": 4654\n", "}\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "
\n", "

Question

\n", "

Does this JSON data structure remind you of a Python data structure?

\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The JSON file bears a striking resemblance to the Python `dict` structure due to the key-value pairings." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Importing JSON Files\n", "\n", "JSON Files can be imported using the `json` library paired with the `with` statement and the `open()` function:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "import json\n", "\n", "with open('../data/json_example.json', 'r') as f:\n", " imported_json = json.load(f)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We can then verify that `input_file` is a `dict`:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "dict" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(imported_json)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "And we can view the data:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "{'planeId': '1xc2345g',\n", " 'manufacturerDetails': {'manufacturer': 'Airbus',\n", " 'model': 'A330',\n", " 'year': 1999},\n", " 'airlineDetails': {'currentAirline': 'Southwest',\n", " 'previousAirlines': {'1st': 'Delta'},\n", " 'lastPurchased': 2013},\n", " 'numberOfFlights': 4654}" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "imported_json" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Pickle Files" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "So far, we've seen that tabular data files can be imported and represented as DataFrames and JSON files can be imported and represented as dicts, but what about other, more complex data?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Python's native data files are known as **Pickle** files:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* All Pickle files have the `.pickle` extension" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* Pickle files are great for saving native Python data that can't easily be represented by other file types such as:\n", " * pre-processed data,\n", " * models,\n", " * any other Python object..." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Importing Pickle Files\n", "\n", "Pickle files can be imported using the `pickle` library paired with the `with` statement and the `open()` function:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "import pickle\n", "\n", "with open('../data/pickle_example.pickle', 'rb') as f:\n", " imported_pickle = pickle.load(f)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We can view this file and see it's the same data as the JSON:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "{'planeId': '1xc2345g',\n", " 'manufacturerDetails': {'manufacturer': 'Airbus',\n", " 'model': 'A330',\n", " 'year': 1999},\n", " 'airlineDetails': {'currentAirline': 'Southwest',\n", " 'previousAirlines': {'1st': 'Delta'},\n", " 'lastPurchased': 2013},\n", " 'numberOfFlights': 4654}" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "imported_pickle" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "And that it was loaded directly as a `dict`:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "dict" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(imported_pickle)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Questions\n", "\n", "Are there any questions before we move on?" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" }, "rise": { "autolaunch": true, "transition": "none" } }, "nbformat": 4, "nbformat_minor": 4 }