# Importing Data

> There is no data science without data.
>
> \- A wise person

## Applied Review

### Fundamentals and Data in Python

* Python stores its data in **variables** - words that you choose to represent values you've stored
* This is done using **assignment** - you assign data to a variable

### Packages/Modules and Data in Python

* Data is frequently represented inside a **DataFrame** - a class from the pandas library
* The pandas library has **methods** for importing different types of files into DataFrames - operations that import data

## General Model for Importing Data

### Memory and Size

* Python stores its data in **memory** - this makes it relatively quickly accessible but can cause size limitations in certain fields.

* With that being said, you are likely not going to run into space limitations anytime soon.

* Python memory is session-specific, so quitting Python (i.e. shutting down JupyterLab) removes the data from memory.

### General Framework

A general way to conceptualize data import into and use within Python:

1. Data sits in on the computer/server - this is frequently called "disk"
2. Python code can be used to copy a data file from disk to the Python session's memory
3. Python data then sits within Python's memory ready to be used by other Python code

Here is a visualization of this process:


<center>
<img src="images/import-framework.png" alt="import-framework.png" width="1000" height="1000">
</center>

## Importing Tabular Data

For much of data science, tabular data -- again, think 2-dimensional datasets -- is the most common format of data.

### Importing pandas

This data format can be imported into Python using the pandas library. We can load pandas with the below code:

In [1]:
import pandas as pd

<div class="admonition note alert alert-info">
    <b><p class="first admonition-title" style="font-weight: bold">Note</p></b>
    <p>Recall that the pandas library is the primary library for representing and working with tabular data in Python.</p>
</div>

### Importing Tabular Data with Pandas

pandas is preferred because it imports the data directly into a DataFrame -- the data structure of choice for tabular data in Python.

The `read_csv` function is used to import a tabular data file, a CSV, into a DataFrame:

In [2]:
planes = pd.read_csv('../data/planes.csv')

And recall we can visualize the first few rows of our DataFrame using the `head()` method:

In [3]:
planes.head()

Unnamed: 0,tailnum,year,type,manufacturer,model,engines,seats,speed,engine
0,N10156,2004.0,Fixed wing multi engine,EMBRAER,EMB-145XR,2,55,,Turbo-fan
1,N102UW,1998.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan
2,N103US,1999.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan
3,N104UW,1999.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan
4,N10575,2002.0,Fixed wing multi engine,EMBRAER,EMB-145LR,2,55,,Turbo-fan


The `read_csv()` function has many parameters for importing data. A few examples:

* `sep` - the data's delimiter
* `header` - the row number containing the column names (0 indicates there is no header)

Full documentation can be pulled up by running the method name followed by a question mark:

In [4]:
pd.read_csv?

[0;31mSignature:[0m
[0mpd[0m[0;34m.[0m[0mread_csv[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mfilepath_or_buffer[0m[0;34m:[0m [0;34m'FilePath | ReadCsvBuffer[bytes] | ReadCsvBuffer[str]'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msep[0m[0;34m:[0m [0;34m'str | None | lib.NoDefault'[0m [0;34m=[0m [0;34m<[0m[0mno_default[0m[0;34m>[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdelimiter[0m[0;34m:[0m [0;34m'str | None | lib.NoDefault'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mheader[0m[0;34m:[0m [0;34m"int | Sequence[int] | None | Literal['infer']"[0m [0;34m=[0m [0;34m'infer'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnames[0m[0;34m:[0m [0;34m'Sequence[Hashable] | None | lib.NoDefault'[0m [0;34m=[0m [0;34m<[0m[0mno_default[0m[0;34m>[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mindex_col[0m[0;34m:[0m [0;34m'IndexLabel | Literal[False] | None'[0m [0

### Your Turn

1. Python stores its data in ____________. 
2. What happens to Python's data when the Python session is terminated?
3. Load the `../data/flights.csv` data file into Python using the `pandas` library.

## Importing Other Files

* While tabular data is the most popular in data science, other types of data will are used as well.

* These are *not* as important as the pandas DataFrame, but it *is* good to be exposed to them.

* These additional data formats are going to be more common in a fully functional programming language like Python.

### JSON Files

A common example is a [JSON](https://en.wikipedia.org/wiki/JSON) file -- these are non-tabular data files that are popular in data engineering due to their space efficiency and flexibility.

Here is an example JSON file:

```json
{
    "planeId": "1xc2345g",
    "manufacturerDetails": {
        "manufacturer": "Airbus",
        "model": "A330",
        "year": 1999
    },
    "airlineDetails": {
        "currentAirline": "Southwest",
        "previousAirlines": {
            "1st": "Delta"
        },
        "lastPurchased": 2013
    },
    "numberOfFlights": 4654
}
```

<div class="admonition tip alert alert-warning">
    <b><p class="first admonition-title" style="font-weight: bold">Question</p></b>
    <p>Does this JSON data structure remind you of a Python data structure?</p>
</div>

The JSON file bears a striking resemblance to the Python `dict` structure due to the key-value pairings.

### Importing JSON Files

JSON Files can be imported using the `json` library paired with the `with` statement and the `open()` function:

In [5]:
import json

with open('../data/json_example.json', 'r') as f:
    imported_json = json.load(f)

We can then verify that `input_file` is a `dict`:

In [6]:
type(imported_json)

dict

And we can view the data:

In [7]:
imported_json

{'planeId': '1xc2345g',
 'manufacturerDetails': {'manufacturer': 'Airbus',
  'model': 'A330',
  'year': 1999},
 'airlineDetails': {'currentAirline': 'Southwest',
  'previousAirlines': {'1st': 'Delta'},
  'lastPurchased': 2013},
 'numberOfFlights': 4654}

### Pickle Files

So far, we've seen that tabular data files can be imported and represented as DataFrames and JSON files can be imported and represented as dicts, but what about other, more complex data?

Python's native data files are known as **Pickle** files:

* All Pickle files have the `.pickle` extension

* Pickle files are great for saving native Python data that can't easily be represented by other file types such as:
  * pre-processed data,
  * models,
  * any other Python object...

### Importing Pickle Files

Pickle files can be imported using the `pickle` library paired with the `with` statement and the `open()` function:

In [8]:
import pickle

with open('../data/pickle_example.pickle', 'rb') as f:
    imported_pickle = pickle.load(f)

We can view this file and see it's the same data as the JSON:

In [9]:
imported_pickle

{'planeId': '1xc2345g',
 'manufacturerDetails': {'manufacturer': 'Airbus',
  'model': 'A330',
  'year': 1999},
 'airlineDetails': {'currentAirline': 'Southwest',
  'previousAirlines': {'1st': 'Delta'},
  'lastPurchased': 2013},
 'numberOfFlights': 4654}

And that it was loaded directly as a `dict`:

In [10]:
type(imported_pickle)

dict

## Questions

Are there any questions before we move on?