# Data Persistence

We can get data, we can twist data, we can visualise data, but how do we effectively store and share data? 

Rudimentary knowledge of Data Storage and Data Formats are a major part of the Data Science ecosystem. 

I'm very biased, YMMV.

## Starting at the start: Why are there different formats

* Compatibility with older/different systems
* Different performance priorities
* Different levels of 'human readability'
* Religion.

## Data IO

We've been playing with `pandas` for a while, reading data with `read_csv`, and the eagle eyed may have noticed a `write_csv` as well, but CSV is a woefully inadequate (if 'simple') format, especially for numerical data.

`pandas` supports a huge range of IO capabilities straight out of the box, but now that we're going a little lower level, lets just make up some data and see how different formats perform:




In [None]:
import pandas as pd
import numpy as np
import string
import random
from pathlib import Path


def get_random_string(length):
 letters = string.ascii_lowercase
 result_str = ''.join(random.sample(letters,k=length))
 return result_str

def get_random_unicode(length):
 """shamelessly stolen https://stackoverflow.com/a/21666621/252556"""
 try:
 get_char = unichr
 except NameError:
 get_char = chr

 # Update this to include code point ranges to be sampled
 include_ranges = [
 ( 0x0021, 0x0021 ),
 ( 0x0023, 0x0026 ),
 ( 0x0028, 0x007E ),
 ( 0x00A1, 0x00AC ),
 ( 0x00AE, 0x00FF ),
 ( 0x0100, 0x017F ),
 ( 0x0180, 0x024F ),
 ( 0x2C60, 0x2C7F ),
 ( 0x16A0, 0x16F0 ),
 ( 0x0370, 0x0377 ),
 ( 0x037A, 0x037E ),
 ( 0x0384, 0x038A ),
 ( 0x038C, 0x038C ),
 ]

 alphabet = [
 get_char(code_point) for current_range in include_ranges
 for code_point in range(current_range[0], current_range[1] + 1)
 ]
 return ''.join(random.choice(alphabet) for i in range(length))


size = int(1e6)
cats = [get_random_string(12) for _ in range(4)]
df = pd.DataFrame({'randn': np.random.randint(0,100, size=size), # ints
 'randnorm': np.random.normal(size=size),# floats
 'randstr': [get_random_string(8) for _ in range(size)], #strs
 'randutf': [get_random_unicode(8) for _ in range(size)], #unicode
 'randcat': random.choices(cats,k=size) # potential categories
 })
csv_path = Path('data/stress.csv')

df.to_csv(csv_path, index=False)

In [None]:
df.head()

## Challenge: 

Check out the [`pandas` IO Tools documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html)
Pick 4 Data Formats, and evaluate them on these characteristics:

1) Data Stability: Is the result of reading it the same as what you put in?

2) Compression Size: How much smaller is the resultant file compared to `data/stress.csv`

3) Decompression Speed: How quickly can you perform operations on the data you read?

This should take no more than 10 minutes (less if you read ahead a bit...)

(Bonus, try different numbers for `size`)

## Apache Arrow

Cross-language in-memory data sharing format and interface protocol.

(i.e "You don't have to convert everyting to json for inter-process communication")

![](img/arrow_mem.png)

* Originally developed by Wes McKinney, the author of `pandas`, so they play well together. 
* Column Based Format
* Binary protocol and serialisation functions
* Memory Mapping and zero-copy reads (Bigger-than-RAM operation)
* Includes type information and metadata
* Lossless compression
* SQL-style querying engine (Column oriented!)

`pyarrow` is directly supported for use with `to_parquet`


In [None]:
pq_path = Path('data/stress.pa.pq')
df.to_parquet(pq_path, engine='pyarrow')

In [None]:
pq_path.stat().st_size/1024**2 #MB

In [None]:
csv_path.stat().st_size/1024**2 #MB

## Question 3

(No notebook this time, answers in the [Miro Board](https://miro.com/app/board/o9J_kj1tCRo=/))

So far we've only dealt with non-timeseries data. 

Can you find an example dataset that has a timeseries component and convert it to a `pyarrow` parquet format?

# Learning Outcomes

In this section we got a whistle stop tour of `pandas.io` and all the formats you can play with, but *I strongly recommend that unless you have a good reason not to, Parquet with `pyarrow` is your best bet*