{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Persistence\n", "\n", "We can get data, we can twist data, we can visualise data, but how do we effectively store and share data? \n", "\n", "Rudimentary knowledge of Data Storage and Data Formats are a major part of the Data Science ecosystem. \n", "\n", "I'm very biased, YMMV.\n", "\n", "## Starting at the start: Why are there different formats\n", "\n", "* Compatibility with older/different systems\n", "* Different performance priorities\n", "* Different levels of 'human readability'\n", "* Religion.\n", "\n", "## Data IO\n", "\n", "We've been playing with `pandas` for a while, reading data with `read_csv`, and the eagle eyed may have noticed a `write_csv` as well, but CSV is a woefully inadequate (if 'simple') format, especially for numerical data.\n", "\n", "`pandas` supports a huge range of IO capabilities straight out of the box, but now that we're going a little lower level, lets just make up some data and see how different formats perform:\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import string\n", "import random\n", "from pathlib import Path\n", "\n", "\n", "def get_random_string(length):\n", " letters = string.ascii_lowercase\n", " result_str = ''.join(random.sample(letters,k=length))\n", " return result_str\n", "\n", "def get_random_unicode(length):\n", " \"\"\"shamelessly stolen https://stackoverflow.com/a/21666621/252556\"\"\"\n", " try:\n", " get_char = unichr\n", " except NameError:\n", " get_char = chr\n", "\n", " # Update this to include code point ranges to be sampled\n", " include_ranges = [\n", " ( 0x0021, 0x0021 ),\n", " ( 0x0023, 0x0026 ),\n", " ( 0x0028, 0x007E ),\n", " ( 0x00A1, 0x00AC ),\n", " ( 0x00AE, 0x00FF ),\n", " ( 0x0100, 0x017F ),\n", " ( 0x0180, 0x024F ),\n", " ( 0x2C60, 0x2C7F ),\n", " ( 0x16A0, 0x16F0 ),\n", " ( 0x0370, 0x0377 ),\n", " ( 0x037A, 0x037E ),\n", " ( 0x0384, 0x038A ),\n", " ( 0x038C, 0x038C ),\n", " ]\n", "\n", " alphabet = [\n", " get_char(code_point) for current_range in include_ranges\n", " for code_point in range(current_range[0], current_range[1] + 1)\n", " ]\n", " return ''.join(random.choice(alphabet) for i in range(length))\n", "\n", "\n", "size = int(1e6)\n", "cats = [get_random_string(12) for _ in range(4)]\n", "df = pd.DataFrame({'randn': np.random.randint(0,100, size=size), # ints\n", " 'randnorm': np.random.normal(size=size),# floats\n", " 'randstr': [get_random_string(8) for _ in range(size)], #strs\n", " 'randutf': [get_random_unicode(8) for _ in range(size)], #unicode\n", " 'randcat': random.choices(cats,k=size) # potential categories\n", " })\n", "csv_path = Path('data/stress.csv')\n", "\n", "df.to_csv(csv_path, index=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Challenge: \n", "\n", "Check out the [`pandas` IO Tools documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html)\n", "Pick 4 Data Formats, and evaluate them on these characteristics:\n", "\n", "1) Data Stability: Is the result of reading it the same as what you put in?\n", "\n", "2) Compression Size: How much smaller is the resultant file compared to `data/stress.csv`\n", "\n", "3) Decompression Speed: How quickly can you perform operations on the data you read?\n", "\n", "This should take no more than 10 minutes (less if you read ahead a bit...)\n", "\n", "(Bonus, try different numbers for `size`)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Apache Arrow\n", "\n", "Cross-language in-memory data sharing format and interface protocol.\n", "\n", "(i.e \"You don't have to convert everyting to json for inter-process communication\")\n", "\n", "![](img/arrow_mem.png)\n", "\n", "* Originally developed by Wes McKinney, the author of `pandas`, so they play well together. \n", "* Column Based Format\n", "* Binary protocol and serialisation functions\n", "* Memory Mapping and zero-copy reads (Bigger-than-RAM operation)\n", "* Includes type information and metadata\n", "* Lossless compression\n", "* SQL-style querying engine (Column oriented!)\n", "\n", "`pyarrow` is directly supported for use with `to_parquet`\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pq_path = Path('data/stress.pa.pq')\n", "df.to_parquet(pq_path, engine='pyarrow')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pq_path.stat().st_size/1024**2 #MB" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "csv_path.stat().st_size/1024**2 #MB" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Question 3\n", "\n", "(No notebook this time, answers in the [Miro Board](https://miro.com/app/board/o9J_kj1tCRo=/))\n", "\n", "So far we've only dealt with non-timeseries data. \n", "\n", "Can you find an example dataset that has a timeseries component and convert it to a `pyarrow` parquet format?\n", "\n", "# Learning Outcomes\n", "\n", "In this section we got a whistle stop tour of `pandas.io` and all the formats you can play with, but *I strongly recommend that unless you have a good reason not to, Parquet with `pyarrow` is your best bet*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 4 }