{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Complex and Nested Data\n", "In this notebook, I wanna try to go through a similar process as this Spark notebook on processing complex and nested data https://docs.databricks.com/spark/latest/dataframes-datasets/complex-nested-data.html " ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "data = [\n", " (0, \"\"\"{\"device_id\": 0, \"device_type\": \"sensor-ipad\", \"ip\": \"68.161.225.1\", \"cca3\": \"USA\", \"cn\": \"United States\", \"temp\": 25, \"signal\": 23, \"battery_level\": 8, \"c02_level\": 917, \"timestamp\" :1475600496 }\"\"\"),\n", " (1, \"\"\"{\"device_id\": 1, \"device_type\": \"sensor-igauge\", \"ip\": \"213.161.254.1\", \"cca3\": \"NOR\", \"cn\": \"Norway\", \"temp\": 30, \"signal\": 18, \"battery_level\": 6, \"c02_level\": 1413, \"timestamp\" :1475600498 }\"\"\"),\n", " (2, \"\"\"{\"device_id\": 2, \"device_type\": \"sensor-ipad\", \"ip\": \"88.36.5.1\", \"cca3\": \"ITA\", \"cn\": \"Italy\", \"temp\": 18, \"signal\": 25, \"battery_level\": 5, \"c02_level\": 1372, \"timestamp\" :1475600500 }\"\"\"),\n", " (3, \"\"\"{\"device_id\": 3, \"device_type\": \"sensor-inest\", \"ip\": \"66.39.173.154\", \"cca3\": \"USA\", \"cn\": \"United States\", \"temp\": 47, \"signal\": 12, \"battery_level\": 1, \"c02_level\": 1447, \"timestamp\" :1475600502 }\"\"\"),\n", "(4, \"\"\"{\"device_id\": 4, \"device_type\": \"sensor-ipad\", \"ip\": \"203.82.41.9\", \"cca3\": \"PHL\", \"cn\": \"Philippines\", \"temp\": 29, \"signal\": 11, \"battery_level\": 0, \"c02_level\": 983, \"timestamp\" :1475600504 }\"\"\"),\n", "(5, \"\"\"{\"device_id\": 5, \"device_type\": \"sensor-istick\", \"ip\": \"204.116.105.67\", \"cca3\": \"USA\", \"cn\": \"United States\", \"temp\": 50, \"signal\": 16, \"battery_level\": 8, \"c02_level\": 1574, \"timestamp\" :1475600506 }\"\"\"),\n", "(6, \"\"\"{\"device_id\": 6, \"device_type\": \"sensor-ipad\", \"ip\": \"220.173.179.1\", \"cca3\": \"CHN\", \"cn\": \"China\", \"temp\": 21, \"signal\": 18, \"battery_level\": 9, \"c02_level\": 1249, \"timestamp\" :1475600508 }\"\"\"),\n", "(7, \"\"\"{\"device_id\": 7, \"device_type\": \"sensor-ipad\", \"ip\": \"118.23.68.227\", \"cca3\": \"JPN\", \"cn\": \"Japan\", \"temp\": 27, \"signal\": 15, \"battery_level\": 0, \"c02_level\": 1531, \"timestamp\" :1475600512 }\"\"\"),\n", "(8 ,\"\"\" {\"device_id\": 8, \"device_type\": \"sensor-inest\", \"ip\": \"208.109.163.218\", \"cca3\": \"USA\", \"cn\": \"United States\", \"temp\": 40, \"signal\": 16, \"battery_level\": 9, \"c02_level\": 1208, \"timestamp\" :1475600514 }\"\"\"),\n", "(9,\"\"\"{\"device_id\": 9, \"device_type\": \"sensor-ipad\", \"ip\": \"88.213.191.34\", \"cca3\": \"ITA\", \"cn\": \"Italy\", \"temp\": 19, \"signal\": 11, \"battery_level\": 0, \"c02_level\": 1171, \"timestamp\" :1475600516 }\"\"\"),\n", "(10,\"\"\"{\"device_id\": 10, \"device_type\": \"sensor-igauge\", \"ip\": \"68.28.91.22\", \"cca3\": \"USA\", \"cn\": \"United States\", \"temp\": 32, \"signal\": 26, \"battery_level\": 7, \"c02_level\": 886, \"timestamp\" :1475600518 }\"\"\"),\n", "(11,\"\"\"{\"device_id\": 11, \"device_type\": \"sensor-ipad\", \"ip\": \"59.144.114.250\", \"cca3\": \"IND\", \"cn\": \"India\", \"temp\": 46, \"signal\": 25, \"battery_level\": 4, \"c02_level\": 863, \"timestamp\" :1475600520 }\"\"\"),\n", "(12, \"\"\"{\"device_id\": 12, \"device_type\": \"sensor-igauge\", \"ip\": \"193.156.90.200\", \"cca3\": \"NOR\", \"cn\": \"Norway\", \"temp\": 18, \"signal\": 26, \"battery_level\": 8, \"c02_level\": 1220, \"timestamp\" :1475600522 }\"\"\"),\n", "(13, \"\"\"{\"device_id\": 13, \"device_type\": \"sensor-ipad\", \"ip\": \"67.185.72.1\", \"cca3\": \"USA\", \"cn\": \"United States\", \"temp\": 34, \"signal\": 20, \"battery_level\": 8, \"c02_level\": 1504, \"timestamp\" :1475600524 }\"\"\"),\n", "(14, \"\"\"{\"device_id\": 14, \"device_type\": \"sensor-inest\", \"ip\": \"68.85.85.106\", \"cca3\": \"USA\", \"cn\": \"United States\", \"temp\": 39, \"signal\": 17, \"battery_level\": 8, \"c02_level\": 831, \"timestamp\" :1475600526 }\"\"\"),\n", "(15, \"\"\"{\"device_id\": 15, \"device_type\": \"sensor-ipad\", \"ip\": \"161.188.212.254\", \"cca3\": \"USA\", \"cn\": \"United States\", \"temp\": 27, \"signal\": 26, \"battery_level\": 5, \"c02_level\": 1378, \"timestamp\" :1475600528 }\"\"\"),\n", "(16, \"\"\"{\"device_id\": 16, \"device_type\": \"sensor-igauge\", \"ip\": \"221.3.128.242\", \"cca3\": \"CHN\", \"cn\": \"China\", \"temp\": 10, \"signal\": 24, \"battery_level\": 6, \"c02_level\": 1423, \"timestamp\" :1475600530 }\"\"\"),\n", "(17, \"\"\"{\"device_id\": 17, \"device_type\": \"sensor-ipad\", \"ip\": \"64.124.180.215\", \"cca3\": \"USA\", \"cn\": \"United States\", \"temp\": 38, \"signal\": 17, \"battery_level\": 9, \"c02_level\": 1304, \"timestamp\" :1475600532 }\"\"\"),\n", "(18, \"\"\"{\"device_id\": 18, \"device_type\": \"sensor-igauge\", \"ip\": \"66.153.162.66\", \"cca3\": \"USA\", \"cn\": \"United States\", \"temp\": 26, \"signal\": 10, \"battery_level\": 0, \"c02_level\": 902, \"timestamp\" :1475600534 }\"\"\"),\n", "(19, \"\"\"{\"device_id\": 19, \"device_type\": \"sensor-ipad\", \"ip\": \"193.200.142.254\", \"cca3\": \"AUT\", \"cn\": \"Austria\", \"temp\": 32, \"signal\": 27, \"battery_level\": 5, \"c02_level\": 1282, \"timestamp\" :1475600536 }\"\"\")]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's through this in XND" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ndt(\"20 * (int64, string)\")" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import xnd\n", "eventsDS = xnd.xnd(data)\n", "eventsDS.type" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looks good! Let's see if we can make that a struct instead of a tuple." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "ename": "TypeError", "evalue": "xnd: expected dict, not 'tuple'", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0meventsDS\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mxnd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mxnd\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdata\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"20 * {id: int64, device: string}\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m/usr/local/miniconda3/envs/numba-xnd/lib/python3.6/site-packages/xnd/__init__.py\u001b[0m in \u001b[0;36m__new__\u001b[0;34m(cls, value, type, dtype, levels, typedef, dtypedef)\u001b[0m\n\u001b[1;32m 125\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 126\u001b[0m \u001b[0mtype\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mtypeof\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 127\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0msuper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__new__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcls\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mvalue\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mvalue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 128\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 129\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m__repr__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mTypeError\u001b[0m: xnd: expected dict, not 'tuple'" ] } ], "source": [ "eventsDS = xnd.xnd(data, type=\"20 * {id: int64, device: string}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "No, looks like we can't do this.\n", "\n", "*Are memory of tuple and struct the same? Can we cast one to the other in XND?*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's parse the JSON. Ideally we could do this in XND, but let's use Python for now for JSON parsing." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Do we have a way of mapping a function (json.loads) over a column in xnd? Like can I get a `20 * string` from a `20 * (int64, string)`?*" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ndt(\"20 * {id : int64, json : {device_id : int64, device_type : string, ip : string, cca3 : string, cn : string, temp : int64, signal : int64, battery_level : int64, c02_level : int64, timestamp : int64}}\")" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import json\n", "\n", "eventsFromJSONDF = xnd.xnd([{\"id\": i.value, \"json\": json.loads(s.value)} for (i, s) in eventsDS])\n", "eventsFromJSONDF.type" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This works :) Now let's get a couple of columns.\n", "\n", "*Can we \"flatten\" a struct in xnd somehow? Like this in spark: https://stackoverflow.com/questions/38753898/how-to-flatten-a-struct-in-a-spark-dataframe. Would this best be done by numba compiling this Python code?*" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "xnd([{'id': 0, 'device_type': 'sensor-ipad', 'ip': '68.161.225.1', 'cca3': 'USA'},\n", " {'id': 1, 'device_type': 'sensor-igauge', 'ip': '213.161.254.1', 'cca3': 'NOR'},\n", " {'id': 2, 'device_type': 'sensor-ipad', 'ip': '88.36.5.1', 'cca3': 'ITA'},\n", " {'id': 3, 'device_type': 'sensor-inest', 'ip': '66.39.173.154', 'cca3': 'USA'},\n", " {'id': 4, 'device_type': 'sensor-ipad', 'ip': '203.82.41.9', 'cca3': 'PHL'},\n", " {'id': 5, 'device_type': 'sensor-istick', 'ip': '204.116.105.67', 'cca3': 'USA'},\n", " {'id': 6, 'device_type': 'sensor-ipad', 'ip': '220.173.179.1', 'cca3': 'CHN'},\n", " {'id': 7, 'device_type': 'sensor-ipad', 'ip': '118.23.68.227', 'cca3': 'JPN'},\n", " {'id': 8, 'device_type': 'sensor-inest', 'ip': '208.109.163.218', 'cca3': 'USA'},\n", " ...],\n", " type='20 * {id : int64, device_type : string, ip : string, cca3 : string}')" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "jsDF = xnd.xnd([{\n", " \"id\": r['id'].value,\n", " \"device_type\": r['json'][\"device_type\"].value,\n", " \"ip\": r['json'][\"ip\"].value,\n", " \"cca3\": r['json'][\"cca3\"].value\n", "} for r in eventsFromJSONDF])\n", "jsDF" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK that seems to work!\n", "\n", "*Is there a way to pretty print this in a table?*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's do some filering:" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ndt(\"20 * (int64, string)\")" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "eventsDS.type" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Can we infer types from existing xnd types? For example, `xnd([xnd(1), xnd(2)])` Then we don't have to get `.value` when we are making xnd from other xnds*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We do the filtering in Python instead of xnd. " ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ndt(\"14 * {device_id : int64, device_type : string, ip : string, cca3 : string, cn : string, temp : int64, signal : int64, battery_level : int64, c02_level : int64, timestamp : int64}\")" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "devices = [json.loads(device.value) for (_, device) in eventsDS]\n", "devicesDF = xnd.xnd([device for device in devices if device['temp'] > 10 and device['signal'] > 15])\n", "devicesDF.type" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How do we get a nice plot of this like we can in spark?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One option is to convert it to pandas dataframe... Lemme put this on my todolist. Make NumPy and Pandas DF wrappers for XND so you can use them in Altair or Dask." ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "xnd(23, type='int64')" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "devicesDF[0]['signal']" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'device_id': 8,\n", " 'device_type': 'sensor-inest',\n", " 'ip': '208.109.163.218',\n", " 'cca3': 'USA',\n", " 'cn': 'United States',\n", " 'temp': 40,\n", " 'signal': 16,\n", " 'battery_level': 9,\n", " 'c02_level': 1208,\n", " ...: ...},\n", " {'device_id': 5,\n", " 'device_type': 'sensor-istick',\n", " 'ip': '204.116.105.67',\n", " 'cca3': 'USA',\n", " 'cn': 'United States',\n", " 'temp': 50,\n", " 'signal': 16,\n", " 'battery_level': 8,\n", " 'c02_level': 1574,\n", " ...: ...},\n", " {'device_id': 17,\n", " 'device_type': 'sensor-ipad',\n", " 'ip': '64.124.180.215',\n", " 'cca3': 'USA',\n", " 'cn': 'United States',\n", " 'temp': 38,\n", " 'signal': 17,\n", " 'battery_level': 9,\n", " 'c02_level': 1304,\n", " ...: ...},\n", " {'device_id': 14,\n", " 'device_type': 'sensor-inest',\n", " 'ip': '68.85.85.106',\n", " 'cca3': 'USA',\n", " 'cn': 'United States',\n", " 'temp': 39,\n", " 'signal': 17,\n", " 'battery_level': 8,\n", " 'c02_level': 831,\n", " ...: ...},\n", " {'device_id': 6,\n", " 'device_type': 'sensor-ipad',\n", " 'ip': '220.173.179.1',\n", " 'cca3': 'CHN',\n", " 'cn': 'China',\n", " 'temp': 21,\n", " 'signal': 18,\n", " 'battery_level': 9,\n", " 'c02_level': 1249,\n", " ...: ...},\n", " {'device_id': 1,\n", " 'device_type': 'sensor-igauge',\n", " 'ip': '213.161.254.1',\n", " 'cca3': 'NOR',\n", " 'cn': 'Norway',\n", " 'temp': 30,\n", " 'signal': 18,\n", " 'battery_level': 6,\n", " 'c02_level': 1413,\n", " ...: ...},\n", " {'device_id': 13,\n", " 'device_type': 'sensor-ipad',\n", " 'ip': '67.185.72.1',\n", " 'cca3': 'USA',\n", " 'cn': 'United States',\n", " 'temp': 34,\n", " 'signal': 20,\n", " 'battery_level': 8,\n", " 'c02_level': 1504,\n", " ...: ...},\n", " {'device_id': 0,\n", " 'device_type': 'sensor-ipad',\n", " 'ip': '68.161.225.1',\n", " 'cca3': 'USA',\n", " 'cn': 'United States',\n", " 'temp': 25,\n", " 'signal': 23,\n", " 'battery_level': 8,\n", " 'c02_level': 917,\n", " ...: ...},\n", " {'device_id': 2,\n", " 'device_type': 'sensor-ipad',\n", " 'ip': '88.36.5.1',\n", " 'cca3': 'ITA',\n", " 'cn': 'Italy',\n", " 'temp': 18,\n", " 'signal': 25,\n", " 'battery_level': 5,\n", " 'c02_level': 1372,\n", " ...: ...},\n", " ...]" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "devicesUSDF = xnd.xnd(sorted([device.value for device in devicesDF], key=lambda d: (d['signal'], d['temp'])))\n", "devicesUSDF.short_value(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK I don't really feel like going through the rest of this notebook and replicating all of it.\n", "\n", "A few other things I noticed was that it supports Map types. We could support these as tuples of keys and values in XND, if we are ok with O(N) value lookups." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are definately some things that are missing if we compare using xnd to spark here. These are:\n", "\n", "1. Map types\n", "2. converting between tuples and structs\n", "3. Taking \"column\". Like if we have a list of structs, converting this to a list of a certain value of that struct. This is also like flattening a list of struct into a struct of lists.\n", "4. Filtering and sorting\n", "5. Displaying Xnd objects that resemble dataframes as dataframes, by converting to pandas or subclassing or acting like pandas Dataframe.\n", "6. Serializing to/from JSON\n", "\n", "\n", "I think that 6. (JSON) could be handled by libxnd, since it really has to do with doing effecient creation of xnd structures. It would be harder to do that in a higher level library and get good performance.\n", "\n", "For 5. (viewing as DF) I think this could be handled by a higher level library that uses XND. So it uses XND as data backing but provides a user facing API on top of it.\n", "\n", "For 1, 2, 3, 4, I am not sure where these should live. It would be nice if we could do filtering inside XND, but maybe this belongs in a higher level library? I can see array abstraction library being useful for this, but then that library wouldn't just be array abstractions but more like nested data abstractions. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }