{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Hi, Are you in Google Colab?\n", "In Google colab you can easily run Optimus. If you not you may want to go here\n", "https://colab.research.google.com/github/ironmussa/Optimus/blob/master/examples/10_min_from_spark_to_pandas_with_optimus.ipynb" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Install Optimus all the dependencies." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import sys\n", "if 'google.colab' in sys.modules:\n", " !apt-get install openjdk-8-jdk-headless -qq > /dev/null\n", " !wget -q https://archive.apache.org/dist/spark/spark-2.4.1/spark-2.4.1-bin-hadoop2.7.tgz\n", " !tar xf spark-2.4.1-bin-hadoop2.7.tgz\n", " !pip install optimuspyspark" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Restart Runtime\n", "Before you continue, please go to the 'Runtime' Menu above, and select 'Restart Runtime (Ctrl + M + .)'." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "if 'google.colab' in sys.modules:\n", " import os\n", " os.environ[\"JAVA_HOME\"] = \"/usr/lib/jvm/java-8-openjdk-amd64\"\n", " os.environ[\"SPARK_HOME\"] = \"/content/spark-2.4.1-bin-hadoop2.7\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## You are done. Enjoy Optimus!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Hacking Optimus!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To hacking Optimus we recommend to clone the repo and change ```repo_path``` relative to this notebook." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "repo_path=\"..\"\n", "\n", "# This will reload the change you make to Optimus in real time\n", "%load_ext autoreload\n", "%autoreload 2\n", "import sys\n", "sys.path.append(repo_path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Install Optimus \n", "\n", "from command line:\n", "\n", "`pip install optimuspyspark`\n", "\n", "from a notebook you can use:\n", "\n", "`!pip install optimuspyspark`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import Optimus and start it" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\argenisleon\\Anaconda3\\lib\\site-packages\\socks.py:58: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working\n", " from collections import Callable\n", "\n", " You are using PySparkling of version 2.4.10, but your PySpark is of\n", " version 2.3.1. Please make sure Spark and PySparkling versions are compatible. \n", "`formatargspec` is deprecated since Python 3.5. Use `signature` and the `Signature` object directly\n" ] } ], "source": [ "from optimus import Optimus" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "op = Optimus(master=\"local\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dataframe creation\n", "\n", "Create a dataframe to passing a list of values for columns and rows. Unlike pandas you need to specify the column names.\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "
Viewing 6 of 6 rows / 8 columns
\n", "
1 partition(s)
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", "
names
\n", "
1 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
height(ft)
\n", "
2 (float)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
function
\n", "
3 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
rank
\n", "
4 (int)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
weight(t)
\n", "
5 (float)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
japanese name
\n", "
6 (array<string>)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
last position
\n", "
7 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
attributes
\n", "
8 (array<float>)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
Optim'us\n", "
\n", "
\n", "
28.0\n", "
\n", "
\n", "
Leader\n", "
\n", "
\n", "
10\n", "
\n", "
\n", "
4.300000190734863\n", "
\n", "
\n", "
['Inochi',⋅'Convoy']\n", "
\n", "
\n", "
19.442735,-99.201111\n", "
\n", "
\n", "
[8.53439998626709,⋅4300.0]\n", "
\n", "
\n", "
bumbl#ebéé⋅⋅\n", "
\n", "
\n", "
17.5\n", "
\n", "
\n", "
Espionage\n", "
\n", "
\n", "
7\n", "
\n", "
\n", "
2.0\n", "
\n", "
\n", "
['Bumble',⋅'Goldback']\n", "
\n", "
\n", "
10.642707,-71.612534\n", "
\n", "
\n", "
[5.334000110626221,⋅2000.0]\n", "
\n", "
\n", "
ironhide&\n", "
\n", "
\n", "
26.0\n", "
\n", "
\n", "
Security\n", "
\n", "
\n", "
7\n", "
\n", "
\n", "
4.0\n", "
\n", "
\n", "
['Roadbuster']\n", "
\n", "
\n", "
37.789563,-122.400356\n", "
\n", "
\n", "
[7.924799919128418,⋅4000.0]\n", "
\n", "
\n", "
Jazz\n", "
\n", "
\n", "
13.0\n", "
\n", "
\n", "
First⋅Lieutenant\n", "
\n", "
\n", "
8\n", "
\n", "
\n", "
1.7999999523162842\n", "
\n", "
\n", "
['Meister']\n", "
\n", "
\n", "
33.670666,-117.841553\n", "
\n", "
\n", "
[3.962399959564209,⋅1800.0]\n", "
\n", "
\n", "
Megatron\n", "
\n", "
\n", "
None\n", "
\n", "
\n", "
None\n", "
\n", "
\n", "
None\n", "
\n", "
\n", "
5.699999809265137\n", "
\n", "
\n", "
['Megatron']\n", "
\n", "
\n", "
None\n", "
\n", "
\n", "
[None,⋅5700.0]\n", "
\n", "
\n", "
Metroplex_)^$\n", "
\n", "
\n", "
300.0\n", "
\n", "
\n", "
Battle⋅Station\n", "
\n", "
\n", "
8\n", "
\n", "
\n", "
None\n", "
\n", "
\n", "
['Metroflex']\n", "
\n", "
\n", "
None\n", "
\n", "
\n", "
[91.44000244140625,⋅None]\n", "
\n", "
\n", "\n", "\n", "
Viewing 6 of 6 rows / 8 columns
\n", "
1 partition(s)
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df = op.create.df(\n", " [\n", " \"names\",\n", " \"height(ft)\",\n", " \"function\",\n", " \"rank\",\n", " \"weight(t)\",\n", " \"japanese name\",\n", " \"last position\",\n", " \"attributes\"\n", " ],\n", " [\n", "\n", " (\"Optim'us\", 28.0, \"Leader\", 10, 4.3, [\"Inochi\", \"Convoy\"], \"19.442735,-99.201111\", [8.5344, 4300.0]),\n", " (\"bumbl#ebéé \", 17.5, \"Espionage\", 7, 2.0, [\"Bumble\", \"Goldback\"], \"10.642707,-71.612534\", [5.334, 2000.0]),\n", " (\"ironhide&\", 26.0, \"Security\", 7, 4.0, [\"Roadbuster\"], \"37.789563,-122.400356\", [7.9248, 4000.0]),\n", " (\"Jazz\", 13.0, \"First Lieutenant\", 8, 1.8, [\"Meister\"], \"33.670666,-117.841553\", [3.9624, 1800.0]),\n", " (\"Megatron\", None, \"None\", None, 5.7, [\"Megatron\"], None, [None, 5700.0]),\n", " (\"Metroplex_)^$\", 300.0, \"Battle Station\", 8, None, [\"Metroflex\"], None, [91.44, None]),\n", "\n", " ]).h_repartition(1)\n", "df.table()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Creating a dataframe by passing a list of tuples specifyng the column data type. You can specify as data type an string or a Spark Datatypes. https://spark.apache.org/docs/2.3.1/api/java/org/apache/spark/sql/types/package-summary.html\n", "\n", "Also you can use some Optimus predefined types:\n", "* \"str\" = StringType() \n", "* \"int\" = IntegerType() \n", "* \"float\" = FloatType()\n", "* \"bool\" = BoleanType()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "
Viewing 5 of 5 rows / 4 columns
\n", "
1 partition(s)
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", "
names
\n", "
1 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
height
\n", "
2 (float)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
function
\n", "
3 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
rank
\n", "
4 (int)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
bumbl#ebéé⋅⋅\n", "
\n", "
\n", "
17.5\n", "
\n", "
\n", "
Espionage\n", "
\n", "
\n", "
7\n", "
\n", "
\n", "
Optim'us\n", "
\n", "
\n", "
28.0\n", "
\n", "
\n", "
Leader\n", "
\n", "
\n", "
10\n", "
\n", "
\n", "
ironhide&\n", "
\n", "
\n", "
26.0\n", "
\n", "
\n", "
Security\n", "
\n", "
\n", "
7\n", "
\n", "
\n", "
Jazz\n", "
\n", "
\n", "
13.0\n", "
\n", "
\n", "
First⋅Lieutenant\n", "
\n", "
\n", "
8\n", "
\n", "
\n", "
Megatron\n", "
\n", "
\n", "
None\n", "
\n", "
\n", "
None\n", "
\n", "
\n", "
None\n", "
\n", "
\n", "\n", "\n", "
Viewing 5 of 5 rows / 4 columns
\n", "
1 partition(s)
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df = op.create.df(\n", " [\n", " (\"names\", \"str\"),\n", " (\"height\", \"float\"),\n", " (\"function\", \"str\"),\n", " (\"rank\", \"int\"),\n", " ],\n", " [\n", " (\"bumbl#ebéé \", 17.5, \"Espionage\", 7),\n", " (\"Optim'us\", 28.0, \"Leader\", 10),\n", " (\"ironhide&\", 26.0, \"Security\", 7),\n", " (\"Jazz\", 13.0, \"First Lieutenant\", 8),\n", " (\"Megatron\", None, \"None\", None),\n", "\n", " ])\n", "df.table()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Creating a dataframe and specify if the column accepts null values" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "
Viewing 5 of 5 rows / 4 columns
\n", "
1 partition(s)
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", "
names
\n", "
1 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
height
\n", "
2 (float)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
function
\n", "
3 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
rank
\n", "
4 (int)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
bumbl#ebéé⋅⋅\n", "
\n", "
\n", "
17.5\n", "
\n", "
\n", "
Espionage\n", "
\n", "
\n", "
7\n", "
\n", "
\n", "
Optim'us\n", "
\n", "
\n", "
28.0\n", "
\n", "
\n", "
Leader\n", "
\n", "
\n", "
10\n", "
\n", "
\n", "
ironhide&\n", "
\n", "
\n", "
26.0\n", "
\n", "
\n", "
Security\n", "
\n", "
\n", "
7\n", "
\n", "
\n", "
Jazz\n", "
\n", "
\n", "
13.0\n", "
\n", "
\n", "
First⋅Lieutenant\n", "
\n", "
\n", "
8\n", "
\n", "
\n", "
Megatron\n", "
\n", "
\n", "
None\n", "
\n", "
\n", "
None\n", "
\n", "
\n", "
None\n", "
\n", "
\n", "\n", "\n", "
Viewing 5 of 5 rows / 4 columns
\n", "
1 partition(s)
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df = op.create.df(\n", " [\n", " (\"names\", \"str\", True),\n", " (\"height\", \"float\", True),\n", " (\"function\", \"str\", True),\n", " (\"rank\", \"int\", True),\n", " ],\n", " [\n", " (\"bumbl#ebéé \", 17.5, \"Espionage\", 7),\n", " (\"Optim'us\", 28.0, \"Leader\", 10),\n", " (\"ironhide&\", 26.0, \"Security\", 7),\n", " (\"Jazz\", 13.0, \"First Lieutenant\", 8),\n", " (\"Megatron\", None, \"None\", None),\n", "\n", " ])\n", "df.table()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Creating a Daframe using a pandas dataframe" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "
Viewing 3 of 3 rows / 4 columns
\n", "
1 partition(s)
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", "
names
\n", "
1 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
height
\n", "
2 (double)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
function
\n", "
3 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
rank
\n", "
4 (bigint)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
bumbl#ebéé⋅⋅\n", "
\n", "
\n", "
17.5\n", "
\n", "
\n", "
Espionage\n", "
\n", "
\n", "
7\n", "
\n", "
\n", "
Optim'us\n", "
\n", "
\n", "
28.0\n", "
\n", "
\n", "
Leader\n", "
\n", "
\n", "
10\n", "
\n", "
\n", "
ironhide&\n", "
\n", "
\n", "
26.0\n", "
\n", "
\n", "
Security\n", "
\n", "
\n", "
7\n", "
\n", "
\n", "\n", "\n", "
Viewing 3 of 3 rows / 4 columns
\n", "
1 partition(s)
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import pandas as pd\n", "\n", "data = [(\"bumbl#ebéé \", 17.5, \"Espionage\", 7),\n", " (\"Optim'us\", 28.0, \"Leader\", 10),\n", " (\"ironhide&\", 26.0, \"Security\", 7)]\n", "labels = [\"names\", \"height\", \"function\", \"rank\"]\n", "\n", "# Create pandas dataframe\n", "pdf = pd.DataFrame.from_records(data, columns=labels)\n", "\n", "df = op.create.df(pdf=pdf)\n", "df.table()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Viewing data\n", "Here is how to View the first 10 elements in a dataframe." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "
Viewing 3 of 3 rows / 4 columns
\n", "
1 partition(s)
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", "
names
\n", "
1 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
height
\n", "
2 (double)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
function
\n", "
3 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
rank
\n", "
4 (bigint)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
bumbl#ebéé⋅⋅\n", "
\n", "
\n", "
17.5\n", "
\n", "
\n", "
Espionage\n", "
\n", "
\n", "
7\n", "
\n", "
\n", "
Optim'us\n", "
\n", "
\n", "
28.0\n", "
\n", "
\n", "
Leader\n", "
\n", "
\n", "
10\n", "
\n", "
\n", "
ironhide&\n", "
\n", "
\n", "
26.0\n", "
\n", "
\n", "
Security\n", "
\n", "
\n", "
7\n", "
\n", "
\n", "\n", "\n", "
Viewing 3 of 3 rows / 4 columns
\n", "
1 partition(s)
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df.table(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## About Spark\n", "Spark and Optimus work differently than pandas or R. If you are not familiar with Spark, we recommend taking the time to take a look at the links below.\n", "\n", "### Partitions\n", "Partition are the way Spark divide the data in your local computer or cluster to better optimize how it will be processed.It can greatly impact the Spark performance.\n", "\n", "Take 5 minutes to read this article:\n", "https://www.dezyre.com/article/how-data-partitioning-in-spark-helps-achieve-more-parallelism/297\n", "\n", "### Lazy operations\n", "Lazy evaluation in Spark means that the execution will not start until an action is triggered.\n", "\n", "https://stackoverflow.com/questions/38027877/spark-transformation-why-its-lazy-and-what-is-the-advantage\n", "\n", "### Inmutability\n", "Immutability rules out a big set of potential problems due to updates from multiple threads at once. Immutable data is definitely safe to share across processes.\n", "\n", "https://www.quora.com/Why-is-RDD-immutable-in-Spark\n", "\n", "### Spark Architecture\n", "https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-architecture.html" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Columns and Rows\n", "\n", "Optimus organized operations in columns and rows. This is a little different of how pandas works in which all operations are aroud the pandas class. We think this approach can better help you to access and transform data. For a deep dive about the designing decision please read:\n", "\n", "https://towardsdatascience.com/announcing-optimus-v2-agile-data-science-workflows-made-easy-c127a12d9e13" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sort by cols names" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "
Viewing 3 of 3 rows / 4 columns
\n", "
1 partition(s)
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", "
function
\n", "
1 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
height
\n", "
2 (double)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
names
\n", "
3 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
rank
\n", "
4 (bigint)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", " Espionage\n", " \n", " 17.5\n", " \n", " bumbl#ebéé⸱⸱\n", " \n", " 7\n", "
\n", " Leader\n", " \n", " 28.0\n", " \n", " Optim'us\n", " \n", " 10\n", "
\n", " Security\n", " \n", " 26.0\n", " \n", " ironhide&\n", " \n", " 7\n", "
\n", "\n", "
Viewing 3 of 3 rows / 4 columns
\n", "
1 partition(s)
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df.cols.sort().table()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sort by rows rank value" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "
Viewing 3 of 3 rows / 4 columns
\n", "
3 partition(s)
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", "
names
\n", "
1 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
height
\n", "
2 (double)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
function
\n", "
3 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
rank
\n", "
4 (bigint)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", " Optim'us\n", " \n", " 28.0\n", " \n", " Leader\n", " \n", " 10\n", "
\n", " bumbl#ebéé⸱⸱\n", " \n", " 17.5\n", " \n", " Espionage\n", " \n", " 7\n", "
\n", " ironhide&\n", " \n", " 26.0\n", " \n", " Security\n", " \n", " 7\n", "
\n", "\n", "
Viewing 3 of 3 rows / 4 columns
\n", "
3 partition(s)
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df.rows.sort(\"rank\").table()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "
Viewing 5 of 5 rows / 5 columns
\n", "
1 partition(s)
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", "
summary
\n", "
1 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
names
\n", "
2 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
height
\n", "
3 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
function
\n", "
4 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
rank
\n", "
5 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
count\n", "
\n", "
\n", "
3\n", "
\n", "
\n", "
3\n", "
\n", "
\n", "
3\n", "
\n", "
\n", "
3\n", "
\n", "
\n", "
mean\n", "
\n", "
\n", "
None\n", "
\n", "
\n", "
23.833333333333332\n", "
\n", "
\n", "
None\n", "
\n", "
\n", "
8.0\n", "
\n", "
\n", "
stddev\n", "
\n", "
\n", "
None\n", "
\n", "
\n", "
5.575242894559244\n", "
\n", "
\n", "
None\n", "
\n", "
\n", "
1.7320508075688772\n", "
\n", "
\n", "
min\n", "
\n", "
\n", "
Optim'us\n", "
\n", "
\n", "
17.5\n", "
\n", "
\n", "
Espionage\n", "
\n", "
\n", "
7\n", "
\n", "
\n", "
max\n", "
\n", "
\n", "
ironhide&\n", "
\n", "
\n", "
28.0\n", "
\n", "
\n", "
Security\n", "
\n", "
\n", "
10\n", "
\n", "
\n", "\n", "\n", "
Viewing 5 of 5 rows / 5 columns
\n", "
1 partition(s)
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df.describe().table()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Selection\n", "\n", "Unlike Pandas, Spark DataFrames don't support random row access. So methods like `loc` in pandas are not available.\n", "\n", "Also Pandas don't handle indexes. So methods like `iloc` are not available." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Select an show an specific column" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "
Viewing 3 of 3 rows / 1 columns
\n", "
1 partition(s)
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", "
names
\n", "
1 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", " bumbl#ebéé⸱⸱\n", "
\n", " Optim'us\n", "
\n", " ironhide&\n", "
\n", "\n", "
Viewing 3 of 3 rows / 1 columns
\n", "
1 partition(s)
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df.cols.select(\"names\").table()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Select rows from a Dataframe where a the condition is meet" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "
Viewing 1 of 1 rows / 4 columns
\n", "
1 partition(s)
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", "
names
\n", "
1 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
height
\n", "
2 (double)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
function
\n", "
3 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
rank
\n", "
4 (bigint)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", " Optim'us\n", " \n", " 28.0\n", " \n", " Leader\n", " \n", " 10\n", "
\n", "\n", "
Viewing 1 of 1 rows / 4 columns
\n", "
1 partition(s)
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df.rows.select(df[\"rank\"] > 7).table()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Select rows by specific values on it" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "
Viewing 3 of 3 rows / 4 columns
\n", "
1 partition(s)
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", "
names
\n", "
1 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
height
\n", "
2 (double)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
function
\n", "
3 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
rank
\n", "
4 (bigint)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", " bumbl#ebéé⸱⸱\n", " \n", " 17.5\n", " \n", " Espionage\n", " \n", " 7\n", "
\n", " Optim'us\n", " \n", " 28.0\n", " \n", " Leader\n", " \n", " 10\n", "
\n", " ironhide&\n", " \n", " 26.0\n", " \n", " Security\n", " \n", " 7\n", "
\n", "\n", "
Viewing 3 of 3 rows / 4 columns
\n", "
1 partition(s)
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df.rows.is_in(\"rank\", [7, 10]).table()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create and unique id for every row." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.rows.create_id().table()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create wew columns" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "
Viewing 3 of 3 rows / 5 columns
\n", "
1 partition(s)
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", "
names
\n", "
1 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
height
\n", "
2 (double)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
function
\n", "
3 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
rank
\n", "
4 (bigint)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
Affiliation
\n", "
5 (string)
\n", "
\n", " \n", "
\n", "
\n", " bumbl#ebéé⸱⸱\n", " \n", " 17.5\n", " \n", " Espionage\n", " \n", " 7\n", " \n", " Autobot\n", "
\n", " Optim'us\n", " \n", " 28.0\n", " \n", " Leader\n", " \n", " 10\n", " \n", " Autobot\n", "
\n", " ironhide&\n", " \n", " 26.0\n", " \n", " Security\n", " \n", " 7\n", " \n", " Autobot\n", "
\n", "\n", "
Viewing 3 of 3 rows / 5 columns
\n", "
1 partition(s)
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df.cols.append(\"Affiliation\", \"Autobot\").table()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Missing Data" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "inputHidden": false, "outputHidden": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "
Viewing 3 of 3 rows / 4 columns
\n", "
1 partition(s)
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", "
names
\n", "
1 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
height
\n", "
2 (double)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
function
\n", "
3 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
rank
\n", "
4 (bigint)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", " bumbl#ebéé⸱⸱\n", " \n", " 17.5\n", " \n", " Espionage\n", " \n", " 7\n", "
\n", " Optim'us\n", " \n", " 28.0\n", " \n", " Leader\n", " \n", " 10\n", "
\n", " ironhide&\n", " \n", " 26.0\n", " \n", " Security\n", " \n", " 7\n", "
\n", "\n", "
Viewing 3 of 3 rows / 4 columns
\n", "
1 partition(s)
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df.rows.drop_na(\"*\", how='any').table()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Filling missing data." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "
Viewing 3 of 3 rows / 4 columns
\n", "
1 partition(s)
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", "
names
\n", "
1 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
height
\n", "
2 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
function
\n", "
3 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
rank
\n", "
4 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", " bumbl#ebéé⸱⸱\n", " \n", " 17.5\n", " \n", " Espionage\n", " \n", " 7\n", "
\n", " Optim'us\n", " \n", " 28.0\n", " \n", " Leader\n", " \n", " 10\n", "
\n", " ironhide&\n", " \n", " 26.0\n", " \n", " Security\n", " \n", " 7\n", "
\n", "\n", "
Viewing 3 of 3 rows / 4 columns
\n", "
1 partition(s)
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df.cols.fill_na(\"*\", \"N//A\").table()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get the boolean mask where values are nan." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "
Viewing 3 of 3 rows / 4 columns
\n", "
1 partition(s)
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", "
names
\n", "
1 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
height
\n", "
2 (boolean)
\n", "
\n", " \n", "
\n", "
\n", "
function
\n", "
3 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
rank
\n", "
4 (boolean)
\n", "
\n", " \n", "
\n", "
\n", " bumbl#ebéé⸱⸱\n", " \n", " False\n", " \n", " Espionage\n", " \n", " False\n", "
\n", " Optim'us\n", " \n", " False\n", " \n", " Leader\n", " \n", " False\n", "
\n", " ironhide&\n", " \n", " False\n", " \n", " Security\n", " \n", " False\n", "
\n", "\n", "
Viewing 3 of 3 rows / 4 columns
\n", "
1 partition(s)
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df.cols.is_na(\"*\").table()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Operations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Stats" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "inputHidden": false, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "23.833333333333332" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.cols.mean(\"height\")" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "inputHidden": false, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "{'rank': {'mean': 8.0}, 'height': {'mean': 23.833333333333332}}" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.cols.mean(\"*\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Apply" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "inputHidden": false, "outputHidden": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "
Viewing 3 of 3 rows / 4 columns
\n", "
1 partition(s)
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", "
names
\n", "
1 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
height
\n", "
2 (float)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
function
\n", "
3 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
rank
\n", "
4 (bigint)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", " bumbl#ebéé⸱⸱\n", " \n", " 18.5\n", " \n", " Espionage\n", " \n", " 7\n", "
\n", " Optim'us\n", " \n", " 29.0\n", " \n", " Leader\n", " \n", " 10\n", "
\n", " ironhide&\n", " \n", " 27.0\n", " \n", " Security\n", " \n", " 7\n", "
\n", "\n", "
Viewing 3 of 3 rows / 4 columns
\n", "
1 partition(s)
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def func(value, args):\n", " return value + 1\n", "\n", "\n", "df.cols.apply(\"height\", func, \"float\").table()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Histogramming" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'names': {'approx_count_distinct': 3},\n", " 'height': {'approx_count_distinct': 3},\n", " 'function': {'approx_count_distinct': 3},\n", " 'rank': {'approx_count_distinct': 2}}" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.cols.count_uniques(\"*\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### String Methods" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "inputHidden": false, "outputHidden": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "
Viewing 3 of 3 rows / 4 columns
\n", "
1 partition(s)
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", "
names
\n", "
1 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
height
\n", "
2 (double)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
function
\n", "
3 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
rank
\n", "
4 (bigint)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", " bumbl#ebéé⸱⸱\n", " \n", " 17.5\n", " \n", " ESPIONAGE\n", " \n", " 7\n", "
\n", " optim'us\n", " \n", " 28.0\n", " \n", " LEADER\n", " \n", " 10\n", "
\n", " ironhide&\n", " \n", " 26.0\n", " \n", " SECURITY\n", " \n", " 7\n", "
\n", "\n", "
Viewing 3 of 3 rows / 4 columns
\n", "
1 partition(s)
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df \\\n", " .cols.lower(\"names\") \\\n", " .cols.upper(\"function\").table()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Merge" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Concat\n", "\n", "Optimus provides and intuitive way to concat Dataframes by columns or rows." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "ename": "NameError", "evalue": "name 'op' is not defined", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m df_new = op.create.df(\n\u001b[0m\u001b[0;32m 2\u001b[0m [\n\u001b[0;32m 3\u001b[0m \u001b[1;34m\"class\"\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 4\u001b[0m ],\n\u001b[0;32m 5\u001b[0m [\n", "\u001b[1;31mNameError\u001b[0m: name 'op' is not defined" ] } ], "source": [ "df_new = op.create.df(\n", " [\n", " \"class\"\n", " ],\n", " [\n", " (\"Autobot\"),\n", " (\"Autobot\"),\n", " (\"Autobot\"),\n", " (\"Autobot\"),\n", " (\"Decepticons\"),\n", "\n", " ]).h_repartition(1)\n", "\n", "op.append([df, df_new], \"columns\").table()" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "
Viewing 4 of 4 rows / 4 columns
\n", "
2 partition(s)
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", "
names
\n", "
1 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
height
\n", "
2 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
function
\n", "
3 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
rank
\n", "
4 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", " bumbl#ebéé⸱⸱\n", " \n", " 17.5\n", " \n", " Espionage\n", " \n", " 7\n", "
\n", " Optim'us\n", " \n", " 28.0\n", " \n", " Leader\n", " \n", " 10\n", "
\n", " ironhide&\n", " \n", " 26.0\n", " \n", " Security\n", " \n", " 7\n", "
\n", " Grimlock\n", " \n", " 22.9\n", " \n", " Dinobot⸱Commander\n", " \n", " 9\n", "
\n", "\n", "
Viewing 4 of 4 rows / 4 columns
\n", "
2 partition(s)
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df_new = op.create.df(\n", " [\n", " \"names\",\n", " \"height\",\n", " \"function\",\n", " \"rank\",\n", " ],\n", " [\n", " (\"Grimlock\", 22.9, \"Dinobot Commander\", 9),\n", " ]).h_repartition(1)\n", "\n", "op.append([df, df_new], \"rows\").table()" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "inputHidden": false, "outputHidden": false }, "outputs": [], "source": [ "# Operations like `join` and `group` are handle using Spark directly" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "inputHidden": false, "outputHidden": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "
Viewing 3 of 3 rows / 4 columns
\n", "
1 partition(s)
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", "
names
\n", "
1 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
height
\n", "
2 (double)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
function
\n", "
3 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
rank
\n", "
4 (bigint)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", " bumbl#ebéé⸱⸱\n", " \n", " 17.5\n", " \n", " Espionage\n", " \n", " 7\n", "
\n", " Optim'us\n", " \n", " 28.0\n", " \n", " Leader\n", " \n", " 10\n", "
\n", " ironhide&\n", " \n", " 26.0\n", " \n", " Security\n", " \n", " 7\n", "
\n", "\n", "
Viewing 3 of 3 rows / 4 columns
\n", "
1 partition(s)
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df_melt = df.melt(id_vars=[\"names\"], value_vars=[\"height\", \"function\", \"rank\"])\n", "df.table()" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "
Viewing 3 of 3 rows / 4 columns
\n", "
200 partition(s)
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", "
names
\n", "
1 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
function
\n", "
2 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
height
\n", "
3 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
rank
\n", "
4 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", " bumbl#ebéé⸱⸱\n", " \n", " Espionage\n", " \n", " 17.5\n", " \n", " 7\n", "
\n", " ironhide&\n", " \n", " Security\n", " \n", " 26.0\n", " \n", " 7\n", "
\n", " Optim'us\n", " \n", " Leader\n", " \n", " 28.0\n", " \n", " 10\n", "
\n", "\n", "
Viewing 3 of 3 rows / 4 columns
\n", "
200 partition(s)
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df_melt.pivot(\"names\", \"variable\", \"value\").table()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Ploting" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "bucketizer() executed in 0.1 sec\n", "hist() executed in 1.27 sec\n", "hist() executed in 3.39 sec\n" ] }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAA1QAAAEHCAYAAACp5ActAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvIxREBQAAFU9JREFUeJzt3X+w5WddH/D3h6zR2gABd0FIdtkUQzVlquiaglqJQmc2KFn+wDYpVNIypoWCFUWMxAbE4vDDgcIYf6TK8EMkBKq4tUtDrTAUh2A2YIAkZNiGQNYEEgIhZDDEhU//OGfD4e79cfJwN/fc7Os1c+ee7/N9zvl+zplnzv2+7/Oc76nuDgAAAPfeAza6AAAAgM1KoAIAABgkUAEAAAwSqAAAAAYJVAAAAIMEKgAAgEECFQD3qKqrq+qMja7jvlJVO6uqq2rLwH13VNWdVXXc0TwOAItNoAI4RlTVDVX15CVt51bVBw5vd/c/6e73rfE4mzocTJ/zG7/Vx+nuz3T3Cd39tXWo6aVV9UdL2t53LIVbgM1KoAJgoWzWoAbAsUmgAuAes7NYVXV6Ve2vqjuq6nNV9Zppt/dPf98+XfL2hKp6QFX9WlV9uqpuqao3V9WDZx73Z6f7bquq/7zkOC+tqndW1R9V1R1Jzp0e+4NVdXtV3VxVv11Vx888XlfVc6vqk1X15ar6jap69PQ+d1TVpbP95/CMqvpMVX2+qi6YOc4Dqur8qvp/09ovraqHTvd900xdVZ1SVe+f1vMXVXXR0lmn5Y5TVbuTvDjJv5q+nlfdi7oB2GACFQAreV2S13X3g5I8Osml0/Yfn/4+cbrk7YNJzp3+/ESSf5TkhCS/nSRVdVqS30nyjCSPSPLgJCctOdaeJO9McmKStyb5WpIXJNma5AlJnpTkuUvuszvJDyV5fJIXJbl4eoztSR6b5JzlnlR3v7G7z13S/GNJ/vH0OBdW1fdN238+ydOSPDHJI5N8MclFyz1ukj9O8tdJvivJS5P8m2X6HHGc7v5fSX4zydunr+f3T+s8Y63llwBsPIEK4Njyrumsz+1VdXsmQWclf5/ke6pqa3ff2d2Xr9L3GUle093Xd/edSX41ydnT2ZunJ/kf3f2B7r47yYVJesn9P9jd7+rur3f333X3ld19eXcf6u4bkvx+JqFm1iu7+47uvjrJx5O8Z3r8LyV5d5LHzfeSJEl+fXrcq5JcleT7p+3/PskF3X2wu7+aSVB6+tJliVW1I8kPJ7mwu+/u7g8k2XsvjgPAJiVQARxbntbdJx7+yZGzPrOeneQxST5RVVdU1U+v0veRST49s/3pJFuSPHy678bDO7r7K0luW3L/G2c3quoxVfXnVfXZ6TLA38xktmrW52Zu/90y2yesUu9Sn525/ZWZ+z4qyZ/OBNBrM5k9e/iS+z8yyRemz23Z57TGcQDYpAQqAJbV3Z/s7nOSPCzJK5O8s6r+YY6cXUqSmzIJH4ftSHIok5Bzc5KTD++oqn+QybK4bzrcku3fTfKJJKdOlxy+OEmNP5thNyY5czaEdvd3dPffLul3c5KHVtV3zrRtvxfHWe41BWATEKgAWFZVPbOqtnX315PcPm3+WpJbk3w9k89KHfa2JC+YXpjhhHzjM0GHMvls1FOr6kemF4r49awdjh6Y5I4kd1bV9yZ5zro9sXvn95K8vKoelSRVta2q9izt1N2fTrI/yUur6viqekKSp96L43wuyc6q8ncZYJPxxg3ASnYnubqq7szkAhVnd/dd02VtL0/yV9OlcI9P8oYkb8nkCoCfSnJXkucnyfQzTs9PckkmMzlfTnJLkq+ucuwXJvnX077/Lcnb1//pzeV1mXwW6j1V9eUklyf5Zyv0fUYmF9C4Lcl/yaTm1Z7jrHdMf99WVR8eLxeA+1p1W2UAwH1nOoN1eybL+T610fUcLVX19iSf6O6XbHQtABw9ZqgAOOqq6qlV9Z3Tz2D9VpKPJblhY6taX1X1w9PvwnrA9Lul9iR510bXBcDRJVABcF/Yk8mFK25Kcmomywfvb0skvjvJ+5LcmeT1SZ7T3R/Z0IoAOOos+QMAABhkhgoAAGCQQAUAADBoy0YdeOvWrb1z586NOjwAAMCKrrzyys9397a1+m1YoNq5c2f279+/UYcHAABYUVV9ep5+lvwBAAAMEqgAAAAGCVQAAACDBCoAAIBBawaqqnpDVd1SVR9fYX9V1eur6kBVfbSqfnD9ywQAAFg888xQvTHJ7lX2n5nk1OnPeUl+91svCwAAYPGtGai6+/1JvrBKlz1J3twTlyc5saoesV4FAgAALKr1+AzVSUlunNk+OG0DAAC4X1uPL/atZdp62Y5V52WyLDA7duxYh0Ovr53n/8+NLmFh3PCKn9roEgCATco51Tesdk7ldfqGzXzuuR4zVAeTbJ/ZPjnJTct17O6Lu3tXd+/atm3bOhwaAABg46xHoNqb5GenV/t7fJIvdffN6/C4AAAAC23NJX9V9bYkZyTZWlUHk7wkybclSXf/XpJ9SZ6S5ECSryT5t0erWAAAgEWyZqDq7nPW2N9J/uO6VQQAALBJrMeSPwAAgGOSQAUAADBIoAIAABgkUAEAAAwSqAAAAAYJVAAAAIMEKgAAgEECFQAAwCCBCgAAYJBABQAAMEigAgAAGCRQAQAADBKoAAAABglUAAAAgwQqAACAQQIVAADAIIEKAABgkEAFAAAwSKACAAAYJFABAAAMEqgAAAAGCVQAAACDBCoAAIBBAhUAAMAggQoAAGCQQAUAADBIoAIAABgkUAEAAAwSqAAAAAYJVAAAAIMEKgAAgEECFQAAwCCBCgAAYNBcgaqqdlfVdVV1oKrOX2b/jqp6b1V9pKo+WlVPWf9SAQAAFsuagaqqjktyUZIzk5yW5JyqOm1Jt19Lcml3Py7J2Ul+Z70LBQAAWDTzzFCdnuRAd1/f3XcnuSTJniV9OsmDprcfnOSm9SsRAABgMc0TqE5KcuPM9sFp26yXJnlmVR1Msi/J85d7oKo6r6r2V9X+W2+9daBcAACAxTFPoKpl2nrJ9jlJ3tjdJyd5SpK3VNURj93dF3f3ru7etW3btntfLQAAwAKZJ1AdTLJ9ZvvkHLmk79lJLk2S7v5gku9IsnU9CgQAAFhU8wSqK5KcWlWnVNXxmVx0Yu+SPp9J8qQkqarvyyRQWdMHAADcr60ZqLr7UJLnJbksybWZXM3v6qp6WVWdNe32S0l+rqquSvK2JOd299JlgQAAAPcrW+bp1N37MrnYxGzbhTO3r0nyo+tbGgAAwGKb64t9AQAAOJJABQAAMEigAgAAGCRQAQAADBKoAAAABglUAAAAgwQqAACAQQIVAADAIIEKAABgkEAFAAAwSKACAAAYJFABAAAMEqgAAAAGCVQAAACDBCoAAIBBAhUAAMAggQoAAGCQQAUAADBIoAIAABgkUAEAAAwSqAAAAAYJVAAAAIMEKgAAgEECFQAAwCCBCgAAYJBABQAAMEigAgAAGCRQAQAADBKoAAAABglUAAAAgwQqAACAQQIVAADAoLkCVVXtrqrrqupAVZ2/Qp9/WVXXVNXVVfXH61smAADA4tmyVoeqOi7JRUn+RZKDSa6oqr3dfc1Mn1OT/GqSH+3uL1bVw45WwQAAAItinhmq05Mc6O7ru/vuJJck2bOkz88luai7v5gk3X3L+pYJAACweOYJVCcluXFm++C0bdZjkjymqv6qqi6vqt3LPVBVnVdV+6tq/6233jpWMQAAwIKYJ1DVMm29ZHtLklOTnJHknCR/UFUnHnGn7ou7e1d379q2bdu9rRUAAGChzBOoDibZPrN9cpKblunzZ9399939qSTXZRKwAAAA7rfmCVRXJDm1qk6pquOTnJ1k75I+70ryE0lSVVszWQJ4/XoWCgAAsGjWDFTdfSjJ85JcluTaJJd299VV9bKqOmva7bIkt1XVNUnem+SXu/u2o1U0AADAIljzsulJ0t37kuxb0nbhzO1O8ovTHwAAgGPCXF/sCwAAwJEEKgAAgEECFQAAwCCBCgAAYJBABQAAMEigAgAAGCRQAQAADBKoAAAABglUAAAAgwQqAACAQQIVAADAIIEKAABgkEAFAAAwSKACAAAYJFABAAAMEqgAAAAGCVQAAACDBCoAAIBBAhUAAMAggQoAAGCQQAUAADBIoAIAABgkUAEAAAwSqAAAAAYJVAAAAIMEKgAAgEECFQAAwCCBCgAAYJBABQAAMEigAgAAGCRQAQAADBKoAAAABs0VqKpqd1VdV1UHqur8Vfo9vaq6qnatX4kAAACLac1AVVXHJbkoyZlJTktyTlWdtky/Byb5+SQfWu8iAQAAFtE8M1SnJznQ3dd3991JLkmyZ5l+v5HkVUnuWsf6AAAAFtY8geqkJDfObB+ctt2jqh6XZHt3//k61gYAALDQ5glUtUxb37Oz6gFJXpvkl9Z8oKrzqmp/Ve2/9dZb568SAABgAc0TqA4m2T6zfXKSm2a2H5jksUneV1U3JHl8kr3LXZiiuy/u7l3dvWvbtm3jVQMAACyAeQLVFUlOrapTqur4JGcn2Xt4Z3d/qbu3dvfO7t6Z5PIkZ3X3/qNSMQAAwIJYM1B196Ekz0tyWZJrk1za3VdX1cuq6qyjXSAAAMCi2jJPp+7el2TfkrYLV+h7xrdeFgAAwOKb64t9AQAAOJJABQAAMEigAgAAGCRQAQAADBKoAAAABglUAAAAgwQqAACAQQIVAADAIIEKAABgkEAFAAAwSKACAAAYJFABAAAMEqgAAAAGCVQAAACDBCoAAIBBAhUAAMAggQoAAGCQQAUAADBIoAIAABgkUAEAAAwSqAAAAAYJVAAAAIMEKgAAgEECFQAAwCCBCgAAYJBABQAAMEigAgAAGCRQAQAADBKoAAAABglUAAAAgwQqAACAQQIVAADAoLkCVVXtrqrrqupAVZ2/zP5frKprquqjVfV/qupR618qAADAYlkzUFXVcUkuSnJmktOSnFNVpy3p9pEku7r7nyZ5Z5JXrXehAAAAi2aeGarTkxzo7uu7++4klyTZM9uhu9/b3V+Zbl6e5OT1LRMAAGDxzBOoTkpy48z2wWnbSp6d5N3fSlEAAACbwZY5+tQybb1sx6pnJtmV5Ikr7D8vyXlJsmPHjjlLBAAAWEzzzFAdTLJ9ZvvkJDct7VRVT05yQZKzuvuryz1Qd1/c3bu6e9e2bdtG6gUAAFgY8wSqK5KcWlWnVNXxSc5Osne2Q1U9LsnvZxKmbln/MgEAABbPmoGquw8leV6Sy5Jcm+TS7r66ql5WVWdNu706yQlJ3lFVf1NVe1d4OAAAgPuNeT5Dle7el2TfkrYLZ24/eZ3rAgAAWHhzfbEvAAAARxKoAAAABglUAAAAgwQqAACAQQIVAADAIIEKAABgkEAFAAAwSKACAAAYJFABAAAMEqgAAAAGCVQAAACDBCoAAIBBAhUAAMAggQoAAGCQQAUAADBIoAIAABgkUAEAAAwSqAAAAAYJVAAAAIMEKgAAgEECFQAAwCCBCgAAYJBABQAAMEigAgAAGCRQAQAADBKoAAAABglUAAAAgwQqAACAQQIVAADAIIEKAABgkEAFAAAwSKACAAAYNFegqqrdVXVdVR2oqvOX2f/tVfX26f4PVdXO9S4UAABg0awZqKrquCQXJTkzyWlJzqmq05Z0e3aSL3b39yR5bZJXrnehAAAAi2aeGarTkxzo7uu7++4klyTZs6TPniRvmt5+Z5InVVWtX5kAAACLZ55AdVKSG2e2D07blu3T3YeSfCnJd61HgQAAAItqyxx9lptp6oE+qarzkpw33byzqq6b4/jHkq1JPr/RRSRJWbS5mSzMuGFTMW4YYdww4pgeN86p5rPM67QI4+ZR83SaJ1AdTLJ9ZvvkJDet0OdgVW1J8uAkX1j6QN19cZKL5ynsWFRV+7t710bXweZi3DDCuGGEccMI44YRm2nczLPk74okp1bVKVV1fJKzk+xd0mdvkmdNbz89yV929xEzVAAAAPcna85QdfehqnpeksuSHJfkDd19dVW9LMn+7t6b5A+TvKWqDmQyM3X20SwaAABgEcyz5C/dvS/JviVtF87cvivJz6xvacckyyEZYdwwwrhhhHHDCOOGEZtm3JSVeQAAAGPm+QwVAAAAyxCoNkhVvaGqbqmqj8+0/UBVXV5Vf1NV+6vq9I2skcVSVdur6r1VdW1VXV1V/2na/tCq+t9V9cnp74dsdK0sjlXGzaur6hNV9dGq+tOqOnGja2VxrDRuZva/sKq6qrZuVI0sntXGTVU9v6qum7a/aiPrZLGs8ndq05wXW/K3Qarqx5PcmeTN3f3Yadt7kry2u99dVU9J8qLuPmMDy2SBVNUjkjyiuz9cVQ9McmWSpyU5N8kXuvsVVXV+kod0969sYKkskFXGzcmZXJH1UNXk2z+MGw5badx09zVVtT3JHyT53iQ/1N0b/T0xLIhV3m8enuSCJD/V3V+tqod19y0bWSuLY5Vx81+zSc6LzVBtkO5+f478rq5O8qDp7QfnyO/74hjW3Td394ent7+c5NokJyXZk+RN025vyuRNCJKsPG66+z3dfWja7fJMAhYkWfX9Jklem+RFmfzNgnusMm6ek+QV3f3V6T5hinusMm42zXnxXFf54z7zC0kuq6rfyiTs/sgG18OCqqqdSR6X5ENJHt7dNyeTN6WqetgGlsYCWzJuZv27JG+/r+thc5gdN1V1VpK/7e6rqmpD62KxLXm/eXWSf15VL09yV5IXdvcVG1cdi2rJuNk058VmqBbLc5K8oLu3J3lBJt/vBd+kqk5I8t+T/EJ337HR9bA5rDRuquqCJIeSvHWjamNxzY6bTMbJBUkuXPVOHPOWeb/ZkuQhSR6f5JeTXFoSOUssM242zXmxQLVYnpXkT6a335FkYT98x8aoqm/L5M3mrd19eKx8brr++PA6ZEsp+CYrjJtU1bOS/HSSZ7QP1LLEMuPm0UlOSXJVVd2QyTLRD1fVd29clSyaFd5vDib5k5746yRfT+KCJtxjhXGzac6LBarFclOSJ05v/2SST25gLSyY6X/z/jDJtd39mpldezN508n095/d17WxuFYaN1W1O8mvJDmru7+yUfWxmJYbN939se5+WHfv7O6dmZwk/2B3f3YDS2WBrPJ36l2ZnNekqh6T5PgkLmZCklXHzaY5L3aVvw1SVW9LckYm/6H5XJKXJLkuyesymRq/K8lzu/vKjaqRxVJVP5bk/yb5WCb/3UuSF2eyzvjSJDuSfCbJz3T30guecIxaZdy8Psm3J7lt2nZ5d/+H+75CFtFK46a79830uSHJLlf547BV3m/+IskbkvxAkrsz+QzVX25IkSycVcbNHdkk58UCFQAAwCBL/gAAAAYJVAAAAIMEKgAAgEECFQAAwCCBCgAAYJBABQAAMEigAgAAGCRQAQAADPr/lmi1DcBwbDcAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df.plot.hist(\"height\", 10)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df.plot.frequency(\"*\", 10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting Data In/Out" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "inputHidden": false, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "['names', 'height', 'function', 'rank']" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.cols.names()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "inputHidden": false, "outputHidden": false }, "outputs": [], "source": [ "df.to_json()" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "inputHidden": false, "outputHidden": false }, "outputs": [ { "data": { "text/plain": [ "StructType(List(StructField(names,StringType,true),StructField(height,DoubleType,true),StructField(function,StringType,true),StructField(rank,LongType,true)))" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.schema" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "
Viewing 3 of 3 rows / 4 columns
\n", "
1 partition(s)
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", "
names
\n", "
1 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
height
\n", "
2 (double)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
function
\n", "
3 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
rank
\n", "
4 (bigint)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", " bumbl#ebéé⸱⸱\n", " \n", " 17.5\n", " \n", " Espionage\n", " \n", " 7\n", "
\n", " Optim'us\n", " \n", " 28.0\n", " \n", " Leader\n", " \n", " 10\n", "
\n", " ironhide&\n", " \n", " 26.0\n", " \n", " Security\n", " \n", " 7\n", "
\n", "\n", "
Viewing 3 of 3 rows / 4 columns
\n", "
1 partition(s)
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df.table()" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "inputHidden": false, "lines_to_next_cell": 0, "outputHidden": false, "scrolled": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Processing column 'height'...\n", "_count_data_types() executed in 1.11 sec\n", "count_data_types() executed in 1.11 sec\n", "cast_columns() executed in 0.0 sec\n", "_exprs() executed in 1.18 sec\n", "general_stats() executed in 1.19 sec\n", "------------------------------\n", "Processing column 'height'...\n", "frequency() executed in 1.19 sec\n", "stats_by_column() executed in 0.0 sec\n", "percentile() executed in 0.04 sec\n", "extra_numeric_stats() executed in 0.17 sec\n", "bucketizer() executed in 0.19 sec\n", "hist() executed in 1.38 sec\n", "dataset_info() executed in 1.21 sec\n" ] }, { "data": { "text/html": [ "\n", "\n", "
\n", "

Overview

\n", "
\n", "
\n", "
\n", "

Dataset info

\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "
Number of columns4
Number of rows3
Total Missing (%)0.0%
Total size in memory81.7 MB
\n", "
\n", "
\n", "

Column types

\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "
String0
Numeric1
Date0
Bool0
Array0
Not available0
\n", "
\n", "
\n", "\n", "
\n", "
\n", "\n", " \n", "\n", "
\n", "
\n", "

height

\n", "
numeric
\n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unique 3
Unique (%) 100.0
Missing0.0
Missing (%)0
\n", "
\n", "

\n", " Datatypes\n", "

\n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "
\n", " String\n", " \n", " 0\n", "
\n", " Integer\n", " \n", " 0\n", "
\n", " Float\n", " \n", " 0\n", "
\n", " Bool\n", " \n", " 0\n", "
\n", " Date\n", " \n", " 0\n", "
\n", " Missing\n", " \n", " 0\n", "
\n", " Null\n", " \n", " 0\n", "
\n", " \n", "
\n", "

\n", " Basic Stats\n", "

\n", "\n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "
Mean23.833333333333332
Minimum17.5
Maximum28.0
Zeros(%)0
\n", " \n", "\n", "
\n", "
\n", "

Frequency

\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ValueCountFrequency (%)
28.0133.333%
26.0133.333%
17.5133.333%
\"Missing\"00.0%
\n", "
\n", " \n", "\n", " \n", "
\n", "\n", "\n", "

Quantile statistics

\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Minimum17.5
5-th percentile17.5
Q117.5
Median17.5
Q317.5
95-th percentile17.5
Maximum28.0
Range10.5
Interquartile range0.0
\n", "
\n", "
\n", "

Descriptive statistics

\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Standard deviation5.575242894559244
Coef of variation0.23393
Kurtosis-1.5000000000000004
Mean23.833333333333332
MAD0.0
Skewness0
Sum71.5
Variance31.083333333333336
\n", "
\n", " \n", "
\n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "
\n", "\n", "
\n", " \n", "
\n", "
\n", "
\n", " \n", "
\n", "\n", "
\n", "
\n", "\n", "\n", "\n", "\n", "
Viewing 3 of 3 rows / 4 columns
\n", "
1 partition(s)
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", "
names
\n", "
1 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
height
\n", "
2 (double)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
function
\n", "
3 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
rank
\n", "
4 (bigint)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", " bumbl#ebéé⸱⸱\n", " \n", " 17.5\n", " \n", " Espionage\n", " \n", " 7\n", "
\n", " Optim'us\n", " \n", " 28.0\n", " \n", " Leader\n", " \n", " 10\n", "
\n", " ironhide&\n", " \n", " 26.0\n", " \n", " Security\n", " \n", " 7\n", "
\n", "\n", "
Viewing 3 of 3 rows / 4 columns
\n", "
1 partition(s)
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "Pika version 0.12.0 connecting to ::1:5672\n", "Created channel=1\n", "Closing channel (0): 'Normal shutdown' on ('::1', 5672, 0, 0) params=>>\n", "Received on ('::1', 5672, 0, 0) params=>>\n", "run() executed in 8.76 sec\n" ] } ], "source": [ "op.profiler.run(df, \"height\", infer=True)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Downloading foo.csv from https://raw.githubusercontent.com/ironmussa/Optimus/master/examples/data/foo.csv\n", "Downloaded 967 bytes\n", "Creating DataFrame for foo.csv. Please wait...\n", "Successfully created DataFrame for 'foo.csv'\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "
Viewing 5 of 5 rows / 8 columns
\n", "
1 partition(s)
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", "
id
\n", "
1 (int)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
firstName
\n", "
2 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
lastName
\n", "
3 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
billingId
\n", "
4 (int)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
product
\n", "
5 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
price
\n", "
6 (int)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
birth
\n", "
7 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
dummyCol
\n", "
8 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", " 1\n", " \n", " Luis\n", " \n", " Alvarez$$%!\n", " \n", " 123\n", " \n", " Cake\n", " \n", " 10\n", " \n", " 1980/07/07\n", " \n", " never\n", "
\n", " 2\n", " \n", " André\n", " \n", " Ampère\n", " \n", " 423\n", " \n", " piza\n", " \n", " 8\n", " \n", " 1950/07/08\n", " \n", " gonna\n", "
\n", " 3\n", " \n", " NiELS\n", " \n", " Böhr//((%%\n", " \n", " 551\n", " \n", " pizza\n", " \n", " 8\n", " \n", " 1990/07/09\n", " \n", " give\n", "
\n", " 4\n", " \n", " PAUL\n", " \n", " dirac$\n", " \n", " 521\n", " \n", " pizza\n", " \n", " 8\n", " \n", " 1954/07/10\n", " \n", " you\n", "
\n", " 5\n", " \n", " Albert\n", " \n", " Einstein\n", " \n", " 634\n", " \n", " pizza\n", " \n", " 8\n", " \n", " 1990/07/11\n", " \n", " up\n", "
\n", "\n", "
Viewing 5 of 5 rows / 8 columns
\n", "
1 partition(s)
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df_csv = op.load.csv(\"https://raw.githubusercontent.com/ironmussa/Optimus/master/examples/data/foo.csv\").limit(5)\n", "df_csv.table()" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Downloading foo.json from https://raw.githubusercontent.com/ironmussa/Optimus/master/examples/data/foo.json\n", "Downloaded 2596 bytes\n", "Creating DataFrame for foo.json. Please wait...\n", "Successfully created DataFrame for 'foo.json'\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "
Viewing 5 of 5 rows / 8 columns
\n", "
1 partition(s)
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", "
billingId
\n", "
1 (bigint)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
birth
\n", "
2 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
dummyCol
\n", "
3 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
firstName
\n", "
4 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
id
\n", "
5 (bigint)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
lastName
\n", "
6 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
price
\n", "
7 (bigint)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
product
\n", "
8 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", " 123\n", " \n", " 1980/07/07\n", " \n", " never\n", " \n", " Luis\n", " \n", " 1\n", " \n", " Alvarez$$%!\n", " \n", " 10\n", " \n", " Cake\n", "
\n", " 423\n", " \n", " 1950/07/08\n", " \n", " gonna\n", " \n", " André\n", " \n", " 2\n", " \n", " Ampère\n", " \n", " 8\n", " \n", " piza\n", "
\n", " 551\n", " \n", " 1990/07/09\n", " \n", " give\n", " \n", " NiELS\n", " \n", " 3\n", " \n", " Böhr//((%%\n", " \n", " 8\n", " \n", " pizza\n", "
\n", " 521\n", " \n", " 1954/07/10\n", " \n", " you\n", " \n", " PAUL\n", " \n", " 4\n", " \n", " dirac$\n", " \n", " 8\n", " \n", " pizza\n", "
\n", " 634\n", " \n", " 1990/07/11\n", " \n", " up\n", " \n", " Albert\n", " \n", " 5\n", " \n", " Einstein\n", " \n", " 8\n", " \n", " pizza\n", "
\n", "\n", "
Viewing 5 of 5 rows / 8 columns
\n", "
1 partition(s)
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df_json = op.load.json(\"https://raw.githubusercontent.com/ironmussa/Optimus/master/examples/data/foo.json\").limit(5)\n", "df_json.table()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_csv.save.csv(\"test.csv\")" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "
Viewing 3 of 3 rows / 4 columns
\n", "
1 partition(s)
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", "
names
\n", "
1 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
height
\n", "
2 (double)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
function
\n", "
3 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
rank
\n", "
4 (bigint)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", " bumbl#ebéé⸱⸱\n", " \n", " 17.5\n", " \n", " Espionage\n", " \n", " 7\n", "
\n", " Optim'us\n", " \n", " 28.0\n", " \n", " Leader\n", " \n", " 10\n", "
\n", " ironhide&\n", " \n", " 26.0\n", " \n", " Security\n", " \n", " 7\n", "
\n", "\n", "
Viewing 3 of 3 rows / 4 columns
\n", "
1 partition(s)
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df.table()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Enrichment" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "df = op.load.json(\"https://raw.githubusercontent.com/ironmussa/Optimus/master/examples/data/foo.json\")" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "
Viewing 10 of 19 rows / 8 columns
\n", "
1 partition(s)
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", "
billingId
\n", "
1 (bigint)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
birth
\n", "
2 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
dummyCol
\n", "
3 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
firstName
\n", "
4 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
id
\n", "
5 (bigint)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
lastName
\n", "
6 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
price
\n", "
7 (bigint)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
product
\n", "
8 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
123\n", "
\n", "
\n", "
1980/07/07\n", "
\n", "
\n", "
never\n", "
\n", "
\n", "
Luis\n", "
\n", "
\n", "
1\n", "
\n", "
\n", "
Alvarez$$%!\n", "
\n", "
\n", "
10\n", "
\n", "
\n", "
Cake\n", "
\n", "
\n", "
423\n", "
\n", "
\n", "
1950/07/08\n", "
\n", "
\n", "
gonna\n", "
\n", "
\n", "
André\n", "
\n", "
\n", "
2\n", "
\n", "
\n", "
Ampère\n", "
\n", "
\n", "
8\n", "
\n", "
\n", "
piza\n", "
\n", "
\n", "
551\n", "
\n", "
\n", "
1990/07/09\n", "
\n", "
\n", "
give\n", "
\n", "
\n", "
NiELS\n", "
\n", "
\n", "
3\n", "
\n", "
\n", "
Böhr//((%%\n", "
\n", "
\n", "
8\n", "
\n", "
\n", "
pizza\n", "
\n", "
\n", "
521\n", "
\n", "
\n", "
1954/07/10\n", "
\n", "
\n", "
you\n", "
\n", "
\n", "
PAUL\n", "
\n", "
\n", "
4\n", "
\n", "
\n", "
dirac$\n", "
\n", "
\n", "
8\n", "
\n", "
\n", "
pizza\n", "
\n", "
\n", "
634\n", "
\n", "
\n", "
1990/07/11\n", "
\n", "
\n", "
up\n", "
\n", "
\n", "
Albert\n", "
\n", "
\n", "
5\n", "
\n", "
\n", "
Einstein\n", "
\n", "
\n", "
8\n", "
\n", "
\n", "
pizza\n", "
\n", "
\n", "
672\n", "
\n", "
\n", "
1930/08/12\n", "
\n", "
\n", "
never\n", "
\n", "
\n", "
Galileo\n", "
\n", "
\n", "
6\n", "
\n", "
\n", "
⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅GALiLEI\n", "
\n", "
\n", "
5\n", "
\n", "
\n", "
arepa\n", "
\n", "
\n", "
323\n", "
\n", "
\n", "
1970/07/13\n", "
\n", "
\n", "
gonna\n", "
\n", "
\n", "
CaRL\n", "
\n", "
\n", "
7\n", "
\n", "
\n", "
Ga%%%uss\n", "
\n", "
\n", "
3\n", "
\n", "
\n", "
taco\n", "
\n", "
\n", "
624\n", "
\n", "
\n", "
1950/07/14\n", "
\n", "
\n", "
let\n", "
\n", "
\n", "
David\n", "
\n", "
\n", "
8\n", "
\n", "
\n", "
H$$$ilbert\n", "
\n", "
\n", "
3\n", "
\n", "
\n", "
taaaccoo\n", "
\n", "
\n", "
735\n", "
\n", "
\n", "
1920/04/22\n", "
\n", "
\n", "
you\n", "
\n", "
\n", "
Johannes\n", "
\n", "
\n", "
9\n", "
\n", "
\n", "
KEPLER\n", "
\n", "
\n", "
3\n", "
\n", "
\n", "
taco\n", "
\n", "
\n", "
875\n", "
\n", "
\n", "
1923/03/12\n", "
\n", "
\n", "
down\n", "
\n", "
\n", "
JaMES\n", "
\n", "
\n", "
10\n", "
\n", "
\n", "
M$$ax%%well\n", "
\n", "
\n", "
3\n", "
\n", "
\n", "
taco\n", "
\n", "
\n", "\n", "\n", "
Viewing 10 of 19 rows / 8 columns
\n", "
1 partition(s)
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df.table()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "count is deprecated. Use Collection.count_documents instead.\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "0720079ce4774eba846f02c6b8a49949", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(IntProgress(value=0, description='Processing...', max=19, style=ProgressStyle(description_width…" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "find_and_modify is deprecated, use find_one_and_delete, find_one_and_replace, or find_one_and_update instead\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "import requests\n", "\n", "\n", "def func_request(params):\n", " # You can use here whatever header or auth info you need to send. \n", " # For more information see the requests library\n", " \n", " url= \"https://jsonplaceholder.typicode.com/todos/\" + str(params[\"id\"])\n", " return requests.get(url)\n", "\n", "def func_response(response):\n", " # Here you can parse de response\n", " return response[\"title\"]\n", "\n", "\n", "e = op.enrich(host=\"localhost\", port=27017, db_name=\"jazz\")\n", "e.flush()\n", "df_result = e.run(df, func_request, func_response, calls= 60, period = 60, max_tries = 8)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "
Viewing 10 of 19 rows / 9 columns
\n", "
1 partition(s)
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", "
billingId
\n", "
1 (bigint)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
birth
\n", "
2 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
dummyCol
\n", "
3 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
firstName
\n", "
4 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
id
\n", "
5 (bigint)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
lastName
\n", "
6 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
price
\n", "
7 (bigint)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
product
\n", "
8 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
jazz_results
\n", "
9 (string)
\n", "
\n", " \n", " nullable\n", " \n", "
\n", "
\n", "
123\n", "
\n", "
\n", "
1980/07/07\n", "
\n", "
\n", "
never\n", "
\n", "
\n", "
Luis\n", "
\n", "
\n", "
1\n", "
\n", "
\n", "
Alvarez$$%!\n", "
\n", "
\n", "
10\n", "
\n", "
\n", "
Cake\n", "
\n", "
\n", "
delectus⋅aut⋅autem\n", "
\n", "
\n", "
423\n", "
\n", "
\n", "
1950/07/08\n", "
\n", "
\n", "
gonna\n", "
\n", "
\n", "
André\n", "
\n", "
\n", "
2\n", "
\n", "
\n", "
Ampère\n", "
\n", "
\n", "
8\n", "
\n", "
\n", "
piza\n", "
\n", "
\n", "
quis⋅ut⋅nam⋅facilis⋅et⋅officia⋅qui\n", "
\n", "
\n", "
551\n", "
\n", "
\n", "
1990/07/09\n", "
\n", "
\n", "
give\n", "
\n", "
\n", "
NiELS\n", "
\n", "
\n", "
3\n", "
\n", "
\n", "
Böhr//((%%\n", "
\n", "
\n", "
8\n", "
\n", "
\n", "
pizza\n", "
\n", "
\n", "
fugiat⋅veniam⋅minus\n", "
\n", "
\n", "
521\n", "
\n", "
\n", "
1954/07/10\n", "
\n", "
\n", "
you\n", "
\n", "
\n", "
PAUL\n", "
\n", "
\n", "
4\n", "
\n", "
\n", "
dirac$\n", "
\n", "
\n", "
8\n", "
\n", "
\n", "
pizza\n", "
\n", "
\n", "
et⋅porro⋅tempora\n", "
\n", "
\n", "
634\n", "
\n", "
\n", "
1990/07/11\n", "
\n", "
\n", "
up\n", "
\n", "
\n", "
Albert\n", "
\n", "
\n", "
5\n", "
\n", "
\n", "
Einstein\n", "
\n", "
\n", "
8\n", "
\n", "
\n", "
pizza\n", "
\n", "
\n", "
laboriosam⋅mollitia⋅et⋅enim⋅quasi⋅adipisci⋅quia⋅provident⋅illum\n", "
\n", "
\n", "
672\n", "
\n", "
\n", "
1930/08/12\n", "
\n", "
\n", "
never\n", "
\n", "
\n", "
Galileo\n", "
\n", "
\n", "
6\n", "
\n", "
\n", "
⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅GALiLEI\n", "
\n", "
\n", "
5\n", "
\n", "
\n", "
arepa\n", "
\n", "
\n", "
qui⋅ullam⋅ratione⋅quibusdam⋅voluptatem⋅quia⋅omnis\n", "
\n", "
\n", "
323\n", "
\n", "
\n", "
1970/07/13\n", "
\n", "
\n", "
gonna\n", "
\n", "
\n", "
CaRL\n", "
\n", "
\n", "
7\n", "
\n", "
\n", "
Ga%%%uss\n", "
\n", "
\n", "
3\n", "
\n", "
\n", "
taco\n", "
\n", "
\n", "
illo⋅expedita⋅consequatur⋅quia⋅in\n", "
\n", "
\n", "
624\n", "
\n", "
\n", "
1950/07/14\n", "
\n", "
\n", "
let\n", "
\n", "
\n", "
David\n", "
\n", "
\n", "
8\n", "
\n", "
\n", "
H$$$ilbert\n", "
\n", "
\n", "
3\n", "
\n", "
\n", "
taaaccoo\n", "
\n", "
\n", "
quo⋅adipisci⋅enim⋅quam⋅ut⋅ab\n", "
\n", "
\n", "
735\n", "
\n", "
\n", "
1920/04/22\n", "
\n", "
\n", "
you\n", "
\n", "
\n", "
Johannes\n", "
\n", "
\n", "
9\n", "
\n", "
\n", "
KEPLER\n", "
\n", "
\n", "
3\n", "
\n", "
\n", "
taco\n", "
\n", "
\n", "
molestiae⋅perspiciatis⋅ipsa\n", "
\n", "
\n", "
875\n", "
\n", "
\n", "
1923/03/12\n", "
\n", "
\n", "
down\n", "
\n", "
\n", "
JaMES\n", "
\n", "
\n", "
10\n", "
\n", "
\n", "
M$$ax%%well\n", "
\n", "
\n", "
3\n", "
\n", "
\n", "
taco\n", "
\n", "
\n", "
illo⋅est⋅ratione⋅doloremque⋅quia⋅maiores⋅aut\n", "
\n", "
\n", "\n", "\n", "
Viewing 10 of 19 rows / 9 columns
\n", "
1 partition(s)
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df_result.table()" ] } ], "metadata": { "jupytext": { "formats": "ipynb,py" }, "kernel_info": { "name": "python3" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" }, "nteract": { "version": "0.11.7" } }, "nbformat": 4, "nbformat_minor": 2 }