{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data science intro with panthera\n", "## Clojure + Pandas + Numpy = 💖\n", "\n", "I'll show how it is possible to get the most out of the [Pandas](https://pandas.pydata.org/) & the Clojure ecosystem at the same time.\n", "\n", "This intro is based on this [Kaggle notebook](https://www.kaggle.com/kanncaa1/data-sciencetutorial-for-beginners) you can follow along with that if you come from the Python world.\n", "\n", "## Env setup\n", "\n", "The easiest way to go is the provided [Docker image](https://cloud.docker.com/u/alanmarazzi/repository/docker/alanmarazzi/panthera), but if you want to setup your machine just follow along.\n", "\n", "### System install\n", "\n", "If you want to install everything at the system level you should do something equivalent to what we do below:\n", "\n", "```bash\n", "sudo apt-get update\n", "sudo apt-get install libpython3.6-dev\n", "pip3 install numpy pandas\n", "```\n", "\n", "### conda\n", "\n", "To work within a conda environment just create a new one with:\n", "\n", "```bash\n", "conda create -n panthera python=3.6 numpy pandas\n", "conda activate panthera\n", "```\n", "Than start your REPL from the activated conda environment. This is the best way to install requirements for panthera because in the process you get MKL as well with Numpy.\n", "\n", "### Here\n", "\n", "Let's just add panthera to our classpath and we're good to go!" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ ":ok" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(require '[clojupyter.misc.helper :as helper])\n", "(helper/add-dependencies '[panthera \"0.1-alpha.11\"])\n", ":ok" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now require panthera main API namespace and define a little helper to better inspect data-frames" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(require '[panthera.panthera :as pt])" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#'user/show" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(require '[clojupyter.display :as display])\n", "(require '[libpython-clj.python :as py])\n", "\n", "(defn show\n", " [obj]\n", " (display/html\n", " (py/call-attr obj \"to_html\")))" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "nil" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(helper/add-dependencies '[metasoarous/oz \"1.5.4\"])\n", "(require '[oz.notebook.clojupyter :as oz])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A brief primer\n", "\n", "We will work with Pokemons! Datasets are available [here](https://www.kaggle.com/kanncaa1/data-sciencetutorial-for-beginners/data).\n", "\n", "We can read data into panthera from various formats, one of the most used is `read-csv`. Most panthera functions accept either a data-frame and/or a series as a first argument, one or more required arguments and then a map of options.\n", "\n", "To see which options are available you can check docs or even original [Pandas docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas.read_csv), just remember that if you pass keywords they'll be converted to Python automatically (for example `:index-col` becomes `index_col`), while if you pass strings you have to use its original name.\n", "\n", "Below as an example we `read-csv` our file, but we want to get only the first 10 rows, so we pass a map to the function like `{:nrows 10}`." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
#NameType 1Type 2HPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
01BulbasaurGrassPoison4549496565451False
12IvysaurGrassPoison6062638080601False
23VenusaurGrassPoison808283100100801False
34Mega VenusaurGrassPoison80100123122120801False
45CharmanderFireNaN3952436050651False
56CharmeleonFireNaN5864588065801False
67CharizardFireFlying788478109851001False
78Mega Charizard XFireDragon78130111130851001False
89Mega Charizard YFireFlying78104781591151001False
910SquirtleWaterNaN4448655064431False
" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(show (pt/read-csv \"../resources/pokemon.csv\" {:nrows 10}))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The cool thing is that we can chain operations, the threading first macro is our friend!\n", "\n", "Below we read the whole csv, get the correlation matrix and then show it" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
#HPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
#1.0000000.0977120.1026640.0946910.0891990.0855960.0121810.9834280.154336
HP0.0977121.0000000.4223860.2396220.3623800.3787180.1759520.0586830.273620
Attack0.1026640.4223861.0000000.4386870.3963620.2639900.3812400.0514510.345408
Defense0.0946910.2396220.4386871.0000000.2235490.5107470.0152270.0424190.246377
Sp. Atk0.0891990.3623800.3963620.2235491.0000000.5061210.4730180.0364370.448907
Sp. Def0.0855960.3787180.2639900.5107470.5061211.0000000.2591330.0284860.363937
Speed0.0121810.1759520.3812400.0152270.4730180.2591331.000000-0.0231210.326715
Generation0.9834280.0586830.0514510.0424190.0364370.028486-0.0231211.0000000.079794
Legendary0.1543360.2736200.3454080.2463770.4489070.3639370.3267150.0797941.000000
" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(-> (pt/read-csv \"../resources/pokemon.csv\")\n", " pt/corr\n", " show)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since we'll be using `pokemon.csv` a lot, let's give it a name, `defonce` is great here" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#'user/pokemon" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(defonce pokemon (pt/read-csv \"../resources/pokemon.csv\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see how plotting goes" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#'user/heatmap" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(defn heatmap \n", " [data x y z]\n", " {:data {:values data}\n", " :width 500\n", " :height 500\n", " :encoding {:x {:field x\n", " :type \"nominal\"}\n", " :y {:field y\n", " :type \"nominal\"}}\n", " :layer [{:mark \"rect\"\n", " :encoding {:color {:field z\n", " :type \"quantitative\"}}}\n", " {:mark \"text\"\n", " :encoding {:text \n", " {:field z\n", " :type \"quantitative\"\n", " :format \".2f\"}\n", " :color {:value \"white\"}}}]})" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", " \n", "
\n", " " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(-> pokemon\n", " pt/corr\n", " pt/reset-index\n", " (pt/melt {:id-vars :index})\n", " pt/->clj\n", " (heatmap :index :variable :value)\n", " oz/view!)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What we did is plotting the heatmap of the correlation matrix shown above. Don't worry too much to all the steps we took, we'll be seeing all of them one by one later on!\n", "\n", "What if we already read our data but we want to see only some rows? We have the `head` function for that" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
#NameType 1Type 2HPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
01BulbasaurGrassPoison4549496565451False
12IvysaurGrassPoison6062638080601False
23VenusaurGrassPoison808283100100801False
34Mega VenusaurGrassPoison80100123122120801False
45CharmanderFireNaN3952436050651False
" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(show (pt/head pokemon))" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
#NameType 1Type 2HPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
01BulbasaurGrassPoison4549496565451False
12IvysaurGrassPoison6062638080601False
23VenusaurGrassPoison808283100100801False
34Mega VenusaurGrassPoison80100123122120801False
45CharmanderFireNaN3952436050651False
56CharmeleonFireNaN5864588065801False
67CharizardFireFlying788478109851001False
78Mega Charizard XFireDragon78130111130851001False
89Mega Charizard YFireFlying78104781591151001False
910SquirtleWaterNaN4448655064431False
" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(show (pt/head pokemon 10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another nice thing we can do is to get columns names" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['#', 'Name', 'Type 1', 'Type 2', 'HP', 'Attack', 'Defense', 'Sp. Atk',\n", " 'Sp. Def', 'Speed', 'Generation', 'Legendary'],\n", " dtype='object')" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(pt/names pokemon)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now when you see an output as the above one, that means that the data we have is still in Python. That's ok if you keep working within panthera, but what if you want to do something with column names using Clojure?" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[\"#\" \"Name\" \"Type 1\" \"Type 2\" \"HP\" \"Attack\" \"Defense\" \"Sp. Atk\" \"Sp. Def\" \"Speed\" \"Generation\" \"Legendary\"]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(vec (pt/names pokemon))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's it! Just call `vec`and now you have a nice Clojure vector that you can deal with.\n", "\n", "> N.B.: with many Python objects you can directly treat them as similar Clojure collections. For instance in this case we can do something like below" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#\n", "Name\n", "Type 1\n", "Type 2\n", "HP\n", "Attack\n", "Defense\n", "Sp. Atk\n", "Sp. Def\n", "Speed\n", "Generation\n", "Legendary\n" ] }, { "data": { "text/plain": [ "nil" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(doseq [a (pt/names pokemon)] (println a))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Some plotting\n", "\n", "Plotting is nice to learn how to munge data: you get a fast visual feedback and usually results are nice to look at!\n", "\n", "Let's plot `Speed` and `Defense`" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#'user/line-plot" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(defn line-plot\n", " [data x y & [color]]\n", " (let [spec {:data {:values data}\n", " :mark \"line\"\n", " :width 600\n", " :height 300\n", " :encoding {:x {:field x\n", " :type \"quantitative\"}\n", " :y {:field y\n", " :type \"quantitative\"}\n", " :color {}}}]\n", " (if color\n", " (assoc-in spec [:encoding :color] {:field color\n", " :type \"nominal\"})\n", " (assoc-in spec [:encoding :color] {:value \"blue\"}))))" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", " \n", "
\n", " " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(-> pokemon\n", " (pt/subset-cols :# :Speed :Defense)\n", " (pt/melt {:id-vars :#})\n", " pt/->clj\n", " (line-plot :# :value :variable)\n", " oz/view!)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at the operation above:\n", "\n", "- `subset-cols`: we use this to, well, subset columns. We can choose N columns by label, we will get a 'new' data-frame with only the selected columns\n", "- `melt`: this transforms the data-frame from wide to long format (for more info about it see [further below](#reshape)\n", "- `->clj`: this turns data-frames and serieses to a Clojure vector of maps\n", "\n", "`subset-cols` is pretty straightforward:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SpeedAttack
04549
16062
28082
380100
46552
" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(-> pokemon (pt/subset-cols :Speed :Attack) pt/head show)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SpeedAttackHP#
04549451
16062602
28082803
380100804
46552395
" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(-> pokemon (pt/subset-cols :Speed :Attack :HP :#) pt/head show)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ " # Attack\n", "0 1 49\n", "1 2 62\n", "2 3 82\n", "3 4 100\n", "4 5 52" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(-> pokemon (pt/subset-cols :# :Attack) pt/head)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`->clj` tries to understand what's the better way to transform panthera data structures to Clojure ones" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{:speed 45} {:speed 60} {:speed 80} {:speed 80} {:speed 65}]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(-> pokemon (pt/subset-cols :Speed) pt/head pt/->clj)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{:speed 45, :hp 45} {:speed 60, :hp 60} {:speed 80, :hp 80} {:speed 80, :hp 80} {:speed 65, :hp 39}]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(-> pokemon (pt/subset-cols :Speed :HP) pt/head pt/->clj)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we want to see what happens when we plot `Attack` vs `Defense`" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#'user/scatter" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(defn scatter\n", " [data x y & [color]]\n", " (let [spec {:data {:values data}\n", " :mark \"point\"\n", " :width 600\n", " :height 300\n", " :encoding {:x {:field x\n", " :type \"quantitative\"}\n", " :y {:field y\n", " :type \"quantitative\"}\n", " :color {}}}]\n", " (if color\n", " (assoc-in spec [:encoding :color] {:field color\n", " :type \"nominal\"})\n", " (assoc-in spec [:encoding :color] {:value \"dodgerblue\"}))))" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", " \n", "
\n", " " ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(-> pokemon\n", " (pt/subset-cols :Attack :Defense)\n", " pt/->clj\n", " (scatter :attack :defense)\n", " oz/view!)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And now the `Speed` histogram" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#'user/hist" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(defn hist\n", " [data x & [color]]\n", " (let [spec {:data {:values data}\n", " :mark \"bar\"\n", " :width 600\n", " :height 300\n", " :encoding {:x {:field x\n", " :bin {:maxbins 50}\n", " :type \"quantitative\"}\n", " :y {:aggregate \"count\"\n", " :type \"quantitative\"}\n", " :color {}}}]\n", " (if color\n", " (assoc-in spec [:encoding :color] {:field color\n", " :type \"nominal\"})\n", " (assoc-in spec [:encoding :color] {:value \"dodgerblue\"}))))" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", " \n", "
\n", " " ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(-> pokemon\n", " (pt/subset-cols :Speed)\n", " pt/->clj\n", " (hist :speed)\n", " oz/view!)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data-frames basics\n", "\n", "### Creation\n", "\n", "How to create data-frames? Above we read a csv, but what if we already have some data in the runtime we want to deal with? Nothing easier than this:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
012
134
" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(show (pt/data-frame [{:a 1 :b 2} {:a 3 :b 4}]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What if we don't care about column names, or we'd prefer to add them to an already generated data-frame?" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01
012
134
" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(show (pt/data-frame (to-array-2d [[1 2] [3 4]])))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Columns of data-frames are just serieses:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ ":series" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(-> pokemon (pt/subset-cols \"Defense\") pt/pytype)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "2 3\n", "dtype: int64" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(pt/series [1 2 3])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The column name is the name of the series:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "2 3\n", "Name: my-series, dtype: int64" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(pt/series [1 2 3] {:name :my-series})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Filtering\n", "\n", "One of the most straightforward ways to filter data-frames is with booleans. We have `filter-rows` that takes either booleans or a function that generates booleans" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
#NameType 1Type 2HPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
224225Mega SteelixSteelGround751252305595302False
230231ShuckleBugRock20102301023052False
333334Mega AggronSteelNaN701402306080503False
" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(-> pokemon\n", " (pt/filter-rows #(-> % (pt/subset-cols \"Defense\") (pt/gt 200)))\n", " show)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`gt` is exactly what you think it is: `>`. Check the [Basic concepts](https://github.com/alanmarazzi/panthera/blob/master/examples/basic-concepts.ipynb) notebook to better understand how math works in panthera.\n", "\n", "Now we'll have to introduce Numpy in the equation. Let's say we want to filter the data-frame based on 2 conditions at the same time, we can do that using `npy`:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "nil" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(require '[panthera.numpy :refer [npy]])" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#'user/my-filter" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(defn my-filter\n", " [col1 col2]\n", " (npy :logical-and \n", " {:args [(-> pokemon\n", " (pt/subset-cols col1)\n", " (pt/gt 200))\n", " (-> pokemon\n", " (pt/subset-cols col2)\n", " (pt/gt 100))]}))" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
#NameType 1Type 2HPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
224225Mega SteelixSteelGround751252305595302False
333334Mega AggronSteelNaN701402306080503False
" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(-> pokemon\n", " (pt/filter-rows (my-filter :Defense :Attack))\n", " show)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`panthera.numpy` works a little differently than regular panthera, usually you need only `npy` to have access to all of numpy functions.\n", "\n", "For instance:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 3.891820\n", "1 4.143135\n", "2 4.418841\n", "3 4.812184\n", "4 3.761200\n", "Name: Defense, dtype: float64" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(-> pokemon\n", " (pt/subset-cols :Defense)\n", " ((npy :log))\n", " pt/head)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Above we just calculated the `log` of the whole `Defense` column! Remember that `npy` operations are vectorized, so usually it is faster to use them (or equivalent panthera ones) than Clojure ones (unless you're doing more complicated operations, then Clojure would *probably* be faster).\n", "\n", "Now let's try to do some more complicated things:" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "27311/400" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(/ (pt/sum (pt/subset-cols pokemon :Speed)) \n", " (pt/n-rows pokemon))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Above we see how we can combine operations on serieses, but of course that's a `mean`, and we have a function for that!" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#'user/col-mean" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(defn col-mean\n", " [col]\n", " (pt/mean (pt/subset-cols pokemon col)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we would like to add a new column that says `high` when the value is above the mean, and `low` for the opposite.\n", "\n", "`npy` is really helpful here:" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['low' 'low' 'high' 'high' 'low']" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(npy :where {:args [(pt/gt (pt/head (pt/subset-cols pokemon :Speed)) (col-mean :Speed))\n", " \"high\"\n", " \"low\"]})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But this is pretty ugly and we can't chain it with other functions. It is pretty easy to wrap it into a chainable function:" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#'user/where" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(defn where\n", " [& args]\n", " (npy :where {:args args}))" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['low' 'low' 'high' 'high' 'low']" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(-> pokemon\n", " (pt/subset-cols :Speed)\n", " pt/head\n", " (pt/gt (col-mean :Speed))\n", " (where \"high\" \"low\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That seems to work! Let's add a new column to our data-frame:" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speed_levelSpeed
0low45
1low60
2high80
3high80
4low65
5high80
6high100
7high100
8high100
9low43
" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(def speed-level\n", " (-> pokemon\n", " (pt/subset-cols :Speed)\n", " (pt/gt (col-mean :Speed))\n", " (where \"high\" \"low\")))\n", "\n", "(-> pokemon\n", " (pt/assign {:speed-level speed-level})\n", " (pt/subset-cols :speed_level :Speed)\n", " (pt/head 10)\n", " show)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of course we didn't actually add `speed_level` to `pokemon`, we created a new data-frame. Everything here is as immutable as possible, let's check if this is really the case:" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[\"#\" \"Name\" \"Type 1\" \"Type 2\" \"HP\" \"Attack\" \"Defense\" \"Sp. Atk\" \"Sp. Def\" \"Speed\" \"Generation\" \"Legendary\"]" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(vec (pt/names pokemon))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Inspecting data\n", "\n", "Other than `head` we have `tail`" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
#NameType 1Type 2HPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
795796DiancieRockFairy50100150100150506True
796797Mega DiancieRockFairy501601101601101106True
797798Hoopa ConfinedPsychicGhost8011060150130706True
798799Hoopa UnboundPsychicDark8016060170130806True
799800VolcanionFireWater8011012013090706True
" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(show (pt/tail pokemon))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can always check what's the shape of the data structure we're interested in. `shape` returns rows and columns count" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(800, 12)" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(pt/shape pokemon)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want just one of the two you can either use one of `n-rows` or `n-cols`, or get the required value by index:" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "800" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(pt/n-rows pokemon)" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "800" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "((pt/shape pokemon) 0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exploratory data analysis\n", "\n", "Now we can move to something a little more interesting: some data analysis.\n", "\n", "One of the first things we might want to do is to look at some frequencies. `value-counts` is our friend" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Water 112\n", "Normal 98\n", "Grass 70\n", "Bug 69\n", "Psychic 57\n", "Fire 52\n", "Rock 44\n", "Electric 44\n", "Ground 32\n", "Ghost 32\n", "Dragon 32\n", "Dark 31\n", "Poison 28\n", "Fighting 27\n", "Steel 27\n", "Ice 24\n", "Fairy 17\n", "Flying 4\n", "Name: Type 1, dtype: int64" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(-> pokemon\n", " (pt/subset-cols \"Type 1\")\n", " (pt/value-counts {:dropna false}))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see we get counts by group automatically and this can come in handy!\n", "\n", "There's also a nice way to see many stats at once for all the numeric columns: `describe`" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
#HPAttackDefenseSp. AtkSp. DefSpeedGeneration
count800.0000800.000000800.000000800.000000800.000000800.000000800.000000800.00000
mean400.500069.25875079.00125073.84250072.82000071.90250068.2775003.32375
std231.084425.53466932.45736631.18350132.72229427.82891629.0604741.66129
min1.00001.0000005.0000005.00000010.00000020.0000005.0000001.00000
25%200.750050.00000055.00000050.00000049.75000050.00000045.0000002.00000
50%400.500065.00000075.00000070.00000065.00000070.00000065.0000003.00000
75%600.250080.000000100.00000090.00000095.00000090.00000090.0000005.00000
max800.0000255.000000190.000000230.000000194.000000230.000000180.0000006.00000
" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(show (pt/describe pokemon))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you need some of these stats only for some columns, chances are that there's a function for that!" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[69.25875 25.53466903233207 1 255]" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(-> (pt/subset-cols pokemon :HP)\n", " ((juxt pt/mean pt/std pt/minimum pt/maximum)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Reshaping data\n", "\n", "Some of the most common operations with rectangular data is to reshape them how we most please to make other operations easier.\n", "\n", "The R people perfectly know what I mean when I talk about [tidy data](https://www.jstatsoft.org/article/view/v059i10/v59i10.pdf), if you have no idea about this check the link, but the main point is that while most are used to work with double entry matrices (like the one above built with `describe`), it is much easier to work with *long data*: one row per observation and one column per variable.\n", "\n", "In panthera there's `melt` as a workhorse for this process" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
#NameType 1Type 2HPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
01BulbasaurGrassPoison4549496565451False
12IvysaurGrassPoison6062638080601False
23VenusaurGrassPoison808283100100801False
34Mega VenusaurGrassPoison80100123122120801False
45CharmanderFireNaN3952436050651False
" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(-> pokemon pt/head show)" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Namevariablevalue
0BulbasaurAttack49
1IvysaurAttack62
2VenusaurAttack82
3Mega VenusaurAttack100
4CharmanderAttack52
5BulbasaurDefense49
6IvysaurDefense63
7VenusaurDefense83
8Mega VenusaurDefense123
9CharmanderDefense43
" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(-> pokemon pt/head (pt/melt {:id-vars \"Name\" :value-vars [\"Attack\" \"Defense\"]}) show)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Above we told panthera that we wanted to `melt` our data-frame and that we would like to have the column `Name` act as the main id, while we're interested in the value of `Attack` and `Defense`.\n", "\n", "This makes much easier to group values by some variable:" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/plain": [ " value\n", "variable \n", "Attack 69.0\n", "Defense 72.2" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(-> pokemon \n", " pt/head \n", " (pt/melt {:id-vars \"Name\" :value-vars [\"Attack\" \"Defense\"]}) \n", " (pt/groupby :variable)\n", " pt/mean)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you've ever used Excel you already know about `pivot`, which is the opposite of `melt`" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
variableAttackDefense
Name
Bulbasaur4949
Charmander5243
Ivysaur6263
Mega Venusaur100123
Venusaur8283
" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(-> pokemon \n", " pt/head \n", " (pt/melt {:id-vars \"Name\" :value-vars [\"Attack\" \"Defense\"]}) \n", " (pt/pivot {:index \"Name\" :columns \"variable\" :values \"value\"})\n", " show)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What if we have more than one data-frame? We can combine them however we want!" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
#NameType 1Type 2HPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
01BulbasaurGrassPoison4549496565451False
12IvysaurGrassPoison6062638080601False
23VenusaurGrassPoison808283100100801False
34Mega VenusaurGrassPoison80100123122120801False
45CharmanderFireNaN3952436050651False
5796DiancieRockFairy50100150100150506True
6797Mega DiancieRockFairy501601101601101106True
7798Hoopa ConfinedPsychicGhost8011060150130706True
8799Hoopa UnboundPsychicDark8016060170130806True
9800VolcanionFireWater8011012013090706True
" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(show \n", " (pt/concatenate\n", " [(pt/head pokemon)\n", " (pt/tail pokemon)]\n", " {:axis 0\n", " :ignore-index true}))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just a second to discuss some options:\n", "\n", "- `:axis`: most of panthera operations can be applied either by rows or columns, we decide which with this keyword where 0 = rows and 1 = columns\n", "- `:ignore-index`: panthera works by index, to better understand what kind of indexes there are and most of their quirks check [Basic concepts](https://nbviewer.jupyter.org/github/alanmarazzi/panthera/blob/master/examples/basic-concepts.ipynb#Indexing-and-subsetting)\n", "\n", "To better understand `:axis` let's make another example" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
#NameType 1Type 2HPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary#NameType 1Type 2HPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
01BulbasaurGrassPoison4549496565451False1BulbasaurGrassPoison4549496565451False
12IvysaurGrassPoison6062638080601False2IvysaurGrassPoison6062638080601False
23VenusaurGrassPoison808283100100801False3VenusaurGrassPoison808283100100801False
34Mega VenusaurGrassPoison80100123122120801False4Mega VenusaurGrassPoison80100123122120801False
45CharmanderFireNaN3952436050651False5CharmanderFireNaN3952436050651False
" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(show\n", " (pt/concatenate\n", " (repeat 2 (pt/head pokemon))\n", " {:axis 1}))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Types, types everywhere\n", "\n", "There are many dedicated types, but no worries, there are nice ways to deal with them." ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "# int64\n", "Name object\n", "Type 1 object\n", "Type 2 object\n", "HP int64\n", "Attack int64\n", "Defense int64\n", "Sp. Atk int64\n", "Sp. Def int64\n", "Speed int64\n", "Generation int64\n", "Legendary bool\n", "dtype: object" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(pt/dtype pokemon)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I guess there isn't much to say about `:int64` and `:bool`, but surely `:object` looks more interesting. When panthera (numpy included) finds either strings or something it doesn't know how to deal with it goes to the less tight type possible which is an `:object`.\n", "\n", "`:object`s are usually bloated, if we want to save some overhead and it makes sense to deal with categorical values we can convert them to `:category`" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 Grass\n", "1 Grass\n", "2 Grass\n", "3 Grass\n", "4 Fire\n", "Name: Type 1, dtype: category\n", "Categories (18, object): [Bug, Dark, Dragon, Electric, ..., Psychic, Rock, Steel, Water]" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(-> pokemon\n", " (pt/subset-cols \"Type 1\")\n", " (pt/astype :category)\n", " pt/head)" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 45.0\n", "1 60.0\n", "2 80.0\n", "3 80.0\n", "4 65.0\n", "Name: Speed, dtype: float64" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(-> pokemon\n", " (pt/subset-cols \"Speed\")\n", " (pt/astype :float)\n", " pt/head)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dealing with missing data\n", "\n", "One of the most painful operations for data scientists and engineers is dealing with the unknown: `NaN` (or `nil`, `Null`, etc).\n", "\n", "panthera tries to make this as painless as possible:" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "NaN 386\n", "Flying 97\n", "Ground 35\n", "Poison 34\n", "Psychic 33\n", "Fighting 26\n", "Grass 25\n", "Fairy 23\n", "Steel 22\n", "Dark 20\n", "Dragon 18\n", "Ice 14\n", "Ghost 14\n", "Water 14\n", "Rock 14\n", "Fire 12\n", "Electric 6\n", "Normal 4\n", "Bug 3\n", "Name: Type 2, dtype: int64" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(-> pokemon\n", " (pt/subset-cols \"Type 2\")\n", " (pt/value-counts {:dropna false}))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We could check for `NaN` in other ways has well:" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[true false]" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(-> pokemon (pt/subset-cols \"Type 2\") ((juxt pt/hasnans? (comp pt/all? pt/not-na?))))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the ways to deal with missing data is to just drop rows" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Flying 97\n", "Ground 35\n", "Poison 34\n", "Psychic 33\n", "Fighting 26\n", "Grass 25\n", "Fairy 23\n", "Steel 22\n", "Dark 20\n", "Dragon 18\n", "Ice 14\n", "Rock 14\n", "Water 14\n", "Ghost 14\n", "Fire 12\n", "Electric 6\n", "Normal 4\n", "Bug 3\n", "Name: Type 2, dtype: int64" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(-> pokemon\n", " (pt/dropna {:subset [\"Type 2\"]})\n", " (pt/subset-cols \"Type 2\")\n", " (pt/value-counts {:dropna false}))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But let's say we want to replace missing observations with a flag or value of some kind, we can do that easily with `fill-na`" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 Poison\n", "1 Poison\n", "2 Poison\n", "3 Poison\n", "4 empty\n", "5 empty\n", "6 Flying\n", "7 Dragon\n", "8 Flying\n", "9 empty\n", "Name: Type 2, dtype: object" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(-> pokemon\n", " (pt/subset-cols \"Type 2\")\n", " (pt/fill-na :empty)\n", " (pt/head 10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Time and dates\n", "\n", "Programmers hate time, that's a fact. Panthera tries to make this experience as painless as possible" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DatetimeIndex(['1992-01-10', '1992-02-10', '1992-03-10', '1993-03-15',\n", " '1993-03-16'],\n", " dtype='datetime64[ns]', freq=None)" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(def times\n", " [\"1992-01-10\",\"1992-02-10\",\"1992-03-10\",\"1993-03-15\",\"1993-03-16\"])\n", "\n", "(pt/->datetime times)" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
#NameType 1Type 2HPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
1992-01-101BulbasaurGrassPoison4549496565451False
1992-02-102IvysaurGrassPoison6062638080601False
1992-03-103VenusaurGrassPoison808283100100801False
1993-03-154Mega VenusaurGrassPoison80100123122120801False
1993-03-165CharmanderFireNaN3952436050651False
" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(-> pokemon\n", " pt/head\n", " (pt/set-index (pt/->datetime times))\n", " show)" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "# 5\n", "Name Charmander\n", "Type 1 Fire\n", "Type 2 NaN\n", "HP 39\n", "Attack 52\n", "Defense 43\n", "Sp. Atk 60\n", "Sp. Def 50\n", "Speed 65\n", "Generation 1\n", "Legendary False\n", "Name: 1993-03-16 00:00:00, dtype: object" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(-> pokemon\n", " pt/head\n", " (pt/set-index (pt/->datetime times))\n", " (pt/select-rows \"1993-03-16\" :loc))" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
#NameType 1Type 2HPAttackDefenseSp. AtkSp. DefSpeedGenerationLegendary
1992-03-103VenusaurGrassPoison808283100100801False
1993-03-154Mega VenusaurGrassPoison80100123122120801False
1993-03-165CharmanderFireNaN3952436050651False
" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(-> pokemon\n", " pt/head\n", " (pt/set-index (pt/->datetime times))\n", " (pt/select-rows (pt/slice \"1992-03-10\" \"1993-03-16\") :loc)\n", " show)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Lein-Clojure", "language": "clojure", "name": "lein-clojure" }, "language_info": { "file_extension": ".clj", "mimetype": "text/x-clojure", "name": "clojure", "version": "1.10.0" } }, "nbformat": 4, "nbformat_minor": 2 }