{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data science intro with panthera\n",
"## Clojure + Pandas + Numpy = 💖\n",
"\n",
"I'll show how it is possible to get the most out of the [Pandas](https://pandas.pydata.org/) & the Clojure ecosystem at the same time.\n",
"\n",
"This intro is based on this [Kaggle notebook](https://www.kaggle.com/kanncaa1/data-sciencetutorial-for-beginners) you can follow along with that if you come from the Python world.\n",
"\n",
"## Env setup\n",
"\n",
"The easiest way to go is the provided [Docker image](https://cloud.docker.com/u/alanmarazzi/repository/docker/alanmarazzi/panthera), but if you want to setup your machine just follow along.\n",
"\n",
"### System install\n",
"\n",
"If you want to install everything at the system level you should do something equivalent to what we do below:\n",
"\n",
"```bash\n",
"sudo apt-get update\n",
"sudo apt-get install libpython3.6-dev\n",
"pip3 install numpy pandas\n",
"```\n",
"\n",
"### conda\n",
"\n",
"To work within a conda environment just create a new one with:\n",
"\n",
"```bash\n",
"conda create -n panthera python=3.6 numpy pandas\n",
"conda activate panthera\n",
"```\n",
"Than start your REPL from the activated conda environment. This is the best way to install requirements for panthera because in the process you get MKL as well with Numpy.\n",
"\n",
"### Here\n",
"\n",
"Let's just add panthera to our classpath and we're good to go!"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
":ok"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(require '[clojupyter.misc.helper :as helper])\n",
"(helper/add-dependencies '[panthera \"0.1-alpha.11\"])\n",
":ok"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now require panthera main API namespace and define a little helper to better inspect data-frames"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"(require '[panthera.panthera :as pt])"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"#'user/show"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(require '[clojupyter.display :as display])\n",
"(require '[libpython-clj.python :as py])\n",
"\n",
"(defn show\n",
" [obj]\n",
" (display/html\n",
" (py/call-attr obj \"to_html\")))"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"nil"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(helper/add-dependencies '[metasoarous/oz \"1.5.4\"])\n",
"(require '[oz.notebook.clojupyter :as oz])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## A brief primer\n",
"\n",
"We will work with Pokemons! Datasets are available [here](https://www.kaggle.com/kanncaa1/data-sciencetutorial-for-beginners/data).\n",
"\n",
"We can read data into panthera from various formats, one of the most used is `read-csv`. Most panthera functions accept either a data-frame and/or a series as a first argument, one or more required arguments and then a map of options.\n",
"\n",
"To see which options are available you can check docs or even original [Pandas docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas.read_csv), just remember that if you pass keywords they'll be converted to Python automatically (for example `:index-col` becomes `index_col`), while if you pass strings you have to use its original name.\n",
"\n",
"Below as an example we `read-csv` our file, but we want to get only the first 10 rows, so we pass a map to the function like `{:nrows 10}`."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
" \n",
" \n",
" | \n",
" # | \n",
" Name | \n",
" Type 1 | \n",
" Type 2 | \n",
" HP | \n",
" Attack | \n",
" Defense | \n",
" Sp. Atk | \n",
" Sp. Def | \n",
" Speed | \n",
" Generation | \n",
" Legendary | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 1 | \n",
" Bulbasaur | \n",
" Grass | \n",
" Poison | \n",
" 45 | \n",
" 49 | \n",
" 49 | \n",
" 65 | \n",
" 65 | \n",
" 45 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
" | 1 | \n",
" 2 | \n",
" Ivysaur | \n",
" Grass | \n",
" Poison | \n",
" 60 | \n",
" 62 | \n",
" 63 | \n",
" 80 | \n",
" 80 | \n",
" 60 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
" | 2 | \n",
" 3 | \n",
" Venusaur | \n",
" Grass | \n",
" Poison | \n",
" 80 | \n",
" 82 | \n",
" 83 | \n",
" 100 | \n",
" 100 | \n",
" 80 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
" | 3 | \n",
" 4 | \n",
" Mega Venusaur | \n",
" Grass | \n",
" Poison | \n",
" 80 | \n",
" 100 | \n",
" 123 | \n",
" 122 | \n",
" 120 | \n",
" 80 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
" | 4 | \n",
" 5 | \n",
" Charmander | \n",
" Fire | \n",
" NaN | \n",
" 39 | \n",
" 52 | \n",
" 43 | \n",
" 60 | \n",
" 50 | \n",
" 65 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
" | 5 | \n",
" 6 | \n",
" Charmeleon | \n",
" Fire | \n",
" NaN | \n",
" 58 | \n",
" 64 | \n",
" 58 | \n",
" 80 | \n",
" 65 | \n",
" 80 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
" | 6 | \n",
" 7 | \n",
" Charizard | \n",
" Fire | \n",
" Flying | \n",
" 78 | \n",
" 84 | \n",
" 78 | \n",
" 109 | \n",
" 85 | \n",
" 100 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
" | 7 | \n",
" 8 | \n",
" Mega Charizard X | \n",
" Fire | \n",
" Dragon | \n",
" 78 | \n",
" 130 | \n",
" 111 | \n",
" 130 | \n",
" 85 | \n",
" 100 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
" | 8 | \n",
" 9 | \n",
" Mega Charizard Y | \n",
" Fire | \n",
" Flying | \n",
" 78 | \n",
" 104 | \n",
" 78 | \n",
" 159 | \n",
" 115 | \n",
" 100 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
" | 9 | \n",
" 10 | \n",
" Squirtle | \n",
" Water | \n",
" NaN | \n",
" 44 | \n",
" 48 | \n",
" 65 | \n",
" 50 | \n",
" 64 | \n",
" 43 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
"
"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(show (pt/read-csv \"../resources/pokemon.csv\" {:nrows 10}))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The cool thing is that we can chain operations, the threading first macro is our friend!\n",
"\n",
"Below we read the whole csv, get the correlation matrix and then show it"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" \n",
" | \n",
" # | \n",
" HP | \n",
" Attack | \n",
" Defense | \n",
" Sp. Atk | \n",
" Sp. Def | \n",
" Speed | \n",
" Generation | \n",
" Legendary | \n",
"
\n",
" \n",
" \n",
" \n",
" | # | \n",
" 1.000000 | \n",
" 0.097712 | \n",
" 0.102664 | \n",
" 0.094691 | \n",
" 0.089199 | \n",
" 0.085596 | \n",
" 0.012181 | \n",
" 0.983428 | \n",
" 0.154336 | \n",
"
\n",
" \n",
" | HP | \n",
" 0.097712 | \n",
" 1.000000 | \n",
" 0.422386 | \n",
" 0.239622 | \n",
" 0.362380 | \n",
" 0.378718 | \n",
" 0.175952 | \n",
" 0.058683 | \n",
" 0.273620 | \n",
"
\n",
" \n",
" | Attack | \n",
" 0.102664 | \n",
" 0.422386 | \n",
" 1.000000 | \n",
" 0.438687 | \n",
" 0.396362 | \n",
" 0.263990 | \n",
" 0.381240 | \n",
" 0.051451 | \n",
" 0.345408 | \n",
"
\n",
" \n",
" | Defense | \n",
" 0.094691 | \n",
" 0.239622 | \n",
" 0.438687 | \n",
" 1.000000 | \n",
" 0.223549 | \n",
" 0.510747 | \n",
" 0.015227 | \n",
" 0.042419 | \n",
" 0.246377 | \n",
"
\n",
" \n",
" | Sp. Atk | \n",
" 0.089199 | \n",
" 0.362380 | \n",
" 0.396362 | \n",
" 0.223549 | \n",
" 1.000000 | \n",
" 0.506121 | \n",
" 0.473018 | \n",
" 0.036437 | \n",
" 0.448907 | \n",
"
\n",
" \n",
" | Sp. Def | \n",
" 0.085596 | \n",
" 0.378718 | \n",
" 0.263990 | \n",
" 0.510747 | \n",
" 0.506121 | \n",
" 1.000000 | \n",
" 0.259133 | \n",
" 0.028486 | \n",
" 0.363937 | \n",
"
\n",
" \n",
" | Speed | \n",
" 0.012181 | \n",
" 0.175952 | \n",
" 0.381240 | \n",
" 0.015227 | \n",
" 0.473018 | \n",
" 0.259133 | \n",
" 1.000000 | \n",
" -0.023121 | \n",
" 0.326715 | \n",
"
\n",
" \n",
" | Generation | \n",
" 0.983428 | \n",
" 0.058683 | \n",
" 0.051451 | \n",
" 0.042419 | \n",
" 0.036437 | \n",
" 0.028486 | \n",
" -0.023121 | \n",
" 1.000000 | \n",
" 0.079794 | \n",
"
\n",
" \n",
" | Legendary | \n",
" 0.154336 | \n",
" 0.273620 | \n",
" 0.345408 | \n",
" 0.246377 | \n",
" 0.448907 | \n",
" 0.363937 | \n",
" 0.326715 | \n",
" 0.079794 | \n",
" 1.000000 | \n",
"
\n",
" \n",
"
"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(-> (pt/read-csv \"../resources/pokemon.csv\")\n",
" pt/corr\n",
" show)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since we'll be using `pokemon.csv` a lot, let's give it a name, `defonce` is great here"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"#'user/pokemon"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(defonce pokemon (pt/read-csv \"../resources/pokemon.csv\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's see how plotting goes"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"#'user/heatmap"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(defn heatmap \n",
" [data x y z]\n",
" {:data {:values data}\n",
" :width 500\n",
" :height 500\n",
" :encoding {:x {:field x\n",
" :type \"nominal\"}\n",
" :y {:field y\n",
" :type \"nominal\"}}\n",
" :layer [{:mark \"rect\"\n",
" :encoding {:color {:field z\n",
" :type \"quantitative\"}}}\n",
" {:mark \"text\"\n",
" :encoding {:text \n",
" {:field z\n",
" :type \"quantitative\"\n",
" :format \".2f\"}\n",
" :color {:value \"white\"}}}]})"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
" "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(-> pokemon\n",
" pt/corr\n",
" pt/reset-index\n",
" (pt/melt {:id-vars :index})\n",
" pt/->clj\n",
" (heatmap :index :variable :value)\n",
" oz/view!)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What we did is plotting the heatmap of the correlation matrix shown above. Don't worry too much to all the steps we took, we'll be seeing all of them one by one later on!\n",
"\n",
"What if we already read our data but we want to see only some rows? We have the `head` function for that"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" \n",
" | \n",
" # | \n",
" Name | \n",
" Type 1 | \n",
" Type 2 | \n",
" HP | \n",
" Attack | \n",
" Defense | \n",
" Sp. Atk | \n",
" Sp. Def | \n",
" Speed | \n",
" Generation | \n",
" Legendary | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 1 | \n",
" Bulbasaur | \n",
" Grass | \n",
" Poison | \n",
" 45 | \n",
" 49 | \n",
" 49 | \n",
" 65 | \n",
" 65 | \n",
" 45 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
" | 1 | \n",
" 2 | \n",
" Ivysaur | \n",
" Grass | \n",
" Poison | \n",
" 60 | \n",
" 62 | \n",
" 63 | \n",
" 80 | \n",
" 80 | \n",
" 60 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
" | 2 | \n",
" 3 | \n",
" Venusaur | \n",
" Grass | \n",
" Poison | \n",
" 80 | \n",
" 82 | \n",
" 83 | \n",
" 100 | \n",
" 100 | \n",
" 80 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
" | 3 | \n",
" 4 | \n",
" Mega Venusaur | \n",
" Grass | \n",
" Poison | \n",
" 80 | \n",
" 100 | \n",
" 123 | \n",
" 122 | \n",
" 120 | \n",
" 80 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
" | 4 | \n",
" 5 | \n",
" Charmander | \n",
" Fire | \n",
" NaN | \n",
" 39 | \n",
" 52 | \n",
" 43 | \n",
" 60 | \n",
" 50 | \n",
" 65 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
"
"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(show (pt/head pokemon))"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" \n",
" | \n",
" # | \n",
" Name | \n",
" Type 1 | \n",
" Type 2 | \n",
" HP | \n",
" Attack | \n",
" Defense | \n",
" Sp. Atk | \n",
" Sp. Def | \n",
" Speed | \n",
" Generation | \n",
" Legendary | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 1 | \n",
" Bulbasaur | \n",
" Grass | \n",
" Poison | \n",
" 45 | \n",
" 49 | \n",
" 49 | \n",
" 65 | \n",
" 65 | \n",
" 45 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
" | 1 | \n",
" 2 | \n",
" Ivysaur | \n",
" Grass | \n",
" Poison | \n",
" 60 | \n",
" 62 | \n",
" 63 | \n",
" 80 | \n",
" 80 | \n",
" 60 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
" | 2 | \n",
" 3 | \n",
" Venusaur | \n",
" Grass | \n",
" Poison | \n",
" 80 | \n",
" 82 | \n",
" 83 | \n",
" 100 | \n",
" 100 | \n",
" 80 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
" | 3 | \n",
" 4 | \n",
" Mega Venusaur | \n",
" Grass | \n",
" Poison | \n",
" 80 | \n",
" 100 | \n",
" 123 | \n",
" 122 | \n",
" 120 | \n",
" 80 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
" | 4 | \n",
" 5 | \n",
" Charmander | \n",
" Fire | \n",
" NaN | \n",
" 39 | \n",
" 52 | \n",
" 43 | \n",
" 60 | \n",
" 50 | \n",
" 65 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
" | 5 | \n",
" 6 | \n",
" Charmeleon | \n",
" Fire | \n",
" NaN | \n",
" 58 | \n",
" 64 | \n",
" 58 | \n",
" 80 | \n",
" 65 | \n",
" 80 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
" | 6 | \n",
" 7 | \n",
" Charizard | \n",
" Fire | \n",
" Flying | \n",
" 78 | \n",
" 84 | \n",
" 78 | \n",
" 109 | \n",
" 85 | \n",
" 100 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
" | 7 | \n",
" 8 | \n",
" Mega Charizard X | \n",
" Fire | \n",
" Dragon | \n",
" 78 | \n",
" 130 | \n",
" 111 | \n",
" 130 | \n",
" 85 | \n",
" 100 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
" | 8 | \n",
" 9 | \n",
" Mega Charizard Y | \n",
" Fire | \n",
" Flying | \n",
" 78 | \n",
" 104 | \n",
" 78 | \n",
" 159 | \n",
" 115 | \n",
" 100 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
" | 9 | \n",
" 10 | \n",
" Squirtle | \n",
" Water | \n",
" NaN | \n",
" 44 | \n",
" 48 | \n",
" 65 | \n",
" 50 | \n",
" 64 | \n",
" 43 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
"
"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(show (pt/head pokemon 10))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another nice thing we can do is to get columns names"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['#', 'Name', 'Type 1', 'Type 2', 'HP', 'Attack', 'Defense', 'Sp. Atk',\n",
" 'Sp. Def', 'Speed', 'Generation', 'Legendary'],\n",
" dtype='object')"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(pt/names pokemon)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now when you see an output as the above one, that means that the data we have is still in Python. That's ok if you keep working within panthera, but what if you want to do something with column names using Clojure?"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[\"#\" \"Name\" \"Type 1\" \"Type 2\" \"HP\" \"Attack\" \"Defense\" \"Sp. Atk\" \"Sp. Def\" \"Speed\" \"Generation\" \"Legendary\"]"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(vec (pt/names pokemon))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That's it! Just call `vec`and now you have a nice Clojure vector that you can deal with.\n",
"\n",
"> N.B.: with many Python objects you can directly treat them as similar Clojure collections. For instance in this case we can do something like below"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"#\n",
"Name\n",
"Type 1\n",
"Type 2\n",
"HP\n",
"Attack\n",
"Defense\n",
"Sp. Atk\n",
"Sp. Def\n",
"Speed\n",
"Generation\n",
"Legendary\n"
]
},
{
"data": {
"text/plain": [
"nil"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(doseq [a (pt/names pokemon)] (println a))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Some plotting\n",
"\n",
"Plotting is nice to learn how to munge data: you get a fast visual feedback and usually results are nice to look at!\n",
"\n",
"Let's plot `Speed` and `Defense`"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"#'user/line-plot"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(defn line-plot\n",
" [data x y & [color]]\n",
" (let [spec {:data {:values data}\n",
" :mark \"line\"\n",
" :width 600\n",
" :height 300\n",
" :encoding {:x {:field x\n",
" :type \"quantitative\"}\n",
" :y {:field y\n",
" :type \"quantitative\"}\n",
" :color {}}}]\n",
" (if color\n",
" (assoc-in spec [:encoding :color] {:field color\n",
" :type \"nominal\"})\n",
" (assoc-in spec [:encoding :color] {:value \"blue\"}))))"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
" "
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(-> pokemon\n",
" (pt/subset-cols :# :Speed :Defense)\n",
" (pt/melt {:id-vars :#})\n",
" pt/->clj\n",
" (line-plot :# :value :variable)\n",
" oz/view!)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's look at the operation above:\n",
"\n",
"- `subset-cols`: we use this to, well, subset columns. We can choose N columns by label, we will get a 'new' data-frame with only the selected columns\n",
"- `melt`: this transforms the data-frame from wide to long format (for more info about it see [further below](#reshape)\n",
"- `->clj`: this turns data-frames and serieses to a Clojure vector of maps\n",
"\n",
"`subset-cols` is pretty straightforward:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" \n",
" | \n",
" Speed | \n",
" Attack | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 45 | \n",
" 49 | \n",
"
\n",
" \n",
" | 1 | \n",
" 60 | \n",
" 62 | \n",
"
\n",
" \n",
" | 2 | \n",
" 80 | \n",
" 82 | \n",
"
\n",
" \n",
" | 3 | \n",
" 80 | \n",
" 100 | \n",
"
\n",
" \n",
" | 4 | \n",
" 65 | \n",
" 52 | \n",
"
\n",
" \n",
"
"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(-> pokemon (pt/subset-cols :Speed :Attack) pt/head show)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" \n",
" | \n",
" Speed | \n",
" Attack | \n",
" HP | \n",
" # | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 45 | \n",
" 49 | \n",
" 45 | \n",
" 1 | \n",
"
\n",
" \n",
" | 1 | \n",
" 60 | \n",
" 62 | \n",
" 60 | \n",
" 2 | \n",
"
\n",
" \n",
" | 2 | \n",
" 80 | \n",
" 82 | \n",
" 80 | \n",
" 3 | \n",
"
\n",
" \n",
" | 3 | \n",
" 80 | \n",
" 100 | \n",
" 80 | \n",
" 4 | \n",
"
\n",
" \n",
" | 4 | \n",
" 65 | \n",
" 52 | \n",
" 39 | \n",
" 5 | \n",
"
\n",
" \n",
"
"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(-> pokemon (pt/subset-cols :Speed :Attack :HP :#) pt/head show)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
" # Attack\n",
"0 1 49\n",
"1 2 62\n",
"2 3 82\n",
"3 4 100\n",
"4 5 52"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(-> pokemon (pt/subset-cols :# :Attack) pt/head)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`->clj` tries to understand what's the better way to transform panthera data structures to Clojure ones"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[{:speed 45} {:speed 60} {:speed 80} {:speed 80} {:speed 65}]"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(-> pokemon (pt/subset-cols :Speed) pt/head pt/->clj)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[{:speed 45, :hp 45} {:speed 60, :hp 60} {:speed 80, :hp 80} {:speed 80, :hp 80} {:speed 65, :hp 39}]"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(-> pokemon (pt/subset-cols :Speed :HP) pt/head pt/->clj)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we want to see what happens when we plot `Attack` vs `Defense`"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"#'user/scatter"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(defn scatter\n",
" [data x y & [color]]\n",
" (let [spec {:data {:values data}\n",
" :mark \"point\"\n",
" :width 600\n",
" :height 300\n",
" :encoding {:x {:field x\n",
" :type \"quantitative\"}\n",
" :y {:field y\n",
" :type \"quantitative\"}\n",
" :color {}}}]\n",
" (if color\n",
" (assoc-in spec [:encoding :color] {:field color\n",
" :type \"nominal\"})\n",
" (assoc-in spec [:encoding :color] {:value \"dodgerblue\"}))))"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
" "
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(-> pokemon\n",
" (pt/subset-cols :Attack :Defense)\n",
" pt/->clj\n",
" (scatter :attack :defense)\n",
" oz/view!)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And now the `Speed` histogram"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"#'user/hist"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(defn hist\n",
" [data x & [color]]\n",
" (let [spec {:data {:values data}\n",
" :mark \"bar\"\n",
" :width 600\n",
" :height 300\n",
" :encoding {:x {:field x\n",
" :bin {:maxbins 50}\n",
" :type \"quantitative\"}\n",
" :y {:aggregate \"count\"\n",
" :type \"quantitative\"}\n",
" :color {}}}]\n",
" (if color\n",
" (assoc-in spec [:encoding :color] {:field color\n",
" :type \"nominal\"})\n",
" (assoc-in spec [:encoding :color] {:value \"dodgerblue\"}))))"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
" "
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(-> pokemon\n",
" (pt/subset-cols :Speed)\n",
" pt/->clj\n",
" (hist :speed)\n",
" oz/view!)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data-frames basics\n",
"\n",
"### Creation\n",
"\n",
"How to create data-frames? Above we read a csv, but what if we already have some data in the runtime we want to deal with? Nothing easier than this:"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" \n",
" | \n",
" a | \n",
" b | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" | 1 | \n",
" 3 | \n",
" 4 | \n",
"
\n",
" \n",
"
"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(show (pt/data-frame [{:a 1 :b 2} {:a 3 :b 4}]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What if we don't care about column names, or we'd prefer to add them to an already generated data-frame?"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" | 1 | \n",
" 3 | \n",
" 4 | \n",
"
\n",
" \n",
"
"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(show (pt/data-frame (to-array-2d [[1 2] [3 4]])))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Columns of data-frames are just serieses:"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
":series"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(-> pokemon (pt/subset-cols \"Defense\") pt/pytype)"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 1\n",
"1 2\n",
"2 3\n",
"dtype: int64"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(pt/series [1 2 3])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The column name is the name of the series:"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 1\n",
"1 2\n",
"2 3\n",
"Name: my-series, dtype: int64"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(pt/series [1 2 3] {:name :my-series})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Filtering\n",
"\n",
"One of the most straightforward ways to filter data-frames is with booleans. We have `filter-rows` that takes either booleans or a function that generates booleans"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" \n",
" | \n",
" # | \n",
" Name | \n",
" Type 1 | \n",
" Type 2 | \n",
" HP | \n",
" Attack | \n",
" Defense | \n",
" Sp. Atk | \n",
" Sp. Def | \n",
" Speed | \n",
" Generation | \n",
" Legendary | \n",
"
\n",
" \n",
" \n",
" \n",
" | 224 | \n",
" 225 | \n",
" Mega Steelix | \n",
" Steel | \n",
" Ground | \n",
" 75 | \n",
" 125 | \n",
" 230 | \n",
" 55 | \n",
" 95 | \n",
" 30 | \n",
" 2 | \n",
" False | \n",
"
\n",
" \n",
" | 230 | \n",
" 231 | \n",
" Shuckle | \n",
" Bug | \n",
" Rock | \n",
" 20 | \n",
" 10 | \n",
" 230 | \n",
" 10 | \n",
" 230 | \n",
" 5 | \n",
" 2 | \n",
" False | \n",
"
\n",
" \n",
" | 333 | \n",
" 334 | \n",
" Mega Aggron | \n",
" Steel | \n",
" NaN | \n",
" 70 | \n",
" 140 | \n",
" 230 | \n",
" 60 | \n",
" 80 | \n",
" 50 | \n",
" 3 | \n",
" False | \n",
"
\n",
" \n",
"
"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(-> pokemon\n",
" (pt/filter-rows #(-> % (pt/subset-cols \"Defense\") (pt/gt 200)))\n",
" show)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`gt` is exactly what you think it is: `>`. Check the [Basic concepts](https://github.com/alanmarazzi/panthera/blob/master/examples/basic-concepts.ipynb) notebook to better understand how math works in panthera.\n",
"\n",
"Now we'll have to introduce Numpy in the equation. Let's say we want to filter the data-frame based on 2 conditions at the same time, we can do that using `npy`:"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"nil"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(require '[panthera.numpy :refer [npy]])"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"#'user/my-filter"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(defn my-filter\n",
" [col1 col2]\n",
" (npy :logical-and \n",
" {:args [(-> pokemon\n",
" (pt/subset-cols col1)\n",
" (pt/gt 200))\n",
" (-> pokemon\n",
" (pt/subset-cols col2)\n",
" (pt/gt 100))]}))"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" \n",
" | \n",
" # | \n",
" Name | \n",
" Type 1 | \n",
" Type 2 | \n",
" HP | \n",
" Attack | \n",
" Defense | \n",
" Sp. Atk | \n",
" Sp. Def | \n",
" Speed | \n",
" Generation | \n",
" Legendary | \n",
"
\n",
" \n",
" \n",
" \n",
" | 224 | \n",
" 225 | \n",
" Mega Steelix | \n",
" Steel | \n",
" Ground | \n",
" 75 | \n",
" 125 | \n",
" 230 | \n",
" 55 | \n",
" 95 | \n",
" 30 | \n",
" 2 | \n",
" False | \n",
"
\n",
" \n",
" | 333 | \n",
" 334 | \n",
" Mega Aggron | \n",
" Steel | \n",
" NaN | \n",
" 70 | \n",
" 140 | \n",
" 230 | \n",
" 60 | \n",
" 80 | \n",
" 50 | \n",
" 3 | \n",
" False | \n",
"
\n",
" \n",
"
"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(-> pokemon\n",
" (pt/filter-rows (my-filter :Defense :Attack))\n",
" show)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`panthera.numpy` works a little differently than regular panthera, usually you need only `npy` to have access to all of numpy functions.\n",
"\n",
"For instance:"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 3.891820\n",
"1 4.143135\n",
"2 4.418841\n",
"3 4.812184\n",
"4 3.761200\n",
"Name: Defense, dtype: float64"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(-> pokemon\n",
" (pt/subset-cols :Defense)\n",
" ((npy :log))\n",
" pt/head)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Above we just calculated the `log` of the whole `Defense` column! Remember that `npy` operations are vectorized, so usually it is faster to use them (or equivalent panthera ones) than Clojure ones (unless you're doing more complicated operations, then Clojure would *probably* be faster).\n",
"\n",
"Now let's try to do some more complicated things:"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"27311/400"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(/ (pt/sum (pt/subset-cols pokemon :Speed)) \n",
" (pt/n-rows pokemon))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Above we see how we can combine operations on serieses, but of course that's a `mean`, and we have a function for that!"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"#'user/col-mean"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(defn col-mean\n",
" [col]\n",
" (pt/mean (pt/subset-cols pokemon col)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we would like to add a new column that says `high` when the value is above the mean, and `low` for the opposite.\n",
"\n",
"`npy` is really helpful here:"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['low' 'low' 'high' 'high' 'low']"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(npy :where {:args [(pt/gt (pt/head (pt/subset-cols pokemon :Speed)) (col-mean :Speed))\n",
" \"high\"\n",
" \"low\"]})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But this is pretty ugly and we can't chain it with other functions. It is pretty easy to wrap it into a chainable function:"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"#'user/where"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(defn where\n",
" [& args]\n",
" (npy :where {:args args}))"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['low' 'low' 'high' 'high' 'low']"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(-> pokemon\n",
" (pt/subset-cols :Speed)\n",
" pt/head\n",
" (pt/gt (col-mean :Speed))\n",
" (where \"high\" \"low\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That seems to work! Let's add a new column to our data-frame:"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" \n",
" | \n",
" speed_level | \n",
" Speed | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" low | \n",
" 45 | \n",
"
\n",
" \n",
" | 1 | \n",
" low | \n",
" 60 | \n",
"
\n",
" \n",
" | 2 | \n",
" high | \n",
" 80 | \n",
"
\n",
" \n",
" | 3 | \n",
" high | \n",
" 80 | \n",
"
\n",
" \n",
" | 4 | \n",
" low | \n",
" 65 | \n",
"
\n",
" \n",
" | 5 | \n",
" high | \n",
" 80 | \n",
"
\n",
" \n",
" | 6 | \n",
" high | \n",
" 100 | \n",
"
\n",
" \n",
" | 7 | \n",
" high | \n",
" 100 | \n",
"
\n",
" \n",
" | 8 | \n",
" high | \n",
" 100 | \n",
"
\n",
" \n",
" | 9 | \n",
" low | \n",
" 43 | \n",
"
\n",
" \n",
"
"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(def speed-level\n",
" (-> pokemon\n",
" (pt/subset-cols :Speed)\n",
" (pt/gt (col-mean :Speed))\n",
" (where \"high\" \"low\")))\n",
"\n",
"(-> pokemon\n",
" (pt/assign {:speed-level speed-level})\n",
" (pt/subset-cols :speed_level :Speed)\n",
" (pt/head 10)\n",
" show)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Of course we didn't actually add `speed_level` to `pokemon`, we created a new data-frame. Everything here is as immutable as possible, let's check if this is really the case:"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[\"#\" \"Name\" \"Type 1\" \"Type 2\" \"HP\" \"Attack\" \"Defense\" \"Sp. Atk\" \"Sp. Def\" \"Speed\" \"Generation\" \"Legendary\"]"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(vec (pt/names pokemon))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Inspecting data\n",
"\n",
"Other than `head` we have `tail`"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" \n",
" | \n",
" # | \n",
" Name | \n",
" Type 1 | \n",
" Type 2 | \n",
" HP | \n",
" Attack | \n",
" Defense | \n",
" Sp. Atk | \n",
" Sp. Def | \n",
" Speed | \n",
" Generation | \n",
" Legendary | \n",
"
\n",
" \n",
" \n",
" \n",
" | 795 | \n",
" 796 | \n",
" Diancie | \n",
" Rock | \n",
" Fairy | \n",
" 50 | \n",
" 100 | \n",
" 150 | \n",
" 100 | \n",
" 150 | \n",
" 50 | \n",
" 6 | \n",
" True | \n",
"
\n",
" \n",
" | 796 | \n",
" 797 | \n",
" Mega Diancie | \n",
" Rock | \n",
" Fairy | \n",
" 50 | \n",
" 160 | \n",
" 110 | \n",
" 160 | \n",
" 110 | \n",
" 110 | \n",
" 6 | \n",
" True | \n",
"
\n",
" \n",
" | 797 | \n",
" 798 | \n",
" Hoopa Confined | \n",
" Psychic | \n",
" Ghost | \n",
" 80 | \n",
" 110 | \n",
" 60 | \n",
" 150 | \n",
" 130 | \n",
" 70 | \n",
" 6 | \n",
" True | \n",
"
\n",
" \n",
" | 798 | \n",
" 799 | \n",
" Hoopa Unbound | \n",
" Psychic | \n",
" Dark | \n",
" 80 | \n",
" 160 | \n",
" 60 | \n",
" 170 | \n",
" 130 | \n",
" 80 | \n",
" 6 | \n",
" True | \n",
"
\n",
" \n",
" | 799 | \n",
" 800 | \n",
" Volcanion | \n",
" Fire | \n",
" Water | \n",
" 80 | \n",
" 110 | \n",
" 120 | \n",
" 130 | \n",
" 90 | \n",
" 70 | \n",
" 6 | \n",
" True | \n",
"
\n",
" \n",
"
"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(show (pt/tail pokemon))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can always check what's the shape of the data structure we're interested in. `shape` returns rows and columns count"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(800, 12)"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(pt/shape pokemon)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you want just one of the two you can either use one of `n-rows` or `n-cols`, or get the required value by index:"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"800"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(pt/n-rows pokemon)"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"800"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"((pt/shape pokemon) 0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exploratory data analysis\n",
"\n",
"Now we can move to something a little more interesting: some data analysis.\n",
"\n",
"One of the first things we might want to do is to look at some frequencies. `value-counts` is our friend"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Water 112\n",
"Normal 98\n",
"Grass 70\n",
"Bug 69\n",
"Psychic 57\n",
"Fire 52\n",
"Rock 44\n",
"Electric 44\n",
"Ground 32\n",
"Ghost 32\n",
"Dragon 32\n",
"Dark 31\n",
"Poison 28\n",
"Fighting 27\n",
"Steel 27\n",
"Ice 24\n",
"Fairy 17\n",
"Flying 4\n",
"Name: Type 1, dtype: int64"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(-> pokemon\n",
" (pt/subset-cols \"Type 1\")\n",
" (pt/value-counts {:dropna false}))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we can see we get counts by group automatically and this can come in handy!\n",
"\n",
"There's also a nice way to see many stats at once for all the numeric columns: `describe`"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" \n",
" | \n",
" # | \n",
" HP | \n",
" Attack | \n",
" Defense | \n",
" Sp. Atk | \n",
" Sp. Def | \n",
" Speed | \n",
" Generation | \n",
"
\n",
" \n",
" \n",
" \n",
" | count | \n",
" 800.0000 | \n",
" 800.000000 | \n",
" 800.000000 | \n",
" 800.000000 | \n",
" 800.000000 | \n",
" 800.000000 | \n",
" 800.000000 | \n",
" 800.00000 | \n",
"
\n",
" \n",
" | mean | \n",
" 400.5000 | \n",
" 69.258750 | \n",
" 79.001250 | \n",
" 73.842500 | \n",
" 72.820000 | \n",
" 71.902500 | \n",
" 68.277500 | \n",
" 3.32375 | \n",
"
\n",
" \n",
" | std | \n",
" 231.0844 | \n",
" 25.534669 | \n",
" 32.457366 | \n",
" 31.183501 | \n",
" 32.722294 | \n",
" 27.828916 | \n",
" 29.060474 | \n",
" 1.66129 | \n",
"
\n",
" \n",
" | min | \n",
" 1.0000 | \n",
" 1.000000 | \n",
" 5.000000 | \n",
" 5.000000 | \n",
" 10.000000 | \n",
" 20.000000 | \n",
" 5.000000 | \n",
" 1.00000 | \n",
"
\n",
" \n",
" | 25% | \n",
" 200.7500 | \n",
" 50.000000 | \n",
" 55.000000 | \n",
" 50.000000 | \n",
" 49.750000 | \n",
" 50.000000 | \n",
" 45.000000 | \n",
" 2.00000 | \n",
"
\n",
" \n",
" | 50% | \n",
" 400.5000 | \n",
" 65.000000 | \n",
" 75.000000 | \n",
" 70.000000 | \n",
" 65.000000 | \n",
" 70.000000 | \n",
" 65.000000 | \n",
" 3.00000 | \n",
"
\n",
" \n",
" | 75% | \n",
" 600.2500 | \n",
" 80.000000 | \n",
" 100.000000 | \n",
" 90.000000 | \n",
" 95.000000 | \n",
" 90.000000 | \n",
" 90.000000 | \n",
" 5.00000 | \n",
"
\n",
" \n",
" | max | \n",
" 800.0000 | \n",
" 255.000000 | \n",
" 190.000000 | \n",
" 230.000000 | \n",
" 194.000000 | \n",
" 230.000000 | \n",
" 180.000000 | \n",
" 6.00000 | \n",
"
\n",
" \n",
"
"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(show (pt/describe pokemon))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you need some of these stats only for some columns, chances are that there's a function for that!"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[69.25875 25.53466903233207 1 255]"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(-> (pt/subset-cols pokemon :HP)\n",
" ((juxt pt/mean pt/std pt/minimum pt/maximum)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## Reshaping data\n",
"\n",
"Some of the most common operations with rectangular data is to reshape them how we most please to make other operations easier.\n",
"\n",
"The R people perfectly know what I mean when I talk about [tidy data](https://www.jstatsoft.org/article/view/v059i10/v59i10.pdf), if you have no idea about this check the link, but the main point is that while most are used to work with double entry matrices (like the one above built with `describe`), it is much easier to work with *long data*: one row per observation and one column per variable.\n",
"\n",
"In panthera there's `melt` as a workhorse for this process"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" \n",
" | \n",
" # | \n",
" Name | \n",
" Type 1 | \n",
" Type 2 | \n",
" HP | \n",
" Attack | \n",
" Defense | \n",
" Sp. Atk | \n",
" Sp. Def | \n",
" Speed | \n",
" Generation | \n",
" Legendary | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 1 | \n",
" Bulbasaur | \n",
" Grass | \n",
" Poison | \n",
" 45 | \n",
" 49 | \n",
" 49 | \n",
" 65 | \n",
" 65 | \n",
" 45 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
" | 1 | \n",
" 2 | \n",
" Ivysaur | \n",
" Grass | \n",
" Poison | \n",
" 60 | \n",
" 62 | \n",
" 63 | \n",
" 80 | \n",
" 80 | \n",
" 60 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
" | 2 | \n",
" 3 | \n",
" Venusaur | \n",
" Grass | \n",
" Poison | \n",
" 80 | \n",
" 82 | \n",
" 83 | \n",
" 100 | \n",
" 100 | \n",
" 80 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
" | 3 | \n",
" 4 | \n",
" Mega Venusaur | \n",
" Grass | \n",
" Poison | \n",
" 80 | \n",
" 100 | \n",
" 123 | \n",
" 122 | \n",
" 120 | \n",
" 80 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
" | 4 | \n",
" 5 | \n",
" Charmander | \n",
" Fire | \n",
" NaN | \n",
" 39 | \n",
" 52 | \n",
" 43 | \n",
" 60 | \n",
" 50 | \n",
" 65 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
"
"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(-> pokemon pt/head show)"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" \n",
" | \n",
" Name | \n",
" variable | \n",
" value | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" Bulbasaur | \n",
" Attack | \n",
" 49 | \n",
"
\n",
" \n",
" | 1 | \n",
" Ivysaur | \n",
" Attack | \n",
" 62 | \n",
"
\n",
" \n",
" | 2 | \n",
" Venusaur | \n",
" Attack | \n",
" 82 | \n",
"
\n",
" \n",
" | 3 | \n",
" Mega Venusaur | \n",
" Attack | \n",
" 100 | \n",
"
\n",
" \n",
" | 4 | \n",
" Charmander | \n",
" Attack | \n",
" 52 | \n",
"
\n",
" \n",
" | 5 | \n",
" Bulbasaur | \n",
" Defense | \n",
" 49 | \n",
"
\n",
" \n",
" | 6 | \n",
" Ivysaur | \n",
" Defense | \n",
" 63 | \n",
"
\n",
" \n",
" | 7 | \n",
" Venusaur | \n",
" Defense | \n",
" 83 | \n",
"
\n",
" \n",
" | 8 | \n",
" Mega Venusaur | \n",
" Defense | \n",
" 123 | \n",
"
\n",
" \n",
" | 9 | \n",
" Charmander | \n",
" Defense | \n",
" 43 | \n",
"
\n",
" \n",
"
"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(-> pokemon pt/head (pt/melt {:id-vars \"Name\" :value-vars [\"Attack\" \"Defense\"]}) show)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Above we told panthera that we wanted to `melt` our data-frame and that we would like to have the column `Name` act as the main id, while we're interested in the value of `Attack` and `Defense`.\n",
"\n",
"This makes much easier to group values by some variable:"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
" value\n",
"variable \n",
"Attack 69.0\n",
"Defense 72.2"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(-> pokemon \n",
" pt/head \n",
" (pt/melt {:id-vars \"Name\" :value-vars [\"Attack\" \"Defense\"]}) \n",
" (pt/groupby :variable)\n",
" pt/mean)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you've ever used Excel you already know about `pivot`, which is the opposite of `melt`"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" \n",
" | variable | \n",
" Attack | \n",
" Defense | \n",
"
\n",
" \n",
" | Name | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" | Bulbasaur | \n",
" 49 | \n",
" 49 | \n",
"
\n",
" \n",
" | Charmander | \n",
" 52 | \n",
" 43 | \n",
"
\n",
" \n",
" | Ivysaur | \n",
" 62 | \n",
" 63 | \n",
"
\n",
" \n",
" | Mega Venusaur | \n",
" 100 | \n",
" 123 | \n",
"
\n",
" \n",
" | Venusaur | \n",
" 82 | \n",
" 83 | \n",
"
\n",
" \n",
"
"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(-> pokemon \n",
" pt/head \n",
" (pt/melt {:id-vars \"Name\" :value-vars [\"Attack\" \"Defense\"]}) \n",
" (pt/pivot {:index \"Name\" :columns \"variable\" :values \"value\"})\n",
" show)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What if we have more than one data-frame? We can combine them however we want!"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" \n",
" | \n",
" # | \n",
" Name | \n",
" Type 1 | \n",
" Type 2 | \n",
" HP | \n",
" Attack | \n",
" Defense | \n",
" Sp. Atk | \n",
" Sp. Def | \n",
" Speed | \n",
" Generation | \n",
" Legendary | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 1 | \n",
" Bulbasaur | \n",
" Grass | \n",
" Poison | \n",
" 45 | \n",
" 49 | \n",
" 49 | \n",
" 65 | \n",
" 65 | \n",
" 45 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
" | 1 | \n",
" 2 | \n",
" Ivysaur | \n",
" Grass | \n",
" Poison | \n",
" 60 | \n",
" 62 | \n",
" 63 | \n",
" 80 | \n",
" 80 | \n",
" 60 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
" | 2 | \n",
" 3 | \n",
" Venusaur | \n",
" Grass | \n",
" Poison | \n",
" 80 | \n",
" 82 | \n",
" 83 | \n",
" 100 | \n",
" 100 | \n",
" 80 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
" | 3 | \n",
" 4 | \n",
" Mega Venusaur | \n",
" Grass | \n",
" Poison | \n",
" 80 | \n",
" 100 | \n",
" 123 | \n",
" 122 | \n",
" 120 | \n",
" 80 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
" | 4 | \n",
" 5 | \n",
" Charmander | \n",
" Fire | \n",
" NaN | \n",
" 39 | \n",
" 52 | \n",
" 43 | \n",
" 60 | \n",
" 50 | \n",
" 65 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
" | 5 | \n",
" 796 | \n",
" Diancie | \n",
" Rock | \n",
" Fairy | \n",
" 50 | \n",
" 100 | \n",
" 150 | \n",
" 100 | \n",
" 150 | \n",
" 50 | \n",
" 6 | \n",
" True | \n",
"
\n",
" \n",
" | 6 | \n",
" 797 | \n",
" Mega Diancie | \n",
" Rock | \n",
" Fairy | \n",
" 50 | \n",
" 160 | \n",
" 110 | \n",
" 160 | \n",
" 110 | \n",
" 110 | \n",
" 6 | \n",
" True | \n",
"
\n",
" \n",
" | 7 | \n",
" 798 | \n",
" Hoopa Confined | \n",
" Psychic | \n",
" Ghost | \n",
" 80 | \n",
" 110 | \n",
" 60 | \n",
" 150 | \n",
" 130 | \n",
" 70 | \n",
" 6 | \n",
" True | \n",
"
\n",
" \n",
" | 8 | \n",
" 799 | \n",
" Hoopa Unbound | \n",
" Psychic | \n",
" Dark | \n",
" 80 | \n",
" 160 | \n",
" 60 | \n",
" 170 | \n",
" 130 | \n",
" 80 | \n",
" 6 | \n",
" True | \n",
"
\n",
" \n",
" | 9 | \n",
" 800 | \n",
" Volcanion | \n",
" Fire | \n",
" Water | \n",
" 80 | \n",
" 110 | \n",
" 120 | \n",
" 130 | \n",
" 90 | \n",
" 70 | \n",
" 6 | \n",
" True | \n",
"
\n",
" \n",
"
"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(show \n",
" (pt/concatenate\n",
" [(pt/head pokemon)\n",
" (pt/tail pokemon)]\n",
" {:axis 0\n",
" :ignore-index true}))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Just a second to discuss some options:\n",
"\n",
"- `:axis`: most of panthera operations can be applied either by rows or columns, we decide which with this keyword where 0 = rows and 1 = columns\n",
"- `:ignore-index`: panthera works by index, to better understand what kind of indexes there are and most of their quirks check [Basic concepts](https://nbviewer.jupyter.org/github/alanmarazzi/panthera/blob/master/examples/basic-concepts.ipynb#Indexing-and-subsetting)\n",
"\n",
"To better understand `:axis` let's make another example"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" \n",
" | \n",
" # | \n",
" Name | \n",
" Type 1 | \n",
" Type 2 | \n",
" HP | \n",
" Attack | \n",
" Defense | \n",
" Sp. Atk | \n",
" Sp. Def | \n",
" Speed | \n",
" Generation | \n",
" Legendary | \n",
" # | \n",
" Name | \n",
" Type 1 | \n",
" Type 2 | \n",
" HP | \n",
" Attack | \n",
" Defense | \n",
" Sp. Atk | \n",
" Sp. Def | \n",
" Speed | \n",
" Generation | \n",
" Legendary | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 1 | \n",
" Bulbasaur | \n",
" Grass | \n",
" Poison | \n",
" 45 | \n",
" 49 | \n",
" 49 | \n",
" 65 | \n",
" 65 | \n",
" 45 | \n",
" 1 | \n",
" False | \n",
" 1 | \n",
" Bulbasaur | \n",
" Grass | \n",
" Poison | \n",
" 45 | \n",
" 49 | \n",
" 49 | \n",
" 65 | \n",
" 65 | \n",
" 45 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
" | 1 | \n",
" 2 | \n",
" Ivysaur | \n",
" Grass | \n",
" Poison | \n",
" 60 | \n",
" 62 | \n",
" 63 | \n",
" 80 | \n",
" 80 | \n",
" 60 | \n",
" 1 | \n",
" False | \n",
" 2 | \n",
" Ivysaur | \n",
" Grass | \n",
" Poison | \n",
" 60 | \n",
" 62 | \n",
" 63 | \n",
" 80 | \n",
" 80 | \n",
" 60 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
" | 2 | \n",
" 3 | \n",
" Venusaur | \n",
" Grass | \n",
" Poison | \n",
" 80 | \n",
" 82 | \n",
" 83 | \n",
" 100 | \n",
" 100 | \n",
" 80 | \n",
" 1 | \n",
" False | \n",
" 3 | \n",
" Venusaur | \n",
" Grass | \n",
" Poison | \n",
" 80 | \n",
" 82 | \n",
" 83 | \n",
" 100 | \n",
" 100 | \n",
" 80 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
" | 3 | \n",
" 4 | \n",
" Mega Venusaur | \n",
" Grass | \n",
" Poison | \n",
" 80 | \n",
" 100 | \n",
" 123 | \n",
" 122 | \n",
" 120 | \n",
" 80 | \n",
" 1 | \n",
" False | \n",
" 4 | \n",
" Mega Venusaur | \n",
" Grass | \n",
" Poison | \n",
" 80 | \n",
" 100 | \n",
" 123 | \n",
" 122 | \n",
" 120 | \n",
" 80 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
" | 4 | \n",
" 5 | \n",
" Charmander | \n",
" Fire | \n",
" NaN | \n",
" 39 | \n",
" 52 | \n",
" 43 | \n",
" 60 | \n",
" 50 | \n",
" 65 | \n",
" 1 | \n",
" False | \n",
" 5 | \n",
" Charmander | \n",
" Fire | \n",
" NaN | \n",
" 39 | \n",
" 52 | \n",
" 43 | \n",
" 60 | \n",
" 50 | \n",
" 65 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
"
"
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(show\n",
" (pt/concatenate\n",
" (repeat 2 (pt/head pokemon))\n",
" {:axis 1}))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Types, types everywhere\n",
"\n",
"There are many dedicated types, but no worries, there are nice ways to deal with them."
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"# int64\n",
"Name object\n",
"Type 1 object\n",
"Type 2 object\n",
"HP int64\n",
"Attack int64\n",
"Defense int64\n",
"Sp. Atk int64\n",
"Sp. Def int64\n",
"Speed int64\n",
"Generation int64\n",
"Legendary bool\n",
"dtype: object"
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(pt/dtype pokemon)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I guess there isn't much to say about `:int64` and `:bool`, but surely `:object` looks more interesting. When panthera (numpy included) finds either strings or something it doesn't know how to deal with it goes to the less tight type possible which is an `:object`.\n",
"\n",
"`:object`s are usually bloated, if we want to save some overhead and it makes sense to deal with categorical values we can convert them to `:category`"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 Grass\n",
"1 Grass\n",
"2 Grass\n",
"3 Grass\n",
"4 Fire\n",
"Name: Type 1, dtype: category\n",
"Categories (18, object): [Bug, Dark, Dragon, Electric, ..., Psychic, Rock, Steel, Water]"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(-> pokemon\n",
" (pt/subset-cols \"Type 1\")\n",
" (pt/astype :category)\n",
" pt/head)"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 45.0\n",
"1 60.0\n",
"2 80.0\n",
"3 80.0\n",
"4 65.0\n",
"Name: Speed, dtype: float64"
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(-> pokemon\n",
" (pt/subset-cols \"Speed\")\n",
" (pt/astype :float)\n",
" pt/head)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Dealing with missing data\n",
"\n",
"One of the most painful operations for data scientists and engineers is dealing with the unknown: `NaN` (or `nil`, `Null`, etc).\n",
"\n",
"panthera tries to make this as painless as possible:"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"NaN 386\n",
"Flying 97\n",
"Ground 35\n",
"Poison 34\n",
"Psychic 33\n",
"Fighting 26\n",
"Grass 25\n",
"Fairy 23\n",
"Steel 22\n",
"Dark 20\n",
"Dragon 18\n",
"Ice 14\n",
"Ghost 14\n",
"Water 14\n",
"Rock 14\n",
"Fire 12\n",
"Electric 6\n",
"Normal 4\n",
"Bug 3\n",
"Name: Type 2, dtype: int64"
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(-> pokemon\n",
" (pt/subset-cols \"Type 2\")\n",
" (pt/value-counts {:dropna false}))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We could check for `NaN` in other ways has well:"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[true false]"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(-> pokemon (pt/subset-cols \"Type 2\") ((juxt pt/hasnans? (comp pt/all? pt/not-na?))))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One of the ways to deal with missing data is to just drop rows"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Flying 97\n",
"Ground 35\n",
"Poison 34\n",
"Psychic 33\n",
"Fighting 26\n",
"Grass 25\n",
"Fairy 23\n",
"Steel 22\n",
"Dark 20\n",
"Dragon 18\n",
"Ice 14\n",
"Rock 14\n",
"Water 14\n",
"Ghost 14\n",
"Fire 12\n",
"Electric 6\n",
"Normal 4\n",
"Bug 3\n",
"Name: Type 2, dtype: int64"
]
},
"execution_count": 61,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(-> pokemon\n",
" (pt/dropna {:subset [\"Type 2\"]})\n",
" (pt/subset-cols \"Type 2\")\n",
" (pt/value-counts {:dropna false}))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But let's say we want to replace missing observations with a flag or value of some kind, we can do that easily with `fill-na`"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 Poison\n",
"1 Poison\n",
"2 Poison\n",
"3 Poison\n",
"4 empty\n",
"5 empty\n",
"6 Flying\n",
"7 Dragon\n",
"8 Flying\n",
"9 empty\n",
"Name: Type 2, dtype: object"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(-> pokemon\n",
" (pt/subset-cols \"Type 2\")\n",
" (pt/fill-na :empty)\n",
" (pt/head 10))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Time and dates\n",
"\n",
"Programmers hate time, that's a fact. Panthera tries to make this experience as painless as possible"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"DatetimeIndex(['1992-01-10', '1992-02-10', '1992-03-10', '1993-03-15',\n",
" '1993-03-16'],\n",
" dtype='datetime64[ns]', freq=None)"
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(def times\n",
" [\"1992-01-10\",\"1992-02-10\",\"1992-03-10\",\"1993-03-15\",\"1993-03-16\"])\n",
"\n",
"(pt/->datetime times)"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" \n",
" | \n",
" # | \n",
" Name | \n",
" Type 1 | \n",
" Type 2 | \n",
" HP | \n",
" Attack | \n",
" Defense | \n",
" Sp. Atk | \n",
" Sp. Def | \n",
" Speed | \n",
" Generation | \n",
" Legendary | \n",
"
\n",
" \n",
" \n",
" \n",
" | 1992-01-10 | \n",
" 1 | \n",
" Bulbasaur | \n",
" Grass | \n",
" Poison | \n",
" 45 | \n",
" 49 | \n",
" 49 | \n",
" 65 | \n",
" 65 | \n",
" 45 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
" | 1992-02-10 | \n",
" 2 | \n",
" Ivysaur | \n",
" Grass | \n",
" Poison | \n",
" 60 | \n",
" 62 | \n",
" 63 | \n",
" 80 | \n",
" 80 | \n",
" 60 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
" | 1992-03-10 | \n",
" 3 | \n",
" Venusaur | \n",
" Grass | \n",
" Poison | \n",
" 80 | \n",
" 82 | \n",
" 83 | \n",
" 100 | \n",
" 100 | \n",
" 80 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
" | 1993-03-15 | \n",
" 4 | \n",
" Mega Venusaur | \n",
" Grass | \n",
" Poison | \n",
" 80 | \n",
" 100 | \n",
" 123 | \n",
" 122 | \n",
" 120 | \n",
" 80 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
" | 1993-03-16 | \n",
" 5 | \n",
" Charmander | \n",
" Fire | \n",
" NaN | \n",
" 39 | \n",
" 52 | \n",
" 43 | \n",
" 60 | \n",
" 50 | \n",
" 65 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
"
"
]
},
"execution_count": 64,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(-> pokemon\n",
" pt/head\n",
" (pt/set-index (pt/->datetime times))\n",
" show)"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"# 5\n",
"Name Charmander\n",
"Type 1 Fire\n",
"Type 2 NaN\n",
"HP 39\n",
"Attack 52\n",
"Defense 43\n",
"Sp. Atk 60\n",
"Sp. Def 50\n",
"Speed 65\n",
"Generation 1\n",
"Legendary False\n",
"Name: 1993-03-16 00:00:00, dtype: object"
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(-> pokemon\n",
" pt/head\n",
" (pt/set-index (pt/->datetime times))\n",
" (pt/select-rows \"1993-03-16\" :loc))"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" \n",
" | \n",
" # | \n",
" Name | \n",
" Type 1 | \n",
" Type 2 | \n",
" HP | \n",
" Attack | \n",
" Defense | \n",
" Sp. Atk | \n",
" Sp. Def | \n",
" Speed | \n",
" Generation | \n",
" Legendary | \n",
"
\n",
" \n",
" \n",
" \n",
" | 1992-03-10 | \n",
" 3 | \n",
" Venusaur | \n",
" Grass | \n",
" Poison | \n",
" 80 | \n",
" 82 | \n",
" 83 | \n",
" 100 | \n",
" 100 | \n",
" 80 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
" | 1993-03-15 | \n",
" 4 | \n",
" Mega Venusaur | \n",
" Grass | \n",
" Poison | \n",
" 80 | \n",
" 100 | \n",
" 123 | \n",
" 122 | \n",
" 120 | \n",
" 80 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
" | 1993-03-16 | \n",
" 5 | \n",
" Charmander | \n",
" Fire | \n",
" NaN | \n",
" 39 | \n",
" 52 | \n",
" 43 | \n",
" 60 | \n",
" 50 | \n",
" 65 | \n",
" 1 | \n",
" False | \n",
"
\n",
" \n",
"
"
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(-> pokemon\n",
" pt/head\n",
" (pt/set-index (pt/->datetime times))\n",
" (pt/select-rows (pt/slice \"1992-03-10\" \"1993-03-16\") :loc)\n",
" show)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Lein-Clojure",
"language": "clojure",
"name": "lein-clojure"
},
"language_info": {
"file_extension": ".clj",
"mimetype": "text/x-clojure",
"name": "clojure",
"version": "1.10.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}