{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Basic panthera concepts\n", "\n", "This is an introductory guide to the concepts driving panthera and its usage.\n", "\n", "## Setup\n", "\n", "Let's add panthera to our classpath and require it to start playing around." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ ":ok" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(require '[clojupyter.misc.helper :as helper])\n", "(helper/add-dependencies '[panthera \"0.1-alpha.13\"])\n", ":ok" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "nil" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(require '[panthera.panthera :as pt])\n", "(require '[libpython-clj.python :as py])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `show` function is a helper to render data-frames" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#'user/show" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(require '[clojupyter.display :as display])\n", "\n", "(defn show\n", " [obj]\n", " (display/html\n", " (py/call-attr obj \"to_html\")))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Series\n", "\n", "*Serieses* are like vectors that act also as columns for *data-frames* (see [Data-frame](#dataframe) section). One *series* must have all the contained data with the same data type and if there is more than one type when you create a *series* than this one takes the most relaxed one.\n", "\n", "### Create" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "2 3\n", "dtype: int64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(pt/series [1 2 3])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we print the *series* we see on the left its index and on the right its values. As you can see below the *series* itself we get the underlying data type (*dtype*) as well. Let's swap 3 with \"a\" and see what happens." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "2 a\n", "dtype: object" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(pt/series [1 2 \"a\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now the *dtype* it's become `object`, which in *panthera* means either `string` or something that can be represented with a `string` and is not a primitive.\n", "\n", "If we get this data back to Clojure we'll see that we get the underlying original representation with mixed data types." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[1 2 \"a\"]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(vec (pt/series [1 2 \"a\"]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This means that we can always move from a representation to another without many problems. A *series* can be treated as a Clojure vector if we want to:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1 2 3)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(map inc (pt/series (range 3)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But when we do this we lose metadata tied to it. The difference with regular vectors is mostly this metadata, a *series* specifically:\n", "\n", "- can have a name\n", "- has a *dtype*\n", "- has an index that can be freely named\n", "\n", "Let's see a few examples:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "name my-series\n", "dtype: object" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(pt/series {:name \"my-series\"})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We just created an empty *series* with the name \"my-series\" to show that it can exist even with just metadata. The map passed as an argument lets you add other options to the function call without bothering about their position (in Python there is a clear distinction between *arguments* and *keyword arguments*, [more info](https://treyhunner.com/2018/04/keyword-arguments-in-python/)).\n", "\n", "We can combine arguments together to get the wanted outcome" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "idx 1\n", "Name: my-series, dtype: int64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(pt/series 1 {:name \"my-series\" :index [\"idx\"]})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Indexing and subsetting\n", "\n", "Now \"my-series\" has a name, a value and a named index. This distinction is very important in *panthera*: indexing can be done by name and by position." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "a 0\n", "d 3\n", "Name: my-series, dtype: int64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(-> (pt/series (range 5) {:name \"my-series\" :index [\"a\" \"b\" \"c\" \"d\" \"e\"]})\n", " (pt/select-rows [0 3]))" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "a 0\n", "d 3\n", "Name: my-series, dtype: int64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(-> (pt/series (range 5) {:name \"my-series\" :index [\"a\" \"b\" \"c\" \"d\" \"e\"]})\n", " (pt/select-rows [\"a\" \"d\"] :loc))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see above we were able to get the same values from the *series*, but the first time we used pure positional indexing, while the second one we used named indexing.\n", "\n", "This isn't something logical, it just works like this in pandas. So you'll have to memorize:\n", "\n", "- `:iloc`: positional indexing\n", "- `:loc`: named indexing or booleans\n", "\n", "Be aware that the result of this can be this behaviour:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "100 0\n", "103 3\n", "Name: my-series, dtype: int64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(-> (pt/series (range 5) {:name \"my-series\" :index (map #(+ 100 %) (range 5))})\n", " (pt/select-rows [0 3] :iloc))" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "100 0\n", "103 3\n", "Name: my-series, dtype: int64" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(-> (pt/series (range 5) {:name \"my-series\" :index (map #(+ 100 %) (range 5))})\n", " (pt/select-rows [100 103] :loc))" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0\n", "3 3\n", "Name: my-series, dtype: int64" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(-> (pt/series (range 5) {:name \"my-series\"})\n", " (pt/select-rows [0 3] :loc))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What happens above is that somewhat unexpectedly we get always the same values. Let's review every cell by itself:\n", "\n", "- the first time our series can be thought as a map like `{100 0 101 1 102 2 ...}`, but this in panthera doesn't change the fact that the first value is 0, the second is 1 and so on. So by getting `[0 3]` the result is a *series* with the first and fourth values\n", "- the second time we ask for named indices, and this for Clojurians is probably the clearest case: `(select-keys {100 0 101 1 102 2 ...} [100 103])` would give the same result\n", "- the last case is probably the least clear, we ask for named indices (`:loc`), but they are integers and they are positional. This happens because when we don't have named indices both *serieses* and *data-frames* assign a monotonically increasing index that has the value of the index itself as a label. If we had to represent a panthera index in pure Clojure it would be something like `{0 \"0\" 1 \"1\" 2 \"2\" ...}`\n", "\n", "There's another way to subset by index: slicing" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3 3\n", "4 4\n", "5 5\n", "dtype: int64" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(-> (pt/series (range 10))\n", " (pt/select-rows (pt/slice 3 6)))" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "a 0\n", "b 1\n", "c 2\n", "d 3\n", "Name: my-series, dtype: int64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(-> (pt/series (range 5) {:name \"my-series\" :index [\"a\" \"b\" \"c\" \"d\" \"e\"]})\n", " (pt/select-rows (pt/slice \"a\" \"d\") :loc))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Math and stats\n", "\n", "Math is easy with panthera! The only thing to keep in mind is that operations are **vectorized**, so something like `(+ [1 2 3] 1)` would result in `[2 3 4]`.\n", "\n", "To avoid confusion the panthera operations are named differently than the core functions (`+`, `-`, `*`, etc)." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 2\n", "1 3\n", "2 4\n", "dtype: int64" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(pt/add (pt/series [1 2 3]) 1)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0\n", "1 1\n", "2 8\n", "3 27\n", "4 64\n", "dtype: int64" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(pt/pow (pt/series (range 5)) 3)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 1\n", "2 1\n", "dtype: int64" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(pt/add (pt/series [1 2 3]) 1 (pt/series [-1 -2 -3]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The only note about these operations is that in order to work the first argument has to be a panthera data structure.\n", "\n", "There are more advanced stats functions besides the more regular ones:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4.5" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(pt/mean (pt/series (range 10)))" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "10.712688874485469" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(pt/kurtosis (pt/series (concat (range 10) [100])))" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3.2568924988901746" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(pt/skew (pt/series (concat (range 10) [100])))" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "837.3636363636364" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(pt/var (pt/series (concat (range 10) [100])))" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-1.0" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(pt/corr (pt/series (range 10)) (pt/series (range 9 0 -1)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Conversions\n", "\n", "It might happen that you'd like to work with different data types than the ones inferred by panthera. The advice here is to do this only on the Python side of things." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "dtype: int64" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(pt/->numeric (pt/series [\"1\" \"2\"]))" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2019-01-01 00:00:00" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(pt/->datetime \"2019-01-01\")" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 2019-01-01\n", "1 2019-02-01\n", "dtype: datetime64[ns]" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(pt/->datetime (pt/series [\"2019-01-01\" \"2019-02-01\"]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below an example of why you should be careful to deal with different data types in panthera" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{:unnamed 2019-01-01 00:00:00} {:unnamed 2019-02-01 00:00:00}]" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(-> (pt/series [\"2019-01-01\" \"2019-02-01\"])\n", " pt/->datetime\n", " pt/->clj)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ ":pyobject" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(-> (pt/series [\"2019-01-01\" \"2019-02-01\"])\n", " pt/->datetime\n", " pt/->clj\n", " first\n", " :unnamed\n", " type)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The safest way to deal with dates on the Clojure side of things is to convert them to strings" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"2019-01-01 00:00:00\"" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(-> (pt/series [\"2019-01-01\" \"2019-02-01\"])\n", " pt/->datetime\n", " pt/->clj\n", " first\n", " :unnamed\n", " str)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can have fun with regular numeric types as well" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1.0\n", "1 2.0\n", "2 3.0\n", "dtype: float32" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(pt/astype (pt/series [1 2 3]) :float32)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reshaping\n", "\n", "There are many facilities to let you *hack 'n' slash* data almost however you want" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 (-0.009, 3.0]\n", "1 (-0.009, 3.0]\n", "2 (-0.009, 3.0]\n", "3 (-0.009, 3.0]\n", "4 (3.0, 6.0]\n", "5 (3.0, 6.0]\n", "6 (3.0, 6.0]\n", "7 (6.0, 9.0]\n", "8 (6.0, 9.0]\n", "9 (6.0, 9.0]\n", "dtype: category\n", "Categories (3, interval[float64]): [(-0.009, 3.0] < (3.0, 6.0] < (6.0, 9.0]]" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(pt/cut (pt/series (range 10)) 3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Intervals aren't handled (yet) on the Clojure side, so keep 'em strictly in Python if you want to deal with them.\n", "\n", "With `factorize` you can convert values to ints, so basically you get categories." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(array([0, 1, 2]), Index(['a', 'b', 'c'], dtype='object'))" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(pt/factorize (pt/series [:a :b :c]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With `remap` yu can, well, *remap* your values however you like. Just be aware that you have to pass `remap` every value present in the series in the new encoding, otherwise those not specified will be interpreted as NaNs." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 this\n", "1 that\n", "2 NaN\n", "dtype: object" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(pt/remap (pt/series [:a :b :c]) {:a \"this\" :b \"that\"})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An example on one way to deal with `remap` when you want to remap only some values" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 a\n", "1 b\n", "2 c\n", "3 d\n", "4 only-this-one\n", "5 f\n", "6 g\n", "7 h\n", "8 i\n", "9 j\n", "dtype: object" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(def remapper \n", " (-> (pt/series [:a :b :c :d :e :f :g :h :i :j])\n", " pt/unique\n", " (#(zipmap % %))\n", " (assoc \"e\" \"only-this-one\")))\n", "\n", "(pt/remap\n", " (pt/series [:a :b :c :d :e :f :g :h :i :j])\n", " remapper)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`rolling` lets you calculate statistics on a rolling window basis" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Rolling [window=2,center=False,axis=0]" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(pt/rolling (pt/series (range 10)) 2)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 NaN\n", "1 0.5\n", "2 1.5\n", "3 2.5\n", "4 3.5\n", "5 4.5\n", "6 5.5\n", "7 6.5\n", "8 7.5\n", "9 8.5\n", "dtype: float64" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(-> (pt/series (range 10))\n", " (pt/rolling 2)\n", " pt/mean)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Missing values\n", "\n", "Dealing with missing values is what really makes the difference between a full-fledged data analysis framework and much more limited solutions.\n", "\n", "panthera gives you many options to try easing the pain a bit" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1.0\n", "1 2.0\n", "3 3.0\n", "dtype: float64" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(pt/dropna (pt/series [1 2 nil 3]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that though the name might let you think that we're mutating the original series, this is similar to Clojure's `drop`" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1.0\n", "1 2.0\n", "2 NaN\n", "3 3.0\n", "dtype: float64" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(def my-srs (pt/series [1 2 nil 3]))\n", "\n", "(pt/dropna my-srs)\n", "my-srs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are various ways to check if your data contains some missing observation. The easiest and fastest one is `hasnans?`." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "true" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(pt/hasnans? (pt/series (concat (range 1000) [nil])))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`hasnans?` is a cached value, but this shouldn't be an issue considering that everything is as immutable as possible.\n", "\n", "This is another potentially slower way to do the same thing" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "false" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(pt/all? (pt/not-na? (pt/series (concat (range 1000) [nil]))))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of course `not-na?` and `all?` have their uses (for instance if you pass the result of `not-na?` to `select-rows` you'll filter NaNs out of the series).\n", "\n", "panthera's workhorse to deal with missing observations is `fill-na` which lets you assign a value to NaNs" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1.0\n", "1 2.0\n", "2 3.0\n", "3 4.0\n", "dtype: float64" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(pt/fill-na (pt/series [1 2 nil 4]) 3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Data-frame\n", "\n", "A data-frame is basically a collection of serieses as columns. In other words it's a rectangular data structure akin to a matrix, but while the latter is usually only numeric, data-frames can have mixed column types.\n", "\n", "### Create" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ " a b\n", "0 1 2\n", "1 3 4" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(pt/data-frame [{:a 1 :b 2} {:a 3 :b 4}])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The easiest way to create a data-frame is with a vector of maps where every map is a row, keys are columns names and values, well, are corresponding values.\n", "\n", "As we saw earlier data-frames are a collection of serieses, so we can create one starting from a bunch of them." ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ " 0 1 2\n", "0 1 2 3\n", "1 4 5 6" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(pt/data-frame [(pt/series [1 2 3]) (pt/series [4 5 6])])" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/plain": [ " a b\n", "0 1 x\n", "1 2 y\n", "2 3 z" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(pt/data-frame {:a (pt/series [1 2 3]) :b (pt/series [:x :y :z])})" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "a int64\n", "b object\n", "dtype: object" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(pt/dtype (pt/data-frame {:a (pt/series [1 2 3]) :b (pt/series [:x :y :z])}))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Above we see that panthera doesn't complain about the column `a` having type `int64` and `b` having type `object` and we can keep working on them as much as we want.\n", "\n", "### Indexing and subsetting\n", "\n", "Now we have two dimensions to work with! No worries, it is always possible to operate on both of them. But first let's check all the metadata available to us" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RangeIndex(start=0, stop=2, step=1)" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(def df (pt/data-frame [{:a 1 :b 2} {:a 3 :b 4}]))\n", "\n", "(pt/index df)" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['a', 'b'], dtype='object')" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(pt/names df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, what we saw for serieses works for data-frames as well" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/plain": [ " a b c\n", "0 0 1 2\n", "5 15 16 17" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(def df (pt/data-frame (map #(zipmap [:a :b :c] %) (partition 3 (range 30)))))\n", "\n", "(pt/select-rows df [0 5])" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ " a b c\n", "2 6 7 8\n", "3 9 10 11\n", "4 12 13 14" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(pt/select-rows df (pt/slice 2 5))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The new thing is subsetting columns, we can do this by name with `subset-cols`. You can select any number of columns in this way, as long as they are in the given data-frame" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/plain": [ " a c\n", "0 0 2\n", "1 3 5\n", "2 6 8\n", "3 9 11\n", "4 12 14\n", "5 15 17\n", "6 18 20\n", "7 21 23\n", "8 24 26\n", "9 27 29" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(pt/subset-cols df :a :c)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Math and stats" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "a 13.5\n", "b 14.5\n", "c 15.5\n", "dtype: float64" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(pt/mean df)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Clojure (clojupyter-v0.2.2)", "language": "clojure", "name": "clojupyter" }, "language_info": { "file_extension": ".clj", "mimetype": "text/x-clojure", "name": "clojure", "version": "1.10.0" } }, "nbformat": 4, "nbformat_minor": 2 }