{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Smoothing\n", "\n", "Smoothing can help to discover trends that otherwise might be hard to see in raw data. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", " " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ " " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%useLatestDescriptors\n", "%use lets-plot\n", "%use dataframe" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Lets-Plot Kotlin API v.4.4.2. Frontend: Notebook with dynamically loaded JS. Lets-Plot JS v.4.0.0." ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "LetsPlot.getInfo()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "application/kotlindataframe+json": "{\"nrow\":5,\"ncol\":12,\"columns\":[\"untitled\",\"manufacturer\",\"model\",\"displ\",\"year\",\"cyl\",\"trans\",\"drv\",\"cty\",\"hwy\",\"fl\",\"class\"],\"kotlin_dataframe\":[{\"untitled\":1,\"manufacturer\":\"audi\",\"model\":\"a4\",\"displ\":1.8,\"year\":1999,\"cyl\":4,\"trans\":\"auto(l5)\",\"drv\":\"f\",\"cty\":18,\"hwy\":29,\"fl\":\"p\",\"class\":\"compact\"},{\"untitled\":2,\"manufacturer\":\"audi\",\"model\":\"a4\",\"displ\":1.8,\"year\":1999,\"cyl\":4,\"trans\":\"manual(m5)\",\"drv\":\"f\",\"cty\":21,\"hwy\":29,\"fl\":\"p\",\"class\":\"compact\"},{\"untitled\":3,\"manufacturer\":\"audi\",\"model\":\"a4\",\"displ\":2.0,\"year\":2008,\"cyl\":4,\"trans\":\"manual(m6)\",\"drv\":\"f\",\"cty\":20,\"hwy\":31,\"fl\":\"p\",\"class\":\"compact\"},{\"untitled\":4,\"manufacturer\":\"audi\",\"model\":\"a4\",\"displ\":2.0,\"year\":2008,\"cyl\":4,\"trans\":\"auto(av)\",\"drv\":\"f\",\"cty\":21,\"hwy\":30,\"fl\":\"p\",\"class\":\"compact\"},{\"untitled\":5,\"manufacturer\":\"audi\",\"model\":\"a4\",\"displ\":2.8,\"year\":1999,\"cyl\":6,\"trans\":\"auto(l5)\",\"drv\":\"f\",\"cty\":16,\"hwy\":26,\"fl\":\"p\",\"class\":\"compact\"}]}", "text/html": [ " \n", " \n", " \n", " \n", " \n", " \n", "
\n", "\n", "

DataFrame: rowsCount = 5, columnsCount = 12

\n", " \n", " \n", " " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "var mpg_df = DataFrame.readCSV(\"https://raw.githubusercontent.com/JetBrains/lets-plot-kotlin/master/docs/examples/data/mpg.csv\")\n", "mpg_df.head()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The default smoothing method is `'linear model'` (or `'lm'`)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "// val dat = (mpg_df.names.map { Pair(it, mpg_df.get(it).values())}).toMap()\n", "val dat = mpg_df.toMap()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val mpg_plot = letsPlot(dat) {x=\"displ\"; y=\"hwy\"} \n", "mpg_plot + geomPoint() + geomSmooth()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `LOESS` model does seem to better fit MPG data than linear model." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mpg_plot + geomPoint() + statSmooth(method=\"loess\", size=1.0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Applying smoothing to groups\n", "\n", "Let's map the vehicle `drivetrain type` (variable 'drv') to the color of points.\n", "\n", "This makes it easy to see that points with the same type of the drivetrain are forming some kind of groups or clusters. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mpg_plot + geomPoint {color=\"drv\"} +\n", " statSmooth(method=\"loess\", size=1.0) {color=\"drv\"}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Apply linear model with 2nd degree polynomial.\n", "\n", "As `LOESS` prediction looks a bit weird let's try 2nd degree polinomial regression." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mpg_plot + geomPoint {color=\"drv\"} +\n", " statSmooth(method=\"lm\", deg=2, size=1.0) {color=\"drv\"}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using `asDiscrete()` function with numeric data series\n", "\n", "In the previous examples we were using a discrete (or categorical) variable `'drv'` to split the data into a groups.\n", "\n", "Now let's try to use a numeric variable `'cyl'` for the same purpose." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mpg_plot + geomPoint {color=\"cyl\"} +\n", " geomSmooth(method=\"lm\", deg=2, size=1.0) {color=\"cyl\"}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Easy to see that the data wasn't split into groups. \n", "`Lets-Plot` offers two solutions in this situation:\n", "\n", " * Use the `group` aesthetic\n", " * Use the `asDiscrete()` function" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The **group** aesthetic helps to create a groups." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mpg_plot + geomPoint {color=\"cyl\"} +\n", " geomSmooth(method=\"lm\", deg=2, size=1.0) {color=\"cyl\"; group=\"cyl\"}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The **asDiscrete('cyl')** function will \"annotate\" the `'cyl'` variable as `discrete`.\n", "\n", "This leads to creation of the groups and to assigning of a `discrete` color scale instead of a `continuous`." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mpg_plot + geomPoint {color=\"cyl\"} +\n", " geomSmooth(method=\"lm\", deg=2, size=1.0) {color=asDiscrete(\"cyl\")}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Effect of `span` parameter on the \"wiggliness\" the LOESS smoother.\n", "\n", "The span is the fraction of points used to fit each local regression.\n", "Small numbers make a wigglier curve, larger numbers make a smoother curve." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "import kotlin.math.PI\n", "import kotlin.random.Random\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "val n = 150\n", "val x_range = generateSequence( -2 * PI ) { it + 4 * PI / n }.takeWhile { it <= 2 * PI }\n", "val y_range = x_range.map{ sin( it ) + Random.nextDouble(-0.5, 0.5) }\n", "val df = mapOf(\n", " \"x\" to x_range,\n", " \"y\" to y_range\n", ")" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val p = ggplot(df) {x=\"x\"; y=\"y\"} + geomPoint(shape=21, fill=\"yellow\", color=\"#8c564b\")\n", "val p1 = p + geomSmooth(method=\"loess\", size=1.5, color=\"#d62728\") + ggtitle(\"default (span = 0.5)\")\n", "val p2 = p + geomSmooth(method=\"loess\", span=.2, size=1.5, color=\"#9467bd\") + ggtitle(\"span = 0.2\")\n", "val p3 = p + geomSmooth(method=\"loess\", span=.7, size=1.5, color=\"#1f77b4\") + ggtitle(\"span = 0.7\")\n", "val p4 = p + geomSmooth(method=\"loess\", span=1, size=1.5, color=\"#2ca02c\") + ggtitle(\"span = 1\")\n", "\n", "GGBunch()\n", ".addPlot(p1, 0, 0, 400, 300)\n", ".addPlot(p2, 400, 0, 400, 300)\n", ".addPlot(p3, 0, 300, 400, 300)\n", ".addPlot(p4, 400, 300, 400, 300)" ] } ], "metadata": { "kernelspec": { "display_name": "Kotlin", "language": "kotlin", "name": "kotlin" }, "language_info": { "codemirror_mode": "text/x-kotlin", "file_extension": ".kt", "mimetype": "text/x-kotlin", "name": "kotlin", "nbconvert_exporter": "", "pygments_lexer": "kotlin", "version": "1.8.20" } }, "nbformat": 4, "nbformat_minor": 4 }