{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Tablesaw \n", "\n", "[Tablesaw](https://tablesaw.tech/) is easy to add to the BeakerX Groovy kernel.\n", "Tablesaw provides the ability to easily transform, summarize, and filter data, as well as computing descriptive statistics and fundamental machine learning algorithms.\n", "\n", "This notebook has some basic demos of how to use Tablesaw, including visualizing the results. This notebook uses the Beaker interactive visualizaiton libraries, but Tablesaw's APIs also work. The notebook covers basic table manipulation, k-means clustering, and linear regression." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%classpath add mvn\n", "tech.tablesaw tablesaw-plot 0.11.4\n", "tech.tablesaw tablesaw-smile 0.11.4\n", "tech.tablesaw tablesaw-beakerx 0.11.4" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%import tech.tablesaw.aggregate.*\n", "%import tech.tablesaw.api.*\n", "%import tech.tablesaw.api.ml.clustering.*\n", "%import tech.tablesaw.api.ml.regression.*\n", "%import tech.tablesaw.columns.*\n", "\n", "// display Tablesaw tables with BeakerX table display widget\n", "tech.tablesaw.beakerx.TablesawDisplayer.register()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tornadoes = Table.read().csv(\"../resources/data/tornadoes_2014.csv\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "//print dataset structure\n", "tornadoes.structure()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "//get header names\n", "tornadoes.columnNames()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "//displays the row and column counts\n", "tornadoes.shape()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "//displays the first n rows\n", "tornadoes.first(10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import static tech.tablesaw.api.QueryHelper.column\n", "tornadoes.structure().selectWhere(column(\"Column Type\").isEqualTo(\"FLOAT\"))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "//summarize the data in each column\n", "tornadoes.summary()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "//Mapping operations\n", "def month = tornadoes.dateColumn(\"Date\").month()\n", "tornadoes.addColumn(month);\n", "tornadoes.columnNames()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "//Sorting by column\n", "tornadoes.sortOn(\"-Fatalities\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "//Descriptive statistics\n", "tornadoes.column(\"Fatalities\").summary()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "//Performing totals and sub-totals\n", "def injuriesByScale = tornadoes.median(\"Injuries\").by(\"Scale\")\n", "injuriesByScale.setName(\"Median injuries by Tornado Scale\")\n", "injuriesByScale" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "//Cross Tabs\n", "CrossTab.xCount(tornadoes, tornadoes.categoryColumn(\"State\"), tornadoes.shortColumn(\"Scale\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## K-means clustering\n", "\n", "K-means is the most common form of “centroid” clustering. Unlike classification, clustering is an unsupervised learning method. The categories are not predetermined. Instead, the goal is to search for natural groupings in the dataset, such that the members of each group are similar to each other and different from the members of the other groups. The K represents the number of groups to find.\n", "\n", "We’ll use a well known Scotch Whiskey dataset, which is used to cluster whiskeys according to their taste based on data collected from tasting notes. As always, we start by loading data and printing its structure.\n", "\n", "More description is available at https://jtablesaw.wordpress.com/2016/08/08/k-means-clustering-in-java/" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "t = Table.read().csv(\"../resources/data/whiskey.csv\")\n", "t.structure()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model = new Kmeans(\n", " 5,\n", " t.nCol(2), t.nCol(3), t.nCol(4), t.nCol(5), t.nCol(6), t.nCol(7),\n", " t.nCol(8), t.nCol(9), t.nCol(10), t.nCol(11), t.nCol(12), t.nCol(13)\n", ");\n", "\n", "//print claster formation\n", "model.clustered(t.column(\"Distillery\"));" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "//print centroids for each claster\n", "model.labeledCentroids();" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "//gets the distortion for our model\n", "model.distortion()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def n = t.rowCount();\n", "def kValues = new double[n - 2];\n", "def distortions = new double[n - 2];\n", "\n", "for (int k = 2; k < n; k++) {\n", " kValues[k - 2] = k;\n", " def kmeans = new Kmeans(k,\n", " t.nCol(2), t.nCol(3), t.nCol(4), t.nCol(5), t.nCol(6), t.nCol(7),\n", " t.nCol(8), t.nCol(9), t.nCol(10), t.nCol(11), t.nCol(12), t.nCol(13)\n", " );\n", " distortions[k - 2] = kmeans.distortion();\n", "}\n", "def linearYPlot = new Plot(title: \"K-means clustering demo\", xLabel:\"K\", yLabel: \"distortion\")\n", "linearYPlot << new Line(x: kValues, y: distortions)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Play (Money)ball with Linear Regression\n", "\n", "In baseball, you make the playoffs by winning more games than your rivals. The number of games the rivals win is out of your control so the A’s looked instead at how many wins it took historically to make the playoffs. They decided that 95 wins would give them a strong chance. Here’s how we might check that assumption in Tablesaw.\n", "\n", "More description is available at https://jtablesaw.wordpress.com/2016/07/31/play-moneyball-data-science-in-tablesaw/" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import static tech.tablesaw.api.QueryHelper.column\n", "\n", "baseball = Table.read().csv(\"../resources/data/baseball.csv\");\n", "\n", "// filter to the data available at the start of the 2002 season\n", "moneyball = baseball.selectWhere(column(\"year\").isLessThan(2002));\n", "wins = moneyball.nCol(\"W\");\n", "year = moneyball.nCol(\"Year\");\n", "playoffs = moneyball.column(\"Playoffs\");\n", "runDifference = moneyball.shortColumn(\"RS\").subtract(moneyball.shortColumn(\"RA\"));\n", "moneyball.addColumn(runDifference);\n", "runDifference.setName(\"RD\");\n", "\n", "def Plot = new Plot(title: \"RD x Wins\", xLabel:\"RD\", yLabel: \"W\")\n", "Plot << new Points(x: moneyball.numericColumn(\"RD\").toDoubleArray(), y: moneyball.numericColumn(\"W\").toDoubleArray())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "winsModel = LeastSquares.train(wins, runDifference);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def runDiff = new double[1];\n", "runDiff[0] = 135;\n", "def expectedWins = winsModel.predict(runDiff);\n", "runsScored2 = LeastSquares.train(moneyball.nCol(\"RS\"), moneyball.nCol(\"OBP\"), moneyball.nCol(\"SLG\"));" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "new Histogram(xLabel:\"X\",\n", " yLabel:\"Proportion\",\n", " data: Arrays.asList(runsScored2.residuals()), \n", " binCount: 25);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Financial and Economic Data\n", "\n", "You can fetch data from [Quandl](https://www.quandl.com/) and load it directly into Tablesaw" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%classpath add mvn com.jimmoores quandl-tablesaw 2.0.0\n", "%import com.jimmoores.quandl.*\n", "%import com.jimmoores.quandl.tablesaw.*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "TableSawQuandlSession session = TableSawQuandlSession.create();\n", "Table table = session.getDataSet(DataSetRequest.Builder.of(\"WIKI/AAPL\").build());\n", "// Create a new column containing the year\n", "ShortColumn yearColumn = table.dateColumn(\"Date\").year();\n", "yearColumn.setName(\"Year\");\n", "table.addColumn(yearColumn);\n", "// Create max, min and total volume tables aggregated by year\n", "Table summaryMax = table.groupBy(\"year\").max(\"Adj. Close\");\n", "Table summaryMin = table.groupBy(\"year\").min(\"Adj. Close\");\n", "Table summaryVolume = table.groupBy(\"year\")sum(\"Volume\");\n", "// Create a new table from each of these\n", "summary = Table.create(\"Summary\", summaryMax.column(0), summaryMax.column(1), \n", " summaryMin.column(1), summaryVolume.column(1));\n", "// Add back a DateColumn to the summary...will be used for plotting\n", "DateColumn yearDates = new DateColumn(\"YearDate\");\n", "for(year in summary.column('Year')){\n", " yearDates.append(java.time.LocalDate.of(year,1,1));\n", "}\n", "summary.addColumn(yearDates)\n", "\n", "summary" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "years = summary.column('YearDate').collect()\n", "\n", "plot = new TimePlot(title: 'Price Chart for AAPL', xLabel: 'Time', yLabel: 'Max [Adj. Close]')\n", "plot << new YAxis(label: 'Volume')\n", "plot << new Points(x: years, y: summary.column('Max [Adj. Close]').collect())\n", "plot << new Line(x: years, y: summary.column('Max [Adj. Close]').collect(), color: Color.blue)\n", "plot << new Stems(x: years, y: summary.column('Sum [Volume]').collect(), yAxis: 'Volume')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Groovy", "language": "groovy", "name": "groovy" }, "language_info": { "codemirror_mode": "groovy", "file_extension": ".groovy", "mimetype": "", "name": "Groovy", "nbconverter_exporter": "", "version": "2.4.3" } }, "nbformat": 4, "nbformat_minor": 2 }