{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Getting Started\n", "\n", "This is a tutorial for a a couple of new Clojure libraries for Machine Learning and ETL -- part of the tech.ml stack.\n", "\n", "Author: Chris Nuernberger\n", "\n", "Translated to [Nextjournal](https://nextjournal.com/alan/tech-dataset-getting-started): Alan Marazzi\n", "\n", "The API is still alpha, we are putting our efforts into extending and beautifying it. Comments will be welcome!\"\n", "\n", "Reading from an excellent article on [advanced regression techniques](https://www.kaggle.com/juliencs/a-study-on-regression-applied-to-the-ames-dataset). \n", "\n", "The target is to predict the SalePrice column.\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#'user/print-table" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(require '[clojupyter.misc.helper :as helper])\n", "(clojupyter.misc.stacktrace/set-print-stacktraces! true)\n", "(helper/add-dependencies '[techascent/tech.ml \"1.0-alpha2-SNAPSHOT\"])\n", ";;Order dependency due to oz including an *ancient* jna version\n", "(helper/add-dependencies '[metasoarous/oz \"1.6.0-alpha2\"])\n", "(require '[oz.notebook.clojupyter :as oz])\n", "\n", "(require '[tech.libs.smile.utils :as smile-utils])\n", "\n", "(require '[tech.ml.dataset.pipeline\n", " ;;We use col a lot, and int map is similar\n", " :refer [col]\n", " :as dsp])\n", "(require '[tech.ml.dataset.pipeline.column-filters :as cf])\n", "(require '[tech.v2.datatype :as dtype])\n", "(require '[tech.v2.datatype.functional :as dfn])\n", "(require '[tech.ml.dataset :as ds])\n", "(require '[tech.ml.dataset.column :as ds-col])\n", "(require '[tech.ml :as ml])\n", "(require '[tech.ml.loss :as loss])\n", "(require '[tech.ml.utils :as ml-utils])\n", "(require '[tech.ml.regression :as ml-regression])\n", "(require '[tech.ml.visualization.vega :as vega-viz])\n", "(require '[clojure.core.matrix :as m])\n", "\n", ";;use tablesaw as dataset backing store\n", "(require '[tech.libs.tablesaw :as tablesaw])\n", "\n", ";;model generators\n", "(require '[tech.libs.xgboost])\n", "(require '[tech.libs.smile.regression])\n", "\n", ";;put/get nippy\n", "(require '[tech.io :as io])\n", "(require '[clojure.pprint :as pp])\n", "(require '[clojure.set :as c-set])\n", "\n", "(import '[java.io File])\n", "\n", "\n", "(defn pp-str\n", " [ds]\n", " (with-out-str\n", " (pp/pprint ds)))\n", "\n", "\n", "(defn print-table\n", " ([ks data]\n", " (->> data\n", " (map (fn [item-map]\n", " (->> item-map\n", " (map (fn [[k v]]\n", " [k (if (or (float? v)\n", " (double? v))\n", " (format \"%.3f\" v)\n", " v)]))\n", " (into {}))))\n", " (pp/print-table ks)))\n", " ([data]\n", " (print-table (sort (keys (first data))) data)))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Well, that wasn't particularly pleasant but it at least is something you can cut & paste..." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[81 1460]\n" ] }, { "data": { "text/plain": [ "nil" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(def src-dataset (tablesaw/path->tablesaw-dataset \"data/ames-house-prices/train.csv\"))\n", "\n", "(println (m/shape src-dataset))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The shape is backward as compared to pandas. This is by intention; core.matrix is a row-major linear algebra system. tech.ml.dataset is column-major. Thus, to ensure sanity when doing conversions we represent the data in a normal shape. Note that pandas returns [1460 81]." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Outliers\n", "\n", "We first check for outliers, graph and then remove them." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "