{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Getting Started\n", "\n", "This is a tutorial for a a couple of new Clojure libraries for Machine Learning and ETL -- part of the tech.ml stack.\n", "\n", "Author: Chris Nuernberger\n", "\n", "Translated to [Nextjournal](https://nextjournal.com/alan/tech-dataset-getting-started): Alan Marazzi\n", "\n", "The API is still alpha, we are putting our efforts into extending and beautifying it. Comments will be welcome!\"\n", "\n", "Reading from an excellent article on [advanced regression techniques](https://www.kaggle.com/juliencs/a-study-on-regression-applied-to-the-ames-dataset). \n", "\n", "The target is to predict the SalePrice column.\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#'user/print-table" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(require '[clojupyter.misc.helper :as helper])\n", "(clojupyter.misc.stacktrace/set-print-stacktraces! true)\n", "(helper/add-dependencies '[techascent/tech.ml \"1.0-alpha2-SNAPSHOT\"])\n", ";;Order dependency due to oz including an *ancient* jna version\n", "(helper/add-dependencies '[metasoarous/oz \"1.6.0-alpha2\"])\n", "(require '[oz.notebook.clojupyter :as oz])\n", "\n", "(require '[tech.libs.smile.utils :as smile-utils])\n", "\n", "(require '[tech.ml.dataset.pipeline\n", " ;;We use col a lot, and int map is similar\n", " :refer [col]\n", " :as dsp])\n", "(require '[tech.ml.dataset.pipeline.column-filters :as cf])\n", "(require '[tech.v2.datatype :as dtype])\n", "(require '[tech.v2.datatype.functional :as dfn])\n", "(require '[tech.ml.dataset :as ds])\n", "(require '[tech.ml.dataset.column :as ds-col])\n", "(require '[tech.ml :as ml])\n", "(require '[tech.ml.loss :as loss])\n", "(require '[tech.ml.utils :as ml-utils])\n", "(require '[tech.ml.regression :as ml-regression])\n", "(require '[tech.ml.visualization.vega :as vega-viz])\n", "(require '[clojure.core.matrix :as m])\n", "\n", ";;use tablesaw as dataset backing store\n", "(require '[tech.libs.tablesaw :as tablesaw])\n", "\n", ";;model generators\n", "(require '[tech.libs.xgboost])\n", "(require '[tech.libs.smile.regression])\n", "\n", ";;put/get nippy\n", "(require '[tech.io :as io])\n", "(require '[clojure.pprint :as pp])\n", "(require '[clojure.set :as c-set])\n", "\n", "(import '[java.io File])\n", "\n", "\n", "(defn pp-str\n", " [ds]\n", " (with-out-str\n", " (pp/pprint ds)))\n", "\n", "\n", "(defn print-table\n", " ([ks data]\n", " (->> data\n", " (map (fn [item-map]\n", " (->> item-map\n", " (map (fn [[k v]]\n", " [k (if (or (float? v)\n", " (double? v))\n", " (format \"%.3f\" v)\n", " v)]))\n", " (into {}))))\n", " (pp/print-table ks)))\n", " ([data]\n", " (print-table (sort (keys (first data))) data)))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Well, that wasn't particularly pleasant but it at least is something you can cut & paste..." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[81 1460]\n" ] }, { "data": { "text/plain": [ "nil" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(def src-dataset (tablesaw/path->tablesaw-dataset \"data/ames-house-prices/train.csv\"))\n", "\n", "(println (m/shape src-dataset))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The shape is backward as compared to pandas. This is by intention; core.matrix is a row-major linear algebra system. tech.ml.dataset is column-major. Thus, to ensure sanity when doing conversions we represent the data in a normal shape. Note that pandas returns [1460 81]." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Outliers\n", "\n", "We first check for outliers, graph and then remove them." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", " \n", "
\n", " " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(-> [:vega-lite {:data {:values\n", " (-> src-dataset\n", " (ds/select [\"SalePrice\" \"GrLivArea\"] :all)\n", " (ds/->flyweight))}\n", " :mark :point\n", " :encoding {:y {:field \"SalePrice\"\n", " :type :quantitative}\n", " :x {:field \"GrLivArea\"\n", " :type :quantitative}}}]\n", " oz/view!)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", " \n", "
\n", " " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(def filtered-ds (dsp/filter src-dataset \"GrLivArea\" #(dfn/< (dsp/col) 4000)))\n", "(-> [:vega-lite {:data {:values\n", " (-> filtered-ds\n", " (ds/select [\"SalePrice\" \"GrLivArea\"] :all)\n", " (ds/->flyweight))}\n", " :mark :point\n", " :encoding {:y {:field \"SalePrice\"\n", " :type :quantitative}\n", " :x {:field \"GrLivArea\"\n", " :type :quantitative}}}]\n", " oz/view!)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Initial Pipeline\n", "\n", "We now begin to construct our data processing pipeline. Note that all pipeline operations are available as repl functions from the pipeline namespace." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "#'user/initial-pipeline-from-article" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(defn initial-pipeline-from-article\n", " [dataset]\n", " (-> dataset\n", " ;;Convert any numeric or boolean columns to be all of one datatype.\n", " (dsp/remove-columns [\"Id\"])\n", " (dsp/->datatype)\n", " (dsp/m= \"SalePrice\" #(dfn/log1p (dsp/col)))\n", " (ds/set-inference-target \"SalePrice\")))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Categorical Fixes\n", "\n", "Whether columns are categorical or not is defined by attributes." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "pre-categorical-count 42\n", "post-categorical-count 45\n" ] }, { "data": { "text/plain": [ "nil" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(defn more-categorical\n", " [dataset]\n", " (dsp/assoc-metadata dataset [\"MSSubClass\" \"OverallQual\" \"OverallCond\"] :categorical? true))\n", "\n", "(println \"pre-categorical-count\" (count (cf/categorical? filtered-ds)))\n", "\n", "(def post-categorical-fix (-> filtered-ds\n", " initial-pipeline-from-article\n", " more-categorical))\n", "\n", "(println \"post-categorical-count\" (count (cf/categorical? post-categorical-fix)))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Missing Entries\n", "\n", "Missing data is a theme that will come up again and again. Pandas has great tooling to clean up missing entries and we borrow heavily from them." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "pre missing fix #1\n", "({:column-name \"Alley\", :missing-count 1365}\n", " {:column-name \"MasVnrType\", :missing-count 8}\n", " {:column-name \"BsmtQual\", :missing-count 37}\n", " {:column-name \"BsmtCond\", :missing-count 37}\n", " {:column-name \"BsmtExposure\", :missing-count 38}\n", " {:column-name \"BsmtFinType1\", :missing-count 37}\n", " {:column-name \"BsmtFinType2\", :missing-count 38}\n", " {:column-name \"Electrical\", :missing-count 1}\n", " {:column-name \"FireplaceQu\", :missing-count 690}\n", " {:column-name \"GarageType\", :missing-count 81}\n", " {:column-name \"GarageFinish\", :missing-count 81}\n", " {:column-name \"GarageQual\", :missing-count 81}\n", " {:column-name \"GarageCond\", :missing-count 81}\n", " {:column-name \"PoolQC\", :missing-count 1451}\n", " {:column-name \"Fence\", :missing-count 1176}\n", " {:column-name \"MiscFeature\", :missing-count 1402})\n", "post missing fix #1\n", "({:column-name \"Electrical\", :missing-count 1})\n" ] }, { "data": { "text/plain": [ "nil" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ ";; Impressive patience to come up with this list!!\n", "(defn initial-missing-entries\n", " [dataset]\n", " (-> dataset\n", " ;; Handle missing values for features where median/mean or most common value doesn't\n", " ;; make sense\n", "\n", " ;; Alley : data description says NA means \"no alley access\"\n", " (dsp/replace-missing \"Alley\" \"None\")\n", " ;; BedroomAbvGr : NA most likely means 0\n", " (dsp/replace-missing [\"BedroomAbvGr\"\n", " \"BsmtFullBath\"\n", " \"BsmtHalfBath\"\n", " \"BsmtUnfSF\"\n", " \"EnclosedPorch\"\n", " \"Fireplaces\"\n", " \"GarageArea\"\n", " \"GarageCars\"\n", " \"HalfBath\"\n", " ;; KitchenAbvGr : NA most likely means 0\n", " \"KitchenAbvGr\"\n", " \"LotFrontage\"\n", " \"MasVnrArea\"\n", " \"MiscVal\"\n", " ;; OpenPorchSF : NA most likely means no open porch\n", " \"OpenPorchSF\"\n", " \"PoolArea\"\n", " ;; ScreenPorch : NA most likely means no screen porch\n", " \"ScreenPorch\"\n", " ;; TotRmsAbvGrd : NA most likely means 0\n", " \"TotRmsAbvGrd\"\n", " ;; WoodDeckSF : NA most likely means no wood deck\n", " \"WoodDeckSF\"\n", " ]\n", " 0)\n", " ;; BsmtQual etc : data description says NA for basement features is \"no basement\"\n", " (dsp/replace-missing [\"BsmtQual\"\n", " \"BsmtCond\"\n", " \"BsmtExposure\"\n", " \"BsmtFinType1\"\n", " \"BsmtFinType2\"\n", " ;; Fence : data description says NA means \"no fence\"\n", " \"Fence\"\n", " ;; FireplaceQu : data description says NA means \"no\n", " ;; fireplace\"\n", "\n", " \"FireplaceQu\"\n", " ;; GarageType etc : data description says NA for garage\n", " ;; features is \"no garage\"\n", " \"GarageType\"\n", " \"GarageFinish\"\n", " \"GarageQual\"\n", " \"GarageCond\"\n", " ;; MiscFeature : data description says NA means \"no misc\n", " ;; feature\"\n", " \"MiscFeature\"\n", " ;; PoolQC : data description says NA means \"no pool\"\n", " \"PoolQC\"\n", " ]\n", " \"No\")\n", " (dsp/replace-missing \"CentralAir\" \"N\")\n", " (dsp/replace-missing [\"Condition1\"\n", " \"Condition2\"]\n", " \"Norm\")\n", " ;; Condition : NA most likely means Normal\n", " ;; EnclosedPorch : NA most likely means no enclosed porch\n", " ;; External stuff : NA most likely means average\n", " (dsp/replace-missing [\"ExterCond\"\n", " \"ExterQual\"\n", " ;; HeatingQC : NA most likely means typical\n", " \"HeatingQC\"\n", " ;; KitchenQual : NA most likely means typical\n", " \"KitchenQual\"\n", " ]\n", " \"TA\")\n", " ;; Functional : data description says NA means typical\n", " (dsp/replace-missing \"Functional\" \"Typ\")\n", " ;; LotShape : NA most likely means regular\n", " (dsp/replace-missing \"LotShape\" \"Reg\")\n", " ;; MasVnrType : NA most likely means no veneer\n", " (dsp/replace-missing \"MasVnrType\" \"None\")\n", " ;; PavedDrive : NA most likely means not paved\n", " (dsp/replace-missing \"PavedDrive\" \"N\")\n", " (dsp/replace-missing \"SaleCondition\" \"Normal\")\n", " (dsp/replace-missing \"Utilities\" \"AllPub\")))\n", "\n", "(println \"pre missing fix #1\")\n", "(pp/pprint (ds/columns-with-missing-seq post-categorical-fix))\n", "\n", "(def post-missing (initial-missing-entries post-categorical-fix))\n", "\n", "(println \"post missing fix #1\")\n", "\n", "(pp/pprint (ds/columns-with-missing-seq post-missing))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## String->Number\n", "\n", "\n", "We need to convert string data into numbers somehow. One method is to build a lookup table such that 1 string column gets converted into 1 numeric column. The exact encoding of these strings can be very important to communicate semantic information from the dataset to the ml system. We remember all these mappings because we have to use them later. They get stored both in the recorded pipeline and in the options map so we can reverse-map label values back into their categorical initial values." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\"PoolQC\" {\"No\" 0, \"Fa\" 1, \"TA\" 2, \"Gd\" 3, \"Ex\" 4},\n", " \"BsmtCond\" {\"No\" 0, \"Po\" 1, \"Fa\" 2, \"TA\" 3, \"Gd\" 4, \"Ex\" 5},\n", " \"GarageQual\" {\"No\" 0, \"Po\" 1, \"Fa\" 2, \"TA\" 3, \"Gd\" 4, \"Ex\" 5},\n", " \"Alley\" {\"Grvl\" 1, \"Pave\" 2, \"None\" 0},\n", " \"LandSlope\" {\"Sev\" 1, \"Mod\" 2, \"Gtl\" 3},\n", " \"PavedDrive\" {\"N\" 0, \"P\" 1, \"Y\" 2},\n", " \"BsmtFinType2\"\n", " {\"No\" 0, \"Unf\" 1, \"LwQ\" 2, \"Rec\" 3, \"BLQ\" 4, \"ALQ\" 5, \"GLQ\" 6},\n", " \"Street\" {\"Grvl\" 1, \"Pave\" 2},\n", " \"ExterQual\" {\"Po\" 1, \"Fa\" 2, \"TA\" 3, \"Gd\" 4, \"Ex\" 5},\n", " \"BsmtFinType1\"\n", " {\"No\" 0, \"Unf\" 1, \"LwQ\" 2, \"Rec\" 3, \"BLQ\" 4, \"ALQ\" 5, \"GLQ\" 6},\n", " \"FireplaceQu\" {\"No\" 0, \"Po\" 1, \"Fa\" 2, \"TA\" 3, \"Gd\" 4, \"Ex\" 5},\n", " \"LotShape\" {\"IR3\" 1, \"IR2\" 2, \"IR1\" 3, \"Reg\" 4},\n", " \"HeatingQC\" {\"Po\" 1, \"Fa\" 2, \"TA\" 3, \"Gd\" 4, \"Ex\" 5},\n", " \"KitchenQual\" {\"Po\" 1, \"Fa\" 2, \"TA\" 3, \"Gd\" 4, \"Ex\" 5},\n", " \"GarageCond\" {\"No\" 0, \"Po\" 1, \"Fa\" 2, \"TA\" 3, \"Gd\" 4, \"Ex\" 5},\n", " \"BsmtQual\" {\"No\" 0, \"Po\" 1, \"Fa\" 2, \"TA\" 3, \"Gd\" 4, \"Ex\" 5},\n", " \"ExterCond\" {\"Po\" 1, \"Fa\" 2, \"TA\" 3, \"Gd\" 4, \"Ex\" 5},\n", " \"Utilities\" {\"ELO\" 1, \"NoSeWa\" 2, \"NoSewr\" 3, \"AllPub\" 4},\n", " \"BsmtExposure\" {\"No\" 0, \"Mn\" 1, \"Av\" 2, \"Gd\" 3},\n", " \"Functional\"\n", " {\"Sal\" 1,\n", " \"Sev\" 2,\n", " \"Maj2\" 3,\n", " \"Maj1\" 4,\n", " \"Mod\" 5,\n", " \"Min2\" 6,\n", " \"Min1\" 7,\n", " \"Typ\" 8}}\n" ] }, { "data": { "text/plain": [ "nil" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(def str->number-initial-map\n", " {\n", " \"Alley\" {\"Grvl\" 1 \"Pave\" 2 \"None\" 0}\n", " \"BsmtCond\" {\"No\" 0 \"Po\" 1 \"Fa\" 2 \"TA\" 3 \"Gd\" 4 \"Ex\" 5}\n", " \"BsmtExposure\" {\"No\" 0 \"Mn\" 1 \"Av\" 2 \"Gd\" 3}\n", " \"BsmtFinType1\" {\"No\" 0 \"Unf\" 1 \"LwQ\" 2 \"Rec\" 3 \"BLQ\" 4\n", " \"ALQ\" 5 \"GLQ\" 6}\n", " \"BsmtFinType2\" {\"No\" 0 \"Unf\" 1 \"LwQ\" 2 \"Rec\" 3 \"BLQ\" 4\n", " \"ALQ\" 5 \"GLQ\" 6}\n", " \"BsmtQual\" {\"No\" 0 \"Po\" 1 \"Fa\" 2 \"TA\" 3 \"Gd\" 4 \"Ex\" 5}\n", " \"ExterCond\" {\"Po\" 1 \"Fa\" 2 \"TA\" 3 \"Gd\" 4 \"Ex\" 5}\n", " \"ExterQual\" {\"Po\" 1 \"Fa\" 2 \"TA\" 3 \"Gd\" 4 \"Ex\" 5}\n", " \"FireplaceQu\" {\"No\" 0 \"Po\" 1 \"Fa\" 2 \"TA\" 3 \"Gd\" 4 \"Ex\" 5}\n", " \"Functional\" {\"Sal\" 1 \"Sev\" 2 \"Maj2\" 3 \"Maj1\" 4 \"Mod\" 5\n", " \"Min2\" 6 \"Min1\" 7 \"Typ\" 8}\n", " \"GarageCond\" {\"No\" 0 \"Po\" 1 \"Fa\" 2 \"TA\" 3 \"Gd\" 4 \"Ex\" 5}\n", " \"GarageQual\" {\"No\" 0 \"Po\" 1 \"Fa\" 2 \"TA\" 3 \"Gd\" 4 \"Ex\" 5}\n", " \"HeatingQC\" {\"Po\" 1 \"Fa\" 2 \"TA\" 3 \"Gd\" 4 \"Ex\" 5}\n", " \"KitchenQual\" {\"Po\" 1 \"Fa\" 2 \"TA\" 3 \"Gd\" 4 \"Ex\" 5}\n", " \"LandSlope\" {\"Sev\" 1 \"Mod\" 2 \"Gtl\" 3}\n", " \"LotShape\" {\"IR3\" 1 \"IR2\" 2 \"IR1\" 3 \"Reg\" 4}\n", " \"PavedDrive\" {\"N\" 0 \"P\" 1 \"Y\" 2}\n", " \"PoolQC\" {\"No\" 0 \"Fa\" 1 \"TA\" 2 \"Gd\" 3 \"Ex\" 4}\n", " \"Street\" {\"Grvl\" 1 \"Pave\" 2}\n", " \"Utilities\" {\"ELO\" 1 \"NoSeWa\" 2 \"NoSewr\" 3 \"AllPub\" 4}\n", " })\n", "\n", "\n", "(defn str->number-pipeline\n", " [dataset]\n", " (->> str->number-initial-map\n", " (reduce (fn [dataset str-num-entry]\n", " (apply dsp/string->number dataset str-num-entry))\n", " dataset)))\n", "\n", "(def str-num-dataset (str->number-pipeline post-missing))\n", "\n", "(pp/pprint (ds/dataset-label-map str-num-dataset))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Replacing values\n", "\n", "There is a numeric operator that allows you to map values from one value to another in a column. We now use this to provide simplified versions of some of the columns." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#{2.0 4.0 5.0 3.0}\n", "#{2.0 1.0}\n" ] }, { "data": { "text/plain": [ "nil" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(def replace-maps\n", " {\n", " ;; Create new features\n", " ;; 1* Simplifications of existing features\n", " ;; The author implicitly leaves values at zero to be zero, so these maps\n", " ;; are intentionally incomplete\n", " \"SimplOverallQual\" {\"OverallQual\" {1 1, 2 1, 3 1, ;; bad\n", " 4 2, 5 2, 6 2, ;; average\n", " 7 3, 8 3, 9 3, 10 3 ;; good\n", " }}\n", " \"SimplOverallCond\" {\"OverallCond\" {1 1, 2 1, 3 1, ;; bad\n", " 4 2, 5 2, 6 2, ;; average\n", " 7 3, 8 3, 9 3, 10 3 ;; good\n", " }}\n", " \"SimplPoolQC\" {\"PoolQC\" {1 1, 2 1, ;; average\n", " 3 2, 4 2 ;; good\n", " }}\n", " \"SimplGarageCond\" {\"GarageCond\" {1 1, ;; bad\n", " 2 1, 3 1, ;; average\n", " 4 2, 5 2 ;; good\n", " }}\n", " \"SimplGarageQual\" {\"GarageQual\" {1 1, ;; bad\n", " 2 1, 3 1, ;; average\n", " 4 2, 5 2 ;; good\n", " }}\n", " \"SimplFireplaceQu\" {\"FireplaceQu\" {1 1, ;; bad\n", " 2 1, 3 1, ;; average\n", " 4 2, 5 2 ;; good\n", " }}\n", " \"SimplFunctional\" {\"Functional\" {1 1, 2 1, ;; bad\n", " 3 2, 4 2, ;; major\n", " 5 3, 6 3, 7 3, ;; minor\n", " 8 4 ;; typical\n", " }}\n", " \"SimplKitchenQual\" {\"KitchenQual\" {1 1, ;; bad\n", " 2 1, 3 1, ;; average\n", " 4 2, 5 2 ;; good\n", " }}\n", " \"SimplHeatingQC\" {\"HeatingQC\" {1 1, ;; bad\n", " 2 1, 3 1, ;; average\n", " 4 2, 5 2 ;; good\n", " }}\n", " \"SimplBsmtFinType1\" {\"BsmtFinType1\" {1 1, ;; unfinished\n", " 2 1, 3 1, ;; rec room\n", " 4 2, 5 2, 6 2 ;; living quarters\n", " }}\n", " \"SimplBsmtFinType2\" {\"BsmtFinType2\" {1 1, ;; unfinished\n", " 2 1, 3 1, ;; rec room\n", " 4 2, 5 2, 6 2 ;; living quarters\n", " }}\n", " \"SimplBsmtCond\" {\"BsmtCond\" {1 1, ;; bad\n", " 2 1, 3 1, ;; average\n", " 4 2, 5 2 ;; good\n", " }}\n", " \"SimplBsmtQual\" {\"BsmtQual\" {1 1, ;; bad\n", " 2 1, 3 1, ;; average\n", " 4 2, 5 2 ;; good\n", " }}\n", " \"SimplExterCond\" {\"ExterCond\" {1 1, ;; bad\n", " 2 1, 3 1, ;; average\n", " 4 2, 5 2 ;; good\n", " }}\n", " \"SimplExterQual\" {\"ExterQual\" {1 1, ;; bad\n", " 2 1, 3 1, ;; average\n", " 4 2, 5 2 ;; good\n", " }}})\n", "\n", "\n", "(defn simplifications\n", " [dataset]\n", " (->> replace-maps\n", " (reduce (fn [dataset [target-name coldata-map]]\n", " (let [[col-name replace-data] (first coldata-map)]\n", " (dsp/m= dataset target-name\n", " #(dsp/int-map replace-data (dsp/col col-name)\n", " :not-strict? true))))\n", " dataset)))\n", "\n", "(def replace-dataset (simplifications str-num-dataset))\n", "\n", "(pp/pprint (-> (ds/column str-num-dataset \"KitchenQual\")\n", " (ds-col/unique)))\n", "\n", "(pp/pprint (-> (ds/column replace-dataset \"SimplKitchenQual\")\n", " (ds-col/unique)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Linear Combinations\n", "\n", "We create a set of simple linear combinations that derive from our semantic understanding of the dataset." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "train.csv [10 5]:\n", "\n", "| TotalBath | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath |\n", "|-----------+--------------+--------------+----------+----------|\n", "| 3.500 | 1.000 | 0.000 | 2.000 | 1.000 |\n", "| 2.500 | 0.000 | 1.000 | 2.000 | 0.000 |\n", "| 3.500 | 1.000 | 0.000 | 2.000 | 1.000 |\n", "| 2.000 | 1.000 | 0.000 | 1.000 | 0.000 |\n", "| 3.500 | 1.000 | 0.000 | 2.000 | 1.000 |\n", "| 2.500 | 1.000 | 0.000 | 1.000 | 1.000 |\n", "| 3.000 | 1.000 | 0.000 | 2.000 | 0.000 |\n", "| 3.500 | 1.000 | 0.000 | 2.000 | 1.000 |\n", "| 2.000 | 0.000 | 0.000 | 2.000 | 0.000 |\n", "| 2.000 | 1.000 | 0.000 | 1.000 | 0.000 |\n", "\n", "train.csv [10 3]:\n", "\n", "| AllSF | GrLivArea | TotalBsmtSF |\n", "|----------+-----------+-------------|\n", "| 2566.000 | 1710.000 | 856.000 |\n", "| 2524.000 | 1262.000 | 1262.000 |\n", "| 2706.000 | 1786.000 | 920.000 |\n", "| 2473.000 | 1717.000 | 756.000 |\n", "| 3343.000 | 2198.000 | 1145.000 |\n", "| 2158.000 | 1362.000 | 796.000 |\n", "| 3380.000 | 1694.000 | 1686.000 |\n", "| 3197.000 | 2090.000 | 1107.000 |\n", "| 2726.000 | 1774.000 | 952.000 |\n", "| 2068.000 | 1077.000 | 991.000 |\n", "\n" ] }, { "data": { "text/plain": [ "nil" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(defn linear-combinations\n", " [dataset]\n", " (-> dataset\n", " (dsp/m= \"OverallGrade\" #(dfn/* (col \"OverallQual\") (col \"OverallCond\")))\n", " ;; Overall quality of the garage\n", " (dsp/m= \"GarageGrade\" #(dfn/* (col \"GarageQual\") (col \"GarageCond\")))\n", " ;; Overall quality of the exterior\n", " (dsp/m= \"ExterGrade\"#(dfn/* (col \"ExterQual\") (col \"ExterCond\")))\n", " ;; Overall kitchen score\n", " (dsp/m= \"KitchenScore\" #(dfn/* (col \"KitchenAbvGr\") (col \"KitchenQual\")))\n", " ;; Overall fireplace score\n", " (dsp/m= \"FireplaceScore\" #(dfn/* (col \"Fireplaces\") (col \"FireplaceQu\")))\n", " ;; Overall garage score\n", " (dsp/m= \"GarageScore\" #(dfn/* (col \"GarageArea\") (col \"GarageQual\")))\n", " ;; Overall pool score\n", " (dsp/m= \"PoolScore\" #(dfn/* (col \"PoolArea\") (col \"PoolQC\")))\n", " ;; Simplified overall quality of the house\n", " (dsp/m= \"SimplOverallGrade\" #(dfn/* (col \"SimplOverallQual\")\n", " (col \"SimplOverallCond\")))\n", " ;; Simplified overall quality of the exterior\n", " (dsp/m= \"SimplExterGrade\" #(dfn/* (col \"SimplExterQual\") (col \"SimplExterCond\")))\n", " ;; Simplified overall pool score\n", " (dsp/m= \"SimplPoolScore\" #(dfn/* (col \"PoolArea\") (col \"SimplPoolQC\")))\n", " ;; Simplified overall garage score\n", " (dsp/m= \"SimplGarageScore\" #(dfn/* (col \"GarageArea\") (col \"SimplGarageQual\")))\n", " ;; Simplified overall fireplace score\n", " (dsp/m= \"SimplFireplaceScore\" #(dfn/* (col \"Fireplaces\") (col \"SimplFireplaceQu\")))\n", " ;; Simplified overall kitchen score\n", " (dsp/m= \"SimplKitchenScore\" #(dfn/* (col \"KitchenAbvGr\" )\n", " (col \"SimplKitchenQual\")))\n", " ;; Total number of bathrooms\n", " (dsp/m= \"TotalBath\" #(dfn/+ (col \"BsmtFullBath\") (dfn/* 0.5 (col \"BsmtHalfBath\"))\n", " (col \"FullBath\") (dfn/* 0.5 (col \"HalfBath\"))))\n", " ;; Total SF for house (incl. basement)\n", " (dsp/m= \"AllSF\" #(dfn/+ (col \"GrLivArea\") (col \"TotalBsmtSF\")))\n", " ;; Total SF for 1st + 2nd floors\n", " (dsp/m= \"AllFlrsSF\" #(dfn/+ (col \"1stFlrSF\") (col \"2ndFlrSF\")))\n", " ;; Total SF for porch\n", " (dsp/m= \"AllPorchSF\" #(dfn/+ (col \"OpenPorchSF\") (col \"EnclosedPorch\")\n", " (col \"3SsnPorch\") (col \"ScreenPorch\")))\n", " ;; Encode MasVrnType\n", " (dsp/string->number \"MasVnrType\" [\"None\" \"BrkCmn\" \"BrkFace\" \"CBlock\" \"Stone\"])\n", " (dsp/m= \"HasMasVnr\" #(dfn/not-eq (col \"MasVnrType\") 0))))\n", "\n", "\n", "(def linear-combined-ds (linear-combinations replace-dataset))\n", "\n", "\n", "\n", "(let [print-columns [\"TotalBath\" \"BsmtFullBath\" \"BsmtHalfBath\"\n", " \"FullBath\" \"HalfBath\"]]\n", " (println (ds/select linear-combined-ds print-columns (range 10))))\n", "\n", "(let [print-columns [\"AllSF\" \"GrLivArea\" \"TotalBsmtSF\"]]\n", " (println (ds/select linear-combined-ds print-columns (range 10))))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Correlation Table\n", "Let's check the correlations between the various columns and the target column (SalePrice). " ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING - excluding columns with missing values:\n", " [{:column-name Electrical, :missing-count 1}]\n", "WARNING - excluding non-numeric columns:\n", " [MSZoning LandContour LotConfig Neighborhood Condition1 Condition2 BldgType HouseStyle RoofStyle RoofMatl Exterior1st Exterior2nd Foundation Heating Electrical GarageType GarageFinish Fence MiscFeature SaleType SaleCondition]\n", "\n", "| :pandas | :tech.ml.dataset |\n", "|----------------------------+-----------------------------------------|\n", "| [\"SalePrice\" 1.0] | [\"SalePrice\" 1.0] |\n", "| [\"OverallQual\" 0.819] | [\"OverallQual\" 0.819240311620789] |\n", "| [\"AllSF\" 0.817] | [\"AllSF\" 0.8172719096463545] |\n", "| [\"AllFlrsSF\" 0.729] | [\"AllFlrsSF\" 0.7294213272894039] |\n", "| [\"GrLivArea\" 0.719] | [\"GrLivArea\" 0.7188444008280218] |\n", "| [\"SimplOverallQual\" 0.708] | [\"SimplOverallQual\" 0.70793366139543] |\n", "| [\"ExterQual\" 0.681] | [\"ExterQual\" 0.6809467113796699] |\n", "| [\"GarageCars\" 0.68] | [\"GarageCars\" 0.6804076538001473] |\n", "| [\"TotalBath\" 0.673] | [\"TotalBath\" 0.6729288592505422] |\n", "| [\"KitchenQual\" 0.667] | [\"KitchenQual\" 0.6671738265720056] |\n", "| [\"GarageScore\" 0.657] | [\"GarageScore\" 0.6568216258465022] |\n", "| [\"GarageArea\" 0.655] | [\"GarageArea\" 0.6552115300468117] |\n", "| [\"TotalBsmtSF\" 0.642] | [\"TotalBsmtSF\" 0.6415527990410921] |\n", "| [\"SimplExterQual\" 0.636] | [\"SimplExterQual\" 0.6355504445439201] |\n", "| [\"SimplGarageScore\" 0.631] | [\"SimplGarageScore\" 0.6308022446817723] |\n", "| [\"BsmtQual\" 0.615] | [\"BsmtQual\" 0.6152245192784745] |\n", "| [\"1stFlrSF\" 0.614] | [\"1stFlrSF\" 0.61374181150233] |\n", "| [\"SimplKitchenQual\" 0.61] | [\"SimplKitchenQual\" 0.6101423001972696] |\n", "| [\"OverallGrade\" 0.604] | [\"OverallGrade\" 0.6042910598186415] |\n", "| [\"SimplBsmtQual\" 0.594] | [\"SimplBsmtQual\" 0.5936507179796586] |\n", "WARNING - excluding columns with missing values:\n", " [{:column-name Electrical, :missing-count 1}]\n", "WARNING - excluding non-numeric columns:\n", " [MSZoning LandContour LotConfig Neighborhood Condition1 Condition2 BldgType HouseStyle RoofStyle RoofMatl Exterior1st Exterior2nd Foundation Heating Electrical GarageType GarageFinish Fence MiscFeature SaleType SaleCondition]\n" ] }, { "data": { "text/plain": [ "#'user/tech-ml-correlations" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(def article-correlations\n", " ;;Default for pandas is pearson.\n", " ;; Find most important features relative to target\n", " (->> {\"SalePrice\" 1.000\n", " \"OverallQual\" 0.819\n", " \"AllSF\" 0.817\n", " \"AllFlrsSF\" 0.729\n", " \"GrLivArea\" 0.719\n", " \"SimplOverallQual\" 0.708\n", " \"ExterQual\" 0.681\n", " \"GarageCars\" 0.680\n", " \"TotalBath\" 0.673\n", " \"KitchenQual\" 0.667\n", " \"GarageScore\" 0.657\n", " \"GarageArea\" 0.655\n", " \"TotalBsmtSF\" 0.642\n", " \"SimplExterQual\" 0.636\n", " \"SimplGarageScore\" 0.631\n", " \"BsmtQual\" 0.615\n", " \"1stFlrSF\" 0.614\n", " \"SimplKitchenQual\" 0.610\n", " \"OverallGrade\" 0.604\n", " \"SimplBsmtQual\" 0.594\n", " \"FullBath\" 0.591\n", " \"YearBuilt\" 0.589\n", " \"ExterGrade\" 0.587\n", " \"YearRemodAdd\" 0.569\n", " \"FireplaceQu\" 0.547\n", " \"GarageYrBlt\" 0.544\n", " \"TotRmsAbvGrd\" 0.533\n", " \"SimplOverallGrade\" 0.527\n", " \"SimplKitchenScore\" 0.523\n", " \"FireplaceScore\" 0.518\n", " \"SimplBsmtCond\" 0.204\n", " \"BedroomAbvGr\" 0.204\n", " \"AllPorchSF\" 0.199\n", " \"LotFrontage\" 0.174\n", " \"SimplFunctional\" 0.137\n", " \"Functional\" 0.136\n", " \"ScreenPorch\" 0.124\n", " \"SimplBsmtFinType2\" 0.105\n", " \"Street\" 0.058\n", " \"3SsnPorch\" 0.056\n", " \"ExterCond\" 0.051\n", " \"PoolArea\" 0.041\n", " \"SimplPoolScore\" 0.040\n", " \"SimplPoolQC\" 0.040\n", " \"PoolScore\" 0.040\n", " \"PoolQC\" 0.038\n", " \"BsmtFinType2\" 0.016\n", " \"Utilities\" 0.013\n", " \"BsmtFinSF2\" 0.006\n", " \"BsmtHalfBath\" -0.015\n", " \"MiscVal\" -0.020\n", " \"SimplOverallCond\" -0.028\n", " \"YrSold\" -0.034\n", " \"OverallCond\" -0.037\n", " \"LowQualFinSF\" -0.038\n", " \"LandSlope\" -0.040\n", " \"SimplExterCond\" -0.042\n", " \"KitchenAbvGr\" -0.148\n", " \"EnclosedPorch\" -0.149\n", " \"LotShape\" -0.286\n", " }\n", " (sort-by second >)\n", " ))\n", "\n", "(def tech-ml-correlations (get (ds/correlation-table\n", " linear-combined-ds\n", " :pearson)\n", " \"SalePrice\"))\n", "\n", "(pp/print-table (map #(zipmap [:pandas :tech.ml.dataset]\n", " [%1 %2])\n", " (take 20 article-correlations)\n", " (take 20 tech-ml-correlations)))\n", "\n", "(def tech-ml-correlations (get (ds/correlation-table \n", " linear-combined-ds \n", " :pearson) \n", " \"SalePrice\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Polynomial Combinations\n", "\n", "We now extend the power of our linear models to be effectively polynomial models for a subset of the columns. We do this using the correlation table to indicate which columns are worth it (the author used the top 10)." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "train.csv [10 4]:\n", "\n", "| OverallQual | OverallQual-s2 | OverallQual-s3 | OverallQual-sqrt |\n", "|-------------+----------------+----------------+------------------|\n", "| 7.000 | 49.000 | 343.000 | 2.646 |\n", "| 6.000 | 36.000 | 216.000 | 2.449 |\n", "| 7.000 | 49.000 | 343.000 | 2.646 |\n", "| 7.000 | 49.000 | 343.000 | 2.646 |\n", "| 8.000 | 64.000 | 512.000 | 2.828 |\n", "| 5.000 | 25.000 | 125.000 | 2.236 |\n", "| 8.000 | 64.000 | 512.000 | 2.828 |\n", "| 7.000 | 49.000 | 343.000 | 2.646 |\n", "| 7.000 | 49.000 | 343.000 | 2.646 |\n", "| 5.000 | 25.000 | 125.000 | 2.236 |\n", "\n" ] }, { "data": { "text/plain": [ "nil" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(defn polynomial-combinations\n", " [dataset correlation-table]\n", " (let [correlation-colnames (->> correlation-table\n", " (drop 1)\n", " (take 10)\n", " (map first))]\n", " (->> correlation-colnames\n", " (reduce (fn [dataset colname]\n", " (-> dataset\n", " (dsp/m= (str colname \"-s2\") #(dfn/pow (col colname) 2))\n", " (dsp/m= (str colname \"-s3\") #(dfn/pow (col colname) 3))\n", " (dsp/m= (str colname \"-sqrt\") #(dfn/sqrt (col colname)))))\n", " dataset))))\n", "\n", "(def poly-data (-> (polynomial-combinations linear-combined-ds tech-ml-correlations)\n", " dsp/string->number))\n", "\n", "\n", "(println (ds/select poly-data\n", " [\"OverallQual\"\n", " \"OverallQual-s2\"\n", " \"OverallQual-s3\"\n", " \"OverallQual-sqrt\"]\n", " (range 10)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Numeric Vs. Categorical\n", "\n", "The article considers anything non-numeric to be categorical. This is a point on which the tech.ml.dataset system differs. For tech, any column can be considered categorical and the underlying datatype does not change this definition. Earlier the article converted numeric columns to string to indicate they are categorical but we just set metadata.\n", "\n", "This, and parsing difference between tablesaw and pandas, lead to different outcomes in the next section." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "numeric-features 97\n", "categorical-features 45\n", "inference targets (SalePrice)\n", "({:name \"MoSold\",\n", " :size 1456,\n", " :datatype :float64,\n", " :column-type :feature}\n", " {:name \"CentralAir\",\n", " :size 1456,\n", " :datatype :float64,\n", " :column-type :feature})\n" ] }, { "data": { "text/plain": [ "nil" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(def numerical-features (cf/numeric-and-non-categorical-and-not-target poly-data))\n", "(def categorical-features (dsp/with-ds poly-data\n", " (cf/and #(cf/not cf/target?)\n", " #(cf/not numerical-features))))\n", "\n", "\n", "(println \"numeric-features\" (count numerical-features))\n", "\n", "(println \"categorical-features\" (count categorical-features))\n", "\n", "(println \"inference targets\" (cf/target? poly-data))\n", "\n", ";;I printed out the categorical features from the when using pandas.\n", "(pp/pprint (->> (c-set/difference\n", " (set [\"MSSubClass\", \"MSZoning\", \"Alley\", \"LandContour\", \"LotConfig\",\n", " \"Neighborhood\", \"Condition1\", \"Condition2\", \"BldgType\",\n", " \"HouseStyle\", \"RoofStyle\", \"RoofMatl\", \"Exterior1st\",\n", " \"Exterior2nd\", \"MasVnrType\", \"Foundation\", \"Heating\",\n", " \"CentralAir\",\n", " \"Electrical\", \"GarageType\", \"GarageFinish\", \"Fence\",\n", " \"MiscFeature\",\n", " \"MoSold\", \"SaleType\", \"SaleCondition\"])\n", " (set categorical-features))\n", " (map (comp ds-col/metadata (partial ds/column poly-data)))))" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "nil\n" ] }, { "data": { "text/plain": [ "nil" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(defn fix-all-missing\n", " [dataset]\n", " (-> dataset\n", " ;;Fix any remaining numeric columns by using the median.\n", " (dsp/replace-missing cf/numeric? #(dfn/median (col)))\n", " ;;Fix any string columns by using 'NA'.\n", " (dsp/replace-missing cf/string? \"NA\")\n", " (dsp/string->number)))\n", "\n", "\n", "(def missing-fixed (fix-all-missing poly-data))\n", "\n", "(pp/pprint (ds/columns-with-missing-seq missing-fixed))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Training And Viewing Results\n", "\n", "Let's setup a simple gridsearch and few the errors and residuals." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Gridsearching dataset train.csv model :smile.regression/lasso\n", "Gridsearching dataset train.csv model :xgboost/regression\n", "Gridsearching dataset train.csv model :smile.regression/ridge\n", "Got 3 Trained results\n" ] }, { "data": { "text/html": [ "

Missing Fixed

\n", "
\n", "
\n", " \n", "
\n", "

:smile.regression/lasso-0.1181

Predictions

\n", "
\n", "
\n", " \n", "
\n", "

Residuals

\n", "
\n", "
\n", " \n", "
\n", "

:smile.regression/ridge-0.1207

Predictions

\n", "
\n", "
\n", " \n", "
\n", "

Residuals

\n", "
\n", "
\n", " \n", "
\n", "

:xgboost/regression-0.1400

Predictions

\n", "
\n", "
\n", " \n", "
\n", "

Residuals

\n", "
\n", "
\n", " \n", "
\n", "
" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(defn render-results\n", " [title gridsearch-results]\n", " [:div\n", " [:h3 title]\n", " (vega-viz/accuracy-graph gridsearch-results :y-scale [0.10, 0.20])])\n", "\n", "\n", "(defn train-regressors\n", " [dataset-name dataset loss-fn & [options]]\n", " (let [base-gridsearch-systems [:smile.regression/lasso\n", " :xgboost/regression\n", " :smile.regression/ridge]\n", " trained-results (ml-regression/train-regressors\n", " dataset options\n", " :loss-fn loss-fn\n", " :gridsearch-regression-systems base-gridsearch-systems)]\n", " (println \"Got\" (count trained-results) \"Trained results\")\n", " (vec trained-results)))\n", "\n", "\n", "(defn train-graph-regressors\n", " [dataset-name dataset loss-fn & [options]]\n", " (let [trained-results (train-regressors dataset-name dataset loss-fn options)]\n", " (->> (apply concat [(render-results dataset-name trained-results)]\n", " (->> trained-results\n", " (sort-by :average-loss)\n", " (map (fn [model-result]\n", " [[:div\n", " [:h3 (format \"%s-%.4f\"\n", " (get-in model-result [:options :model-type])\n", " (:average-loss model-result))]\n", " [:div\n", " [:span\n", " [:h4 \"Predictions\"]\n", " (vega-viz/graph-regression-verification-results\n", " model-result :target-key :predictions\n", " :y-scale [10 14]\n", " :x-scale [10 14])]\n", " [:span\n", " [:h4 \"Residuals\"]\n", " (vega-viz/graph-regression-verification-results\n", " model-result :target-key :residuals\n", " :y-scale [10 14]\n", " :x-scale [-1 1])]]]]))))\n", " (into [:div]))))\n", "\n", "(oz/view! (train-graph-regressors \"Missing Fixed\" missing-fixed loss/rmse))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Skew\n", "\n", "Here is where things go a bit awry. We attempt to fix skew. The attempted fix barely reduces the actual skew in the dataset. We will talk about what went wrong. We also begin running models on the stages to see what the effect of some of these things are.\n", "\n", "We start setting the target in the options for the pipeline. This allows the rest of the system downstream (training) to automatically infer the feature columns." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Pre-fix skew counts 70\n", "Post-fix skew counts 43\n" ] }, { "data": { "text/plain": [ "nil" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(defn skew-column-filter\n", " [& [dataset]]\n", " (dsp/with-ds (cf/check-dataset dataset)\n", " (cf/and cf/numeric?\n", " #(cf/not \"SalePrice\")\n", " #(cf/not cf/categorical?)\n", " (fn []\n", " (cf/> #(dfn/abs (dfn/skewness (col)))\n", " 0.5)))))\n", "\n", "(def skew-fixed (-> (dsp/m= missing-fixed\n", " skew-column-filter\n", " #(dfn/log1p (col)))))\n", "\n", "(println \"Pre-fix skew counts\" (count (skew-column-filter missing-fixed)))\n", "\n", "(println \"Post-fix skew counts\" (count (skew-column-filter skew-fixed)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That didn't work. Or at least it barely did. What happened??" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "| :column-name | :before-skew | :after-skew | :before-mean | :after-mean |\n", "|-------------------+--------------+-------------+-----------------+-------------|\n", "| AllSF-s2 | 1.743 | -0.514 | 7117066.152 | 15.603 |\n", "| SimplBsmtFinType2 | 0.799 | -2.014 | 1.019 | 0.693 |\n", "| ExterQual-s2 | 1.153 | 0.533 | 11.832 | 2.506 |\n", "| GarageGrade | -2.087 | -3.507 | 8.391 | 2.155 |\n", "| PoolArea | 17.523 | 17.006 | 2.056 | 0.022 |\n", "| SimplExterQual | 0.547 | 0.547 | 1.368 | 0.842 |\n", "| BsmtFinSF2 | 4.249 | 2.519 | 46.677 | 0.657 |\n", "| YearBuilt | -0.610 | -0.638 | 1971.185 | 7.587 |\n", "| OverallQual-s2 | 0.819 | -0.696 | 38.946 | 3.591 |\n", "| LowQualFinSF | 8.999 | 7.450 | 5.861 | 0.100 |\n", "| OverallQual-s3 | 1.410 | -0.775 | 260.425 | 5.345 |\n", "| AllSF | 0.652 | -0.513 | 2557.161 | 7.802 |\n", "| SimplBsmtCond | 0.827 | -2.020 | 1.019 | 0.694 |\n", "| SimplBsmtFinType1 | -0.509 | -1.143 | 1.512 | 0.893 |\n", "| GarageCars-s2 | 1.024 | -0.502 | 3.670 | 1.377 |\n", "| ExterQual-sqrt | 0.650 | 0.532 | 1.836 | 1.041 |\n", "| GarageScore-s3 | 5.273 | -3.536 | 4719134529.977 | 20.486 |\n", "| SimplPoolQC | 18.880 | 18.004 | 0.005 | 0.003 |\n", "| AllPorchSF | 2.009 | -0.510 | 86.757 | 3.107 |\n", "| ExterQual-s3 | 1.538 | 0.501 | 42.491 | 3.654 |\n", "| KitchenAbvGr | 4.481 | 3.863 | 1.047 | 0.712 |\n", "| GarageScore-s2 | 2.406 | -3.536 | 2425464.701 | 13.658 |\n", "| CentralAir | -3.524 | -3.524 | 0.935 | 0.648 |\n", "| 3SsnPorch | 10.290 | 7.724 | 3.419 | 0.086 |\n", "| SimplExterGrade | 1.538 | 0.811 | 1.497 | 0.887 |\n", "| EnclosedPorch | 3.084 | 2.108 | 22.014 | 0.700 |\n", "| MiscVal | 24.443 | 5.163 | 43.609 | 0.234 |\n", "| AllSF-s3 | 3.255 | -0.514 | 21441766385.821 | 23.405 |\n", "| PoolScore | 20.863 | 17.178 | 4.916 | 0.024 |\n", "| SimplGarageQual | -2.111 | -3.274 | 0.956 | 0.659 |\n", "| BsmtHalfBath | 4.129 | 3.956 | 0.057 | 0.039 |\n", "| GarageCars-sqrt | -1.839 | -2.635 | 1.271 | 0.801 |\n", "| SimplFunctional | -4.348 | -4.961 | 3.917 | 1.590 |\n", "| BsmtFinSF1 | 0.745 | -0.616 | 436.991 | 4.220 |\n", "| GarageScore-sqrt | -1.496 | -3.558 | 35.822 | 3.440 |\n", "| SimplGarageCond | -2.612 | -3.470 | 0.952 | 0.658 |\n", "| ScreenPorch | 4.116 | 3.145 | 15.102 | 0.412 |\n", "| SimplHeatingQC | -0.732 | -0.732 | 1.672 | 0.965 |\n", "| BsmtUnfSF | 0.922 | -2.183 | 566.990 | 5.646 |\n", "| HalfBath | 0.684 | 0.574 | 0.381 | 0.262 |\n", "| SimplPoolScore | 19.744 | 17.074 | 3.310 | 0.023 |\n", "| SimplExterCond | 2.627 | 2.627 | 1.102 | 0.735 |\n", "| SimplGarageScore | 0.650 | -3.458 | 478.343 | 5.813 |\n" ] }, { "data": { "text/plain": [ "nil" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ ";; I apologize for the formatting. This is a poor replacement for emacs with paredit\n", "(let [before-columns (set (skew-column-filter missing-fixed))\n", " after-columns (set (skew-column-filter skew-fixed))\n", " check-columns (c-set/intersection before-columns after-columns)]\n", " (->> check-columns\n", " (map (fn [colname]\n", " (let [{before-min :min\n", " before-max :max\n", " before-mean :mean\n", " before-skew :skew} \n", " (-> (ds/column missing-fixed colname)\n", " (ds-col/stats [:min :max :mean :skew]))\n", " {after-min :min\n", " after-max :max\n", " after-mean :mean\n", " after-skew :skew} \n", " (-> (ds/column skew-fixed colname)\n", " (ds-col/stats [:min :max :mean :skew]))]\n", " {:column-name colname\n", " :before-skew before-skew\n", " :after-skew after-skew\n", " :before-mean before-mean\n", " :after-mean after-mean})))\n", " (print-table [:column-name \n", " :before-skew :after-skew\n", " :before-mean :after-mean])))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Maybe you can see the issue now. For positive skew and and small means, the log1p fix has very little effect. For very large numbers, it may skew the result all the way to be negative. And then for negative skew, it makes it worse.\n", "\n", "No easy fixes here today, but a combined method attempting several versions of the skew fix and including the best one could eventually figure it all out in an automated way.\n", "\n", "In any case, let's see some actual results:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Gridsearching dataset train.csv model :smile.regression/lasso\n", "Gridsearching dataset train.csv model :xgboost/regression\n", "Gridsearching dataset train.csv model :smile.regression/ridge\n", "Got 3 Trained results\n" ] }, { "data": { "text/html": [ "

Skew Fixed

\n", "
\n", "
\n", " \n", "
\n", "

:smile.regression/ridge-0.1496

Predictions

\n", "
\n", "
\n", " \n", "
\n", "

Residuals

\n", "
\n", "
\n", " \n", "
\n", "

:smile.regression/lasso-0.1508

Predictions

\n", "
\n", "
\n", " \n", "
\n", "

Residuals

\n", "
\n", "
\n", " \n", "
\n", "

:xgboost/regression-0.1591

Predictions

\n", "
\n", "
\n", " \n", "
\n", "

Residuals

\n", "
\n", "
\n", " \n", "
\n", "
" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(oz/view! (train-graph-regressors \"Skew Fixed\" skew-fixed loss/rmse))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## std-scaler\n", "\n", "There are two scale methods so far in the tech.ml.dataset system. \n", "\n", "* **range-scaler** - scale column such that min/max equal a range min/max. Range defaults to [-1 1].\n", "* **std-scaler** - scale column such that mean = 0 and variance,stddev = 1." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Before std-scaler\n", "\n", "| :column-name | :mean | :variance |\n", "|--------------+----------+------------|\n", "| LotFrontage | 4.204 | 0.117 |\n", "| LotArea | 9.108 | 0.264 |\n", "| YearBuilt | 7.587 | 0.000 |\n", "| YearRemodAdd | 1984.819 | 426.511 |\n", "| MasVnrArea | 2.124 | 6.894 |\n", "| BsmtFinSF1 | 4.220 | 8.941 |\n", "| BsmtFinSF2 | 0.657 | 3.412 |\n", "| BsmtUnfSF | 5.646 | 3.445 |\n", "| TotalBsmtSF | 1050.659 | 169872.334 |\n", "| CentralAir | 0.648 | 0.029 |\n", "\n", "\n", "After std-scaler\n", "\n", "| :column-name | :mean | :variance |\n", "|--------------+------------------------+--------------------|\n", "| LotFrontage | 1.1962027024720671E-15 | 0.9999999999999976 |\n", "| LotArea | -6.583561534762089E-16 | 0.9999999999999989 |\n", "| YearBuilt | -2.521208974194011E-14 | 1.0000000000000033 |\n", "| YearRemodAdd | 4.7892093221604965E-15 | 0.9999999999999986 |\n", "| MasVnrArea | 4.882527776804556E-16 | 0.9999999999999808 |\n", "| BsmtFinSF1 | -7.175273806128243E-17 | 1.0000000000000044 |\n", "| BsmtFinSF2 | 4.880101207143528E-18 | 1.0000000000000318 |\n", "| BsmtUnfSF | -2.019904390269258E-16 | 1.0000000000000002 |\n", "| TotalBsmtSF | 6.283130304197315E-16 | 0.999999999999997 |\n", "| CentralAir | 1.4251420556486386E-16 | 0.9999999999999899 |\n" ] }, { "data": { "text/plain": [ "nil" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(def poly-std-scale-ds (dsp/std-scale missing-fixed))\n", "\n", "(def std-scale-ds (dsp/std-scale skew-fixed))\n", "\n", "\n", "\n", "(println \"Before std-scaler\")\n", "\n", "(->> (ds/select skew-fixed (take 10 numerical-features) :all)\n", " (ds/columns)\n", " (map (fn [col]\n", " (merge (ds-col/stats col [:mean :variance])\n", " {:column-name (ds-col/column-name col)})))\n", " (print-table [:column-name :mean :variance]))\n", "\n", "(println \"\\n\\nAfter std-scaler\")\n", "\n", "(->> (ds/select std-scale-ds (take 10 numerical-features) :all)\n", " (ds/columns)\n", " (map (fn [col]\n", " (merge (ds-col/stats col [:mean :variance])\n", " {:column-name (ds-col/column-name col)})))\n", " (pp/print-table [:column-name :mean :variance]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Final Models\n", "\n", "We now train our prepared data across a range of models." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Gridsearching dataset train.csv model :smile.regression/lasso\n", "Gridsearching dataset train.csv model :xgboost/regression\n", "Gridsearching dataset train.csv model :smile.regression/ridge\n", "Got 3 Trained results\n", "Gridsearching dataset train.csv model :smile.regression/lasso\n", "Gridsearching dataset train.csv model :xgboost/regression\n", "Gridsearching dataset train.csv model :smile.regression/ridge\n", "Got 3 Trained results\n" ] }, { "data": { "text/html": [ "

Final Result-No Skew Fix

\n", "
\n", "
\n", " \n", "
\n", "

:smile.regression/ridge-0.1143

Predictions

\n", "
\n", "
\n", " \n", "
\n", "

Residuals

\n", "
\n", "
\n", " \n", "
\n", "

:smile.regression/lasso-0.1170

Predictions

\n", "
\n", "
\n", " \n", "
\n", "

Residuals

\n", "
\n", "
\n", " \n", "
\n", "

:xgboost/regression-0.1371

Predictions

\n", "
\n", "
\n", " \n", "
\n", "

Residuals

\n", "
\n", "
\n", " \n", "
\n", "

Final Result-Skew Fix

\n", "
\n", "
\n", " \n", "
\n", "

:smile.regression/ridge-0.1466

Predictions

\n", "
\n", "
\n", " \n", "
\n", "

Residuals

\n", "
\n", "
\n", " \n", "
\n", "

:smile.regression/lasso-0.1470

Predictions

\n", "
\n", "
\n", " \n", "
\n", "

Residuals

\n", "
\n", "
\n", " \n", "
\n", "

:xgboost/regression-0.1557

Predictions

\n", "
\n", "
\n", " \n", "
\n", "

Residuals

\n", "
\n", "
\n", " \n", "
\n", "
" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(oz/view! [:div (train-graph-regressors \"Final Result-No Skew Fix\" poly-std-scale-ds loss/rmse)\n", " (train-graph-regressors \"Final Result-Skew Fix\" skew-fixed loss/rmse)])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Going To Production\n", "\n", "You won't see this ever talked about in notebooks and that is unfair to the rest of the organization but you have to take everything above and go to production with it at some point.\n", "\n", "Without getting into too much detail, we show how to build a production pipeline using the tech system. In essence, you can capture context during the training dataset processing and then use this context to make building the inference\n", "pipeline just a bit easier.\n", "\n", "There is quite a bit of ephemeral data used during the above dataset processing. Sometimes we do a string->number \n", "conversion and we don't specify precisely how to map the values. We had std-scaler which recorded means and variances for all of the systems. We had a correlation table that we referenced to build out column augmentations.\n", "\n", "We can't make going to production automatic, but we can do at least a bit in this area." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING - excluding columns with missing values:\n", " [{:column-name Electrical, :missing-count 1}]\n", "WARNING - excluding non-numeric columns:\n", " [MSZoning LandContour LotConfig Neighborhood Condition1 Condition2 BldgType HouseStyle RoofStyle RoofMatl Exterior1st Exterior2nd Foundation Heating Electrical GarageType GarageFinish Fence MiscFeature SaleType SaleCondition]\n", "train.csv [10 4]:\n", "\n", "| OverallQual | OverallQual-s2 | OverallQual-s3 | OverallQual-sqrt |\n", "|-------------+----------------+----------------+------------------|\n", "| 7.000 | 0.565 | 0.454 | 0.677 |\n", "| 6.000 | -0.178 | -0.262 | -0.013 |\n", "| 7.000 | 0.565 | 0.454 | 0.677 |\n", "| 7.000 | 0.565 | 0.454 | 0.677 |\n", "| 8.000 | 1.422 | 1.405 | 1.320 |\n", "| 5.000 | -0.806 | -0.774 | -0.764 |\n", "| 8.000 | 1.422 | 1.405 | 1.320 |\n", "| 7.000 | 0.565 | 0.454 | 0.677 |\n", "| 7.000 | 0.565 | 0.454 | 0.677 |\n", "| 5.000 | -0.806 | -0.774 | -0.764 |\n", "\n", "train.csv [10 4]:\n", "\n", "| OverallQual | OverallQual-s2 | OverallQual-s3 | OverallQual-sqrt |\n", "|-------------+----------------+----------------+------------------|\n", "| 7.000 | 0.565 | 0.454 | 0.677 |\n", "| 6.000 | -0.178 | -0.262 | -0.013 |\n", "| 7.000 | 0.565 | 0.454 | 0.677 |\n", "| 7.000 | 0.565 | 0.454 | 0.677 |\n", "| 8.000 | 1.422 | 1.405 | 1.320 |\n", "| 5.000 | -0.806 | -0.774 | -0.764 |\n", "| 8.000 | 1.422 | 1.405 | 1.320 |\n", "| 7.000 | 0.565 | 0.454 | 0.677 |\n", "| 7.000 | 0.565 | 0.454 | 0.677 |\n", "| 5.000 | -0.806 | -0.774 | -0.764 |\n", "\n" ] }, { "data": { "text/plain": [ "nil" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(defn data-pipeline\n", " \"Now you have a model and you want to go to production.\"\n", " [dataset training?]\n", " (let [sale-price-col (when training?\n", " (dsp/without-recording\n", " (-> dataset\n", " ;;Sale price is originally an integer\n", " (dsp/m= \"SalePrice\" #(-> (dsp/col)\n", " (dtype/->reader :float64)\n", " dfn/log1p))\n", " (ds/column \"SalePrice\"))))\n", "\n", " dataset (if training?\n", " (ds/remove-columns dataset [\"SalePrice\"])\n", " dataset)\n", " dataset\n", " (-> dataset\n", " (dsp/remove-columns [\"Id\"])\n", " (dsp/->datatype)\n", " more-categorical\n", " initial-missing-entries\n", " str->number-pipeline\n", " simplifications\n", " linear-combinations\n", " (dsp/store-variables #(hash-map :correlation-table\n", " (-> (ds/add-column % sale-price-col)\n", " (ds/correlation-table :pearson)\n", " (get \"SalePrice\"))))\n", " (polynomial-combinations (dsp/read-var :correlation-table))\n", " fix-all-missing\n", " dsp/std-scale)]\n", " (if training?\n", " (-> (ds/add-column dataset sale-price-col)\n", " (ds/set-inference-target \"SalePrice\"))\n", " dataset)))\n", "\n", "\n", "\n", "(def inference-pipeline-data (dsp/pipeline-train-context\n", " (data-pipeline src-dataset true)))\n", "\n", "(def pipeline-train-dataset (:dataset inference-pipeline-data))\n", "\n", "\n", "(def inference-pipeline-context (:context inference-pipeline-data))\n", "\n", "\n", ";;At inference time we wouldn't have the saleprice column\n", "(def test-inference-src-dataset (dsp/remove-columns src-dataset [\"SalePrice\"]))\n", "\n", "\n", ";;Now we can build the same dataset easily using context built during\n", ";;the training system. This means any string tables generated or any range\n", ";;k-means, stdscale, etc are all in the context.\n", "(def pipeline-inference-dataset (:dataset\n", " (dsp/pipeline-inference-context\n", " inference-pipeline-context\n", " (data-pipeline test-inference-src-dataset false))))\n", "\n", "\n", "(println (ds/select pipeline-train-dataset [\"OverallQual\"\n", " \"OverallQual-s2\"\n", " \"OverallQual-s3\"\n", " \"OverallQual-sqrt\"]\n", " (range 10)))\n", "\n", "\n", "(println (ds/select pipeline-inference-dataset [\"OverallQual\"\n", " \"OverallQual-s2\"\n", " \"OverallQual-s3\"\n", " \"OverallQual-sqrt\"]\n", " (range 10)))\n" ] } ], "metadata": { "kernelspec": { "display_name": "Clojure", "language": "clojure", "name": "clojure" }, "language_info": { "file_extension": ".clj", "mimetype": "text/x-clojure", "name": "clojure", "version": "1.10.0" } }, "nbformat": 4, "nbformat_minor": 2 }