{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/alan-turing-institute/MLJ.jl/master?filepath=binder%2FMLJ_demo.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"Julia\"\n", "\n", "# A taste of the Julia programming language and the MLJ machine learning toolbox" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This first cell instantiates a Julia project environment, reproducing a collection of mutually compatible packages for use in this demonstration:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[32m\u001b[1mActivating\u001b[22m\u001b[39m environment at `~/Dropbox/Julia7/MLJ/MLJ/binder/Project.toml`\n" ] } ], "source": [ "using Pkg\n", "Pkg.activate(@__DIR__)\n", "Pkg.instantiate()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Lightning encounter with Julia programming language\n", "\n", "###### Julia related content prepared by [@ablaom](https://github.com/ablaom)\n", "\n", "Interacting with Julia at the REPL, or in a notebook, feels very much\n", "\n", "the same as python, MATLAB or R:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Hello world!" ] } ], "source": [ "print(\"Hello world!\")" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "2 + 2" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Float64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "typeof(42.0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Multiple dispatch\n", "\n", "You will never see anything like `A.add(B)` in Julia because Julia\n", "is not a traditional object-oriented language. In Julia, function and\n", "structure are kept separate, with the help of abstract types and\n", "multiple dispatch, as we explain next\n", "In addition to regular concrete types, such as `Float64` and\n", "`String`, Julia has a built-in heirarchy of *abstract* types. These\n", "generally have subtypes but no instances:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Int64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "typeof(42)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Signed" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "supertype(Int64)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Integer" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "supertype(Signed)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3-element Array{Any,1}:\n", " Bool \n", " Signed \n", " Unsigned" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "subtypes(Integer)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "true" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Bool <: Integer # is Bool a subtype of Integer?" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "false" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Bool <: String" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In Julia, which is optionally typed, one uses type annotations to\n", "adapt the behaviour of functions to their types. If we define" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "divide (generic function with 1 method)" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "divide(x, y) = x / y" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "then `divide(x, y)` will make sense whenever `x / y` makes sense (for\n", "the built-in function `/`). For example, we can use it to divide two\n", "integers, or two matrices:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "divide(1, 2)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2×2 Array{Float64,2}:\n", " 1.0 -0.0\n", " 9.0 -2.0" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "divide([1 2; 3 4], [1 2; 3 7])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To vary the behaviour for specific types we make type annotatations:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "divide(x::Integer, y::Integer) = floor(x/y)\n", "divide(x::String, y::String) = join([x, y], \" / \")\n", "divide(1, 2)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"Hello / World!\"" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "divide(\"Hello\", \"World!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the case of `Float64` the original \"fallback\" method still\n", "applies:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "divide(1.0, 2.0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## User-defined types\n", "\n", "Users can define their own abstract types and composite types:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "describe (generic function with 2 methods)" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "abstract type Organism end\n", "\n", "struct Animal <: Organism\n", " name::String\n", " is_hervibore::Bool\n", "end\n", "\n", "struct Plant <: Organism\n", " name::String\n", " is_flowering::Bool\n", "end\n", "\n", "describe(o::Organism) = string(o.name) # fall-back method\n", "function describe(p::Plant)\n", "\n", " if p.is_flowering\n", " text = \" is a flowering plant.\"\n", " else\n", " text = \" is a non-flowering plant.\"\n", " end\n", " return p.name*text\n", "\n", "end" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"Elephant\"" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "describe(Animal(\"Elephant\", true))" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"Fern is a non-flowering plant.\"" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "describe(Plant(\"Fern\", false))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For more on multiple dispatch, see this [blog post](http://www.stochasticlifestyle.com/type-dispatch-design-post-object-oriented-programming-julia/) by [Christopher Rackauckas](http://www.chrisrackauckas.com/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Automatic differentiation\n", "\n", "Differentiation of almost arbitrary programs with respect to their input. ([source]( https://render.githubusercontent.com/view/ipynb?commit=89317894e2e5370a80e45d52db8a4055a4fdecd6&enc_url=68747470733a2f2f7261772e67697468756275736572636f6e74656e742e636f6d2f6d6174626573616e636f6e2f454d455f4a756c69615f776f726b73686f702f383933313738393465326535333730613830653435643532646238613430353561346664656364362f315f496e74726f64756374696f6e2e6970796e62&nwo=matbesancon%2FEME_Julia_workshop&path=1_Introduction.ipynb&repository_id=270611906&repository_type=Repository#Automatic-differentiation) by [@matbesancon](https://github.com/matbesancon))" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "sqrt_babylonian (generic function with 1 method)" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using ForwardDiff\n", "\n", "function sqrt_babylonian(s)\n", " x = s / 2\n", " while abs(x^2 - s) > 0.001\n", " x = (x + s/x) / 2\n", " end\n", " x\n", "end" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2.123901414519125e-6" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sqrt_babylonian(2) - sqrt(2)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ForwardDiff.derivative(sqrt_babylonian, 2) = 0.353541906958862\n", "ForwardDiff.derivative(sqrt, 2) = 0.35355339059327373\n" ] } ], "source": [ "@show ForwardDiff.derivative(sqrt_babylonian, 2);\n", "@show ForwardDiff.derivative(sqrt, 2);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Unitful computations\n", "Physicists' dreams finally made true. ([soure](https://render.githubusercontent.com/view/ipynb?commit=89317894e2e5370a80e45d52db8a4055a4fdecd6&enc_url=68747470733a2f2f7261772e67697468756275736572636f6e74656e742e636f6d2f6d6174626573616e636f6e2f454d455f4a756c69615f776f726b73686f702f383933313738393465326535333730613830653435643532646238613430353561346664656364362f315f496e74726f64756374696f6e2e6970796e62&nwo=matbesancon%2FEME_Julia_workshop&path=1_Introduction.ipynb&repository_id=270611906&repository_type=Repository#Unitful-computations) by [@matbesancon](https://github.com/matbesancon))" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "┌ Info: Precompiling Unitful [1986cc42-f94f-5a68-af5c-568840ba703d]\n", "└ @ Base loading.jl:1273\n" ] } ], "source": [ "using Unitful\n", "using Unitful: J, kg, m, s" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4.0 kg m² s⁻²" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "3J + 1kg * (1m / 1s)^2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"MLJ\"\n", "\n", "# MLJ\n", "\n", "MLJ (Machine Learning in Julia) is a toolbox written in Julia\n", "providing a common interface and meta-algorithms for selecting,\n", "tuning, evaluating, composing and comparing machine learning models\n", "written in Julia and other languages. In particular MLJ wraps a large\n", "number of [scikit-learn](https://scikit-learn.org/stable/) models. \n", "\n", "## Key goals\n", "\n", "* Offer a consistent way to use, compose and tune machine learning\n", " models in Julia,\n", "\n", "* Promote the improvement of the Julia ML/Stats ecosystem by making it\n", " easier to use models from a wide range of packages,\n", "\n", "* Unlock performance gains by exploiting Julia's support for\n", " parallelism, automatic differentiation, GPU, optimisation etc.\n", "\n", "\n", "## Key features\n", "\n", "* Data agnostic, train models on any data supported by the\n", " [Tables.jl](https://github.com/JuliaData/Tables.jl) interface,\n", "\n", "* Extensive support for model composition (*pipelines* and *learning\n", " networks*),\n", "\n", "* Convenient syntax to tune and evaluate (composite) models.\n", "\n", "* Consistent interface to handle probabilistic predictions.\n", "\n", "* Extensible [tuning\n", " interface](https://github.com/alan-turing-institute/MLJTuning.jl),\n", " to support growing number of optimization strategies, and designed\n", " to play well with model composition.\n", "\n", "\n", "More information is available from the [MLJ design paper](https://github.com/alan-turing-institute/MLJ.jl/blob/master/paper/paper.md)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's how to genearate the full list of models supported by MLJ:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "142-element Array{NamedTuple{(:name, :package_name, :is_supervised, :docstring, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :is_pure_julia, :is_wrapper, :load_path, :package_license, :package_url, :package_uuid, :prediction_type, :supports_online, :supports_weights, :input_scitype, :target_scitype, :output_scitype),T} where T<:Tuple,1}:\n", " (name = ARDRegressor, package_name = ScikitLearn, ... ) \n", " (name = AdaBoostClassifier, package_name = ScikitLearn, ... ) \n", " (name = AdaBoostRegressor, package_name = ScikitLearn, ... ) \n", " (name = AdaBoostStumpClassifier, package_name = DecisionTree, ... ) \n", " (name = AffinityPropagation, package_name = ScikitLearn, ... ) \n", " (name = AgglomerativeClustering, package_name = ScikitLearn, ... ) \n", " (name = BaggingClassifier, package_name = ScikitLearn, ... ) \n", " (name = BaggingRegressor, package_name = ScikitLearn, ... ) \n", " (name = BayesianLDA, package_name = MultivariateStats, ... ) \n", " (name = BayesianLDA, package_name = ScikitLearn, ... ) \n", " (name = BayesianQDA, package_name = ScikitLearn, ... ) \n", " (name = BayesianRidgeRegressor, package_name = ScikitLearn, ... ) \n", " (name = BayesianSubspaceLDA, package_name = MultivariateStats, ... )\n", " ⋮ \n", " (name = Standardizer, package_name = MLJModels, ... ) \n", " (name = StaticSurrogate, package_name = MLJBase, ... ) \n", " (name = SubspaceLDA, package_name = MultivariateStats, ... ) \n", " (name = TheilSenRegressor, package_name = ScikitLearn, ... ) \n", " (name = UnivariateBoxCoxTransformer, package_name = MLJModels, ... )\n", " (name = UnivariateDiscretizer, package_name = MLJModels, ... ) \n", " (name = UnivariateStandardizer, package_name = MLJModels, ... ) \n", " (name = UnsupervisedSurrogate, package_name = MLJBase, ... ) \n", " (name = WrappedFunction, package_name = MLJBase, ... ) \n", " (name = XGBoostClassifier, package_name = XGBoost, ... ) \n", " (name = XGBoostCount, package_name = XGBoost, ... ) \n", " (name = XGBoostRegressor, package_name = XGBoost, ... ) " ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using MLJ\n", "models()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Performance evaluation\n", "\n", "The following example shows how to evaluate the performance of supervised learning model in MLJ. We'll start by loading a canned data set that is very well-known:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "X, y = @load_iris;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here `X` is a table of input features, and `y` the target observations (iris species)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we can inspect a list of models that apply immediately to this data:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "42-element Array{NamedTuple{(:name, :package_name, :is_supervised, :docstring, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :is_pure_julia, :is_wrapper, :load_path, :package_license, :package_url, :package_uuid, :prediction_type, :supports_online, :supports_weights, :input_scitype, :target_scitype, :output_scitype),T} where T<:Tuple,1}:\n", " (name = AdaBoostClassifier, package_name = ScikitLearn, ... ) \n", " (name = AdaBoostStumpClassifier, package_name = DecisionTree, ... ) \n", " (name = BaggingClassifier, package_name = ScikitLearn, ... ) \n", " (name = BayesianLDA, package_name = MultivariateStats, ... ) \n", " (name = BayesianLDA, package_name = ScikitLearn, ... ) \n", " (name = BayesianQDA, package_name = ScikitLearn, ... ) \n", " (name = BayesianSubspaceLDA, package_name = MultivariateStats, ... ) \n", " (name = ConstantClassifier, package_name = MLJModels, ... ) \n", " (name = DecisionTreeClassifier, package_name = DecisionTree, ... ) \n", " (name = DeterministicConstantClassifier, package_name = MLJModels, ... )\n", " (name = DummyClassifier, package_name = ScikitLearn, ... ) \n", " (name = EvoTreeClassifier, package_name = EvoTrees, ... ) \n", " (name = ExtraTreesClassifier, package_name = ScikitLearn, ... ) \n", " ⋮ \n", " (name = ProbabilisticSGDClassifier, package_name = ScikitLearn, ... ) \n", " (name = RandomForestClassifier, package_name = DecisionTree, ... ) \n", " (name = RandomForestClassifier, package_name = ScikitLearn, ... ) \n", " (name = RidgeCVClassifier, package_name = ScikitLearn, ... ) \n", " (name = RidgeClassifier, package_name = ScikitLearn, ... ) \n", " (name = SGDClassifier, package_name = ScikitLearn, ... ) \n", " (name = SVC, package_name = LIBSVM, ... ) \n", " (name = SVMClassifier, package_name = ScikitLearn, ... ) \n", " (name = SVMLinearClassifier, package_name = ScikitLearn, ... ) \n", " (name = SVMNuClassifier, package_name = ScikitLearn, ... ) \n", " (name = SubspaceLDA, package_name = MultivariateStats, ... ) \n", " (name = XGBoostClassifier, package_name = XGBoost, ... ) " ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "models(matching(X, y))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll choose one and invoke the `@load` macro, which simultaneously loads the code for the chosen model, and instantiates the model, using default hyper-parameters:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RandomForestClassifier(\n", " max_depth = -1,\n", " min_samples_leaf = 1,\n", " min_samples_split = 2,\n", " min_purity_increase = 0.0,\n", " n_subfeatures = -1,\n", " n_trees = 10,\n", " sampling_fraction = 0.7,\n", " pdf_smoothing = 0.0)\u001b[34m @709\u001b[39m" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tree_model = @load RandomForestClassifier pkg=DecisionTree" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can evaluate it's performance using, say, 6-fold cross-validation, and the `cross_entropy` performance measure:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[33mEvaluating over 6 folds: 100%[=========================] Time: 0:00:05\u001b[39m\n" ] }, { "data": { "text/plain": [ "┌\u001b[0m───────────────\u001b[0m┬\u001b[0m───────────────\u001b[0m┬\u001b[0m─────────────────────────────────────────────\u001b[0m┐\u001b[0m\n", "│\u001b[0m\u001b[22m _.measure \u001b[0m│\u001b[0m\u001b[22m _.measurement \u001b[0m│\u001b[0m\u001b[22m _.per_fold \u001b[0m│\u001b[0m\n", "├\u001b[0m───────────────\u001b[0m┼\u001b[0m───────────────\u001b[0m┼\u001b[0m─────────────────────────────────────────────\u001b[0m┤\u001b[0m\n", "│\u001b[0m cross_entropy \u001b[0m│\u001b[0m 0.0893 \u001b[0m│\u001b[0m [0.14, 0.0216, 0.104, 0.109, 0.0582, 0.103] \u001b[0m│\u001b[0m\n", "└\u001b[0m───────────────\u001b[0m┴\u001b[0m───────────────\u001b[0m┴\u001b[0m─────────────────────────────────────────────\u001b[0m┘\u001b[0m\n", "_.per_observation = [[[2.22e-16, 1.2, ..., 2.22e-16], [2.22e-16, 2.22e-16, ..., 0.105], [2.22e-16, 0.357, ..., 0.357], [2.22e-16, 2.22e-16, ..., 2.22e-16], [2.22e-16, 2.22e-16, ..., 2.22e-16], [2.22e-16, 2.22e-16, ..., 0.105]]]\n" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "evaluate(tree_model, X, y, resampling=CV(nfolds=6, shuffle=true), measure=cross_entropy)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fit and predict\n", "\n", "We'll now evaluate the peformance of our model by hand, but using a simple holdout set, to illustate a typical `fit!` and `predict` workflow. \n", "\n", "First note that a *model* in MLJ is an object that only serves as a container for the hyper-parameters of the model, and that's all. A *machine* is an object binding a model to some data, and is where *learned* parameters are stored (among other things):" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\u001b[34mMachine{RandomForestClassifier} @047\u001b[39m trained 0 times.\n", " args: \n", " 1:\t\u001b[34mSource @200\u001b[39m ⏎ `Table{AbstractArray{Continuous,1}}`\n", " 2:\t\u001b[34mSource @688\u001b[39m ⏎ `AbstractArray{Multiclass{3},1}`\n" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tree = machine(tree_model, X, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Splitting the data\n", "\n", "To split the data into a training and testing set, you can use the function `partition` to obtain indices for data points that should be considered either as training or testing data:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3-element Array{Int64,1}:\n", " 27\n", " 54\n", " 150" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train, test = partition(eachindex(y), 0.7, shuffle=true) \n", "test[1:3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To fit the machine, you can use the function `fit!` specifying the rows to be used for the training:" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "┌ Info: Training \u001b[34mMachine{RandomForestClassifier} @047\u001b[39m.\n", "└ @ MLJBase /Users/anthony/.julia/packages/MLJBase/2UxSl/src/machines.jl:317\n" ] }, { "data": { "text/plain": [ "\u001b[34mMachine{RandomForestClassifier} @047\u001b[39m trained 1 time.\n", " args: \n", " 1:\t\u001b[34mSource @200\u001b[39m ⏎ `Table{AbstractArray{Continuous,1}}`\n", " 2:\t\u001b[34mSource @688\u001b[39m ⏎ `AbstractArray{Multiclass{3},1}`\n" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fit!(tree, rows=train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that this modifies the machine, which now contains the trained parameters of the decision tree. You can inspect the result of the fitting with the `fitted_params` method:" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(forest = Ensemble of Decision Trees\n", "Trees: 10\n", "Avg Leaves: 6.4\n", "Avg Depth: 4.8,)" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fitted_params(tree) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can now use the machine to make predictions with the `predict` function specifying rows to be used for the prediction:" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "┌ Warning: `predict(mach)` and `predict(mach, rows=...)` are deprecated. Data or nodes should be explictly specified, as in `predict(mach, X)`. \n", "│ caller = ip:0x0\n", "└ @ Core :-1\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "ŷ[1] = UnivariateFinite{Multiclass{3}}(setosa=>1.0, versicolor=>0.0, virginica=>0.0)\n" ] }, { "data": { "text/plain": [ "UnivariateFinite{Multiclass{3}}(setosa=>1.0, versicolor=>0.0, virginica=>0.0)" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ŷ = predict(tree, rows=test)\n", "@show ŷ[1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the output is probabilistic, effectively a vector with a score for each class. You could get the mode by using the `mode` function on `ŷ` or using `predict_mode`:" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "┌ Warning: `predict_mode(mach)` and `predict_mode(mach, rows=...)` are deprecated. Data or nodes should be explictly specified, as in `predict_mode(mach, X)`. \n", "│ caller = ip:0x0\n", "└ @ Core :-1\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "ȳ[1] = \"setosa\"\n", "mode(ŷ[1]) = \"setosa\"\n" ] }, { "data": { "text/plain": [ "CategoricalArrays.CategoricalValue{String,UInt32} \"setosa\"" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ȳ = predict_mode(tree, rows=test)\n", "@show ȳ[1]\n", "@show mode(ŷ[1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To measure the discrepancy between ŷ and y you could use the average cross entropy:" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0777" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mce = cross_entropy(ŷ, y[test]) |> mean\n", "round(mce, digits=4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# A more advanced example\n", "\n", "As in other frameworks, MLJ also supports a variety of unsupervised models for pre-processing data, reducing dimensionality, etc. It also provides a [wrapper](https://alan-turing-institute.github.io/MLJ.jl/dev/tuning_models/) for tuning model hyper-parameters in various ways. Data transformations, and supervised models are then typically combined into linear [pipelines](https://alan-turing-institute.github.io/MLJ.jl/dev/composing_models/#Linear-pipelines-1). However, a more advanced feature of MLJ not common in other frameworks allows you to combine models in more complicated ways. We give a simple demonstration of that next.\n", "\n", "We start by loading the model code we'll need:" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "@load RidgeRegressor pkg=MultivariateStats\n", "@load RandomForestRegressor pkg=DecisionTree;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next step is to define \"learning network\" - a kind of blueprint for the new composite model type. Later we \"export\" the network as a new stand-alone model type.\n", "\n", "Our learing network will:\n", "\n", "- standarizes the input data\n", "\n", "- learn and apply a Box-Cox transformation to the target variable\n", "\n", "- blend the predictions of two supervised learning models - a ridge regressor and a random forest regressor; we'll blend using a simple average (for a more sophisticated stacking example, see [here](https://alan-turing-institute.github.io/DataScienceTutorials.jl/getting-started/stacking/))\n", "\n", "- apply the *inverse* Box-Cox transformation to this blended prediction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The basic idea is to proceed as if one were composing the various steps \"by hand\", but to wrap the training data in \"source nodes\" first. In place of production data, one typically uses some dummy data, to test the network as it is built. When the learning network is \"exported\" as a new stand-alone model type, it will no longer be bound to any data. You bind the exported model to production data when your're ready to use your new model type (just like you would with any other MLJ model).\n", "\n", "There is no need to `fit!` the machines you create, as this will happen automatically when you *call* the final node in the network (assuming you provide the dummy data).\n", "\n", "*Input layer*" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\u001b[34mSource @647\u001b[39m ⏎ `AbstractArray{Continuous,1}`" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# define some synthetic data:\n", "X, y = make_regression(100)\n", "y = abs.(y)\n", "\n", "test, train = partition(eachindex(y), 0.8);\n", "\n", "# wrap as source nodes:\n", "Xs = source(X)\n", "ys = source(y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*First layer and target transformation*" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\u001b[34mNode{Machine{UnivariateBoxCoxTransformer}} @033\u001b[39m\n", " args:\n", " 1:\t\u001b[34mSource @647\u001b[39m\n", " transform(\n", " \u001b[0m\u001b[1m\u001b[34mMachine{UnivariateBoxCoxTransformer} @291\u001b[39m\u001b[22m, \n", " \u001b[34mSource @647\u001b[39m)" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ " formula:\n" ] } ], "source": [ "std_model = Standardizer()\n", "stand = machine(std_model, Xs)\n", "W = MLJ.transform(stand, Xs)\n", "\n", "box_model = UnivariateBoxCoxTransformer()\n", "box = machine(box_model, ys)\n", "z = MLJ.transform(box, ys)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Second layer*" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\u001b[34mNode{Nothing} @404\u001b[39m\n", " args:\n", " 1:\t\u001b[34mNode{Nothing} @096\u001b[39m\n", " 2:\t\u001b[34mNode{Nothing} @501\u001b[39m\n", " +(\n", " #118(\n", " predict(\n", " \u001b[0m\u001b[1m\u001b[34mMachine{RidgeRegressor} @194\u001b[39m\u001b[22m, \n", " transform(\n", " \u001b[0m\u001b[1m\u001b[34mMachine{Standardizer} @178\u001b[39m\u001b[22m, \n", " \u001b[34mSource @446\u001b[39m))),\n", " #118(\n", " predict(\n", " \u001b[0m\u001b[1m\u001b[34mMachine{RandomForestRegressor} @202\u001b[39m\u001b[22m, \n", " transform(\n", " \u001b[0m\u001b[1m\u001b[34mMachine{Standardizer} @178\u001b[39m\u001b[22m, \n", " \u001b[34mSource @446\u001b[39m))))" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ " formula:\n" ] } ], "source": [ "ridge_model = RidgeRegressor(lambda=0.1)\n", "ridge = machine(ridge_model, W, z)\n", "\n", "forest_model = RandomForestRegressor(n_trees=50)\n", "forest = machine(forest_model, W, z)\n", "\n", "ẑ = 0.5*predict(ridge, W) + 0.5*predict(forest, W)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Output*" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\u001b[34mNode{Machine{UnivariateBoxCoxTransformer}} @855\u001b[39m\n", " args:\n", " 1:\t\u001b[34mNode{Nothing} @404\u001b[39m\n", " inverse_transform(\n", " \u001b[0m\u001b[1m\u001b[34mMachine{UnivariateBoxCoxTransformer} @291\u001b[39m\u001b[22m, \n", " +(\n", " #118(\n", " predict(\n", " \u001b[0m\u001b[1m\u001b[34mMachine{RidgeRegressor} @194\u001b[39m\u001b[22m, \n", " transform(\n", " \u001b[0m\u001b[1m\u001b[34mMachine{Standardizer} @178\u001b[39m\u001b[22m, \n", " \u001b[34mSource @446\u001b[39m))),\n", " #118(\n", " predict(\n", " \u001b[0m\u001b[1m\u001b[34mMachine{RandomForestRegressor} @202\u001b[39m\u001b[22m, \n", " transform(\n", " \u001b[0m\u001b[1m\u001b[34mMachine{Standardizer} @178\u001b[39m\u001b[22m, \n", " \u001b[34mSource @446\u001b[39m)))))" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ " formula:\n" ] } ], "source": [ "ŷ = inverse_transform(box, ẑ)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "No fitting has been done thus far, we have just defined a sequence of operations. We can test the netork by fitting the final predction node and then calling it to retrieve the prediction:" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "┌ Info: Training \u001b[34mMachine{UnivariateBoxCoxTransformer} @291\u001b[39m.\n", "└ @ MLJBase /Users/anthony/.julia/packages/MLJBase/2UxSl/src/machines.jl:317\n", "┌ Info: Training \u001b[34mMachine{Standardizer} @178\u001b[39m.\n", "└ @ MLJBase /Users/anthony/.julia/packages/MLJBase/2UxSl/src/machines.jl:317\n", "┌ Info: Training \u001b[34mMachine{RidgeRegressor} @194\u001b[39m.\n", "└ @ MLJBase /Users/anthony/.julia/packages/MLJBase/2UxSl/src/machines.jl:317\n", "┌ Info: Training \u001b[34mMachine{RandomForestRegressor} @202\u001b[39m.\n", "└ @ MLJBase /Users/anthony/.julia/packages/MLJBase/2UxSl/src/machines.jl:317\n" ] }, { "data": { "text/plain": [ "4-element Array{Float64,1}:\n", " 0.6921862297324471\n", " 1.6808322228314643\n", " 1.457825054812906 \n", " 4.189887949048443 " ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fit!(ŷ);\n", "ŷ()[1:4]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To \"export\" the network a new stand-alone model type, we can use a macro:" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [], "source": [ "@from_network machine(Deterministic(), Xs, ys, predict=ŷ) begin\n", " mutable struct CompositeModel\n", " rgs1 = ridge_model\n", " rgs2 = forest_model\n", " end\n", "end" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's an instance of our new type:" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "CompositeModel(\n", " rgs1 = RidgeRegressor(\n", " lambda = 0.1),\n", " rgs2 = RandomForestRegressor(\n", " max_depth = -1,\n", " min_samples_leaf = 1,\n", " min_samples_split = 2,\n", " min_purity_increase = 0.0,\n", " n_subfeatures = -1,\n", " n_trees = 50,\n", " sampling_fraction = 0.7,\n", " pdf_smoothing = 0.0))\u001b[34m @810\u001b[39m" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "composite = CompositeModel()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since we made our model mutable, we could change the regressors for different ones.\n", "\n", "For now we'll evaluate this model on the famous Boston data set:" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[33mEvaluating over 6 folds: 100%[=========================] Time: 0:00:00\u001b[39m\n" ] }, { "data": { "text/plain": [ "┌\u001b[0m───────────\u001b[0m┬\u001b[0m───────────────\u001b[0m┬\u001b[0m──────────────────────────────────────\u001b[0m┐\u001b[0m\n", "│\u001b[0m\u001b[22m _.measure \u001b[0m│\u001b[0m\u001b[22m _.measurement \u001b[0m│\u001b[0m\u001b[22m _.per_fold \u001b[0m│\u001b[0m\n", "├\u001b[0m───────────\u001b[0m┼\u001b[0m───────────────\u001b[0m┼\u001b[0m──────────────────────────────────────\u001b[0m┤\u001b[0m\n", "│\u001b[0m rms \u001b[0m│\u001b[0m 4.03 \u001b[0m│\u001b[0m [3.72, 2.79, 3.77, 6.18, 3.59, 3.27] \u001b[0m│\u001b[0m\n", "│\u001b[0m mae \u001b[0m│\u001b[0m 2.49 \u001b[0m│\u001b[0m [2.52, 1.96, 2.55, 3.04, 2.54, 2.36] \u001b[0m│\u001b[0m\n", "└\u001b[0m───────────\u001b[0m┴\u001b[0m───────────────\u001b[0m┴\u001b[0m──────────────────────────────────────\u001b[0m┘\u001b[0m\n", "_.per_observation = [missing, missing]\n" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X, y = @load_boston\n", "evaluate(composite, X, y, resampling=CV(nfolds=6, shuffle=true), measures=[rms, mae])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Check out more [Data Science Tutorials in Julia](https://alan-turing-institute.github.io/DataScienceTutorials.jl/)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Julia 1.3.0", "language": "julia", "name": "julia-1.3" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "1.3.0" } }, "nbformat": 4, "nbformat_minor": 4 }