{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\"Julia\"\n", "\n", "[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/darenasc/mlj-tutorials/master?filepath=README.ipynb)\n", "\n", "## Lightning encounter with Julia programming language\n", "\n", "###### Julia related content prepared by [@ablaom](https://github.com/ablaom)\n", "\n", "Interacting with Julia at the REPL, or in a notebook, feels very much\n", "\n", "the same as python, MATLAB or R:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Hello world!" ] } ], "source": [ "print(\"Hello world!\")" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "2 + 2" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Float64" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "typeof(42.0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Just-in-time compilation\n", "\n", "Here's a function used in generating the famous Mandelbrot set,\n", "\n", "which looks pretty much the same in python, MATLAB or R:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "mandel (generic function with 1 method)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "function mandel(z)\n", " c = z\n", " maxiter = 80\n", " for n in 1:maxiter\n", " if abs(z) > 2\n", " return n-1\n", " end\n", " z = z^2 + c\n", " end\n", " return maxiter\n", "end" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In particular, notice the absence of type annotations. The crucial difference is what happens when you call this function:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.012019 seconds (29.06 k allocations: 1.662 MiB)\n" ] }, { "data": { "text/plain": [ "1" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@time mandel(1.2) # time call on a Float64" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is actually pretty lousy, slower than python. However, trying again:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.000003 seconds (4 allocations: 160 bytes)\n" ] }, { "data": { "text/plain": [ "0" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@time mandel(3.4) # time on another Float64" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Thousands of times faster, second time around! What happenend?\n", "\n", "When you call `mandel(1.2)` in python, say, then the defining code\n", "is interpreted each time. When you call `mandel(1.2)` in Julia for\n", "the first time Julia inspects the of the argument, namely `Float64`,\n", "and using this information *compiles* an efficient type-specfic\n", "version of `mandel`, which it caches for use in any subsequent call\n", "*on the same type*. Indeed if we call `mandel` on a new type, a new\n", "compilation will be needed:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.041186 seconds (75.59 k allocations: 4.152 MiB)\n" ] }, { "data": { "text/plain": [ "0" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@time mandel(1.0 + 5.0im)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0.000009 seconds (6 allocations: 224 bytes)\n" ] }, { "data": { "text/plain": [ "0" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@time mandel(2.0 + 0.5im)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since plotting the Mandelbrot set means calling `mandel` millions of\n", "times on the same type, the advantage of just-in-time compilation is\n", "obvious." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "Figure(PyObject
)" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "PyObject " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using PyPlot\n", "\n", "plt.imshow([mandel(x + y * im) for y = -1:0.001:1, x = -2:0.001:1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Multiple dispatch\n", "\n", "You will never see anything like `A.add(B)` in Julia because Julia\n", "is not a traditional object-oriented language. In Julia, function and\n", "structure are kept separate, with the help of abstract types and\n", "multiple dispatch, as we explain next\n", "In addition to regular concrete types, such as `Float64` and\n", "`String`, Julia has a built-in heirarchy of *abstract* types. These\n", "generally have subtypes but no instances:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Int64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "typeof(42)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Signed" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "supertype(Int64)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Integer" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "supertype(Signed)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3-element Array{Any,1}:\n", " Bool \n", " Signed \n", " Unsigned" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "subtypes(Integer)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "true" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Bool <: Integer # is Bool a subtype of Integer?" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "false" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Bool <: String" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In Julia, which is optionally typed, one uses type annotations to\n", "adapt the behaviour of functions to their types. If we define" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "divide (generic function with 1 method)" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "divide(x, y) = x / y" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "then `divide(x, y)` will make sense whenever `x / y` makes sense (for\n", "the built-in function `/`). For example, we can use it to divide two\n", "integers, or two matrices:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "divide(1, 2)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2×2 Array{Float64,2}:\n", " 1.0 0.0\n", " 9.0 -2.0" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "divide([1 2; 3 4], [1 2; 3 7])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To vary the behaviour for specific types we make type annotatations:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "divide(x::Integer, y::Integer) = floor(x/y)\n", "divide(x::String, y::String) = join([x, y], \" / \")\n", "divide(1, 2)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"Hello / World!\"" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "divide(\"Hello\", \"World!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the case of `Float64` the original \"fallback\" method still\n", "applies:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "divide(1.0, 2.0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## User-defined types\n", "\n", "Users can define their own abstract types and composite types:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "describe (generic function with 2 methods)" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "abstract type Organism end\n", "\n", "struct Animal <: Organism\n", " name::String\n", " is_hervibore::Bool\n", "end\n", "\n", "struct Plant <: Organism\n", " name::String\n", " is_flowering::Bool\n", "end\n", "\n", "describe(o::Organism) = string(o.name) # fall-back method\n", "function describe(p::Plant)\n", "\n", " if p.is_flowering\n", " text = \" is a flowering plant.\"\n", " else\n", " text = \" is a non-flowering plant.\"\n", " end\n", " return p.name*text\n", "\n", "end" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"Elephant\"" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "describe(Animal(\"Elephant\", true))" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"Fern is a non-flowering plant.\"" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "describe(Plant(\"Fern\", false))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Type inference and multiple dispatch\n", "\n", "*Type inference* is the process of identifying the types of the arguments to dispatch the right method.\n", "\n", "Blogpost about [type dispatch](http://www.stochasticlifestyle.com/type-dispatch-design-post-object-oriented-programming-julia/) by [Christopher Rackauckas](http://www.chrisrackauckas.com/)." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "function_x (generic function with 2 methods)" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "function function_x(x::String)\n", " println(\"this is a string: $x\")\n", "end\n", "\n", "function function_x(x::Int)\n", " println(\"$(x^2) is the square of $x\")\n", "end" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "this is a string: a string\n", "4 is the square of 2\n" ] } ], "source": [ "# each call to the function_x() will dispatch the corresponding method depending on the parameter's type\n", "function_x(\"a string\")\n", "function_x(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Automatic differentiation\n", "\n", "Differentiation of almost arbitrary programs with respect to their input. ([source]( https://render.githubusercontent.com/view/ipynb?commit=89317894e2e5370a80e45d52db8a4055a4fdecd6&enc_url=68747470733a2f2f7261772e67697468756275736572636f6e74656e742e636f6d2f6d6174626573616e636f6e2f454d455f4a756c69615f776f726b73686f702f383933313738393465326535333730613830653435643532646238613430353561346664656364362f315f496e74726f64756374696f6e2e6970796e62&nwo=matbesancon%2FEME_Julia_workshop&path=1_Introduction.ipynb&repository_id=270611906&repository_type=Repository#Automatic-differentiation) by [@matbesancon](https://github.com/matbesancon))" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "sqrt_babylonian (generic function with 1 method)" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using ForwardDiff\n", "\n", "function sqrt_babylonian(s)\n", " x = s / 2\n", " while abs(x^2 - s) > 0.001\n", " x = (x + s/x) / 2\n", " end\n", " x\n", "end" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2.123901414519125e-6" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sqrt_babylonian(2) - sqrt(2)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ForwardDiff.derivative(sqrt_babylonian, 2) = 0.353541906958862\n", "ForwardDiff.derivative(sqrt, 2) = 0.35355339059327373\n" ] } ], "source": [ "@show ForwardDiff.derivative(sqrt_babylonian, 2);\n", "@show ForwardDiff.derivative(sqrt, 2);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Unitful computations\n", "Physicists' dreams finally made true. ([soure](https://render.githubusercontent.com/view/ipynb?commit=89317894e2e5370a80e45d52db8a4055a4fdecd6&enc_url=68747470733a2f2f7261772e67697468756275736572636f6e74656e742e636f6d2f6d6174626573616e636f6e2f454d455f4a756c69615f776f726b73686f702f383933313738393465326535333730613830653435643532646238613430353561346664656364362f315f496e74726f64756374696f6e2e6970796e62&nwo=matbesancon%2FEME_Julia_workshop&path=1_Introduction.ipynb&repository_id=270611906&repository_type=Repository#Unitful-computations) by [@matbesancon](https://github.com/matbesancon))" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "using Unitful\n", "using Unitful: J, kg, m, s" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4.0 kg m² s⁻²" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "3J + 1kg * (1m / 1s)^2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"MLJ\"\n", "\n", "# MLJ\n", "\n", "MLJ (Machine Learning in Julia) is a toolbox written in Julia providing a common interface and meta-algorithms for selecting, tuning, evaluating, composing and comparing machine learning models written in Julia and other languages. MLJ is released under the MIT licensed and sponsored by the [Alan Turing Institute](https://www.turing.ac.uk/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The MLJ Universe\n", "\n", "The functionality of MLJ is distributed over a number of repositories\n", "illustrated in the dependency chart below.\n", "\n", "[MLJ](https://github.com/alan-turing-institute/MLJ) * [MLJBase](https://github.com/alan-turing-institute/MLJBase.jl) * [MLJModelInterface](https://github.com/alan-turing-institute/MLJModelInterface.jl) * [MLJModels](https://github.com/alan-turing-institute/MLJModels.jl) * [MLJTuning](https://github.com/alan-turing-institute/MLJTuning.jl) * [MLJLinearModels](https://github.com/alan-turing-institute/MLJLinearModels.jl) * [MLJFlux](https://github.com/alan-turing-institute/MLJFlux.jl) * [MLJTutorials](https://github.com/alan-turing-institute/MLJTutorials) * [MLJScientificTypes](https://github.com/alan-turing-institute/MLJScientificTypes.jl) * [ScientificTypes](https://github.com/alan-turing-institute/ScientificTypes.jl)\n", "\n", "\n", "
\n", " \"Dependency\n", "
\n", "\n", "*Dependency chart for MLJ repositories. Repositories with dashed\n", "connections do not currently exist but are planned/proposed.*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "MLJ provides access to to a wide variety of machine learning models. For the most up-to-date list of available models `models()`." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "┌ Info: Recompiling stale cache file /Users/darenasc/.julia/compiled/v1.2/MLJ/rAU56.ji for MLJ [add582a8-e3ab-11e8-2d5e-e98b27df1bc7]\n", "└ @ Base loading.jl:1240\n" ] }, { "data": { "text/plain": [ "132-element Array{NamedTuple{(:name, :package_name, :is_supervised, :docstring, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :is_pure_julia, :is_wrapper, :load_path, :package_license, :package_url, :package_uuid, :prediction_type, :supports_online, :supports_weights, :input_scitype, :target_scitype, :output_scitype),T} where T<:Tuple,1}:\n", " (name = ARDRegressor, package_name = ScikitLearn, ... ) \n", " (name = AdaBoostClassifier, package_name = ScikitLearn, ... ) \n", " (name = AdaBoostRegressor, package_name = ScikitLearn, ... ) \n", " (name = AdaBoostStumpClassifier, package_name = DecisionTree, ... ) \n", " (name = AffinityPropagation, package_name = ScikitLearn, ... ) \n", " (name = AgglomerativeClustering, package_name = ScikitLearn, ... ) \n", " (name = BaggingClassifier, package_name = ScikitLearn, ... ) \n", " (name = BaggingRegressor, package_name = ScikitLearn, ... ) \n", " (name = BayesianLDA, package_name = MultivariateStats, ... ) \n", " (name = BayesianLDA, package_name = ScikitLearn, ... ) \n", " (name = BayesianQDA, package_name = ScikitLearn, ... ) \n", " (name = BayesianRidgeRegressor, package_name = ScikitLearn, ... ) \n", " (name = BayesianSubspaceLDA, package_name = MultivariateStats, ... )\n", " ⋮ \n", " (name = SVMRegressor, package_name = ScikitLearn, ... ) \n", " (name = SpectralClustering, package_name = ScikitLearn, ... ) \n", " (name = Standardizer, package_name = MLJModels, ... ) \n", " (name = StaticTransformer, package_name = MLJBase, ... ) \n", " (name = SubspaceLDA, package_name = MultivariateStats, ... ) \n", " (name = TheilSenRegressor, package_name = ScikitLearn, ... ) \n", " (name = UnivariateBoxCoxTransformer, package_name = MLJModels, ... )\n", " (name = UnivariateDiscretizer, package_name = MLJModels, ... ) \n", " (name = UnivariateStandardizer, package_name = MLJModels, ... ) \n", " (name = XGBoostClassifier, package_name = XGBoost, ... ) \n", " (name = XGBoostCount, package_name = XGBoost, ... ) \n", " (name = XGBoostRegressor, package_name = XGBoost, ... ) " ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using MLJ\n", "models()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fit, predict, transform\n", "\n", "The following example is using the `fit()`, `predict()`, and `transform()` functions of MLJ." ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "scrolled": true }, "outputs": [], "source": [ "import Statistics\n", "using PrettyPrinting\n", "using StableRNGs" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "X, y = @load_iris;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "let's also load the DecisionTreeClassifier:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DecisionTreeClassifier(\n", " max_depth = -1,\n", " min_samples_leaf = 1,\n", " min_samples_split = 2,\n", " min_purity_increase = 0.0,\n", " n_subfeatures = 0,\n", " post_prune = false,\n", " merge_purity_threshold = 1.0,\n", " pdf_smoothing = 0.0,\n", " display_depth = 5)\u001b[34m @ 1…27\u001b[39m" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@load DecisionTreeClassifier\n", "tree_model = DecisionTreeClassifier()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## MLJ Machine\n", "\n", "In MLJ, a *model* is an object that only serves as a container for the hyperparameters of the model. A *machine* is an object wrapping both a model and data and can contain information on the *trained* model; it does *not* fit the model by itself. However, it does check that the model is compatible with the scientific type of the data and will warn you otherwise." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\u001b[34mMachine{DecisionTreeClassifier} @ 1…24\u001b[39m\n" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tree = machine(tree_model, X, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A machine is used both for supervised and unsupervised model. In this tutorial we give an example for the supervised model first and then go on with the unsupervised case.\n", "\n", "## Training and testing a supervised model\n", "\n", "Now that you've declared the model you'd like to consider and the data, we are left with the standard training and testing step for a supervised learning algorithm.\n", "\n", "## Splitting the data\n", "\n", "To split the data into a training and testing set, you can use the function `partition` to obtain indices for data points that should be considered either as training or testing data:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3-element Array{Int64,1}:\n", " 39\n", " 54\n", " 9" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rng = StableRNG(566)\n", "train, test = partition(eachindex(y), 0.7, shuffle=true, rng=rng)\n", "test[1:3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fitting and testing the machine\n", "\n", "To fit the machine, you can use the function `fit!` specifying the rows to be used for the training:" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "┌ Info: Training \u001b[34mMachine{DecisionTreeClassifier} @ 1…24\u001b[39m.\n", "└ @ MLJBase /Users/darenasc/.julia/packages/MLJBase/Cb9AY/src/machines.jl:187\n" ] }, { "data": { "text/plain": [ "\u001b[34mMachine{DecisionTreeClassifier} @ 1…24\u001b[39m\n" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fit!(tree, rows=train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that this **modifies** the machine which now contains the trained parameters of the decision tree. You can inspect the result of the fitting with the `fitted_params` method:" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(tree = Decision Tree\n", "Leaves: 5\n", "Depth: 4,\n", " encoding = Dict(\"virginica\" => 0x00000003,\n", " \"setosa\" => 0x00000001,\n", " \"versicolor\" => 0x00000002))" ] } ], "source": [ "fitted_params(tree) |> pprint" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This `fitresult` will vary from model to model though classifiers will usually give out a tuple with the first element corresponding to the fitting and the second one keeping track of how classes are named (so that predictions can be appropriately named).\n", "\n", "You can now use the machine to make predictions with the `predict` function specifying rows to be used for the prediction:" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ŷ[1] = UnivariateFinite{Multiclass{3}}(setosa=>1.0, versicolor=>0.0, virginica=>0.0)\n" ] }, { "data": { "text/plain": [ "UnivariateFinite{Multiclass{3}}(setosa=>1.0, versicolor=>0.0, virginica=>0.0)" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ŷ = predict(tree, rows=test)\n", "@show ŷ[1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the output is probabilistic, effectively a vector with a score for each class. You could get the mode by using the `mode` function on `ŷ` or using `predict_mode`:" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ȳ[1] = \"setosa\"\n", "mode(ŷ[1]) = \"setosa\"\n" ] }, { "data": { "text/plain": [ "CategoricalArrays.CategoricalValue{String,UInt32} \"setosa\"" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ȳ = predict_mode(tree, rows=test)\n", "@show ȳ[1]\n", "@show mode(ŷ[1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To measure the discrepancy between ŷ and y you could use the average cross entropy:" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2.4029" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mce = cross_entropy(ŷ, y[test]) |> mean\n", "round(mce, digits=4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# [Check out MLJ example with TreeParzen.jl](TreeParzen_example.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# A more advanced example" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "┌ Info: Recompiling stale cache file /Users/darenasc/.julia/compiled/v1.2/MultivariateStats/l7I74.ji for MultivariateStats [6f286f6a-111f-5878-ab1e-185364afe411]\n", "└ @ Base loading.jl:1240\n" ] }, { "data": { "text/plain": [ "RidgeRegressor(\n", " lambda = 1.0)\u001b[34m @ 1…26\u001b[39m" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using MLJ\n", "using StableRNGs\n", "import DataFrames\n", "@load RidgeRegressor pkg=MultivariateStats" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example we will show how to generate a model from a network; there are two approaches:\n", "\n", "* using the `@from_network` macro\n", "* writing the model in full\n", "\n", "the first approach should usually be the one considered as it's simpler.\n", "\n", "Generating a model from a network allows subsequent composition of that network with other tasks and tuning of that network.\n", "\n", "### Using the @from_network macro\n", "\n", "Let's define a simple network\n", "\n", "*Input layer*" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\u001b[34mSource{:target} @ 1…62\u001b[39m\n" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rng = StableRNG(6616) # for reproducibility\n", "x1 = rand(rng, 300)\n", "x2 = rand(rng, 300)\n", "x3 = rand(rng, 300)\n", "y = exp.(x1 - x2 -2x3 + 0.1*rand(rng, 300))\n", "X = DataFrames.DataFrame(x1=x1, x2=x2, x3=x3)\n", "test, train = partition(eachindex(y), 0.8);\n", "\n", "Xs = source(X)\n", "ys = source(y, kind=:target)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*First layer*" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\u001b[34mNode @ 5…77\u001b[39m = transform(\u001b[0m\u001b[1m5…00\u001b[22m, \u001b[34m1…62\u001b[39m)" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "std_model = Standardizer()\n", "stand = machine(std_model, Xs)\n", "W = MLJ.transform(stand, Xs)\n", "\n", "box_model = UnivariateBoxCoxTransformer()\n", "box = machine(box_model, ys)\n", "z = MLJ.transform(box, ys)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Second layer*" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\u001b[34mNode @ 1…44\u001b[39m = predict(\u001b[0m\u001b[1m4…03\u001b[22m, transform(\u001b[0m\u001b[1m7…37\u001b[22m, \u001b[34m6…82\u001b[39m))" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ridge_model = RidgeRegressor(lambda=0.1)\n", "ridge = machine(ridge_model, W, z)\n", "ẑ = predict(ridge, W)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Output*" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\u001b[34mNode @ 2…37\u001b[39m = inverse_transform(\u001b[0m\u001b[1m5…00\u001b[22m, predict(\u001b[0m\u001b[1m4…03\u001b[22m, transform(\u001b[0m\u001b[1m7…37\u001b[22m, \u001b[34m6…82\u001b[39m)))" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ŷ = inverse_transform(box, ẑ)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "No fitting has been done thus far, we have just defined a sequence of operations.\n", "\n", "To form a model out of that network is easy using the `@from_network` macro:" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "@from_network CompositeModel(std=std_model, box=box_model,\n", " ridge=ridge_model) <= ŷ;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The macro defines a constructor CompositeModel and attributes a name to the different models; the ordering / connection between the nodes is inferred from `ŷ` via the `<= ŷ`.\n", "\n", "**Note**: had the model been probabilistic (e.g. `RidgeClassifier`) you would have needed to add `is_probabilistic=true` at the end." ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0136" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cm = machine(CompositeModel(), X, y)\n", "res = evaluate!(cm, resampling=Holdout(fraction_train=0.8, rng=51),\n", " measure=rms)\n", "round(res.measurement[1], sigdigits=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Check out more [Data Science tutorials in Julia](https://alan-turing-institute.github.io/DataScienceTutorials.jl/)." ] } ], "metadata": { "kernelspec": { "display_name": "Julia 1.2.0", "language": "julia", "name": "julia-1.2" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "1.2.0" } }, "nbformat": 4, "nbformat_minor": 4 }