{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lightning tour of MLJ" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*For a more elementary introduction to MLJ, see [Getting\n", "Started](https://juliaai.github.io/MLJ.jl/dev/getting_started/).*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note.** Be sure this file has not been separated from the\n", "accompanying Project.toml and Manifest.toml files, which should not\n", "should be altered unless you know what you are doing. Using them,\n", "the following code block instantiates a julia environment with a tested\n", "bundle of packages known to work with the rest of the script:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Activating project at `~/GoogleDrive/Julia/MLJ/MLJ/examples/lightning_tour`\n" ] } ], "source": [ "using Pkg\n", "Pkg.activate(@__DIR__)\n", "Pkg.instantiate()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Assuming Julia 1.7" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In MLJ a *model* is just a container for hyperparameters, and that's\n", "all. Here we will apply several kinds of model composition before\n", "binding the resulting \"meta-model\" to data in a *machine* for\n", "evaluation, using cross-validation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Loading and instantiating a gradient tree-boosting model:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ Info: For silent loading, specify `verbosity=0`. \n", "import EvoTrees ✔\n" ] }, { "data": { "text/plain": [ "EvoTreeRegressor(\n", " loss = EvoTrees.Linear(),\n", " nrounds = 10,\n", " λ = 0.0,\n", " γ = 0.0,\n", " η = 0.1,\n", " max_depth = 2,\n", " min_weight = 1.0,\n", " rowsample = 1.0,\n", " colsample = 1.0,\n", " nbins = 64,\n", " α = 0.5,\n", " metric = :mse,\n", " rng = Random.MersenneTwister(123),\n", " device = \"cpu\")" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using MLJ\n", "MLJ.color_off()\n", "\n", "Booster = @load EvoTreeRegressor # loads code defining a model type\n", "booster = Booster(max_depth=2) # specify hyperparameter at construction" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "EvoTreeRegressor(\n", " loss = EvoTrees.Linear(),\n", " nrounds = 50,\n", " λ = 0.0,\n", " γ = 0.0,\n", " η = 0.1,\n", " max_depth = 2,\n", " min_weight = 1.0,\n", " rowsample = 1.0,\n", " colsample = 1.0,\n", " nbins = 64,\n", " α = 0.5,\n", " metric = :mse,\n", " rng = Random.MersenneTwister(123),\n", " device = \"cpu\")" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "booster.nrounds=50 # or mutate post facto\n", "booster" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This model is an example of an iterative model. As is stands, the\n", "number of iterations `nrounds` is fixed." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Composition 1: Wrapping the model to make it \"self-iterating\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's create a new model that automatically learns the number of iterations,\n", "using the `NumberSinceBest(3)` criterion, as applied to an\n", "out-of-sample `l1` loss:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DeterministicIteratedModel(\n", " model = EvoTreeRegressor(\n", " loss = EvoTrees.Linear(),\n", " nrounds = 50,\n", " λ = 0.0,\n", " γ = 0.0,\n", " η = 0.1,\n", " max_depth = 2,\n", " min_weight = 1.0,\n", " rowsample = 1.0,\n", " colsample = 1.0,\n", " nbins = 64,\n", " α = 0.5,\n", " metric = :mse,\n", " rng = Random.MersenneTwister(123),\n", " device = \"cpu\"),\n", " controls = Any[Step(2), NumberSinceBest(3), NumberLimit(300)],\n", " resampling = Holdout(\n", " fraction_train = 0.8,\n", " shuffle = false,\n", " rng = Random._GLOBAL_RNG()),\n", " measure = LPLoss(p = 1),\n", " weights = nothing,\n", " class_weights = nothing,\n", " operation = MLJModelInterface.predict,\n", " retrain = true,\n", " check_measure = true,\n", " iteration_parameter = nothing,\n", " cache = true)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using MLJIteration\n", "iterated_booster = IteratedModel(model=booster,\n", " resampling=Holdout(fraction_train=0.8),\n", " controls=[Step(2), NumberSinceBest(3), NumberLimit(300)],\n", " measure=l1,\n", " retrain=true)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Composition 2: Preprocess the input features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Combining the model with categorical feature encoding:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DeterministicPipeline(\n", " continuous_encoder = ContinuousEncoder(\n", " drop_last = false,\n", " one_hot_ordered_factors = false),\n", " deterministic_iterated_model = DeterministicIteratedModel(\n", " model = EvoTreeRegressor{Float64,…},\n", " controls = Any[Step(2), NumberSinceBest(3), NumberLimit(300)],\n", " resampling = Holdout,\n", " measure = LPLoss(p = 1),\n", " weights = nothing,\n", " class_weights = nothing,\n", " operation = MLJModelInterface.predict,\n", " retrain = true,\n", " check_measure = true,\n", " iteration_parameter = nothing,\n", " cache = true),\n", " cache = true)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe = ContinuousEncoder |> iterated_booster" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Composition 3: Wrapping the model to make it \"self-tuning\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we define a hyperparameter range for optimization of a\n", "(nested) hyperparameter:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "NumericRange(1 ≤ deterministic_iterated_model.model.max_depth ≤ 10; origin=5.5, unit=4.5)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "max_depth_range = range(pipe,\n", " :(deterministic_iterated_model.model.max_depth),\n", " lower = 1,\n", " upper = 10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can wrap the pipeline model in an optimization strategy to make\n", "it \"self-tuning\":" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DeterministicTunedModel(\n", " model = DeterministicPipeline(\n", " continuous_encoder = ContinuousEncoder,\n", " deterministic_iterated_model = DeterministicIteratedModel{EvoTreeRegressor{Float64,…}},\n", " cache = true),\n", " tuning = RandomSearch(\n", " bounded = Distributions.Uniform,\n", " positive_unbounded = Distributions.Gamma,\n", " other = Distributions.Normal,\n", " rng = Random._GLOBAL_RNG()),\n", " resampling = CV(\n", " nfolds = 3,\n", " shuffle = true,\n", " rng = Random.MersenneTwister(456)),\n", " measure = LPLoss(p = 1),\n", " weights = nothing,\n", " operation = nothing,\n", " range = NumericRange(1 ≤ deterministic_iterated_model.model.max_depth ≤ 10; origin=5.5, unit=4.5),\n", " selection_heuristic = MLJTuning.NaiveSelection(nothing),\n", " train_best = true,\n", " repeats = 1,\n", " n = 50,\n", " acceleration = CPUThreads{Int64}(5),\n", " acceleration_resampling = CPU1{Nothing}(nothing),\n", " check_measure = true,\n", " cache = true)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "self_tuning_pipe = TunedModel(model=pipe,\n", " tuning=RandomSearch(),\n", " ranges = max_depth_range,\n", " resampling=CV(nfolds=3, rng=456),\n", " measure=l1,\n", " acceleration=CPUThreads(),\n", " n=50)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Binding to data and evaluating performance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Loading a selection of features and labels from the Ames\n", "House Price dataset:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "X, y = @load_reduced_ames;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Binding the \"self-tuning\" pipeline model to data in a *machine* (which\n", "will additionally store *learned* parameters):" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Machine{DeterministicTunedModel{RandomSearch,…},…} trained 0 times; caches data\n", " model: MLJTuning.DeterministicTunedModel{RandomSearch, MLJBase.DeterministicPipeline{NamedTuple{(:continuous_encoder, :deterministic_iterated_model), Tuple{Unsupervised, Deterministic}}, MLJModelInterface.predict}}\n", " args: \n", " 1:\tSource @512 ⏎ `Table{Union{AbstractVector{Continuous}, AbstractVector{Count}, AbstractVector{Multiclass{15}}, AbstractVector{Multiclass{25}}, AbstractVector{OrderedFactor{10}}}}`\n", " 2:\tSource @129 ⏎ `AbstractVector{Continuous}`\n" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mach = machine(self_tuning_pipe, X, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Evaluating the \"self-tuning\" pipeline model's performance using 5-fold\n", "cross-validation (implies multiple layers of nested resampling):" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ Info: Performing evaluations using 5 threads.\n", "\rEvaluating over 5 folds: 40%[==========> ] ETA: 0:08:27\u001b[K\rEvaluating over 5 folds: 60%[===============> ] ETA: 0:04:11\u001b[K\rEvaluating over 5 folds: 80%[====================> ] ETA: 0:01:35\u001b[K\rEvaluating over 5 folds: 100%[=========================] Time: 0:07:23\u001b[K\n" ] }, { "data": { "text/plain": [ "PerformanceEvaluation object with these fields:\n", " measure, measurement, operation, per_fold,\n", " per_observation, fitted_params_per_fold,\n", " report_per_fold, train_test_pairs\n", "Extract:\n", "┌───────────────┬─────────────┬───────────┬───────────────────────────────────────────────┐\n", "│\u001b[22m measure \u001b[0m│\u001b[22m measurement \u001b[0m│\u001b[22m operation \u001b[0m│\u001b[22m per_fold \u001b[0m│\n", "├───────────────┼─────────────┼───────────┼───────────────────────────────────────────────┤\n", "│ LPLoss(p = 1) │ 16800.0 │ predict │ [16500.0, 16300.0, 16300.0, 16600.0, 18600.0] │\n", "│ LPLoss(p = 2) │ 6.65e8 │ predict │ [6.14e8, 6.3e8, 5.98e8, 6.17e8, 8.68e8] │\n", "└───────────────┴─────────────┴───────────┴───────────────────────────────────────────────┘\n" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "evaluate!(mach,\n", " measures=[l1, l2],\n", " resampling=CV(nfolds=5, rng=123),\n", " acceleration=CPUThreads())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*" ] } ], "metadata": { "kernelspec": { "display_name": "Julia 1.7.1", "language": "julia", "name": "julia-1.7" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "1.7.1" } }, "nbformat": 4, "nbformat_minor": 3 }