{ "cells": [ { "cell_type": "markdown", "id": "594dfbd3", "metadata": {}, "source": [ "# ⭐ Scalling Machine Learning in Three Week course ⭐\n", "\n", "## Intro to MLFlow\n", "\n", "In this excercise, you will use:\n", "* MLflow\n", "* Track runa and experiment\n", "* MLFlow cli\n", "* ElasticNet by sklearn\n", "* Training a simple model to understand MLFlow tracking capabilites.\n", "\n", "\n", "This excercise is part of the [Scaling Machine Learning with Spark book](https://learning.oreilly.com/library/view/scaling-machine-learning/9781098106812/)\n", "available on the O'Reilly platform or on [Amazon](https://amzn.to/3WgHQvd)." ] }, { "cell_type": "code", "execution_count": 1, "id": "b89b3a00", "metadata": {}, "outputs": [], "source": [ "# The data set used in this example is from http://archive.ics.uci.edu/ml/datasets/Wine+Quality\n", "# P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.\n", "# Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.\n", "\n", "import os\n", "import warnings\n", "import sys\n", "\n", "import pandas as pd\n", "import numpy as np\n", "from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.linear_model import ElasticNet\n", "from urllib.parse import urlparse\n", "import mlflow\n", "import mlflow.sklearn\n", "\n", "import logging" ] }, { "cell_type": "code", "execution_count": 2, "id": "ab0d501e", "metadata": {}, "outputs": [], "source": [ "logging.basicConfig(level=logging.WARN)\n", "logger = logging.getLogger(__name__)" ] }, { "cell_type": "markdown", "id": "811b51c7", "metadata": {}, "source": [ "## Set eval metrics for \n", "We are using rmse, mae and r2.\n", "\n", "\n", "rmse - Root Mean Squared Error\n", "\n", "mae - Mean Absolute Error\n", "\n", "**RMSE and MAE** - The lower value of MAE, MSE, and RMSE implies higher accuracy of a regression model.\n", "\n", "> In our case of ElasticNet is part of the Linear Regression family where the x (input) and y (output) are assumed to have a linear relationship.\n", "\n", "\n", "\n", "**r2**- A higher value of R square is considered desirable. R Squared & Adjusted R Squared are used for explaining how well the independent variables in the linear regression model explains the variability in the dependent variable.\n", "\n", "### MAE\n", "Mean Absolute Error - In the context of machine learning, absolute error refers to the magnitude of difference between the prediction of an observation and the true value of that observation.\n", "\n", "![text](../figures/mae.jpeg)" ] }, { "cell_type": "markdown", "id": "58cb4edd", "metadata": {}, "source": [ "### RMSE\n", "It measures the average difference between values predicted by a model and the actual values. \n", "\n", "It provides an estimation of how well the model is able to predict the target value (accuracy).\n" ] }, { "cell_type": "markdown", "id": "834128c7", "metadata": {}, "source": [ "### R2 or R Square\n", "\n", "Statistical measure that represents the goodness of fit of a regression model. \n", "\n", "The ideal value for r-square is **1**. \n", "\n", "The closer the value of r-square to 1, the better is the model fitted.\n", "\n", "![text](../figures/rsquare.jpeg)" ] }, { "cell_type": "code", "execution_count": 3, "id": "0b3e8c28", "metadata": {}, "outputs": [], "source": [ "def eval_metrics(actual, pred):\n", " rmse = np.sqrt(mean_squared_error(actual, pred))\n", " mae = mean_absolute_error(actual, pred)\n", " r2 = r2_score(actual, pred)\n", " return rmse, mae, r2" ] }, { "cell_type": "code", "execution_count": 4, "id": "f6c65f3d", "metadata": {}, "outputs": [], "source": [ " # Read the wine-quality csv file from path\n", " csv_path = (\n", " \"../datasets/winequality-red.csv\"\n", " )\n", " try:\n", " data = pd.read_csv(csv_path, sep=\";\")\n", " except Exception as e:\n", " logger.exception(\n", " \"Error: %s\", e)" ] }, { "cell_type": "code", "execution_count": 5, "id": "343dae80", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fixed acidityvolatile aciditycitric acidresidual sugarchloridesfree sulfur dioxidetotal sulfur dioxidedensitypHsulphatesalcoholquality
07.40.7000.001.90.07611.034.00.997803.510.569.45
17.80.8800.002.60.09825.067.00.996803.200.689.85
27.80.7600.042.30.09215.054.00.997003.260.659.85
311.20.2800.561.90.07517.060.00.998003.160.589.86
47.40.7000.001.90.07611.034.00.997803.510.569.45
.......................................
15946.20.6000.082.00.09032.044.00.994903.450.5810.55
15955.90.5500.102.20.06239.051.00.995123.520.7611.26
15966.30.5100.132.30.07629.040.00.995743.420.7511.06
15975.90.6450.122.00.07532.044.00.995473.570.7110.25
15986.00.3100.473.60.06718.042.00.995493.390.6611.06
\n", "

1599 rows × 12 columns

\n", "
" ], "text/plain": [ " fixed acidity volatile acidity citric acid residual sugar chlorides \\\n", "0 7.4 0.700 0.00 1.9 0.076 \n", "1 7.8 0.880 0.00 2.6 0.098 \n", "2 7.8 0.760 0.04 2.3 0.092 \n", "3 11.2 0.280 0.56 1.9 0.075 \n", "4 7.4 0.700 0.00 1.9 0.076 \n", "... ... ... ... ... ... \n", "1594 6.2 0.600 0.08 2.0 0.090 \n", "1595 5.9 0.550 0.10 2.2 0.062 \n", "1596 6.3 0.510 0.13 2.3 0.076 \n", "1597 5.9 0.645 0.12 2.0 0.075 \n", "1598 6.0 0.310 0.47 3.6 0.067 \n", "\n", " free sulfur dioxide total sulfur dioxide density pH sulphates \\\n", "0 11.0 34.0 0.99780 3.51 0.56 \n", "1 25.0 67.0 0.99680 3.20 0.68 \n", "2 15.0 54.0 0.99700 3.26 0.65 \n", "3 17.0 60.0 0.99800 3.16 0.58 \n", "4 11.0 34.0 0.99780 3.51 0.56 \n", "... ... ... ... ... ... \n", "1594 32.0 44.0 0.99490 3.45 0.58 \n", "1595 39.0 51.0 0.99512 3.52 0.76 \n", "1596 29.0 40.0 0.99574 3.42 0.75 \n", "1597 32.0 44.0 0.99547 3.57 0.71 \n", "1598 18.0 42.0 0.99549 3.39 0.66 \n", "\n", " alcohol quality \n", "0 9.4 5 \n", "1 9.8 5 \n", "2 9.8 5 \n", "3 9.8 6 \n", "4 9.4 5 \n", "... ... ... \n", "1594 10.5 5 \n", "1595 11.2 6 \n", "1596 11.0 6 \n", "1597 10.2 5 \n", "1598 11.0 6 \n", "\n", "[1599 rows x 12 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data" ] }, { "cell_type": "markdown", "id": "0af64eb2", "metadata": {}, "source": [ "## Creating Test, Set and parmeters " ] }, { "cell_type": "code", "execution_count": 6, "id": "5792a81a", "metadata": {}, "outputs": [], "source": [ " # Split the data into training and test sets. (0.75, 0.25) split.\n", " train, test = train_test_split(data)\n", "\n", " # The predicted column is \"quality\" which is a scalar from [3, 9]\n", " train_x = train.drop([\"quality\"], axis=1)\n", " test_x = test.drop([\"quality\"], axis=1)\n", " train_y = train[[\"quality\"]]\n", " test_y = test[[\"quality\"]]\n", "\n", " # 1 0r 0.5\n", " alpha = 0.5\n", " # 1 or o.5\n", " l1_ratio = 0.5" ] }, { "cell_type": "markdown", "id": "45c4edf2", "metadata": {}, "source": [ "## MLFlow" ] }, { "cell_type": "markdown", "id": "d87109c3", "metadata": {}, "source": [ "It's time to learn more about MLFlow. \n", "To better gain hands on experience, let's go over the following steps, which are enabled to you by the available code snippets. I do encourage you to experiment and try run different variations of the model by changing the code.\n", "\n", "1. create an experiment\n", "2. try multiple runs within an experiment\n", "3. collect metrics \n", "4. explore experiments directory and runs" ] }, { "cell_type": "code", "execution_count": 7, "id": "0705cb2d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Elasticnet model (alpha=0.500000, l1_ratio=0.500000):\n", " RMSE: 0.7872150893245666\n", " MAE: 0.6382297731300293\n", " R2: 0.11845002046973885\n" ] } ], "source": [ " run_id = 0\n", "\n", " with mlflow.start_run() as run:\n", " run_id = run.info.run_id\n", " lr = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, random_state=42)\n", " lr.fit(train_x, train_y)\n", "\n", " predicted_qualities = lr.predict(test_x)\n", "\n", " (rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)\n", "\n", " print(\"Elasticnet model (alpha={:f}, l1_ratio={:f}):\".format(alpha, l1_ratio))\n", " print(\" RMSE: %s\" % rmse)\n", " print(\" MAE: %s\" % mae)\n", " print(\" R2: %s\" % r2)\n", "\n", " mlflow.log_param(\"alpha\", alpha)\n", " mlflow.log_param(\"l1_ratio\", l1_ratio)\n", " mlflow.log_metric(\"rmse\", rmse)\n", " mlflow.log_metric(\"r2\", r2)\n", " mlflow.log_metric(\"mae\", mae)\n", "\n", " tracking_url_type_store = urlparse(mlflow.get_tracking_uri()).scheme\n", "\n", " # Model registry does not work with file store\n", " if tracking_url_type_store != \"file\":\n", "\n", " # Register the model\n", " # There are other ways to use the Model Registry, which depends on the use case,\n", " # please refer to the doc for more information:\n", " # https://mlflow.org/docs/latest/model-registry.html#api-workflow\n", " mlflow.sklearn.log_model(lr, \"model\", registered_model_name=\"ElasticnetWineModel\")\n", " else:\n", " mlflow.sklearn.log_model(lr, \"model\")" ] }, { "attachments": { "image.png": { "image/png": "" } }, "cell_type": "markdown", "id": "d77901c4", "metadata": {}, "source": [ "## 🚀🚀🚀 Great! \n", "\n", "Now, let's go back to **mlrun** folder in your jupyter envirnment, and go over the project there to better understand the stracture of MLflow experiment! \n", "\n", "You will see experiment 0 and all the runs within it: \n", "![image.png](attachment:image.png)" ] }, { "cell_type": "markdown", "id": "98578c49", "metadata": {}, "source": [ "#### Question - which experiment did we run?\n", "Looking at our mlruns directory, there is a folder named 0.\n", "0 is the experiemnt id.\n", "Let's have a look at our experiment details." ] }, { "cell_type": "code", "execution_count": 8, "id": "4eed53b4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Name: Default\n", "Artifact Location: file:///home/jovyan/notebooks/mlruns/0\n", "Lifecycle_stage: active\n" ] } ], "source": [ "experiment_id = \"0\"\n", "experiment = mlflow.get_experiment(experiment_id)\n", "print(\"Name: {}\".format(experiment.name))\n", "print(\"Artifact Location: {}\".format(experiment.artifact_location))\n", "print(\"Lifecycle_stage: {}\".format(experiment.lifecycle_stage))" ] }, { "cell_type": "markdown", "id": "a7c3a2d1", "metadata": {}, "source": [ "You can also run the next command from the terminal:\n", " \n", "> ```mlflow experiments list```" ] }, { "cell_type": "markdown", "id": "186f8c0c", "metadata": {}, "source": [ "#### How about run details?\n", "We can investigate that throught code as well! \n" ] }, { "cell_type": "code", "execution_count": 9, "id": "f0e49c36", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from mlflow.tracking import MlflowClient\n", "\n", "client = MlflowClient()\n", "data = client.get_run(run_id).data\n", "data" ] }, { "cell_type": "code", "execution_count": 10, "id": "d1ade77c", "metadata": {}, "outputs": [], "source": [ "mlflow.end_run()" ] }, { "cell_type": "markdown", "id": "7f0e1512", "metadata": {}, "source": [ "## Well Done! 👏👏👏\n", "## You just finished: Intro to MLflow\n", "## Next: Intro to PySpark" ] }, { "cell_type": "code", "execution_count": null, "id": "41a91d38", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.4" } }, "nbformat": 4, "nbformat_minor": 5 }