{ "cells": [ { "cell_type": "code", "execution_count": 1, "id": "d700a42a-7611-4726-a3c2-6b788682dfab", "metadata": { "tags": [ "hide-input" ] }, "outputs": [], "source": [ "# Install the necessary dependencies\n", "\n", "import os\n", "import sys\n", "!{sys.executable} -m pip install --quiet seaborn pandas scikit-learn numpy matplotlib jupyterlab_myst ipython" ] }, { "cell_type": "markdown", "id": "5357e111", "metadata": { "tags": [ "remove-cell" ] }, "source": [ "---\n", "license:\n", " code: MIT\n", " content: CC-BY-4.0\n", "github: https://github.com/ocademy-ai/machine-learning\n", "venue: By Ocademy\n", "open_access: true\n", "bibliography:\n", " - https://raw.githubusercontent.com/ocademy-ai/machine-learning/main/open-machine-learning-jupyter-book/references.bib\n", "---" ] }, { "cell_type": "markdown", "id": "6195d6f2", "metadata": {}, "source": [ "# Overview\n", "\n", "Moving Machine Learning models into production is as important as building them, sometimes even harder. Maintaining data quality and model accuracy over time are just a few of the challenges. To achieve end-to-end system productionization as a whole, the various components and designs need to be identified, from defining a problem to serving the model as a service.\n", "\n", "According to the Algorithmia statistics, 55% of businesses working on ML models have yet to get them into production. " ] }, { "cell_type": "markdown", "id": "b13ac488", "metadata": {}, "source": [ ":::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/the%202020%20State%20of%20Enterprise%20ML%20by%20Algorithmia%20based%20on%20750%20businesses.png\n", "---\n", "name: the 2020 State of Enterprise ML by Algorithmia based on 750 businesses\n", "---\n", "the 2020 State of Enterprise ML by Algorithmia based on 750 businesses\n", ":::" ] }, { "cell_type": "markdown", "id": "294cbab7", "metadata": {}, "source": [ "This is because there are many problems and challenges between the theoretical study of the model and the actual production deployment:\n", "\n", "(1)**POC** to production gap:\n", "\n", "There is a huge gap from Proof of Concept (POC) to actual final product or service deployment in production, with only a tiny fraction of the complete machine learning service model actually invested consisting of ML code, and the surrounding infrastructure required for this is large and complex. At the same time, this gap may also involve challenges in technology, resources, security, stability and other aspects.\n" ] }, { "cell_type": "markdown", "id": "2bd5ba4a", "metadata": {}, "source": [ ":::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/the%20portion%20of%20ML%20code%20that%20is%20part%20of%20a%20complete%20ML%20system.jpg\n", "---\n", "name: the portion of ML code that is part of a complete ML system\n", "---\n", "the portion of ML code that is part of a complete ML system\n", ":::" ] }, { "cell_type": "markdown", "id": "92c348bc", "metadata": {}, "source": [ "(2) data drift and concept drift:\n", "\n", "Models do not last forever and sometimes degrade over time, even if the data itself is of good quality. Sometimes, model performance degrades due to data quality degradation:\n", "\n", "**Data drift** usually means that the variable distribution of the input data (**x**) changes, and the trained model is not related to this new data, so the performance will decline. For example, an e-commerce platform sets up a predictive model to predict the purchase possibility of users to push personalized offers, but at the beginning, the training and application of the model are based on the user data of spontaneous paid search. When the e-commerce platform launches a new advertising campaign, the users attracted by the new influx of advertisements do not adapt to the model previously analyzed.\n" ] }, { "cell_type": "markdown", "id": "23a0c31b", "metadata": {}, "source": [ ":::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/Continuous%20blue%20squares%20in%20the%20Data%20stream%20indicate%20the%20start%20of%20a%20Data%20Drift.jpg\n", "---\n", "name: Continuous blue squares in the Data stream indicate the start of a Data Drift\n", "---\n", "Continuous blue squares in the Data stream indicate the start of a Data Drift\n", ":::\n" ] }, { "cell_type": "markdown", "id": "fb685489", "metadata": {}, "source": [ "**Concept drift** usually means that the mapping between input and output changes (**x->y**), the pattern learned by the model is no longer valid, and what changes is not the data itself, but the statistical properties of the target domain have changed over time, that is, the so-called \"world has changed\". Sometimes these changes happen very quickly or even unexpectedly, as in the case of the COVID-19 outbreak, the Black Swan event, which dramatically increases the demand for gowns and masks in response to changes in government policies; Sometimes it is a slow change, for example, customers' online shopping preferences change with changes in personal interests, merchants' reputation, and service types.\n", "\n", "These data changes will affect the performance of the model and cause serious problems in the actual project landing process, so the model needs to be monitored and continuously deployed." ] }, { "cell_type": "markdown", "id": "02964563", "metadata": {}, "source": [ "This chapter combines the foundational concepts of Machine Learning with the functional expertise of modern software development and engineering to help you develop production-ready Machine Learning knowledge.\n", "\n", "Productionization of a Machine Learning solution is not a one-time thing. It is always under improving one-time through the iterative process continuously." ] }, { "cell_type": "markdown", "id": "f9a75a94", "metadata": {}, "source": [ "> \"Machine learning is a highly iterative process: you may try many dozens of ideas before finding one that you're satisfied with.\"\n", "-- Andrew Ng" ] }, { "cell_type": "markdown", "id": "218ca90a", "metadata": {}, "source": [ "The Machine Learning lifecycle, also known as MLOps(Machine Learning Operations), could be mapped and fit into the traditional software development process. A better understanding of Machine Learning will help you as you think about how to incorporate machine learning, including models, into your software development processes.\n", "\n", "A Machine Learning lifecycle consists of such major phases, including:\n", "\n", "- problem framing,\n", "- data engineering,\n", "- model training & evaluation,\n", "- deployment,\n", "- maintenance." ] }, { "cell_type": "markdown", "id": "65943c92", "metadata": {}, "source": [ ":::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/machine-learning-lifecycle.svg\n", "---\n", "name: Machine Learning Lifecycle\n", "---\n", "Machine Learning Lifecycle\n", ":::" ] }, { "cell_type": "markdown", "id": "8ac7baf9", "metadata": {}, "source": [ "In the below sections, we will walk through the Machine Learning lifecycle components with a real-world example." ] }, { "cell_type": "markdown", "id": "358a61e7", "metadata": {}, "source": [ "## Problem framing\n", "\n", "To bring a Machine Learning solution to production successfully, the first step is to define a valuable business objective and translate the objective into a Machine Learning solvable problem.\n", "\n", "**[COVID-19](https://en.wikipedia.org/wiki/COVID-19) Projections** is an artificial intelligence solution to accurately forecast infections, deaths, and recovery timelines of the COVID-19/coronavirus pandemic in the US and globally. By the end of April 2020, it was cited by the Centers for Disease Control & Prevention (CDC) as one of the first models to “help inform public health decision-making”." ] }, { "cell_type": "markdown", "id": "0f3eaf89", "metadata": {}, "source": [ ">\"I began estimating true infections in November 2020 because I couldn’t find any good models that were doing that in real-time during a critical moment in the pandemic (though there were 30+ models for forecasting deaths)... My goal when I started covid19-projections.com was to create the most accurate COVID-19 model.\"\n", "-- Youyang Gu, creator of covid19-projections.com" ] }, { "cell_type": "markdown", "id": "46692a2c", "metadata": {}, "source": [ "There have been three separate iterations of the covid19-projections.com model, which are Death Forecasts,\n", "Infections Estimates, and Vaccination Projections. We will use the [Death Forecasting model](https://covid19-projections.com/model-details/) as an example to explore how to frame a Machine Learning problem.\n", "\n", "Let's start with answering some Problem Framing related basic questions:\n", "\n", "1. What are the inputs?\n", " 1. time-series table of death data with geography and demography information. For example to United Status, each row of the data needs to have **a number of deaths $x$ at date $y$ in the region $z$**.\n", "2. What are the outputs?\n", " 1. **a number of deaths $x'$ at a given future date $y'$ in region $z'$**.\n", "3. What are the metrics to measure the success of the project? Such as,\n", " 1. projection accuracy, precision, etc. - comparing with existing Machine Learning models and real-world data,\n", " 2. model inference speed - comparing with existing Machine Learning models,\n", " 3. etc.\n", "4. What are the system architecture and required infrastructure?\n", " 1. a data pipeline to refresh the input data regularly,\n", " 2. a Machine Learning pipeline to regularly iterate the model by using the latest input data,\n", " 3. an event schedule module to manage the system communication and collaboration,\n", " 4. and a website to show the projected results and be accessible in real-time.\n", "5. Any other questions? Such as,\n", " 1. is the data generally available and easy to access,\n", " 2. what are the existing solutions,\n", " 3. etc." ] }, { "cell_type": "markdown", "id": "2d191a1b", "metadata": {}, "source": [ "## Data engineering" ] }, { "cell_type": "markdown", "id": "bbdfacdf", "metadata": {}, "source": [ "### Data ingestion" ] }, { "cell_type": "markdown", "id": "b3fc3be7", "metadata": {}, "source": [ "COVID-19 Projections Death Forecasting model uses the daily death total provided by [Johns Hopkins CSSE](https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series), which is considered by experts to be the “gold standard” reference data. It sometimes uses US testing data from the [COVID Tracking Project](https://covidtracking.com/) in our research and graphs. Below is a piece of sample data from CSSE." ] }, { "cell_type": "code", "execution_count": 2, "id": "5fbb347d", "metadata": { "tags": [ "output_scroll" ] }, "outputs": [ { "data": { "text/html": [ "
| \n", " | UID | \n", "iso2 | \n", "iso3 | \n", "code3 | \n", "FIPS | \n", "Admin2 | \n", "Province_State | \n", "Country_Region | \n", "Lat | \n", "Long_ | \n", "... | \n", "6/17/22 | \n", "6/18/22 | \n", "6/19/22 | \n", "6/20/22 | \n", "6/21/22 | \n", "6/22/22 | \n", "6/23/22 | \n", "6/24/22 | \n", "6/25/22 | \n", "6/26/22 | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "84001001 | \n", "US | \n", "USA | \n", "840 | \n", "1001.0 | \n", "Autauga | \n", "Alabama | \n", "US | \n", "32.539527 | \n", "-86.644082 | \n", "... | \n", "217 | \n", "217 | \n", "217 | \n", "217 | \n", "217 | \n", "217 | \n", "217 | \n", "217 | \n", "217 | \n", "217 | \n", "
| 1 | \n", "84001003 | \n", "US | \n", "USA | \n", "840 | \n", "1003.0 | \n", "Baldwin | \n", "Alabama | \n", "US | \n", "30.727750 | \n", "-87.722071 | \n", "... | \n", "683 | \n", "683 | \n", "683 | \n", "683 | \n", "683 | \n", "683 | \n", "683 | \n", "683 | \n", "683 | \n", "683 | \n", "
| 2 | \n", "84001005 | \n", "US | \n", "USA | \n", "840 | \n", "1005.0 | \n", "Barbour | \n", "Alabama | \n", "US | \n", "31.868263 | \n", "-85.387129 | \n", "... | \n", "99 | \n", "99 | \n", "99 | \n", "99 | \n", "99 | \n", "99 | \n", "99 | \n", "99 | \n", "99 | \n", "99 | \n", "
| 3 | \n", "84001007 | \n", "US | \n", "USA | \n", "840 | \n", "1007.0 | \n", "Bibb | \n", "Alabama | \n", "US | \n", "32.996421 | \n", "-87.125115 | \n", "... | \n", "105 | \n", "105 | \n", "105 | \n", "105 | \n", "105 | \n", "105 | \n", "105 | \n", "105 | \n", "105 | \n", "105 | \n", "
| 4 | \n", "84001009 | \n", "US | \n", "USA | \n", "840 | \n", "1009.0 | \n", "Blount | \n", "Alabama | \n", "US | \n", "33.982109 | \n", "-86.567906 | \n", "... | \n", "245 | \n", "245 | \n", "245 | \n", "245 | \n", "245 | \n", "245 | \n", "245 | \n", "245 | \n", "245 | \n", "245 | \n", "
5 rows × 899 columns
\n", "