{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "<table style=\"border: none\" align=\"left\">\n", " <tr style=\"border: none\">\n", " <th style=\"border: none\"><font face=\"verdana\" size=\"5\" color=\"black\"><b>Best heart drug prediction using Watson Machine Learning</b></font></th>\n", " <th style=\"border: none\"><img src=\"https://github.com/pmservice/customer-satisfaction-prediction/blob/master/app/static/images/ml_icon_gray.png?raw=true\" alt=\"Watson Machine Learning icon\" height=\"40\" width=\"40\"></th>\n", " </tr> \n", " <tr style=\"border: none\">\n", " <td style=\"border: none\"><img src=\"https://github.com/pmservice/wml-sample-models/raw/master/spark/drug-selection/images/learning_banner-05.png\" width=\"600\" alt=\"Icon\"></td>\n", " </tr>\n", "</table>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook contains steps and code to train, deploy, and start scoring new data. This notebook introduces commands for getting data, model persistance to Watson Machine Learning repository, model deployment, and scoring.\n", "\n", "Some familiarity with Python is helpful. This notebook uses Python 3 and Apache Spark 2.1.\n", "\n", "You will use the data set published on git, **drug_feedback_data.csv**, which contains anonymous information about patients records. Use the details of this data set to predict the best drug to treat heart disease.\n", "\n", "## Learning goals\n", "\n", "This notebook teaches you how to:\n", "- Publish a sample model in the Watson Machine Learning (WML) repository\n", "\n", "You will also learn how to use the WML API to:\n", "- Deploy a model for online scoring \n", "\n", "\n", "## Contents\n", "\n", "This notebook contains the following parts:\n", "\n", "1.\t[Set up the environment](#setup)\n", "2.\t[Create spark ml model](#model)\n", "3.\t[Persist model](#load)\n", "4.\t[Deploy & score](#score)\n", "5.\t[Summary and next steps](#summary)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<a id=\"setup\"></a>\n", "## 1. Set up the environment\n", "\n", "Before you use the sample code in this notebook, you must perform the following setup tasks:\n", "\n", "- Create a [Watson Machine Learning (WML) Service](https://console.ng.bluemix.net/catalog/services/ibm-watson-machine-learning/) instance (a free plan is offered and information about how to create the instance is [here](https://dataplatform.ibm.com/docs/content/analyze-data/wml-setup.html))\n", "- Create a [Spark Service](https://console.ng.bluemix.net/catalog/services/spark/) instance (an entry plan is offered).\n", "- Create a [Db2 Warehouse on Cloud Service](https://console.bluemix.net/catalog/services/db2-warehouse-on-cloud/) instance (an entry plan is offered).\n", "- Create the **DRUG_TRAIN_DATA_UPDATED** table in **Db2 Warehouse on Cloud**. \n", " + Download [drug_train_data_updated.csv](https://raw.githubusercontent.com/pmservice/wml-sample-models/master/spark/drug-selection/data/drug_train_data_updated.csv) file from git repository.\n", " + Click **Open the console** to get started with **Db2 Warehouse on Cloud** icon.\n", " + Select the **Load Data** and **Desktop** load type.\n", " + **Drag and drop** previously downloaded file and press **Next**.\n", " + Select **Schema** to import data and click **New Table**. \n", " + Write the name **DRUG_TRAIN_DATA_UPDATED** for **new table** than click **Next** to finish data import.\n", " + Use `;` as **field separator**.\n", " + Click **Next** to create a table with the uploaded data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<a id=\"model\"></a>\n", "## 2. Create the spark machine learning model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this section you will learn how to prepare data, create an Apache Spark machine learning pipeline, and train a model.\n", "\n", "- [2.1 Load the training data from Db2 Warehouse on Cloud](#load)\n", "- [2.2 Prepare the data](#prep)\n", "- [2.3 Create the pipeline](#pipe)\n", "- [2.4 Train the model](#train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.1 Load the training data from Db2 Warehouse on Cloud<a id=\"load\"></a>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run the following cell to the load the DRUG_TRAIN_DATA_UPDATED table content into the Spark DataFrame." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Enter your authentication data as required. \n", "\n", "**Tip:** The authentication information can be found under the **Service Credentials** tab of Db2 Warehouse on Cloud service instance created in IBM Cloud. Click **New credential** to create credentials if you do not have any." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "db2_service_credentials = {\n", " \"port\": 50000,\n", " \"db\": \"BLUDB\",\n", " \"username\": \"*****\",\n", " \"ssljdbcurl\": \"jdbc:db2://dashdb-entry-yp-dal10-01.services.dal.bluemix.net:50001/BLUDB:sslConnection=true;\",\n", " \"host\": \"dashdb-entry-yp-dal10-01.services.dal.bluemix.net\",\n", " \"https_url\": \"https://dashdb-entry-yp-dal10-01.services.dal.bluemix.net:8443\",\n", " \"dsn\": \"***\",\n", " \"hostname\": \"dashdb-entry-yp-dal10-01.services.dal.bluemix.net\",\n", " \"jdbcurl\": \"***\",\n", " \"ssldsn\": \"***\",\n", " \"uri\": \"***\",\n", " \"password\": \"***\"\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "db2_credentials = {\n", " 'driver': 'com.ibm.db2.jcc.DB2Driver',\n", " 'jdbcurl': db2_service_credentials['jdbcurl'],\n", " 'user': db2_service_credentials['username'],\n", " 'password': db2_service_credentials['password']\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tablename = \"{schema}.{table}\".format(schema=db2_credentials['user'], table='DRUG_TRAIN_DATA_UPDATED')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "DRUG_TRAIN_DATA_UPDATED_data = spark.read.jdbc(db2_credentials['jdbcurl'], table=tablename, properties=db2_credentials)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "DRUG_TRAIN_DATA_UPDATED_data.show(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The DRUG column is the target/label column." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2 Prepare the data<a id=\"prep\"></a>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this subsection you will split your data into two data sets: \n", "- Train data set\n", "- Test data set" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(train_data, test_data) = DRUG_TRAIN_DATA_UPDATED_data.randomSplit([0.8, 0.2], 24)\n", "\n", "print(\"Number of records for training: \" + str(train_data.count()))\n", "print(\"Number of records for evaluation: \" + str(test_data.count()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, your data has been successfully split into two data sets:\n", " - The train data set, which is the largest group, is used for training.\n", " - The test data set is used for model evaluation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.3 Create the pipeline<a id=\"pipe\"></a>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this section you will create an Apache Spark machine learning pipeline." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, import the Apache Spark machine learning packages that will be needed in the subsequent steps." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from pyspark.ml.feature import OneHotEncoder, StringIndexer, IndexToString, VectorAssembler\n", "from pyspark.ml.classification import DecisionTreeClassifier\n", "from pyspark.ml.evaluation import MulticlassClassificationEvaluator\n", "from pyspark.ml import Pipeline, Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the following step, use the StringIndexer transformer to convert all the string fields to numeric ones." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "stringIndexer_sex = StringIndexer(inputCol = 'SEX', outputCol = 'SEX_IX')\n", "stringIndexer_bp = StringIndexer(inputCol = 'BP', outputCol = 'BP_IX')\n", "stringIndexer_chol = StringIndexer(inputCol = 'CHOLESTEROL', outputCol = 'CHOL_IX')\n", "stringIndexer_label = StringIndexer(inputCol=\"DRUG\", outputCol=\"label\").fit(DRUG_TRAIN_DATA_UPDATED_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a feature vector by combining all the features together." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "vectorAssembler_features = VectorAssembler(inputCols=[\"AGE\", \"SEX_IX\", \"BP_IX\", \"CHOL_IX\", \"NA\", \"K\"], outputCol=\"features\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, define the estimators you want to use for classification. Decision Tree is used in the following example." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dt = DecisionTreeClassifier(labelCol=\"label\", featuresCol=\"features\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, convert the indexed labels back to the original labels." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "labelConverter = IndexToString(inputCol=\"prediction\", outputCol=\"predictedLabel\", labels=stringIndexer_label.labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Build the pipeline. A pipeline consists of transformers and an estimator." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pipeline_dt = Pipeline(stages=[stringIndexer_label, stringIndexer_sex, stringIndexer_bp, stringIndexer_chol, vectorAssembler_features, dt, labelConverter])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.4 Train the model<a id=\"train\"></a>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, you can train your Decision Tree model by using the previously defined pipeline and train data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model = pipeline_dt.fit(train_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can check your model accuracy now. Use test data to evaluate the model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predictions = model.transform(test_data)\n", "evaluatorDT = MulticlassClassificationEvaluator(labelCol=\"label\", predictionCol=\"prediction\", metricName=\"accuracy\")\n", "accuracy = evaluatorDT.evaluate(predictions)\n", "\n", "print(\"Accuracy = %g\" % accuracy)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can tune your model now to achieve better accuracy. To keep this example simple, the tuning section is omitted." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "<a id=\"load\"></a>\n", "## 3. Store the model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this section you will learn how to store sample model in Watson Machine Learning repository by using repository client." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, install and import the client library." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!rm -rf $PIP_BUILD/watson-machine-learning-client" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install watson-machine-learning-client --upgrade" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note**: Apache Spark 2.1 is required." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from watson_machine_learning_client import WatsonMachineLearningAPIClient\n", "import json" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Authenticate to the Watson Machine Learning service on IBM Cloud.\n", "\n", "**Tip**: Authentication information (your credentials) can be found in the <a href=\"https://console.bluemix.net/docs/services/service_credentials.html#service_credentials\" target=\"_blank\" rel=\"noopener no referrer\">Service Credentials</a> tab of the service instance that you created on IBM Cloud. \n", "\n", "If you cannot see the **instance_id** field in **Service Credentials**, click **New credential (+)** to generate new authentication information. \n", "\n", "**Action**: Enter your Watson Machine Learning service instance credentials here.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "wml_credentials={\n", " \"url\": \"https://us-south.ml.cloud.ibm.com\",\n", " \"username\": \"***\",\n", " \"password\": \"***\",\n", " \"instance_id\": \"***\"\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create the WatsonMachineLearningAPIClient." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "client = WatsonMachineLearningAPIClient(wml_credentials)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Prepare the metadata" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Tip**: If the accuracy value falls below the threshold value, retraining action is required." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Prepare the additional information to be saved as model's metadata:\n", "* TRAINING_DATA_REFERENCE\n", "* OUTPUT_DATA_SCHEMA\n", "* EVALUATION_METHOD: **multiclass**\n", "* EVALUATION_METRICS name: **accuracy** (metric name used to evaluate the model)\n", "* EVALUATION_METRICS value: **0.87** (accuracy value calculated few steps above)\n", "* EVALUATION_METRICS threshold: **0.8** (if the accuracy after evaluation using feedback data is below this threshold auto-retraining is triggered)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Tip**: All required fields can be found on Service Credentials tab of Db2 Warehouse on Cloud service instance created in IBM Cloud." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "training_data_reference = {\n", " \"name\": \"DRUG feedback\",\n", " \"connection\": db2_service_credentials,\n", " \"source\": {\n", " \"tablename\": \"DRUG_TRAIN_DATA_UPDATED\",\n", " \"type\": \"dashdb\"\n", " }\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Define OUTPUT_DATA_SCHEMA" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_data_schema = train_data.schema\n", "label_field = next(f for f in train_data_schema.fields if f.name == \"DRUG\")\n", "label_field.metadata['values'] = stringIndexer_label.labels" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Set up modelling roles in OUTPUT_DATA_SCHEMA" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from pyspark.sql.types import *\n", "\n", "input_fileds = filter(lambda f: f.name != \"DRUG\", train_data_schema.fields)\n", "\n", "output_data_schema = StructType(list(input_fileds)). \\\n", " add(\"prediction\", DoubleType(), True, {'modeling_role': 'prediction'}). \\\n", " add(\"predictedLabel\", StringType(), True, {'modeling_role': 'decoded-target', 'values': stringIndexer_label.labels}). \\\n", " add(\"probability\", ArrayType(DoubleType()), True, {'modeling_role': 'probability'})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Add all the information to model meta props." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_props = {\n", " client.repository.ModelMetaNames.NAME: \"drug-selection\",\n", " client.repository.ModelMetaNames.TRAINING_DATA_REFERENCE: training_data_reference,\n", " client.repository.ModelMetaNames.OUTPUT_DATA_SCHEMA: output_data_schema.jsonValue(),\n", " client.repository.ModelMetaNames.EVALUATION_METHOD: \"multiclass\",\n", " client.repository.ModelMetaNames.EVALUATION_METRICS: [\n", " {\n", " \"name\": \"accuracy\",\n", " \"value\": accuracy,\n", " \"threshold\": 0.8\n", " }\n", " ]\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Store the model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "published_model_details = client.repository.store_model(model=model, meta_props=model_props, training_data=train_data, pipeline=pipeline_dt)\n", "model_uid = client.repository.get_model_uid(published_model_details)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Tip**: Use `client.repository.ModelMetaNames.show()` to get the list of available props." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<a id=\"score\"></a>\n", "## 4. Deploy and score" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can udeploy previously stored model as web service." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "deployment_details = client.deployments.create(model_uid, 'best-drug model deployment')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In next step, get the scoring endpoint." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "scoring_endpoint = client.deployments.get_scoring_url(deployment_details)\n", "print(scoring_endpoint)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "scoring_payload = {\n", " \"fields\": [\"AGE\", \"SEX\", \"BP\", \"CHOLESTEROL\",\"NA\",\"K\"],\n", " \"values\": [[20.0, \"F\", \"HIGH\", \"HIGH\", 0.71, 0.07], [55.0, \"M\", \"LOW\", \"HIGH\", 0.71, 0.07]]\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Score the model using sample scoring records and scoring_enpoint." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "score = client.deployments.score(scoring_endpoint, scoring_payload)\n", "\n", "print(str(score))" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "<a id=\"summary\"></a>\n", "## 5. Summary and next steps " ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "You successfully completed this notebook! \n", " \n", "You learned how to serve trained model. \n", "Check out our next notebook: [Data Mart configuration and usage with ibm-ai-openscale python package](https://github.com/pmservice/ai-openscale-sample-notebooks/blob/master/Data%20Mart%20configuration%20and%20usage.ipynb)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Authors\n", "\n", "**Lukasz Cmielowski**, PhD, is an Automation Architect and Data Scientist at IBM with a track record of developing enterprise-level applications that substantially increases clients' ability to turn data into actionable knowledge.\n", "\n", "**Maria Oleszkiewicz**, MSc, is a developer who took part in building the wml api client used in this notebook." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Copyright © 2018 IBM. This notebook and its source code are released under the terms of the MIT License." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<div style=\"background:#F5F7FA; height:110px; padding: 2em; font-size:14px;\">\n", "<span style=\"font-size:18px;color:#152935;\">Love this notebook? </span>\n", "<span style=\"font-size:15px;color:#152935;float:right;margin-right:40px;\">Don't have an account yet?</span><br>\n", "<span style=\"color:#5A6872;\">Share it with your colleagues and help them discover the power of Watson Studio!</span>\n", "<span style=\"border: 1px solid #3d70b2;padding:8px;float:right;margin-right:40px; color:#3d70b2;\"><a href=\"https://ibm.co/wsnotebooks\" target=\"_blank\" style=\"color: #3d70b2;text-decoration: none;\">Sign Up</a></span><br>\n", "</div>" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.5 with Spark", "language": "python3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.4" } }, "nbformat": 4, "nbformat_minor": 1 }