This notebook introduces commands for getting data, model persistance to Watson Machine Learning repository, model deployment, and scoring.\n", "\n", "Some familiarity with Python is helpful. This notebook uses Python 3 and Apache Spark 2.1.\n", "\n", "You will use the data set published on git, **drug_feedback_data.csv**, which contains anonymous information about patients records. Use the details of this data set to predict the best drug to treat heart disease.\n", "\n", "## Learning goals\n", "\n", "This notebook teaches you how to:\n", "- Publish a sample model in the Watson Machine Learning (WML) repository\n", "\n", "You will also learn how to use the WML API to:\n", "- Deploy a model for online scoring \n", "\n", "\n", "## Contents\n", "\n", "This notebook contains the following parts:\n", "\n", "1.\t[Set up the environment](#setup)\n", "2.\t[Create spark ml model](#model)\n", "3.\t[Persist model](#load)\n", "4.\t[Deploy & score](#score)\n", "5.\t[Summary and next steps](#summary)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<a id=\"setup\"></a>\n", "## 1. Set up the environment\n", "\n", "Before you use the sample code in this notebook, you must perform the following setup tasks:\n", "\n", "- Create a [Watson Machine Learning (WML) Service](https://console.ng.bluemix.net/catalog/services/ibm-watson-machine-learning/) instance (a free plan is offered and information about how to create the instance is [here](https://dataplatform.ibm.com/docs/content/analyze-data/wml-setup.html))\n", "- Create a [Spark Service](https://console.ng.bluemix.net/catalog/services/spark/) instance (an entry plan is offered).\n", "- Create a [Db2 Warehouse on Cloud Service](https://console.bluemix.net/catalog/services/db2-warehouse-on-cloud/) instance (an entry plan is offered).\n", "- Create the **DRUG_TRAIN_DATA_UPDATED** table in **Db2 Warehouse on Cloud**. \n", " + Download [drug_train_data_updated.csv](https://raw.githubusercontent.com/pmservice/wml-sample-models/master/spark/drug-selection/data/drug_train_data_updated.csv) file from git repository.\n", " + Click **Open the console** to get started with **Db2 Warehouse on Cloud** icon.\n", " + Select the **Load Data** and **Desktop** load type.\n", " + **Drag and drop** previously downloaded file and press **Next**.\n", " + Select **Schema** to import data and click **New Table**. \n", " + Write the name **DRUG_TRAIN_DATA_UPDATED** for **new table** than click **Next** to finish data import.\n", " + Use `;` as **field separator**.\n", " + Click **Next** to create a table with the uploaded data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<a id=\"model\"></a>\n", "## 2. Create the spark machine learning model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this section you will learn how to prepare data, create an Apache Spark machine learning pipeline, and train a model.\n", "\n", "- [2.1 Load the training data from Db2 Warehouse on Cloud](#load)\n", "- [2.2 Prepare the data](#prep)\n", "- [2.3 Create the pipeline](#pipe)\n", "- [2.4 Train the model](#train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.1 Load the training data from Db2 Warehouse on Cloud<a id=\"load\"></a>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run the following cell to the load the DRUG_TRAIN_DATA_UPDATED table content into the Spark DataFrame." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Enter your authentication data as required. \n", "\n", "**Tip:** The authentication information can be found under the **Service Credentials** tab of Db2 Warehouse on Cloud service instance created in IBM Cloud. Click **New credential** to create credentials if you do not have any." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "db2_service_credentials = {\n", " \"port\": 50000,\n", " \"db\": \"BLUDB\",\n", " \"username\": \"*****\",\n", " \"ssljdbcurl\": \"jdbc:db2://dashdb-entry-yp-dal10-01.services.dal.bluemix.net:50001/BLUDB:sslConnection=true;\",\n", " \"host\": \"dashdb-entry-yp-dal10-01.services.dal.bluemix.net\",\n", " \"https_url\": \"https://dashdb-entry-yp-dal10-01.services.dal.bluemix.net:8443\",\n", " \"dsn\": \"***\",\n", " \"hostname\": \"dashdb-entry-yp-dal10-01.services.dal.bluemix.net\",\n", " \"jdbcurl\": \"***\",\n", " \"ssldsn\": \"***\",\n", " \"uri\": \"***\",\n", " \"password\": \"***\"\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "db2_credentials = {\n", " 'driver': 'com.ibm.db2.jcc.DB2Driver',\n", " 'jdbcurl': db2_service_credentials['jdbcurl'],\n", " 'user': db2_service_credentials['username'],\n", " 'password': db2_service_credentials['password']\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tablename = \"{schema}.{table}\".format(schema=db2_credentials['user'], table='DRUG_TRAIN_DATA_UPDATED')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "DRUG_TRAIN_DATA_UPDATED_data = spark.read.jdbc(db2_credentials['jdbcurl'], table=tablename, properties=db2_credentials)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "DRUG_TRAIN_DATA_UPDATED_data.show(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The DRUG column is the target/label column." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2 Prepare the data<a id=\"prep\"></a>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this subsection you will split your data into two data sets: \n", "- Train data set\n", "- Test data set" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(train_data, test_data) = DRUG_TRAIN_DATA_UPDATED_data.randomSplit([0.8, 0.2], 24)\n", "\n", "print(\"Number of records for training: \" + str(train_data.count()))\n", "print(\"Number of records for evaluation: \" + str(test_data.count()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, your data has been successfully split into two data sets:\n", " - The train data set, which is the largest group, is used for training.\n", " - The test data set is used for model evaluation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.3 Create the pipeline<a id=\"pipe\"></a>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this section you will create an Apache Spark machine learning pipeline." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, import the Apache Spark machine learning packages that will be needed in the subsequent steps." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from pyspark.ml.feature import OneHotEncoder, StringIndexer, IndexToString, VectorAssembler\n", "from pyspark.ml.classification import DecisionTreeClassifier\n", "from pyspark.ml.evaluation import MulticlassClassificationEvaluator\n", "from pyspark.ml import Pipeline, Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the following step, use the StringIndexer transformer to convert all the string fields to numeric ones." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "stringIndexer_sex = StringIndexer(inputCol = 'SEX', outputCol = 'SEX_IX')\n", "stringIndexer_bp = StringIndexer(inputCol = 'BP', outputCol = 'BP_IX')\n", "stringIndexer_chol = StringIndexer(inputCol = 'CHOLESTEROL', outputCol = 'CHOL_IX')\n", "stringIndexer_label = StringIndexer(inputCol=\"DRUG\", outputCol=\"label\").fit(DRUG_TRAIN_DATA_UPDATED_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a feature vector by combining all the features together." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "vectorAssembler_features = VectorAssembler(inputCols=[\"AGE\", \"SEX_IX\", \"BP_IX\", \"CHOL_IX\", \"NA\", \"K\"], outputCol=\"features\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, define the estimators you want to use for classification. Decision Tree is used in the following example." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dt = DecisionTreeClassifier(labelCol=\"label\", featuresCol=\"features\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, convert the indexed labels back to the original labels." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "labelConverter = IndexToString(inputCol=\"prediction\", outputCol=\"predictedLabel\", labels=stringIndexer_label.labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Build the pipeline. A pipeline consists of transformers and an estimator." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pipeline_dt = Pipeline(stages=[stringIndexer_label, stringIndexer_sex, stringIndexer_bp, stringIndexer_chol, vectorAssembler_features, dt, labelConverter])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.4 Train the model<a id=\"train\"></a>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, you can train your Decision Tree model by using the previously defined pipeline and train data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model = pipeline_dt.fit(train_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can check your model accuracy now. Use test data to evaluate the model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predictions = model.transform(test_data)\n", "evaluatorDT = MulticlassClassificationEvaluator(labelCol=\"label\", predictionCol=\"prediction\", metricName=\"accuracy\")\n", "accuracy = evaluatorDT.evaluate(predictions)\n", "\n", "print(\"Accuracy = %g\" % accuracy)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can tune your model now to achieve better accuracy. To keep this example simple, the tuning section is omitted." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "<a id=\"load\"></a>\n", "## 3. Store the model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this section you will learn how to store sample model in Watson Machine Learning repository by using repository client." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, install and import the client library." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!rm -rf $PIP_BUILD/watson-machine-learning-client" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install watson-machine-learning-client --upgrade" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note**: Apache Spark 2.1 is required." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from watson_machine_learning_client import WatsonMachineLearningAPIClient\n", "import json" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Authenticate to the Watson Machine Learning service on IBM Cloud.\n", "\n", "**Tip**: Authentication information (your credentials) can be found in the <a href=\"https://console.bluemix.net/docs/services/service_credentials.html#service_credentials\" target=\"_blank\" rel=\"noopener no referrer\">Service Credentials</a> tab of the service instance that you created on IBM Cloud. \n", "\n", "If you cannot see the **instance_id** field in **Service Credentials**, click **New credential (+)** to generate new authentication information. \n", "\n", "**Action**: Enter your Watson Machine Learning service instance credentials here.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "wml_credentials={\n", " \"url\": \"https://us-south.ml.cloud.ibm.com\",\n", " \"username\": \"***\",\n", " \"password\": \"***\",\n", " \"instance_id\": \"***\"\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create the WatsonMachineLearningAPIClient." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "client = WatsonMachineLearningAPIClient(wml_credentials)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Prepare the metadata" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Tip**: If the accuracy value falls below the threshold value, retraining action is required." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Prepare the additional information to be saved as model's metadata:\n", "* TRAINING_DATA_REFERENCE\n", "* OUTPUT_DATA_SCHEMA\n", "* EVALUATION_METHOD: **multiclass**\n", "* EVALUATION_METRICS name: **accuracy** (metric name used to evaluate the model)\n", "* EVALUATION_METRICS value: **0.87** (accuracy value calculated few steps above)\n", "* EVALUATION_METRICS threshold: **0.8** (if the accuracy after evaluation using feedback data is below this threshold auto-retraining is triggered)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Tip**: All required fields can be found on Service Credentials tab of Db2 Warehouse on Cloud service instance created in IBM Cloud." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "training_data_reference = {\n", " \"name\": \"DRUG feedback\",\n", " \"connection\": db2_service_credentials,\n", " \"source\": {\n", " \"tablename\": \"DRUG_TRAIN_DATA_UPDATED\",\n", " \"type\": \"dashdb\"\n", " }\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Define OUTPUT_DATA_SCHEMA" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_data_schema = train_data.schema\n", "label_field = next(f for f in train_data_schema.fields if f.name == \"DRUG\")\n", "label_field.metadata['values'] = stringIndexer_label.labels" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Set up modelling roles in OUTPUT_DATA_SCHEMA" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from pyspark.sql.types import *\n", "\n", "input_fileds = filter(lambda f: f.name != \"DRUG\", train_data_schema.fields)\n", "\n", "output_data_schema = StructType(list(input_fileds)). \\\n", " add(\"prediction\", DoubleType(), True, {'modeling_role': 'prediction'}). \\\n", " add(\"predictedLabel\", StringType(), True, {'modeling_role': 'decoded-target', 'values': stringIndexer_label.labels}). \\\n", " add(\"probability\", ArrayType(DoubleType()), True, {'modeling_role': 'probability'})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Add all the information to model meta props." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_props = {\n", " client.repository.ModelMetaNames.NAME: \"drug-selection\",\n", " client.repository.ModelMetaNames.TRAINING_DATA_REFERENCE: training_data_reference,\n", " client.repository.ModelMetaNames.OUTPUT_DATA_SCHEMA: output_data_schema.jsonValue(),\n", " client.repository.ModelMetaNames.EVALUATION_METHOD: \"multiclass\",\n", " client.repository.ModelMetaNames.EVALUATION_METRICS: [\n", " {\n", " \"name\": \"accuracy\",\n", " \"value\": accuracy,\n", " \"threshold\": 0.8\n", " }\n", " ]\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Store the model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "published_model_details = client.repository.store_model(model=model, meta_props=model_props, training_data=train_data, pipeline=pipeline_dt)\n", "model_uid = client.repository.get_model_uid(published_model_details)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Tip**: Use `client.repository.ModelMetaNames.show()` to get the list of available props." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<a id=\"score\"></a>\n", "## 4. Deploy and score" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can udeploy previously stored model as web service." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "deployment_details = client.deployments.create(model_uid, 'best-drug model deployment')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In next step, get the scoring endpoint." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "scoring_endpoint = client.deployments.get_scoring_url(deployment_details)\n", "print(scoring_endpoint)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "scoring_payload = {\n", " \"fields\": [\"AGE\", \"SEX\", \"BP\", \"CHOLESTEROL\",\"NA\",\"K\"],\n", " \"values\": [[20.0, \"F\", \"HIGH\", \"HIGH\", 0.71, 0.07], [55.0, \"M\", \"LOW\", \"HIGH\", 0.71, 0.07]]\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Score the model using sample scoring records and scoring_enpoint." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "score = client.deployments.score(scoring_endpoint, scoring_payload)\n", "\n", "print(str(score))" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "<a id=\"summary\"></a>\n", "## 5. Summary and next steps " ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "You successfully completed this notebook! \n", " \n", "You learned how to serve trained model. \n", "Check out our next notebook: [Data Mart configuration and usage with ibm-ai-openscale python package](https://github.com/pmservice/ai-openscale-sample-notebooks/blob/master/Data%20Mart%20configuration%20and%20usage.ipynb)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Authors\n", "\n", "**Lukasz Cmielowski**, PhD, is an Automation Architect and Data Scientist at IBM with a track record of developing enterprise-level applications that substantially increases clients' ability to turn data into actionable knowledge.\n", "\n", "**Maria Oleszkiewicz**, MSc, is a developer who took part in building the wml api client used in this notebook." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Copyright © 2018 IBM. 