{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Op Titanic Simple Sample\n", "\n", "Derived from https://github.com/salesforce/TransmogrifAI/tree/master/helloworld/notebooks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we describe a very simple TransmogrifAI workflow for predicting survivors in the often-cited Titanic dataset. The code for building and applying the Titanic model can be found here: Titanic Code, and the data can be found here: [Titanic Data](https://github.com/salesforce/op/blob/master/helloworld/src/main/resources/TitanicDataset/TitanicPassengersTrainData.csv).\n", "\n", "First we need to load transmogrifai and Spark Mllib jars\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\u001b[32mimport \u001b[39m\u001b[36m$ivy.$ \n", "\u001b[39m\n", "\u001b[32mimport \u001b[39m\u001b[36m$ivy.$ \n", "\u001b[39m\n", "\u001b[32mimport \u001b[39m\u001b[36m$ivy.$ \n", "\u001b[39m\n", "\u001b[32mimport \u001b[39m\u001b[36m$ivy.$ \n", "\n", "\u001b[39m\n", "\u001b[32mimport \u001b[39m\u001b[36mcom.salesforce.op._\n", "\u001b[39m\n", "\u001b[32mimport \u001b[39m\u001b[36mcom.salesforce.op.features._\n", "\u001b[39m\n", "\u001b[32mimport \u001b[39m\u001b[36mcom.salesforce.op.features.types._\n", "\u001b[39m\n", "\u001b[32mimport \u001b[39m\u001b[36mcom.salesforce.op.evaluators.Evaluators\u001b[39m" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import $ivy.`org.apache.spark::spark-sql:2.3.3`\n", "import $ivy.`org.apache.spark::spark-mllib:2.3.3`\n", "import $ivy.`sh.almond::almond-spark:0.5.0`\n", "import $ivy.`com.salesforce.transmogrifai::transmogrifai-core:0.5.1`\n", "\n", "import com.salesforce.op._\n", "import com.salesforce.op.features._\n", "import com.salesforce.op.features.types._\n", "import com.salesforce.op.evaluators.Evaluators" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We also want avoid too extensive logging and long outputs in our notebook." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\u001b[32mimport \u001b[39m\u001b[36morg.apache.log4j.{Level, Logger}\n", "\u001b[39m" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import org.apache.log4j.{Level, Logger}\n", "Logger.getLogger(\"org.apache.spark\").setLevel(Level.WARN)\n", "Logger.getLogger(\"breeze\").setLevel(Level.WARN)\n", "Logger.getLogger(\"com.salesforce.op\").setLevel(Level.WARN)\n", "\n", "repl.pprinter() = repl.pprinter().copy(defaultHeight = 10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instantiate a SparkSession." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loading spark-stubs\n", "Getting spark JARs\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "log4j:WARN No appenders could be found for logger (org.eclipse.jetty.util.log).\n", "log4j:WARN Please initialize the log4j system properly.\n", "log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Creating SparkSession\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\n", "19/04/04 23:29:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n" ] }, { "data": { "text/html": [ "Spark UI" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "\u001b[32mimport \u001b[39m\u001b[36morg.apache.spark.sql._\n", "\n", "\u001b[39m\n", "\u001b[36mspark\u001b[39m: \u001b[32mSparkSession\u001b[39m = org.apache.spark.sql.SparkSession@2aea6bbc" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import org.apache.spark.sql._\n", "\n", "implicit val spark = {\n", " NotebookSparkSession.builder()\n", " .progress(false)\n", " .master(\"local[*]\")\n", " .appName(\"TitanicPrediction\")\n", " .getOrCreate()\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us create a case class to describe the schema for the data:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "defined \u001b[32mclass\u001b[39m \u001b[36mPassenger\u001b[39m\n", "\u001b[36mpassengerTypeTag\u001b[39m: \u001b[32mreflect\u001b[39m.\u001b[32mruntime\u001b[39m.\u001b[32mpackage\u001b[39m.\u001b[32muniverse\u001b[39m.\u001b[32mWeakTypeTag\u001b[39m[\u001b[32mPassenger\u001b[39m] = TypeTag[Helper.this.Passenger]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "// Needed for now for case classes defined withing Ammonite. Won't be necessary in future versions of Spark.\n", "// See https://github.com/alexarchambault/ammonite-spark/issues/19 and https://github.com/apache/spark/pull/23607\n", "org.apache.spark.sql.catalyst.encoders.OuterScopes.addOuterScope(this);\n", "case class Passenger(\n", " id: Int,\n", " survived: Int,\n", " pClass: Option[Int],\n", " name: Option[String],\n", " sex: Option[String],\n", " age: Option[Double],\n", " sibSp: Option[Int],\n", " parCh: Option[Int],\n", " ticket: Option[String],\n", " fare: Option[Double],\n", " cabin: Option[String],\n", " embarked: Option[String]\n", ")\n", "// Required to make sure the String representation of the case class doesn't change in later cells.\n", "implicit val passengerTypeTag = scala.reflect.runtime.universe.weakTypeTag[Passenger]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We then define the set of raw features that we would like to extract from the data. The raw features are defined using [FeatureBuilders](https://docs.transmogrif.ai/Developer-Guide#featurebuilders), and are strongly typed. TransmogrifAI supports the following basic feature types: `Text`, `Numeric`, `Vector`, `List` , `Set`, `Map`. \n", "In addition it supports many specific feature types which extend these base types: Email extends Text; Integral, Real and Binary extend Numeric; Currency and Percentage extend Real. For a complete view of the types supported see the Type Hierarchy and Automatic Feature Engineering section in the Documentation.\n", "\n", "Basic `FeatureBuilders` will be created for you if you use the TransmogrifAI CLI to bootstrap your project as described here. However, it is often useful to edit this code to customize feature generation and take full advantage of the Feature types available (selecting the appropriate type will improve automatic feature engineering steps).\n", "\n", "When defining raw features, specify the extract logic to be applied to the raw data, and also annotate the features as either predictor or response variables via the FeatureBuilders:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\u001b[36msurvived\u001b[39m: \u001b[32mFeature\u001b[39m[\u001b[32mRealNN\u001b[39m] = \u001b[33mFeature\u001b[39m(\n", " \u001b[32m\"survived\"\u001b[39m,\n", " true,\n", " FeatureGeneratorStage_000000000001,\n", " \u001b[33mList\u001b[39m(),\n", " \u001b[32m\"RealNN_000000000001\"\u001b[39m,\n", " \u001b[33mList\u001b[39m()\n", ")\n", "\u001b[36mpClass\u001b[39m: \u001b[32mFeature\u001b[39m[\u001b[32mPickList\u001b[39m] = \u001b[33mFeature\u001b[39m(\n", " \u001b[32m\"pClass\"\u001b[39m,\n", " false,\n", " FeatureGeneratorStage_000000000002,\n", " \u001b[33mList\u001b[39m(),\n", " \u001b[32m\"PickList_000000000002\"\u001b[39m,\n", " \u001b[33mList\u001b[39m()\n", ")\n", "\u001b[36mname\u001b[39m: \u001b[32mFeature\u001b[39m[\u001b[32mText\u001b[39m] = \u001b[33mFeature\u001b[39m(\n", " \u001b[32m\"name\"\u001b[39m,\n", " false,\n", " FeatureGeneratorStage_000000000003,\n", " \u001b[33mList\u001b[39m(),\n", " \u001b[32m\"Text_000000000003\"\u001b[39m,\n", " \u001b[33mList\u001b[39m()\n", ")\n", "\u001b[36msex\u001b[39m: \u001b[32mFeature\u001b[39m[\u001b[32mPickList\u001b[39m] = \u001b[33mFeature\u001b[39m(\n", " \u001b[32m\"sex\"\u001b[39m,\n", " false,\n", " FeatureGeneratorStage_000000000004,\n", " \u001b[33mList\u001b[39m(),\n", " \u001b[32m\"PickList_000000000004\"\u001b[39m,\n", " \u001b[33mList\u001b[39m()\n", ")\n", "\u001b[36mage\u001b[39m: \u001b[32mFeature\u001b[39m[\u001b[32mReal\u001b[39m] = \u001b[33mFeature\u001b[39m(\n", " \u001b[32m\"age\"\u001b[39m,\n", " false,\n", " FeatureGeneratorStage_000000000005,\n", " \u001b[33mList\u001b[39m(),\n", " \u001b[32m\"Real_000000000005\"\u001b[39m,\n", " \u001b[33mList\u001b[39m()\n", ")\n", "\u001b[36msibSp\u001b[39m: \u001b[32mFeature\u001b[39m[\u001b[32mIntegral\u001b[39m] = \u001b[33mFeature\u001b[39m(\n", " \u001b[32m\"sibSp\"\u001b[39m,\n", " false,\n", " FeatureGeneratorStage_000000000006,\n", " \u001b[33mList\u001b[39m(),\n", " \u001b[32m\"Integral_000000000006\"\u001b[39m,\n", " \u001b[33mList\u001b[39m()\n", ")\n", "\u001b[36mparCh\u001b[39m: \u001b[32mFeature\u001b[39m[\u001b[32mIntegral\u001b[39m] = \u001b[33mFeature\u001b[39m(\n", " \u001b[32m\"parCh\"\u001b[39m,\n", " false,\n", " FeatureGeneratorStage_000000000007,\n", " \u001b[33mList\u001b[39m(),\n", " \u001b[32m\"Integral_000000000007\"\u001b[39m,\n", " \u001b[33mList\u001b[39m()\n", ")\n", "\u001b[36mticket\u001b[39m: \u001b[32mFeature\u001b[39m[\u001b[32mPickList\u001b[39m] = \u001b[33mFeature\u001b[39m(\n", " \u001b[32m\"ticket\"\u001b[39m,\n", " false,\n", " FeatureGeneratorStage_000000000008,\n", " \u001b[33mList\u001b[39m(),\n", " \u001b[32m\"PickList_000000000008\"\u001b[39m,\n", " \u001b[33mList\u001b[39m()\n", ")\n", "\u001b[36mfare\u001b[39m: \u001b[32mFeature\u001b[39m[\u001b[32mReal\u001b[39m] = \u001b[33mFeature\u001b[39m(\n", " \u001b[32m\"fare\"\u001b[39m,\n", " false,\n", " FeatureGeneratorStage_000000000009,\n", " \u001b[33mList\u001b[39m(),\n", " \u001b[32m\"Real_000000000009\"\u001b[39m,\n", " \u001b[33mList\u001b[39m()\n", ")\n", "\u001b[36mcabin\u001b[39m: \u001b[32mFeature\u001b[39m[\u001b[32mPickList\u001b[39m] = \u001b[33mFeature\u001b[39m(\n", " \u001b[32m\"cabin\"\u001b[39m,\n", " false,\n", " FeatureGeneratorStage_00000000000a,\n", " \u001b[33mList\u001b[39m(),\n", " \u001b[32m\"PickList_00000000000a\"\u001b[39m,\n", " \u001b[33mList\u001b[39m()\n", ")\n", "\u001b[36membarked\u001b[39m: \u001b[32mFeature\u001b[39m[\u001b[32mPickList\u001b[39m] = \u001b[33mFeature\u001b[39m(\n", " \u001b[32m\"embarked\"\u001b[39m,\n", " false,\n", " FeatureGeneratorStage_00000000000b,\n", " \u001b[33mList\u001b[39m(),\n", " \u001b[32m\"PickList_00000000000b\"\u001b[39m,\n", " \u001b[33mList\u001b[39m()\n", ")" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val survived = FeatureBuilder.RealNN[Passenger].extract(_.survived.toRealNN).asResponse\n", "val pClass = FeatureBuilder.PickList[Passenger].extract(_.pClass.map(_.toString).toPickList).asPredictor\n", "val name = FeatureBuilder.Text[Passenger].extract(_.name.toText).asPredictor\n", "val sex = FeatureBuilder.PickList[Passenger].extract(_.sex.map(_.toString).toPickList).asPredictor\n", "val age = FeatureBuilder.Real[Passenger].extract(_.age.toReal).asPredictor\n", "val sibSp = FeatureBuilder.Integral[Passenger].extract(_.sibSp.toIntegral).asPredictor\n", "val parCh = FeatureBuilder.Integral[Passenger].extract(_.parCh.toIntegral).asPredictor\n", "val ticket = FeatureBuilder.PickList[Passenger].extract(_.ticket.map(_.toString).toPickList).asPredictor\n", "val fare = FeatureBuilder.Real[Passenger].extract(_.fare.toReal).asPredictor\n", "val cabin = FeatureBuilder.PickList[Passenger].extract(_.cabin.map(_.toString).toPickList).asPredictor\n", "val embarked = FeatureBuilder.PickList[Passenger].extract(_.embarked.map(_.toString).toPickList).asPredictor" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that the raw features have been defined, we go ahead and define how we would like to manipulate them via Stages (Transformers and Estimators). A TransmogrifAI Stage takes one or more Features, and returns a new Feature. TransmogrifAI provides numerous handy short cuts for specifying common feature manipulations. For basic arithmetic operations, you can just use “+”, “-“, “*” and “/”. In addition, shortcuts like “normalize”, “pivot” and “map” are also available." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\u001b[36mfamilySize\u001b[39m: \u001b[32mFeatureLike\u001b[39m[\u001b[32mReal\u001b[39m] = \u001b[33mFeature\u001b[39m(\n", " \u001b[32m\"parCh-sibSp_2-stagesApplied_Real_00000000000d\"\u001b[39m,\n", " false,\n", " UnaryLambdaTransformer_00000000000d,\n", " \u001b[33mWrappedArray\u001b[39m(\n", " \u001b[33mFeature\u001b[39m(\n", " \u001b[32m\"parCh-sibSp_1-stagesApplied_Real_00000000000c\"\u001b[39m,\n", " false,\n", " BinaryLambdaTransformer_00000000000c,\n", "...\n", "\u001b[36mestimatedCostOfTickets\u001b[39m: \u001b[32mFeatureLike\u001b[39m[\u001b[32mReal\u001b[39m] = \u001b[33mFeature\u001b[39m(\n", " \u001b[32m\"fare-parCh-sibSp_3-stagesApplied_Real_00000000000e\"\u001b[39m,\n", " false,\n", " BinaryLambdaTransformer_00000000000e,\n", " \u001b[33mWrappedArray\u001b[39m(\n", " \u001b[33mFeature\u001b[39m(\n", " \u001b[32m\"parCh-sibSp_2-stagesApplied_Real_00000000000d\"\u001b[39m,\n", " false,\n", " UnaryLambdaTransformer_00000000000d,\n", "...\n", "\u001b[36mpivotedSex\u001b[39m: \u001b[32mFeatureLike\u001b[39m[\u001b[32mOPVector\u001b[39m] = \u001b[33mFeature\u001b[39m(\n", " \u001b[32m\"sex_1-stagesApplied_OPVector_00000000000f\"\u001b[39m,\n", " false,\n", " OpSetVectorizer_00000000000f,\n", " \u001b[33mWrappedArray\u001b[39m(\n", " \u001b[33mFeature\u001b[39m(\n", " \u001b[32m\"sex\"\u001b[39m,\n", " false,\n", " FeatureGeneratorStage_000000000004,\n", "...\n", "\u001b[36mnormedAge\u001b[39m: \u001b[32mFeatureLike\u001b[39m[\u001b[32mRealNN\u001b[39m] = \u001b[33mFeature\u001b[39m(\n", " \u001b[32m\"age_2-stagesApplied_RealNN_000000000011\"\u001b[39m,\n", " false,\n", " OpScalarStandardScaler_000000000011,\n", " \u001b[33mWrappedArray\u001b[39m(\n", " \u001b[33mFeature\u001b[39m(\n", " \u001b[32m\"age_1-stagesApplied_RealNN_000000000010\"\u001b[39m,\n", " false,\n", " FillMissingWithMean_000000000010,\n", "...\n", "\u001b[36mageGroup\u001b[39m: \u001b[32mFeatureLike\u001b[39m[\u001b[32mPickList\u001b[39m] = \u001b[33mFeature\u001b[39m(\n", " \u001b[32m\"age_1-stagesApplied_PickList_000000000012\"\u001b[39m,\n", " false,\n", " UnaryLambdaTransformer_000000000012,\n", " \u001b[33mWrappedArray\u001b[39m(\n", " \u001b[33mFeature\u001b[39m(\n", " \u001b[32m\"age\"\u001b[39m,\n", " false,\n", " FeatureGeneratorStage_000000000005,\n", "..." ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val familySize = sibSp + parCh + 1\n", "val estimatedCostOfTickets = familySize * fare\n", "val pivotedSex = sex.pivot()\n", "val normedAge = age.fillMissingWithMean().zNormalize()\n", "val ageGroup = age.map[PickList](_.value.map(v => if (v > 18) \"adult\" else \"child\").toPickList)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "See [Creating Shortcuts for Transformers and Estimators](https://docs.transmogrif.ai/en/stable/developer-guide#creating-shortcuts-for-transformers-and-estimators) for more documentation on how shortcuts for stages can be created. We now define a Feature of type `Vector`, that is a vector representation of all the features we would like to use as predictors in our workflow." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\u001b[36mpassengerFeatures\u001b[39m: \u001b[32mFeatureLike\u001b[39m[\u001b[32mOPVector\u001b[39m] = \u001b[33mFeature\u001b[39m(\n", " \u001b[32m\"age-cabin-embarked-fare-name-pClass-parCh-sex-sibSp-ticket_10-stagesApplied_OPVector_000000000017\"\u001b[39m,\n", " false,\n", " VectorsCombiner_000000000017,\n", " \u001b[33mWrappedArray\u001b[39m(\n", " \u001b[33mFeature\u001b[39m(\n", " \u001b[32m\"parCh-sibSp_1-stagesApplied_OPVector_000000000013\"\u001b[39m,\n", " false,\n", "..." ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val passengerFeatures = Seq(\n", " pClass, name, age, sibSp, parCh, ticket,\n", " cabin, embarked, familySize, estimatedCostOfTickets,\n", " pivotedSex, ageGroup\n", ").transmogrify()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `.transmogrify()` shortcut is a special AutoML Estimator that applies a default set of transformations to all the specified inputs and combines them into a single vector. This is in essence the automatic feature engineering Stage of TransmogrifAI. This stage can be discarded in favor of hand-tuned feature engineering and manual vector creation followed by combination using the `VectorsCombiner` `Transformer` (short-hand `Seq(....).combine()`) if the user desires to have complete control over feature engineering.\n", "\n", "The next stage applies another powerful AutoML Estimator — the `SanityChecker`. The `SanityChecker` applies a variety of statistical tests to the data based on Feature types and discards predictors that are indicative of label leakage or that show little to no predictive power. This is in essence the automatic feature selection Stage of TransmogrifAI:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\u001b[36msanityCheck\u001b[39m: \u001b[32mBoolean\u001b[39m = true\n", "\u001b[36mfinalFeatures\u001b[39m: \u001b[32mFeatureLike\u001b[39m[\u001b[32mOPVector\u001b[39m] = \u001b[33mFeature\u001b[39m(\n", " \u001b[32m\"age-cabin-embarked-fare-name-pClass-parCh-sex-sibSp-survived-ticket_11-stagesApplied_OPVector_000000000018\"\u001b[39m,\n", " false,\n", " SanityChecker_000000000018,\n", " \u001b[33mWrappedArray\u001b[39m(\n", " \u001b[33mFeature\u001b[39m(\n", " \u001b[32m\"survived\"\u001b[39m,\n", " true,\n", "..." ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val sanityCheck = true\n", "val finalFeatures = if (sanityCheck) survived.sanityCheck(passengerFeatures) else passengerFeatures" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, the `OpLogisticRegression` Estimator is applied to derive a new triplet of Features which are essentially probabilities and predictions returned by the logistic regression algorithm:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\u001b[32mimport \u001b[39m\u001b[36mcom.salesforce.op.stages.impl.classification.BinaryClassificationModelSelector\n", "\u001b[39m\n", "\u001b[32mimport \u001b[39m\u001b[36mcom.salesforce.op.stages.impl.classification.BinaryClassificationModelsToTry._\n", "\n", "\u001b[39m\n", "\u001b[36mprediction\u001b[39m: \u001b[32mFeatureLike\u001b[39m[\u001b[32mPrediction\u001b[39m] = \u001b[33mFeature\u001b[39m(\n", " \u001b[32m\"age-cabin-embarked-fare-name-pClass-parCh-sex-sibSp-survived-ticket_12-stagesApplied_Prediction_000000000023\"\u001b[39m,\n", " true,\n", " ModelSelector_000000000023,\n", " \u001b[33mWrappedArray\u001b[39m(\n", " \u001b[33mFeature\u001b[39m(\n", " \u001b[32m\"survived\"\u001b[39m,\n", " true,\n", "..." ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import com.salesforce.op.stages.impl.classification.BinaryClassificationModelSelector\n", "import com.salesforce.op.stages.impl.classification.BinaryClassificationModelsToTry._\n", "\n", "val prediction = BinaryClassificationModelSelector.withTrainValidationSplit(\n", " modelTypesToUse = Seq(OpLogisticRegression)\n", ").setInput(survived, finalFeatures).getOutput()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\u001b[36mevaluator\u001b[39m: \u001b[32mevaluators\u001b[39m.\u001b[32mOpBinaryClassificationEvaluator\u001b[39m = OpBinaryClassificationEvaluator_000000000024" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val evaluator = Evaluators.BinaryClassification().setLabelCol(survived).setPredictionCol(prediction)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that everything we’ve done so far has been purely at the level of definitions. We have defined how we would like to extract our raw features from data of type `Passenger`, and we have defined how we would like to manipulate them. In order to actually manifest the data described by these features, we need to add them to a workflow and attach a data source to the workflow." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\u001b[32mimport \u001b[39m\u001b[36mspark.implicits._ // Needed for Encoders for the Passenger case class\n", "\u001b[39m\n", "\u001b[32mimport \u001b[39m\u001b[36mcom.salesforce.op.readers.DataReaders\n", "\n", "\u001b[39m\n", "\u001b[36mtrainFilePath\u001b[39m: \u001b[32mString\u001b[39m = \u001b[32m\"datasets/TitanicDataset/TitanicPassengersTrainData.csv\"\u001b[39m\n", "\u001b[36mtrainDataReader\u001b[39m: \u001b[32mreaders\u001b[39m.\u001b[32mCSVProductReader\u001b[39m[\u001b[32mPassenger\u001b[39m] = com.salesforce.op.readers.CSVProductReader@e8a32a1" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import spark.implicits._ // Needed for Encoders for the Passenger case class\n", "import com.salesforce.op.readers.DataReaders\n", "\n", "val trainFilePath = \"datasets/TitanicDataset/TitanicPassengersTrainData.csv\"\n", "// Define a way to read data into our Passenger class from our CSV file\n", "val trainDataReader = DataReaders.Simple.csvCase[Passenger](\n", " path = Option(trainFilePath),\n", " key = _.id.toString\n", ")" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\u001b[36mworkflow\u001b[39m: \u001b[32mOpWorkflow\u001b[39m = com.salesforce.op.OpWorkflow@7ef03307" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val workflow = new OpWorkflow()\n", " .setResultFeatures(survived, prediction)\n", " .setReader(trainDataReader)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When we now call `train` on this workflow, it automatically computes and executes the entire DAG of Stages needed to compute the features survived, prediction, rawPrediction, and prob, fitting all the estimators on the training data in the process. Calling score on the fitted workflow then transforms the underlying training data to produce a DataFrame with the all the features manifested. The score method can optionally be passed an evaluator that produces metrics." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "19/04/04 23:30:08 WARN PearsonCorrelation: Pearson correlation matrix contains NaN values.\n", "19/04/04 23:30:08 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS\n", "19/04/04 23:30:08 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Summary:\n", "Evaluated OpLogisticRegression model using Train Validation Split and area under precision-recall metric.\n", "Evaluated 8 OpLogisticRegression models with area under precision-recall metric between [0.7258579350211893, 0.8158704337161141].\n", "+--------------------------------------------------------+\n", "| Selected Model - OpLogisticRegression |\n", "+--------------------------------------------------------+\n", "| Model Param | Value |\n", "+------------------+-------------------------------------+\n", "| aggregationDepth | 2 |\n", "| elasticNetParam | 0.5 |\n", "| family | auto |\n", "| fitIntercept | true |\n", "| maxIter | 50 |\n", "| modelType | OpLogisticRegression |\n", "| name | OpLogisticRegression_00000000001c_7 |\n", "| regParam | 0.2 |\n", "| standardization | true |\n", "| tol | 1.0E-6 |\n", "| uid | OpLogisticRegression_00000000001c |\n", "+------------------+-------------------------------------+\n", "+------------------------------------------------------------------------+\n", "| Model Evaluation Metrics |\n", "+------------------------------------------------------------------------+\n", "| Metric Name | Training Set Value | Hold Out Set Value |\n", "+-----------------------------+---------------------+--------------------+\n", "| area under ROC | 0.8454730495722972 | 0.8350683148991541 |\n", "| area under precision-recall | 0.8250338838777734 | 0.7387030767861703 |\n", "| error | 0.21508034610630408 | 0.1951219512195122 |\n", "| f1 | 0.71 | 0.7142857142857143 |\n", "| false negative | 100.0 | 9.0 |\n", "| false positive | 74.0 | 7.0 |\n", "| precision | 0.7421602787456446 | 0.7407407407407407 |\n", "| recall | 0.6805111821086262 | 0.6896551724137931 |\n", "| true negative | 422.0 | 46.0 |\n", "| true positive | 213.0 | 20.0 |\n", "+-----------------------------+---------------------+--------------------+\n", "+-------------------------------------------------------------------------------+\n", "| Top Model Insights |\n", "+-------------------------------------------------------------------------------+\n", "| Top Positive Correlations | Correlation Value |\n", "+--------------------------------------------------------+----------------------+\n", "| sex(sex = Female) | 0.5408676879258657 |\n", "| name | 0.33895808575240116 |\n", "| cabin(cabin = other) | 0.3142575853287733 |\n", "| pClass(pClass = 1) | 0.2829126682258749 |\n", "| embarked(embarked = C) | 0.19616299633692874 |\n", "| age(age_1-stagesApplied_PickList_000000000012 = Child) | 0.11682855241012707 |\n", "| pClass(pClass = 2) | 0.1162019601454126 |\n", "| parCh | 0.11141839753358806 |\n", "| fare | 0.11141839753358806 |\n", "| sibSp | 0.11141839753358806 |\n", "| embarked(embarked = null) | 0.06266815370525775 |\n", "| age(age_1-stagesApplied_PickList_000000000012 = Adult) | 0.003828185766210828 |\n", "| embarked(embarked = Q) | -0.01423449991531148 |\n", "| age | -0.07606557384744762 |\n", "| age(age = null) | -0.1116519063188936 |\n", "+--------------------------------------------------------+----------------------+\n", "+------------------------------------------------------------------------------+\n", "| Top Negative Correlations | Correlation Value |\n", "+-------------------------------------------------------+----------------------+\n", "| sex(sex = Male) | -0.5408676879258653 |\n", "| name | -0.5320277005898669 |\n", "| pClass(pClass = 3) | -0.33865161163109314 |\n", "| cabin(cabin = null) | -0.31425758532877285 |\n", "| embarked(embarked = S) | -0.17044508720497387 |\n", "| age(age_1-stagesApplied_PickList_000000000012 = null) | -0.1116519063188936 |\n", "| sibSp | -0.04103199324784826 |\n", "| parCh | 0.014173619278117376 |\n", "+-------------------------------------------------------+----------------------+\n", "+----------------------------------------------------------------------------------------+\n", "| Top Contributions | Contribution Value |\n", "+------------------------------------------------------------------+---------------------+\n", "| name | 0.43686641838492407 |\n", "| sex(sex = Male) | 0.41412991086601963 |\n", "| sex(sex = Female) | 0.4127484203214077 |\n", "| pClass(pClass = 3) | 0.21862002466078745 |\n", "| cabin(cabin = null) | 0.08833226022369427 |\n", "| cabin(cabin = other) | 0.08794630568886878 |\n", "| pClass(pClass = null) | 0.0 |\n", "| pClass(pClass = other) | 0.0 |\n", "| pClass(pClass = 2) | 0.0 |\n", "| pClass(pClass = 1) | 0.0 |\n", "| parCh(fare-parCh-sibSp_3-stagesApplied_Real_00000000000e = null) | 0.0 |\n", "| parCh | 0.0 |\n", "| parCh(parCh-sibSp_2-stagesApplied_Real_00000000000d = null) | 0.0 |\n", "| parCh(parCh = null) | 0.0 |\n", "| embarked(embarked = null) | 0.0 |\n", "+------------------------------------------------------------------+---------------------+\n", "+--------------------------------------------------------------------------+\n", "| Top CramersV | CramersV |\n", "+----------------------------------------------------+---------------------+\n", "| sex | 0.5408676879258657 |\n", "| pClass | 0.34975140363576573 |\n", "| cabin | 0.31425758532877324 |\n", "| embarked | 0.20797606343621572 |\n", "| age_1-stagesApplied_PickList_000000000012 | 0.1469944783048929 |\n", "| age | 0.1116519063188937 |\n", "| parCh | 0.0 |\n", "| ticket | 0.0 |\n", "| name | 0.0 |\n", "| fare-parCh-sibSp_3-stagesApplied_Real_00000000000e | 0.0 |\n", "| parCh-sibSp_2-stagesApplied_Real_00000000000d | 0.0 |\n", "| sibSp | 0.0 |\n", "+----------------------------------------------------+---------------------+\n" ] }, { "data": { "text/plain": [ "\u001b[36mfittedWorkflow\u001b[39m: \u001b[32mOpWorkflowModel\u001b[39m = com.salesforce.op.OpWorkflowModel@3ce84a7c" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val fittedWorkflow = workflow.train()\n", "println(\"Summary:\\n\" + fittedWorkflow.summaryPretty())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After model has been fitted we use `scoreAndEvaluate()` function to evaluate the metrics" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Scoring the model:\n", "=================\n", "Transformed dataframe columns:\n", "--------------------------\n", "key\n", "survived\n", "age-cabin-embarked-fare-name-pClass-parCh-sex-sibSp-survived-ticket_12-stagesApplied_Prediction_000000000023\n", "Metrics:\n", "------------\n", "{\n", " \"Precision\" : 0.7420382165605095,\n", " \"Recall\" : 0.6812865497076024,\n", " \"F1\" : 0.7103658536585367,\n", " \"AuROC\" : 0.8442463170677148,\n", " \"AuPR\" : 0.8175464116852524,\n", " \"Error\" : 0.2132435465768799,\n", " \"TP\" : 233.0,\n", " \"TN\" : 468.0,\n", " \"FP\" : 81.0,\n", " \"FN\" : 109.0,\n", " \"thresholds\" : [ 0.6337362389928601, 0.5919422915099323, 0.581676424427552, 0.53827013167045, 0.4307972287346466, 0.38820165599662904, 0.3781943782287164, 0.33771406493189465, 0.3283922601119171, 0.29075001557187263, 0.2820965571896635, 0.24780299079760443, 0.20938927696917828 ],\n", " \"precisionByThreshold\" : [ 0.9560439560439561, 0.9470588235294117, 0.9375, 0.7420382165605095, 0.7423312883435583, 0.7217391304347827, 0.7225433526011561, 0.696236559139785, 0.6355748373101953, 0.5334507042253521, 0.5305410122164049, 0.38288288288288286, 0.3838383838383838 ],\n", " \"recallByThreshold\" : [ 0.2543859649122807, 0.47076023391812866, 0.4824561403508772, 0.6812865497076024, 0.7076023391812866, 0.7280701754385965, 0.7309941520467836, 0.7573099415204678, 0.8567251461988304, 0.8859649122807017, 0.8888888888888888, 0.9941520467836257, 1.0 ],\n", " \"falsePositiveRateByThreshold\" : [ 0.007285974499089253, 0.01639344262295082, 0.020036429872495445, 0.14754098360655737, 0.15300546448087432, 0.17486338797814208, 0.17486338797814208, 0.2058287795992714, 0.30601092896174864, 0.48269581056466304, 0.4899817850637523, 0.9981785063752276, 1.0 ]\n", "}\n" ] }, { "data": { "text/plain": [ "\u001b[36mdataframe\u001b[39m: \u001b[32mDataFrame\u001b[39m = [key: string, survived: double ... 1 more field]\n", "\u001b[36mmetrics\u001b[39m: \u001b[32mevaluators\u001b[39m.\u001b[32mEvaluationMetrics\u001b[39m = \u001b[33mBinaryClassificationMetrics\u001b[39m(\n", " \u001b[32m0.7420382165605095\u001b[39m,\n", " \u001b[32m0.6812865497076024\u001b[39m,\n", " \u001b[32m0.7103658536585367\u001b[39m,\n", " \u001b[32m0.8442463170677148\u001b[39m,\n", " \u001b[32m0.8175464116852524\u001b[39m,\n", " \u001b[32m0.2132435465768799\u001b[39m,\n", " \u001b[32m233.0\u001b[39m,\n", " \u001b[32m468.0\u001b[39m,\n", "..." ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "println(\"Scoring the model:\\n=================\")\n", "val (dataframe, metrics) = fittedWorkflow.scoreAndEvaluate(evaluator = evaluator)\n", "\n", "println(\"Transformed dataframe columns:\\n--------------------------\")\n", "dataframe.columns.foreach(println)\n", "\n", "println(\"Metrics:\\n------------\")\n", "println(metrics)" ] } ], "metadata": { "kernelspec": { "display_name": "Scala (2.11)", "language": "scala", "name": "scala211" }, "language_info": { "codemirror_mode": "text/x-scala", "file_extension": ".scala", "mimetype": "text/x-scala", "name": "scala", "nbconvert_exporter": "script", "version": "2.11.12" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": false, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": false, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }