{ "cells": [ { "cell_type": "markdown", "id": "fdfcf286", "metadata": {}, "source": [ "# PyCaret Fugue Integration\n", "\n", "[Fugue](https://github.com/fugue-project/fugue) is a low-code unified interface for different computing frameworks such as Spark, Dask and Pandas. PyCaret is using Fugue to support distributed computing scenarios.\n", "\n", "# Hello World\n", "\n", "# Classification\n", "\n", "Let's start with the most standard example, the code is exactly the same as the local version, there is no magic." ] }, { "cell_type": "code", "execution_count": 1, "id": "398b0e09", "metadata": { "scrolled": true }, "outputs": [ { "data": {}, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 DescriptionValue
0Session id4292
1TargetPurchase
2Target typeBinary
3Target mappingCH: 0, MM: 1
4Original data shape(1070, 19)
5Transformed data shape(1070, 19)
6Transformed train set shape(748, 19)
7Transformed test set shape(322, 19)
8Ordinal features1
9Numeric features17
10Categorical features1
11PreprocessTrue
12Imputation typesimple
13Numeric imputationmean
14Categorical imputationconstant
15Maximum one-hot encoding5
16Encoding methodNone
17Fold GeneratorStratifiedKFold
18Fold Number10
19CPU Jobs1
20Use GPUFalse
21Log ExperimentFalse
22Experiment Nameclf-default-name
23USI9c46
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from pycaret.datasets import get_data\n", "from pycaret.classification import *\n", "\n", "setup(data=get_data(\"juice\", verbose=False), target = 'Purchase', n_jobs=1)\n", "\n", "test_models = models().index.tolist()[:5]" ] }, { "cell_type": "markdown", "id": "37b1957a", "metadata": {}, "source": [ "`compare_model` is also exactly the same if you don't want to use a distributed system" ] }, { "cell_type": "code", "execution_count": 2, "id": "c8cc5a40", "metadata": {}, "outputs": [ { "data": {}, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 ModelAccuracyAUCRecallPrec.F1KappaMCCTT (Sec)
lrLogistic Regression0.83300.89750.75320.80970.77910.64510.64750.3270
dtDecision Tree Classifier0.77150.76250.72240.70580.71060.52240.52560.0780
nbNaive Bayes0.76080.83370.78020.66930.71790.51290.52060.0780
knnK Neighbors Classifier0.75940.79890.60930.73230.66200.47820.48560.1080
svmSVM - Linear Kernel0.48810.00000.75900.33460.46280.06150.10610.0590
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Processing: 0%| | 0/26 [00:00\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ModelAccuracyAUCRecallPrec.F1KappaMCCTT (Sec)
lrLogistic Regression0.83300.89750.75320.80970.77910.64510.64750.214
dtDecision Tree Classifier0.77150.76250.72240.70580.71060.52240.52560.078
nbNaive Bayes0.76080.83370.78020.66930.71790.51290.52060.209
knnK Neighbors Classifier0.75940.79890.60930.73230.66200.47820.48560.134
svmSVM - Linear Kernel0.48810.00000.75900.33460.46280.06150.10610.058
\n", "" ], "text/plain": [ " Model Accuracy AUC Recall Prec. F1 \\\n", "lr Logistic Regression 0.8330 0.8975 0.7532 0.8097 0.7791 \n", "dt Decision Tree Classifier 0.7715 0.7625 0.7224 0.7058 0.7106 \n", "nb Naive Bayes 0.7608 0.8337 0.7802 0.6693 0.7179 \n", "knn K Neighbors Classifier 0.7594 0.7989 0.6093 0.7323 0.6620 \n", "svm SVM - Linear Kernel 0.4881 0.0000 0.7590 0.3346 0.4628 \n", "\n", " Kappa MCC TT (Sec) \n", "lr 0.6451 0.6475 0.214 \n", "dt 0.5224 0.5256 0.078 \n", "nb 0.5129 0.5206 0.209 \n", "knn 0.4782 0.4856 0.134 \n", "svm 0.0615 0.1061 0.058 " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "[LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n", " intercept_scaling=1, l1_ratio=None, max_iter=1000,\n", " multi_class='auto', n_jobs=None, penalty='l2',\n", " random_state=4292, solver='lbfgs', tol=0.0001, verbose=0,\n", " warm_start=False),\n", " DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',\n", " max_depth=None, max_features=None, max_leaf_nodes=None,\n", " min_impurity_decrease=0.0, min_samples_leaf=1,\n", " min_samples_split=2, min_weight_fraction_leaf=0.0,\n", " random_state=4292, splitter='best')]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pycaret.parallel import FugueBackend\n", "\n", "compare_models(include=test_models, n_select=2, parallel=FugueBackend(\"dask\"))" ] }, { "cell_type": "markdown", "id": "3953dc74", "metadata": {}, "source": [ "In order to use Spark as the execution engine, you must have access to a Spark cluster, and you must have a `SparkSession`, let's initialize a local Spark session" ] }, { "cell_type": "code", "execution_count": 5, "id": "998bd694", "metadata": {}, "outputs": [], "source": [ "from pyspark.sql import SparkSession\n", "\n", "spark = SparkSession.builder.getOrCreate()" ] }, { "cell_type": "markdown", "id": "0f5d91d6", "metadata": {}, "source": [ "Now just change `parallel_backend` to this session object, you make it run on Spark. You must understand this is a toy case. In the real situation, you need to have a SparkSession pointing to a real Spark cluster to enjoy the power of Spark" ] }, { "cell_type": "code", "execution_count": 6, "id": "87834c91", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ModelAccuracyAUCRecallPrec.F1KappaMCCTT (Sec)
lrLogistic Regression0.83300.89750.75320.80970.77910.64510.64750.678
dtDecision Tree Classifier0.77150.76250.72240.70580.71060.52240.52560.208
nbNaive Bayes0.76080.83370.78020.66930.71790.51290.52060.213
knnK Neighbors Classifier0.75940.79890.60930.73230.66200.47820.48560.573
svmSVM - Linear Kernel0.48810.00000.75900.33460.46280.06150.10610.059
\n", "
" ], "text/plain": [ " Model Accuracy AUC Recall Prec. F1 \\\n", "lr Logistic Regression 0.8330 0.8975 0.7532 0.8097 0.7791 \n", "dt Decision Tree Classifier 0.7715 0.7625 0.7224 0.7058 0.7106 \n", "nb Naive Bayes 0.7608 0.8337 0.7802 0.6693 0.7179 \n", "knn K Neighbors Classifier 0.7594 0.7989 0.6093 0.7323 0.6620 \n", "svm SVM - Linear Kernel 0.4881 0.0000 0.7590 0.3346 0.4628 \n", "\n", " Kappa MCC TT (Sec) \n", "lr 0.6451 0.6475 0.678 \n", "dt 0.5224 0.5256 0.208 \n", "nb 0.5129 0.5206 0.213 \n", "knn 0.4782 0.4856 0.573 \n", "svm 0.0615 0.1061 0.059 " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "[LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n", " intercept_scaling=1, l1_ratio=None, max_iter=1000,\n", " multi_class='auto', n_jobs=None, penalty='l2',\n", " random_state=4292, solver='lbfgs', tol=0.0001, verbose=0,\n", " warm_start=False),\n", " DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',\n", " max_depth=None, max_features=None, max_leaf_nodes=None,\n", " min_impurity_decrease=0.0, min_samples_leaf=1,\n", " min_samples_split=2, min_weight_fraction_leaf=0.0,\n", " random_state=4292, splitter='best')]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "compare_models(include=test_models, n_select=2, parallel=FugueBackend(spark))" ] }, { "cell_type": "markdown", "id": "c490458a", "metadata": {}, "source": [ "In the end, you can `pull` to get the metrics table" ] }, { "cell_type": "code", "execution_count": 7, "id": "f74ca178", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ModelAccuracyAUCRecallPrec.F1KappaMCCTT (Sec)
lrLogistic Regression0.83300.89750.75320.80970.77910.64510.64750.678
dtDecision Tree Classifier0.77150.76250.72240.70580.71060.52240.52560.208
nbNaive Bayes0.76080.83370.78020.66930.71790.51290.52060.213
knnK Neighbors Classifier0.75940.79890.60930.73230.66200.47820.48560.573
svmSVM - Linear Kernel0.48810.00000.75900.33460.46280.06150.10610.059
\n", "
" ], "text/plain": [ " Model Accuracy AUC Recall Prec. F1 \\\n", "lr Logistic Regression 0.8330 0.8975 0.7532 0.8097 0.7791 \n", "dt Decision Tree Classifier 0.7715 0.7625 0.7224 0.7058 0.7106 \n", "nb Naive Bayes 0.7608 0.8337 0.7802 0.6693 0.7179 \n", "knn K Neighbors Classifier 0.7594 0.7989 0.6093 0.7323 0.6620 \n", "svm SVM - Linear Kernel 0.4881 0.0000 0.7590 0.3346 0.4628 \n", "\n", " Kappa MCC TT (Sec) \n", "lr 0.6451 0.6475 0.678 \n", "dt 0.5224 0.5256 0.208 \n", "nb 0.5129 0.5206 0.213 \n", "knn 0.4782 0.4856 0.573 \n", "svm 0.0615 0.1061 0.059 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pull()" ] }, { "cell_type": "markdown", "id": "76a1c5be", "metadata": {}, "source": [ "# Regression\n", "\n", "It follows the same pattern as classification." ] }, { "cell_type": "code", "execution_count": 7, "id": "917c6ac4", "metadata": { "scrolled": true }, "outputs": [ { "data": {}, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 DescriptionValue
0Session id3514
1Targetcharges
2Target typeRegression
3Data shape(1338, 10)
4Train data shape(936, 10)
5Test data shape(402, 10)
6Ordinal features2
7Numeric features3
8Categorical features3
9PreprocessTrue
10Imputation typesimple
11Numeric imputationmean
12Categorical imputationconstant
13Maximum one-hot encoding5
14Encoding methodNone
15Fold GeneratorKFold
16Fold Number10
17CPU Jobs1
18Use GPUFalse
19Log ExperimentFalse
20Experiment Namereg-default-name
21USI478f
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from pycaret.datasets import get_data\n", "from pycaret.regression import *\n", "\n", "setup(data=get_data(\"insurance\", verbose=False), target = 'charges', n_jobs=1)\n", "\n", "test_models = models().index.tolist()[:5]" ] }, { "cell_type": "markdown", "id": "4356758c", "metadata": {}, "source": [ "`compare_model` is also exactly the same if you don't want to use a distributed system" ] }, { "cell_type": "code", "execution_count": 8, "id": "bf87f67b", "metadata": {}, "outputs": [ { "data": {}, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 ModelMAEMSERMSER2RMSLEMAPETT (Sec)
larLeast Angle Regression4215.375036942784.90916056.65120.74120.59440.43010.0540
lrLinear Regression4216.069236946939.17746057.01150.74120.59560.43030.1540
lassoLasso Regression4216.076636944721.46846056.80510.74120.59430.43030.0590
ridgeRidge Regression4226.726436949983.84126057.12500.74130.59230.43190.0550
enElastic Net7260.003590321787.12189448.80410.38610.72170.89810.0540
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Processing: 0%| | 0/26 [00:00\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ModelMAEMSERMSER2RMSLEMAPETT (Sec)
larLeast Angle Regression4215.37503.694278e+076056.65120.74120.59440.43010.055
lrLinear Regression4216.06923.694694e+076057.01150.74120.59560.43030.054
lassoLasso Regression4216.07663.694472e+076056.80510.74120.59430.43030.056
ridgeRidge Regression4226.72643.694998e+076057.12500.74130.59230.43190.111
enElastic Net7260.00359.032179e+079448.80410.38610.72170.89810.236
\n", "" ], "text/plain": [ " Model MAE MSE RMSE R2 \\\n", "lar Least Angle Regression 4215.3750 3.694278e+07 6056.6512 0.7412 \n", "lr Linear Regression 4216.0692 3.694694e+07 6057.0115 0.7412 \n", "lasso Lasso Regression 4216.0766 3.694472e+07 6056.8051 0.7412 \n", "ridge Ridge Regression 4226.7264 3.694998e+07 6057.1250 0.7413 \n", "en Elastic Net 7260.0035 9.032179e+07 9448.8041 0.3861 \n", "\n", " RMSLE MAPE TT (Sec) \n", "lar 0.5944 0.4301 0.055 \n", "lr 0.5956 0.4303 0.054 \n", "lasso 0.5943 0.4303 0.056 \n", "ridge 0.5923 0.4319 0.111 \n", "en 0.7217 0.8981 0.236 " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "[Lars(copy_X=True, eps=2.220446049250313e-16, fit_intercept=True, fit_path=True,\n", " jitter=None, n_nonzero_coefs=500, normalize='deprecated',\n", " precompute='auto', random_state=3514, verbose=False),\n", " LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1,\n", " normalize='deprecated', positive=False)]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pycaret.parallel import FugueBackend\n", "\n", "compare_models(include=test_models, n_select=2, sort=\"MAE\", parallel=FugueBackend(\"dask\"))" ] }, { "cell_type": "markdown", "id": "38ad1ddb", "metadata": {}, "source": [ "In order to use Spark as the execution engine, you must have access to a Spark cluster, and you must have a `SparkSession`, let's initialize a local Spark session" ] }, { "cell_type": "code", "execution_count": 10, "id": "8221c7c3", "metadata": {}, "outputs": [], "source": [ "from pyspark.sql import SparkSession\n", "\n", "spark = SparkSession.builder.getOrCreate()" ] }, { "cell_type": "markdown", "id": "1ad84f4b", "metadata": {}, "source": [ "Now just change `parallel_backend` to this session object, you make it run on Spark. You must understand this is a toy case. In the real situation, you need to have a SparkSession pointing to a real Spark cluster to enjoy the power of Spark" ] }, { "cell_type": "code", "execution_count": 12, "id": "2ce39e6d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ModelMAEMSERMSER2RMSLEMAPETT (Sec)
larLeast Angle Regression4215.37503.694278e+076056.65120.74120.59440.43010.098
lrLinear Regression4216.06923.694694e+076057.01150.74120.59560.43030.100
lassoLasso Regression4216.07663.694472e+076056.80510.74120.59430.43030.094
ridgeRidge Regression4226.72643.694998e+076057.12500.74130.59230.43190.053
enElastic Net7260.00359.032179e+079448.80410.38610.72170.89810.092
\n", "
" ], "text/plain": [ " Model MAE MSE RMSE R2 \\\n", "lar Least Angle Regression 4215.3750 3.694278e+07 6056.6512 0.7412 \n", "lr Linear Regression 4216.0692 3.694694e+07 6057.0115 0.7412 \n", "lasso Lasso Regression 4216.0766 3.694472e+07 6056.8051 0.7412 \n", "ridge Ridge Regression 4226.7264 3.694998e+07 6057.1250 0.7413 \n", "en Elastic Net 7260.0035 9.032179e+07 9448.8041 0.3861 \n", "\n", " RMSLE MAPE TT (Sec) \n", "lar 0.5944 0.4301 0.098 \n", "lr 0.5956 0.4303 0.100 \n", "lasso 0.5943 0.4303 0.094 \n", "ridge 0.5923 0.4319 0.053 \n", "en 0.7217 0.8981 0.092 " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "[Lars(copy_X=True, eps=2.220446049250313e-16, fit_intercept=True, fit_path=True,\n", " jitter=None, n_nonzero_coefs=500, normalize='deprecated',\n", " precompute='auto', random_state=3514, verbose=False),\n", " LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1,\n", " normalize='deprecated', positive=False)]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "compare_models(include=test_models, n_select=2, sort=\"MAE\", parallel=FugueBackend(spark))" ] }, { "cell_type": "markdown", "id": "789fd969", "metadata": {}, "source": [ "In the end, you can `pull` to get the metrics table" ] }, { "cell_type": "code", "execution_count": 13, "id": "ecdd02a4", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ModelMAEMSERMSER2RMSLEMAPETT (Sec)
larLeast Angle Regression4215.37503.694278e+076056.65120.74120.59440.43010.098
lrLinear Regression4216.06923.694694e+076057.01150.74120.59560.43030.100
lassoLasso Regression4216.07663.694472e+076056.80510.74120.59430.43030.094
ridgeRidge Regression4226.72643.694998e+076057.12500.74130.59230.43190.053
enElastic Net7260.00359.032179e+079448.80410.38610.72170.89810.092
\n", "
" ], "text/plain": [ " Model MAE MSE RMSE R2 \\\n", "lar Least Angle Regression 4215.3750 3.694278e+07 6056.6512 0.7412 \n", "lr Linear Regression 4216.0692 3.694694e+07 6057.0115 0.7412 \n", "lasso Lasso Regression 4216.0766 3.694472e+07 6056.8051 0.7412 \n", "ridge Ridge Regression 4226.7264 3.694998e+07 6057.1250 0.7413 \n", "en Elastic Net 7260.0035 9.032179e+07 9448.8041 0.3861 \n", "\n", " RMSLE MAPE TT (Sec) \n", "lar 0.5944 0.4301 0.098 \n", "lr 0.5956 0.4303 0.100 \n", "lasso 0.5943 0.4303 0.094 \n", "ridge 0.5923 0.4319 0.053 \n", "en 0.7217 0.8981 0.092 " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pull()" ] }, { "cell_type": "markdown", "id": "981a9c79", "metadata": {}, "source": [ "As you see, the results from the distributed versions can be different from your local versions. In the later sections, we will show how to make them identical.\n", "\n", "# Time Series\n", "\n", "It follows the same pattern as classification.\n" ] }, { "cell_type": "code", "execution_count": 14, "id": "ac63eb2e", "metadata": {}, "outputs": [ { "data": {}, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 DescriptionValue
0session_id42
1TargetNumber of airline passengers
2ApproachUnivariate
3Exogenous VariablesNot Present
4Original data shape(144, 1)
5Transformed data shape(144, 1)
6Transformed train set shape(132, 1)
7Transformed test set shape(12, 1)
8Rows with missing values0.0%
9Fold GeneratorExpandingWindowSplitter
10Fold Number3
11Enforce Prediction IntervalFalse
12Seasonal Period(s) Tested12
13Seasonality PresentTrue
14Seasonalities Detected[12]
15Primary Seasonality12
16Target Strictly PositiveTrue
17Target White NoiseNo
18Recommended d1
19Recommended Seasonal D1
20PreprocessFalse
21CPU Jobs-1
22Use GPUFalse
23Log ExperimentFalse
24Experiment Namets-default-name
25USI49cf
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from pycaret.datasets import get_data\n", "from pycaret.time_series import *\n", "\n", "exp = TSForecastingExperiment()\n", "exp.setup(data=get_data('airline', verbose=False), fh=12, fold=3, fig_kwargs={'renderer': 'notebook'}, session_id=42)\n", "\n", "test_models = exp.models().index.tolist()[:5]" ] }, { "cell_type": "code", "execution_count": 15, "id": "cbb457fe", "metadata": {}, "outputs": [ { "data": {}, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 ModelMASERMSSEMAERMSEMAPESMAPER2TT (Sec)
arimaARIMA0.68300.673520.006922.21990.05010.05070.86770.3200
snaiveSeasonal Naive Forecaster1.14791.094533.361135.91390.08320.08790.60720.0200
polytrendPolynomial Trend Forecaster1.65231.920248.630163.42990.11700.1216-0.07840.0167
naiveNaive Forecaster2.35992.761269.027891.03220.15690.1792-1.22161.0600
grand_meansGrand Means Forecaster5.53065.2596162.4117173.64920.40000.5075-7.04621.2700
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Processing: 0%| | 0/27 [00:00\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ModelMASERMSSEMAERMSEMAPESMAPER2TT (Sec)
arimaARIMA0.6830.673520.006922.21990.05010.05070.86770.1267
snaiveSeasonal Naive Forecaster1.14791.094533.361135.91390.08320.08790.60720.0367
polytrendPolynomial Trend Forecaster1.65231.920248.630163.42990.1170.1216-0.07840.0133
naiveNaive Forecaster2.35992.761269.027891.03220.15690.1792-1.22160.0200
grand_meansGrand Means Forecaster5.53065.2596162.4117173.64920.40.5075-7.04620.0233
\n", "" ], "text/plain": [ " Model MASE RMSSE MAE RMSE \\\n", "arima ARIMA 0.683 0.6735 20.0069 22.2199 \n", "snaive Seasonal Naive Forecaster 1.1479 1.0945 33.3611 35.9139 \n", "polytrend Polynomial Trend Forecaster 1.6523 1.9202 48.6301 63.4299 \n", "naive Naive Forecaster 2.3599 2.7612 69.0278 91.0322 \n", "grand_means Grand Means Forecaster 5.5306 5.2596 162.4117 173.6492 \n", "\n", " MAPE SMAPE R2 TT (Sec) \n", "arima 0.0501 0.0507 0.8677 0.1267 \n", "snaive 0.0832 0.0879 0.6072 0.0367 \n", "polytrend 0.117 0.1216 -0.0784 0.0133 \n", "naive 0.1569 0.1792 -1.2216 0.0200 \n", "grand_means 0.4 0.5075 -7.0462 0.0233 " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "[ARIMA(maxiter=50, method='lbfgs', order=(1, 0, 0), out_of_sample_size=0,\n", " scoring='mse', scoring_args=None, seasonal_order=(0, 1, 0, 12),\n", " start_params=None, suppress_warnings=False, trend=None,\n", " with_intercept=True),\n", " NaiveForecaster(sp=12, strategy='last', window_length=None),\n", " PolynomialTrendForecaster(degree=1, regressor=None, with_intercept=True)]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pycaret.parallel import FugueBackend\n", "\n", "best_baseline_models = exp.compare_models(include=test_models, n_select=3, parallel=FugueBackend(\"dask\"))\n", "best_baseline_models" ] }, { "cell_type": "code", "execution_count": 17, "id": "45e191f9", "metadata": {}, "outputs": [], "source": [ "from pyspark.sql import SparkSession\n", "\n", "spark = SparkSession.builder.getOrCreate()" ] }, { "cell_type": "code", "execution_count": 18, "id": "ed579ca3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ModelMASERMSSEMAERMSEMAPESMAPER2TT (Sec)
naiveNaive Forecaster2.35992.761269.027891.03220.15690.1792-1.22162.5600
grand_meansGrand Means Forecaster5.53065.2596162.4117173.64920.40.5075-7.04622.5267
\n", "
" ], "text/plain": [ " Model MASE RMSSE MAE RMSE \\\n", "naive Naive Forecaster 2.3599 2.7612 69.0278 91.0322 \n", "grand_means Grand Means Forecaster 5.5306 5.2596 162.4117 173.6492 \n", "\n", " MAPE SMAPE R2 TT (Sec) \n", "naive 0.1569 0.1792 -1.2216 2.5600 \n", "grand_means 0.4 0.5075 -7.0462 2.5267 " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "[NaiveForecaster(sp=1, strategy='last', window_length=None),\n", " NaiveForecaster(sp=1, strategy='mean', window_length=None)]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pycaret.parallel import FugueBackend\n", "\n", "best_baseline_models = exp.compare_models(include=test_models[:2], n_select=3, parallel=FugueBackend(spark))\n", "best_baseline_models" ] }, { "cell_type": "code", "execution_count": 19, "id": "3eb73043", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ModelMASERMSSEMAERMSEMAPESMAPER2TT (Sec)
naiveNaive Forecaster2.35992.761269.027891.03220.15690.1792-1.22162.5600
grand_meansGrand Means Forecaster5.53065.2596162.4117173.64920.40.5075-7.04622.5267
\n", "
" ], "text/plain": [ " Model MASE RMSSE MAE RMSE \\\n", "naive Naive Forecaster 2.3599 2.7612 69.0278 91.0322 \n", "grand_means Grand Means Forecaster 5.5306 5.2596 162.4117 173.6492 \n", "\n", " MAPE SMAPE R2 TT (Sec) \n", "naive 0.1569 0.1792 -1.2216 2.5600 \n", "grand_means 0.4 0.5075 -7.0462 2.5267 " ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "exp.pull()" ] }, { "cell_type": "markdown", "id": "c910b81c", "metadata": {}, "source": [ "# A more practical case\n", "\n", "The above examples are pure toys, to make things work perfectly in a distributed system you must be careful about a few things\n", "\n", "# Use a lambda instead of a dataframe in setup\n", "\n", "If you directly provide a dataframe in `setup`, this dataset will need to be sent to all worker nodes. If the dataframe is 1G, you have 100 workers, then it is possible your dirver machine will need to send out up to 100G data (depending on specific framework's implementation), then this data transfer becomes a bottleneck itself. Instead, if you provide a lambda function, it doesn't change the local compute scenario, but the driver will only send the function reference to workers, and each worker will be responsible to load the data by themselves, so there is no heavy traffic on the driver side.\n", "\n", "# Be deterministic\n", "\n", "You should always use `session_id` to make the distributed compute deterministic.\n", "\n", "# Set n_jobs\n", "\n", "It is important to be explicit on n_jobs when you want to run something distributedly, so it will not overuse the local/remote resources. This can also avoid resrouce contention, and make the compute faster." ] }, { "cell_type": "code", "execution_count": 1, "id": "1d76ddae", "metadata": { "scrolled": true }, "outputs": [ { "data": {}, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 DescriptionValue
0Session id0
1TargetPurchase
2Target typeBinary
3Target mappingCH: 0, MM: 1
4Original data shape(1070, 19)
5Transformed data shape(1070, 19)
6Transformed train set shape(748, 19)
7Transformed test set shape(322, 19)
8Ordinal features1
9Numeric features17
10Categorical features1
11PreprocessTrue
12Imputation typesimple
13Numeric imputationmean
14Categorical imputationconstant
15Maximum one-hot encoding5
16Encoding methodNone
17Fold GeneratorStratifiedKFold
18Fold Number10
19CPU Jobs1
20Use GPUFalse
21Log ExperimentFalse
22Experiment Nameclf-default-name
23USIae18
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from pycaret.datasets import get_data\n", "from pycaret.classification import *\n", "\n", "setup(data_func=lambda: get_data(\"juice\", verbose=False, profile=False), target = 'Purchase', session_id=0, n_jobs=1);" ] }, { "cell_type": "markdown", "id": "2fc80912", "metadata": {}, "source": [ "# Set the appropriate batch_size\n", "\n", "`batch_size` parameter helps adjust between load balence and overhead. For each batch, setup will be called only once. So\n", "\n", "| Choice |Load Balance|Overhead|Best Scenario|\n", "|---|---|---|---|\n", "|Smaller batch size|Better|Worse|`training time >> data loading time` or `models ~= workers`|\n", "|Larger batch size|Worse|Better|`training time << data loading time` or `models >> workers`|\n", "\n", "The default value is set to `1`, meaning we want the best load balance.\n", "\n", "# Display progress\n", "\n", "In development, you can enable visual effect by `display_remote=True`, but meanwhile you must also enable [Fugue Callback](https://fugue-tutorials.readthedocs.io/tutorials/advanced/rpc.html) so that the driver can monitor worker progress. But it is recommended to turn off display in production." ] }, { "cell_type": "code", "execution_count": 9, "id": "9775c4f4", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ModelAccuracyAUCRecallPrec.F1KappaMCCDUMMYDUMMY2TT (Sec)
ridgeRidge Classifier0.83830.00000.78020.80850.78960.65850.66370.00.00.099
ldaLinear Discriminant Analysis0.83290.89860.77010.80440.78240.64720.65220.01.00.132
lrLogistic Regression0.83030.89590.75300.80530.77480.63910.64330.01.00.271
gbcGradient Boosting Classifier0.81950.89820.75620.78700.76560.61930.62600.01.00.263
lightgbmLight Gradient Boosting Machine0.80470.88280.74920.75850.74820.58930.59500.01.00.128
adaAda Boost Classifier0.79680.87890.73260.74990.73880.57270.57510.01.00.178
rfRandom Forest Classifier0.79550.87310.72560.75000.73380.56820.57270.01.00.243
dtDecision Tree Classifier0.77950.77110.73280.71680.72010.53890.54410.01.00.082
etExtra Trees Classifier0.77140.84790.69510.72130.70380.51830.52250.01.00.214
nbNaive Bayes0.76210.82550.72550.68250.70090.50390.50740.01.00.080
knnK Neighbors Classifier0.75280.80530.62310.72080.66420.47030.47700.01.00.083
qdaQuadratic Discriminant Analysis0.65100.63490.45460.76170.44260.23770.30860.01.00.077
dummyDummy Classifier0.60960.50000.00000.00000.00000.00000.00000.01.00.072
svmSVM - Linear Kernel0.56770.00000.26900.20770.19010.02900.03960.00.00.201
\n", "
" ], "text/plain": [ " Model Accuracy AUC Recall Prec. \\\n", "ridge Ridge Classifier 0.8383 0.0000 0.7802 0.8085 \n", "lda Linear Discriminant Analysis 0.8329 0.8986 0.7701 0.8044 \n", "lr Logistic Regression 0.8303 0.8959 0.7530 0.8053 \n", "gbc Gradient Boosting Classifier 0.8195 0.8982 0.7562 0.7870 \n", "lightgbm Light Gradient Boosting Machine 0.8047 0.8828 0.7492 0.7585 \n", "ada Ada Boost Classifier 0.7968 0.8789 0.7326 0.7499 \n", "rf Random Forest Classifier 0.7955 0.8731 0.7256 0.7500 \n", "dt Decision Tree Classifier 0.7795 0.7711 0.7328 0.7168 \n", "et Extra Trees Classifier 0.7714 0.8479 0.6951 0.7213 \n", "nb Naive Bayes 0.7621 0.8255 0.7255 0.6825 \n", "knn K Neighbors Classifier 0.7528 0.8053 0.6231 0.7208 \n", "qda Quadratic Discriminant Analysis 0.6510 0.6349 0.4546 0.7617 \n", "dummy Dummy Classifier 0.6096 0.5000 0.0000 0.0000 \n", "svm SVM - Linear Kernel 0.5677 0.0000 0.2690 0.2077 \n", "\n", " F1 Kappa MCC DUMMY DUMMY2 TT (Sec) \n", "ridge 0.7896 0.6585 0.6637 0.0 0.0 0.099 \n", "lda 0.7824 0.6472 0.6522 0.0 1.0 0.132 \n", "lr 0.7748 0.6391 0.6433 0.0 1.0 0.271 \n", "gbc 0.7656 0.6193 0.6260 0.0 1.0 0.263 \n", "lightgbm 0.7482 0.5893 0.5950 0.0 1.0 0.128 \n", "ada 0.7388 0.5727 0.5751 0.0 1.0 0.178 \n", "rf 0.7338 0.5682 0.5727 0.0 1.0 0.243 \n", "dt 0.7201 0.5389 0.5441 0.0 1.0 0.082 \n", "et 0.7038 0.5183 0.5225 0.0 1.0 0.214 \n", "nb 0.7009 0.5039 0.5074 0.0 1.0 0.080 \n", "knn 0.6642 0.4703 0.4770 0.0 1.0 0.083 \n", "qda 0.4426 0.2377 0.3086 0.0 1.0 0.077 \n", "dummy 0.0000 0.0000 0.0000 0.0 1.0 0.072 \n", "svm 0.1901 0.0290 0.0396 0.0 0.0 0.201 " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Processing: 0%| | 0/14 [00:00\n", "Scorer make_scorer(score_dummy, greater_is_better=False)\n", "Target pred\n", "Args {}\n", "Greater is Better False\n", "Multiclass True\n", "Custom True\n", "Name: mydummy, dtype: object" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def score_dummy(y_true, y_pred, axis=0):\n", " return 0.0\n", "\n", "add_metric(id = 'mydummy',\n", " name = 'DUMMY',\n", " score_func = score_dummy,\n", " target = 'pred',\n", " greater_is_better = False,\n", " )" ] }, { "cell_type": "markdown", "id": "7ccaa531", "metadata": {}, "source": [ "Adding a function in a class instance is also ok, but make sure all member variables in the class are serializable." ] }, { "cell_type": "code", "execution_count": 4, "id": "83576a2d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ModelAccuracyAUCRecallPrec.F1KappaMCCDUMMYTT (Sec)
dtDecision Tree Classifier0.77950.77110.73280.71680.72010.53890.54410.00.240
lrLogistic Regression0.83030.89590.75300.80530.77480.63910.64330.00.306
nbNaive Bayes0.76210.82550.72550.68250.70090.50390.50740.00.130
knnK Neighbors Classifier0.75280.80530.62310.72080.66420.47030.47700.00.097
svmSVM - Linear Kernel0.56770.00000.26900.20770.19010.02900.03960.00.102
\n", "
" ], "text/plain": [ " Model Accuracy AUC Recall Prec. F1 \\\n", "dt Decision Tree Classifier 0.7795 0.7711 0.7328 0.7168 0.7201 \n", "lr Logistic Regression 0.8303 0.8959 0.7530 0.8053 0.7748 \n", "nb Naive Bayes 0.7621 0.8255 0.7255 0.6825 0.7009 \n", "knn K Neighbors Classifier 0.7528 0.8053 0.6231 0.7208 0.6642 \n", "svm SVM - Linear Kernel 0.5677 0.0000 0.2690 0.2077 0.1901 \n", "\n", " Kappa MCC DUMMY TT (Sec) \n", "dt 0.5389 0.5441 0.0 0.240 \n", "lr 0.6391 0.6433 0.0 0.306 \n", "nb 0.5039 0.5074 0.0 0.130 \n", "knn 0.4703 0.4770 0.0 0.097 \n", "svm 0.0290 0.0396 0.0 0.102 " ] }, "metadata": {}, "output_type": "display_data" }, { "data": {}, "metadata": {}, "output_type": "display_data" }, { "data": {}, "metadata": {}, "output_type": "display_data" }, { "data": {}, "metadata": {}, "output_type": "display_data" }, { "data": {}, "metadata": {}, "output_type": "display_data" }, { "data": {}, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "[DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',\n", " max_depth=None, max_features=None, max_leaf_nodes=None,\n", " min_impurity_decrease=0.0, min_samples_leaf=1,\n", " min_samples_split=2, min_weight_fraction_leaf=0.0,\n", " random_state=0, splitter='best'),\n", " LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n", " intercept_scaling=1, l1_ratio=None, max_iter=1000,\n", " multi_class='auto', n_jobs=None, penalty='l2',\n", " random_state=0, solver='lbfgs', tol=0.0001, verbose=0,\n", " warm_start=False)]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_models = models().index.tolist()[:5]\n", "compare_models(include=test_models, n_select=2, sort=\"DUMMY\", parallel=FugueBackend(\"dask\"))" ] }, { "cell_type": "code", "execution_count": 5, "id": "04d5e7c9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ModelAccuracyAUCRecallPrec.F1KappaMCCDUMMYTT (Sec)
dtDecision Tree Classifier0.77950.77110.73280.71680.72010.53890.54410.00.240
lrLogistic Regression0.83030.89590.75300.80530.77480.63910.64330.00.306
nbNaive Bayes0.76210.82550.72550.68250.70090.50390.50740.00.130
knnK Neighbors Classifier0.75280.80530.62310.72080.66420.47030.47700.00.097
svmSVM - Linear Kernel0.56770.00000.26900.20770.19010.02900.03960.00.102
\n", "
" ], "text/plain": [ " Model Accuracy AUC Recall Prec. F1 \\\n", "dt Decision Tree Classifier 0.7795 0.7711 0.7328 0.7168 0.7201 \n", "lr Logistic Regression 0.8303 0.8959 0.7530 0.8053 0.7748 \n", "nb Naive Bayes 0.7621 0.8255 0.7255 0.6825 0.7009 \n", "knn K Neighbors Classifier 0.7528 0.8053 0.6231 0.7208 0.6642 \n", "svm SVM - Linear Kernel 0.5677 0.0000 0.2690 0.2077 0.1901 \n", "\n", " Kappa MCC DUMMY TT (Sec) \n", "dt 0.5389 0.5441 0.0 0.240 \n", "lr 0.6391 0.6433 0.0 0.306 \n", "nb 0.5039 0.5074 0.0 0.130 \n", "knn 0.4703 0.4770 0.0 0.097 \n", "svm 0.0290 0.0396 0.0 0.102 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pull()" ] }, { "cell_type": "code", "execution_count": 6, "id": "8f1d99c5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Name DUMMY2\n", "Display Name DUMMY2\n", "Score Function \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ModelAccuracyAUCRecallPrec.F1KappaMCCDUMMYDUMMY2TT (Sec)
dtDecision Tree Classifier0.77950.77110.73280.71680.72010.53890.54410.01.00.237
lrLogistic Regression0.83030.89590.75300.80530.77480.63910.64330.01.00.399
nbNaive Bayes0.76210.82550.72550.68250.70090.50390.50740.01.00.077
knnK Neighbors Classifier0.75280.80530.62310.72080.66420.47030.47700.01.00.082
svmSVM - Linear Kernel0.56770.00000.26900.20770.19010.02900.03960.00.00.104
\n", "" ], "text/plain": [ " Model Accuracy AUC Recall Prec. F1 \\\n", "dt Decision Tree Classifier 0.7795 0.7711 0.7328 0.7168 0.7201 \n", "lr Logistic Regression 0.8303 0.8959 0.7530 0.8053 0.7748 \n", "nb Naive Bayes 0.7621 0.8255 0.7255 0.6825 0.7009 \n", "knn K Neighbors Classifier 0.7528 0.8053 0.6231 0.7208 0.6642 \n", "svm SVM - Linear Kernel 0.5677 0.0000 0.2690 0.2077 0.1901 \n", "\n", " Kappa MCC DUMMY DUMMY2 TT (Sec) \n", "dt 0.5389 0.5441 0.0 1.0 0.237 \n", "lr 0.6391 0.6433 0.0 1.0 0.399 \n", "nb 0.5039 0.5074 0.0 1.0 0.077 \n", "knn 0.4703 0.4770 0.0 1.0 0.082 \n", "svm 0.0290 0.0396 0.0 0.0 0.104 " ] }, "metadata": {}, "output_type": "display_data" }, { "data": {}, "metadata": {}, "output_type": "display_data" }, { "data": {}, "metadata": {}, "output_type": "display_data" }, { "data": {}, "metadata": {}, "output_type": "display_data" }, { "data": {}, "metadata": {}, "output_type": "display_data" }, { "data": {}, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "[DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',\n", " max_depth=None, max_features=None, max_leaf_nodes=None,\n", " min_impurity_decrease=0.0, min_samples_leaf=1,\n", " min_samples_split=2, min_weight_fraction_leaf=0.0,\n", " random_state=0, splitter='best'),\n", " LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n", " intercept_scaling=1, l1_ratio=None, max_iter=1000,\n", " multi_class='auto', n_jobs=None, penalty='l2',\n", " random_state=0, solver='lbfgs', tol=0.0001, verbose=0,\n", " warm_start=False)]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "compare_models(include=test_models, n_select=2, sort=\"DUMMY2\", parallel=FugueBackend(\"dask\"))" ] }, { "cell_type": "code", "execution_count": 8, "id": "ee4e174b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ModelAccuracyAUCRecallPrec.F1KappaMCCDUMMYDUMMY2TT (Sec)
dtDecision Tree Classifier0.77950.77110.73280.71680.72010.53890.54410.01.00.237
lrLogistic Regression0.83030.89590.75300.80530.77480.63910.64330.01.00.399
nbNaive Bayes0.76210.82550.72550.68250.70090.50390.50740.01.00.077
knnK Neighbors Classifier0.75280.80530.62310.72080.66420.47030.47700.01.00.082
svmSVM - Linear Kernel0.56770.00000.26900.20770.19010.02900.03960.00.00.104
\n", "
" ], "text/plain": [ " Model Accuracy AUC Recall Prec. F1 \\\n", "dt Decision Tree Classifier 0.7795 0.7711 0.7328 0.7168 0.7201 \n", "lr Logistic Regression 0.8303 0.8959 0.7530 0.8053 0.7748 \n", "nb Naive Bayes 0.7621 0.8255 0.7255 0.6825 0.7009 \n", "knn K Neighbors Classifier 0.7528 0.8053 0.6231 0.7208 0.6642 \n", "svm SVM - Linear Kernel 0.5677 0.0000 0.2690 0.2077 0.1901 \n", "\n", " Kappa MCC DUMMY DUMMY2 TT (Sec) \n", "dt 0.5389 0.5441 0.0 1.0 0.237 \n", "lr 0.6391 0.6433 0.0 1.0 0.399 \n", "nb 0.5039 0.5074 0.0 1.0 0.077 \n", "knn 0.4703 0.4770 0.0 1.0 0.082 \n", "svm 0.0290 0.0396 0.0 0.0 0.104 " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pull()" ] }, { "cell_type": "markdown", "id": "c7e34629", "metadata": {}, "source": [ "# Notes\n", "\n", "# Spark settings\n", "\n", "It is highly recommended to have only 1 worker on each Spark executor, so the worker can fully utilize all cpus (set `spark.task.cpus`). Also when you do this you should explicitly set `n_jobs` in `setup` to the number of cpus of each executor.\n", "\n", "```python\n", "executor_cores = 4\n", "\n", "spark = SparkSession.builder.config(\"spark.task.cpus\", executor_cores).config(\"spark.executor.cores\", executor_cores).getOrCreate()\n", "\n", "setup(data=get_data(\"juice\", verbose=False, profile=False), target = 'Purchase', session_id=0, n_jobs=executor_cores)\n", "\n", "compare_models(n_select=2, parallel=FugueBackend(spark))\n", "```\n", "\n", "# Databricks\n", "\n", "On Databricks, `spark` is the magic variable representing a SparkSession. But there is no difference to use. You do the exactly same thing as before:\n", "\n", "```python\n", "compare_models(parallel=FugueBackend(spark))\n", "```\n", "\n", "But Databricks, the visualization is difficult, so it may be a good idea to do two things:\n", "\n", "* Set `verbose` to False in `setup`\n", "* Set `display_remote` to False in `FugueBackend`\n", "\n", "# Dask\n", "\n", "Dask has fake distributed modes such as the default (multi-thread) and multi-process modes. The default mode will just work fine (but they are actually running sequentially), and multi-process doesn't work for PyCaret for now because it messes up with PyCaret's global variables. On the other hand, any Spark execution mode will just work fine.\n", "\n", "# Local Parallelization\n", "\n", "For practical use where you try non-trivial data and models, local parallelization (The eaiest way is to use local Dask as backend as shown above) normally doesn't have performance advantage. Because it's very easy to overload the CPUS on training, increasing the contention of resources. The value of local parallelization is to verify the code and give you confidence that the distributed environment will provide the expected result with much shorter time.\n", "\n", "# How to develop \n", "\n", "Distributed systems are powerful but you must follow some good practices to use them:\n", "\n", "1. **From small to large:** initially, you must start with a small set of data, for example in `compare_model` limit the models you want to try to a small number of cheap models, and when you verify they work, you can change to a larger model collection.\n", "2. **From local to distributed:** you should follow this sequence: verify small data locally then verify small data distributedly and then verify large data distributedly. The current design makes the transition seamless. You can do these sequentially: `parallel=None` -> `parallel=FugueBackend()` -> `parallel=FugueBackend(spark)`. In the second step, you can replace with a local SparkSession or local dask." ] }, { "cell_type": "code", "execution_count": null, "id": "ee7d43a6", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" } }, "nbformat": 4, "nbformat_minor": 5 }