{ "cells": [ { "cell_type": "markdown", "id": "fdfcf286", "metadata": {}, "source": [ "# PyCaret Fugue Integration\n", "\n", "[Fugue](https://github.com/fugue-project/fugue) is a low-code unified interface for different computing frameworks such as Spark, Dask and Pandas. PyCaret is using Fugue to support distributed computing scenarios.\n", "\n", "# Hello World\n", "\n", "# Classification\n", "\n", "Let's start with the most standard example, the code is exactly the same as the local version, there is no magic." ] }, { "cell_type": "code", "execution_count": 2, "id": "398b0e09", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 DescriptionValue
0session_id5517
1TargetPurchase
2Target TypeBinary
3Label EncodedCH: 0, MM: 1
4Original Data(1070, 19)
5Missing ValuesFalse
6Numeric Features13
7Categorical Features5
8Ordinal FeaturesFalse
9High Cardinality FeaturesFalse
10High Cardinality MethodNone
11Transformed Train Set(748, 17)
12Transformed Test Set(322, 17)
13Shuffle Train-TestTrue
14Stratify Train-TestFalse
15Fold GeneratorStratifiedKFold
16Fold Number10
17CPU Jobs1
18Use GPUFalse
19Log ExperimentFalse
20Experiment Nameclf-default-name
21USIb06e
22Imputation Typesimple
23Iterative Imputation IterationNone
24Numeric Imputermean
25Iterative Imputation Numeric ModelNone
26Categorical Imputerconstant
27Iterative Imputation Categorical ModelNone
28Unknown Categoricals Handlingleast_frequent
29NormalizeFalse
30Normalize MethodNone
31TransformationFalse
32Transformation MethodNone
33PCAFalse
34PCA MethodNone
35PCA ComponentsNone
36Ignore Low VarianceFalse
37Combine Rare LevelsFalse
38Rare Level ThresholdNone
39Numeric BinningFalse
40Remove OutliersFalse
41Outliers ThresholdNone
42Remove MulticollinearityFalse
43Multicollinearity ThresholdNone
44Remove Perfect CollinearityTrue
45ClusteringFalse
46Clustering IterationNone
47Polynomial FeaturesFalse
48Polynomial DegreeNone
49Trignometry FeaturesFalse
50Polynomial ThresholdNone
51Group FeaturesFalse
52Feature SelectionFalse
53Feature Selection Methodclassic
54Features Selection ThresholdNone
55Feature InteractionFalse
56Feature RatioFalse
57Interaction ThresholdNone
58Fix ImbalanceFalse
59Fix Imbalance MethodSMOTE
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from pycaret.datasets import get_data\n", "from pycaret.classification import *\n", "\n", "setup(data=get_data(\"juice\"), target = 'Purchase', n_jobs=1)\n", "\n", "test_models = models().index.tolist()[:5]" ] }, { "cell_type": "markdown", "id": "37b1957a", "metadata": {}, "source": [ "`compare_model` is also exactly the same if you don't want to use a distributed system" ] }, { "cell_type": "code", "execution_count": 3, "id": "c8cc5a40", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 ModelAccuracyAUCRecallPrec.F1KappaMCCTT (Sec)
lrLogistic Regression0.83950.89820.73990.83630.78330.65650.66140.1390
nbNaive Bayes0.76460.83870.78460.67760.72440.52190.52910.0080
dtDecision Tree Classifier0.74870.74200.68480.67960.67990.47340.47570.0100
knnK Neighbors Classifier0.70850.75080.58200.64170.60750.37700.38020.0110
svmSVM - Linear Kernel0.55780.00000.61380.46590.43450.13440.16480.0100
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "[LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n", " intercept_scaling=1, l1_ratio=None, max_iter=1000,\n", " multi_class='auto', n_jobs=None, penalty='l2',\n", " random_state=5517, solver='lbfgs', tol=0.0001, verbose=0,\n", " warm_start=False),\n", " GaussianNB(priors=None, var_smoothing=1e-09)]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "compare_models(include=test_models, n_select=2)" ] }, { "cell_type": "markdown", "id": "86aa67d8", "metadata": {}, "source": [ "Now let's make it distributed, as a toy case, on dask. The only thing changed is an additional parameter `parallel_backend`" ] }, { "cell_type": "code", "execution_count": 4, "id": "e7e649ce", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n", " intercept_scaling=1, l1_ratio=None, max_iter=1000,\n", " multi_class='auto', n_jobs=None, penalty='l2',\n", " random_state=5517, solver='lbfgs', tol=0.0001, verbose=0,\n", " warm_start=False),\n", " GaussianNB(priors=None, var_smoothing=1e-09)]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pycaret.parallel import FugueBackend\n", "\n", "compare_models(include=test_models, n_select=2, parallel=FugueBackend(\"dask\"))" ] }, { "cell_type": "markdown", "id": "3953dc74", "metadata": {}, "source": [ "In order to use Spark as the execution engine, you must have access to a Spark cluster, and you must have a `SparkSession`, let's initialize a local Spark session" ] }, { "cell_type": "code", "execution_count": 6, "id": "998bd694", "metadata": {}, "outputs": [], "source": [ "from pyspark.sql import SparkSession\n", "\n", "spark = SparkSession.builder.getOrCreate()" ] }, { "cell_type": "markdown", "id": "0f5d91d6", "metadata": {}, "source": [ "Now just change `parallel_backend` to this session object, you make it run on Spark. You must understand this is a toy case. In the real situation, you need to have a SparkSession pointing to a real Spark cluster to enjoy the power of Spark" ] }, { "cell_type": "code", "execution_count": 7, "id": "87834c91", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "data": { "text/plain": [ "[LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n", " intercept_scaling=1, l1_ratio=None, max_iter=1000,\n", " multi_class='auto', n_jobs=None, penalty='l2',\n", " random_state=4418, solver='lbfgs', tol=0.0001, verbose=0,\n", " warm_start=False),\n", " GaussianNB(priors=None, var_smoothing=1e-09)]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "compare_models(include=test_models, n_select=2, parallel=FugueBackend(spark))" ] }, { "cell_type": "markdown", "id": "c490458a", "metadata": {}, "source": [ "In the end, you can `pull` to get the metrics table" ] }, { "cell_type": "code", "execution_count": 8, "id": "f74ca178", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ModelAccuracyAUCRecallPrec.F1KappaMCCTT (Sec)
lrLogistic Regression0.82760.89050.74200.81410.77320.63510.64010.384
nbNaive Bayes0.76740.83940.76740.67570.71740.52130.52580.015
dtDecision Tree Classifier0.75940.75490.69700.68970.69110.49460.49670.040
knnK Neighbors Classifier0.72850.77160.60520.67500.63670.42140.42390.012
svmSVM - Linear Kernel0.51620.00000.56550.26740.35050.05000.05760.020
\n", "
" ], "text/plain": [ " Model Accuracy AUC Recall Prec. F1 \\\n", "lr Logistic Regression 0.8276 0.8905 0.7420 0.8141 0.7732 \n", "nb Naive Bayes 0.7674 0.8394 0.7674 0.6757 0.7174 \n", "dt Decision Tree Classifier 0.7594 0.7549 0.6970 0.6897 0.6911 \n", "knn K Neighbors Classifier 0.7285 0.7716 0.6052 0.6750 0.6367 \n", "svm SVM - Linear Kernel 0.5162 0.0000 0.5655 0.2674 0.3505 \n", "\n", " Kappa MCC TT (Sec) \n", "lr 0.6351 0.6401 0.384 \n", "nb 0.5213 0.5258 0.015 \n", "dt 0.4946 0.4967 0.040 \n", "knn 0.4214 0.4239 0.012 \n", "svm 0.0500 0.0576 0.020 " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pull()" ] }, { "cell_type": "markdown", "id": "76a1c5be", "metadata": {}, "source": [ "# Regression\n", "\n", "It's follows the same pattern as classification." ] }, { "cell_type": "code", "execution_count": 9, "id": "917c6ac4", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 DescriptionValue
0session_id4045
1Targetcharges
2Original Data(1338, 7)
3Missing ValuesFalse
4Numeric Features2
5Categorical Features4
6Ordinal FeaturesFalse
7High Cardinality FeaturesFalse
8High Cardinality MethodNone
9Transformed Train Set(936, 14)
10Transformed Test Set(402, 14)
11Shuffle Train-TestTrue
12Stratify Train-TestFalse
13Fold GeneratorKFold
14Fold Number10
15CPU Jobs1
16Use GPUFalse
17Log ExperimentFalse
18Experiment Namereg-default-name
19USId080
20Imputation Typesimple
21Iterative Imputation IterationNone
22Numeric Imputermean
23Iterative Imputation Numeric ModelNone
24Categorical Imputerconstant
25Iterative Imputation Categorical ModelNone
26Unknown Categoricals Handlingleast_frequent
27NormalizeFalse
28Normalize MethodNone
29TransformationFalse
30Transformation MethodNone
31PCAFalse
32PCA MethodNone
33PCA ComponentsNone
34Ignore Low VarianceFalse
35Combine Rare LevelsFalse
36Rare Level ThresholdNone
37Numeric BinningFalse
38Remove OutliersFalse
39Outliers ThresholdNone
40Remove MulticollinearityFalse
41Multicollinearity ThresholdNone
42Remove Perfect CollinearityTrue
43ClusteringFalse
44Clustering IterationNone
45Polynomial FeaturesFalse
46Polynomial DegreeNone
47Trignometry FeaturesFalse
48Polynomial ThresholdNone
49Group FeaturesFalse
50Feature SelectionFalse
51Feature Selection Methodclassic
52Features Selection ThresholdNone
53Feature InteractionFalse
54Feature RatioFalse
55Interaction ThresholdNone
56Transform TargetFalse
57Transform Target Methodbox-cox
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from pycaret.datasets import get_data\n", "from pycaret.regression import *\n", "\n", "setup(data=get_data(\"insurance\"), target = 'charges', n_jobs=1)\n", "\n", "test_models = models().index.tolist()[:5]" ] }, { "cell_type": "markdown", "id": "4356758c", "metadata": {}, "source": [ "`compare_model` is also exactly the same if you don't want to use a distributed system" ] }, { "cell_type": "code", "execution_count": 12, "id": "bf87f67b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 ModelMAEMSERMSER2RMSLEMAPETT (Sec)
lassoLasso Regression4121.955636109634.60005980.61140.73760.54630.42430.0130
ridgeRidge Regression4134.413236105753.40005980.28800.73760.54530.42680.0120
lrLinear Regression4122.649736115891.40005981.17520.73750.54720.42430.0080
enElastic Net7122.393387174564.00009313.89340.36740.74210.93440.0100
larLeast Angle Regression7305.26471287737542.077416591.0408-9.75220.64500.85880.0120
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "[Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,\n", " normalize=False, positive=False, precompute=False, random_state=4045,\n", " selection='cyclic', tol=0.0001, warm_start=False),\n", " Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,\n", " normalize=False, random_state=4045, solver='auto', tol=0.001)]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "compare_models(include=test_models, n_select=2)" ] }, { "cell_type": "markdown", "id": "8cc73849", "metadata": {}, "source": [ "Now let's make it distributed, as a toy case, on dask. The only thing changed is an additional parameter `parallel_backend`" ] }, { "cell_type": "code", "execution_count": 13, "id": "ee333586", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,\n", " normalize=False, positive=False, precompute=False, random_state=4045,\n", " selection='cyclic', tol=0.0001, warm_start=False),\n", " Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,\n", " normalize=False, random_state=4045, solver='auto', tol=0.001)]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pycaret.parallel import FugueBackend\n", "\n", "compare_models(include=test_models, n_select=2, parallel=FugueBackend(\"dask\"))" ] }, { "cell_type": "markdown", "id": "38ad1ddb", "metadata": {}, "source": [ "In order to use Spark as the execution engine, you must have access to a Spark cluster, and you must have a `SparkSession`, let's initialize a local Spark session" ] }, { "cell_type": "code", "execution_count": 14, "id": "8221c7c3", "metadata": {}, "outputs": [], "source": [ "from pyspark.sql import SparkSession\n", "\n", "spark = SparkSession.builder.getOrCreate()" ] }, { "cell_type": "markdown", "id": "1ad84f4b", "metadata": {}, "source": [ "Now just change `parallel_backend` to this session object, you make it run on Spark. You must understand this is a toy case. In the real situation, you need to have a SparkSession pointing to a real Spark cluster to enjoy the power of Spark" ] }, { "cell_type": "code", "execution_count": 15, "id": "2ce39e6d", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "data": { "text/plain": [ "[Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,\n", " normalize=False, positive=False, precompute=False, random_state=7138,\n", " selection='cyclic', tol=0.0001, warm_start=False),\n", " LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "compare_models(include=test_models, n_select=2, parallel=FugueBackend(spark))" ] }, { "cell_type": "markdown", "id": "789fd969", "metadata": {}, "source": [ "In the end, you can `pull` to get the metrics table" ] }, { "cell_type": "code", "execution_count": 16, "id": "ecdd02a4", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ModelMAEMSERMSER2RMSLEMAPETT (Sec)
lassoLasso Regression4240.98473.703576e+076063.90520.74780.59590.43290.015
lrLinear Regression4211.76143.722926e+076058.17080.74000.58220.42110.021
larLeast Angle Regression4403.09123.944249e+076243.09430.73170.57580.42890.020
ridgeRidge Regression4152.40583.682102e+076037.51010.71420.57220.42630.018
enElastic Net7406.38229.128549e+079497.01260.36460.74750.94720.030
\n", "
" ], "text/plain": [ " Model MAE MSE RMSE R2 \\\n", "lasso Lasso Regression 4240.9847 3.703576e+07 6063.9052 0.7478 \n", "lr Linear Regression 4211.7614 3.722926e+07 6058.1708 0.7400 \n", "lar Least Angle Regression 4403.0912 3.944249e+07 6243.0943 0.7317 \n", "ridge Ridge Regression 4152.4058 3.682102e+07 6037.5101 0.7142 \n", "en Elastic Net 7406.3822 9.128549e+07 9497.0126 0.3646 \n", "\n", " RMSLE MAPE TT (Sec) \n", "lasso 0.5959 0.4329 0.015 \n", "lr 0.5822 0.4211 0.021 \n", "lar 0.5758 0.4289 0.020 \n", "ridge 0.5722 0.4263 0.018 \n", "en 0.7475 0.9472 0.030 " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pull()" ] }, { "cell_type": "markdown", "id": "c910b81c", "metadata": {}, "source": [ "As you see, the results from the distributed versions can be different from your local versions. In the next section, we will show how to make them identical.\n", "\n", "# A more practical case\n", "\n", "The above examples are pure toys, to make things work perfectly in a distributed system you must be careful about a few things\n", "\n", "# Use a lambda instead of a dataframe in setup\n", "\n", "If you directly provide a dataframe in `setup`, this dataset will need to be sent to all worker nodes. If the dataframe is 1G, you have 100 workers, then it is possible your dirver machine will need to send out up to 100G data (depending on specific framework's implementation), then this data transfer becomes a bottleneck itself. Instead, if you provide a lambda function, it doesn't change the local compute scenario, but the driver will only send the function reference to workers, and each worker will be responsible to load the data by themselves, so there is no heavy traffic on the driver side.\n", "\n", "# Be deterministic\n", "\n", "You should always use `session_id` to make the distributed compute deterministic, otherwise, for the exactly same logic you could get drastically different selection for each run.\n", "\n", "# Set n_jobs\n", "\n", "It is important to be explicit on n_jobs when you want to run something distributedly, so it will not overuse the local/remote resources. This can also avoid resrouce contention, and make the compute faster." ] }, { "cell_type": "code", "execution_count": 17, "id": "1d76ddae", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 DescriptionValue
0session_id0
1TargetPurchase
2Target TypeBinary
3Label EncodedCH: 0, MM: 1
4Original Data(1070, 19)
5Missing ValuesFalse
6Numeric Features13
7Categorical Features5
8Ordinal FeaturesFalse
9High Cardinality FeaturesFalse
10High Cardinality MethodNone
11Transformed Train Set(748, 17)
12Transformed Test Set(322, 17)
13Shuffle Train-TestTrue
14Stratify Train-TestFalse
15Fold GeneratorStratifiedKFold
16Fold Number10
17CPU Jobs1
18Use GPUFalse
19Log ExperimentFalse
20Experiment Nameclf-default-name
21USIcc4a
22Imputation Typesimple
23Iterative Imputation IterationNone
24Numeric Imputermean
25Iterative Imputation Numeric ModelNone
26Categorical Imputerconstant
27Iterative Imputation Categorical ModelNone
28Unknown Categoricals Handlingleast_frequent
29NormalizeFalse
30Normalize MethodNone
31TransformationFalse
32Transformation MethodNone
33PCAFalse
34PCA MethodNone
35PCA ComponentsNone
36Ignore Low VarianceFalse
37Combine Rare LevelsFalse
38Rare Level ThresholdNone
39Numeric BinningFalse
40Remove OutliersFalse
41Outliers ThresholdNone
42Remove MulticollinearityFalse
43Multicollinearity ThresholdNone
44Remove Perfect CollinearityTrue
45ClusteringFalse
46Clustering IterationNone
47Polynomial FeaturesFalse
48Polynomial DegreeNone
49Trignometry FeaturesFalse
50Polynomial ThresholdNone
51Group FeaturesFalse
52Feature SelectionFalse
53Feature Selection Methodclassic
54Features Selection ThresholdNone
55Feature InteractionFalse
56Feature RatioFalse
57Interaction ThresholdNone
58Fix ImbalanceFalse
59Fix Imbalance MethodSMOTE
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from pycaret.classification import *\n", "\n", "setup(data=lambda: get_data(\"juice\", verbose=False, profile=False), target = 'Purchase', session_id=0, n_jobs=1);" ] }, { "cell_type": "markdown", "id": "2fc80912", "metadata": {}, "source": [ "# Set the appropriate batch_size\n", "\n", "`batch_size` parameter helps adjust between load balence and overhead. For each batch, setup will be called only once. So\n", "\n", "| Choice |Load Balance|Overhead|Best Scenario|\n", "|---|---|---|---|\n", "|Smaller batch size|Better|Worse|`training time >> data loading time` or `models ~= workers`|\n", "|Larger batch size|Worse|Better|`training time << data loading time` or `models >> workers`|\n", "\n", "The default value is set to `1`, meaning we want the best load balance.\n", "\n", "# Display progress\n", "\n", "In development, you can enable visual effect by `display_remote=True`, but meanwhile you must also enable [Fugue Callback](https://fugue-tutorials.readthedocs.io/tutorials/advanced/rpc.html) so that the driver can monitor worker progress. But it is recommended to turn off display in production." ] }, { "cell_type": "code", "execution_count": 18, "id": "9775c4f4", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "7c88aa829a914e658437a5732dfb497d", "version_major": 2, "version_minor": 0 }, "text/plain": [ "IntProgress(value=0, description='Processing: ', max=16)" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ModelAccuracyAUCRecallPrec.F1KappaMCCTT (Sec)
ldaLinear Discriminant Analysis0.83280.89490.75850.79850.77350.64160.64640.016
lrLogistic Regression0.82750.89640.72650.81050.75890.62600.63440.185
ridgeRidge Classifier0.82750.00000.74790.79710.76540.62990.63660.011
catboostCatBoost Classifier0.82210.89670.75850.77550.76240.62090.62540.779
gbcGradient Boosting Classifier0.81950.88550.75100.77600.75940.61540.61930.113
rfRandom Forest Classifier0.80480.87920.74080.74830.73970.58430.58890.171
adaAda Boost Classifier0.80210.86680.70140.76390.72750.57290.57760.090
lightgbmLight Gradient Boosting Machine0.79940.87750.72990.74440.73310.57300.57680.051
xgboostExtreme Gradient Boosting0.79410.87290.72280.73530.72480.56090.56490.258
etExtra Trees Classifier0.78200.85090.71220.72140.71010.53650.54280.148
dtDecision Tree Classifier0.77780.76460.70470.70980.70480.52700.52940.009
nbNaive Bayes0.76740.83400.73690.67760.70310.51290.51730.008
knnK Neighbors Classifier0.70730.76460.54470.62750.57920.35790.36270.011
svmSVM - Linear Kernel0.64030.00000.11070.14390.10470.06880.08200.010
dummyDummy Classifier0.62430.50000.00000.00000.00000.00000.00000.005
qdaQuadratic Discriminant Analysis0.58530.56760.43950.32360.34740.10350.11710.008
\n", "
" ], "text/plain": [ " Model Accuracy AUC Recall Prec. \\\n", "lda Linear Discriminant Analysis 0.8328 0.8949 0.7585 0.7985 \n", "lr Logistic Regression 0.8275 0.8964 0.7265 0.8105 \n", "ridge Ridge Classifier 0.8275 0.0000 0.7479 0.7971 \n", "catboost CatBoost Classifier 0.8221 0.8967 0.7585 0.7755 \n", "gbc Gradient Boosting Classifier 0.8195 0.8855 0.7510 0.7760 \n", "rf Random Forest Classifier 0.8048 0.8792 0.7408 0.7483 \n", "ada Ada Boost Classifier 0.8021 0.8668 0.7014 0.7639 \n", "lightgbm Light Gradient Boosting Machine 0.7994 0.8775 0.7299 0.7444 \n", "xgboost Extreme Gradient Boosting 0.7941 0.8729 0.7228 0.7353 \n", "et Extra Trees Classifier 0.7820 0.8509 0.7122 0.7214 \n", "dt Decision Tree Classifier 0.7778 0.7646 0.7047 0.7098 \n", "nb Naive Bayes 0.7674 0.8340 0.7369 0.6776 \n", "knn K Neighbors Classifier 0.7073 0.7646 0.5447 0.6275 \n", "svm SVM - Linear Kernel 0.6403 0.0000 0.1107 0.1439 \n", "dummy Dummy Classifier 0.6243 0.5000 0.0000 0.0000 \n", "qda Quadratic Discriminant Analysis 0.5853 0.5676 0.4395 0.3236 \n", "\n", " F1 Kappa MCC TT (Sec) \n", "lda 0.7735 0.6416 0.6464 0.016 \n", "lr 0.7589 0.6260 0.6344 0.185 \n", "ridge 0.7654 0.6299 0.6366 0.011 \n", "catboost 0.7624 0.6209 0.6254 0.779 \n", "gbc 0.7594 0.6154 0.6193 0.113 \n", "rf 0.7397 0.5843 0.5889 0.171 \n", "ada 0.7275 0.5729 0.5776 0.090 \n", "lightgbm 0.7331 0.5730 0.5768 0.051 \n", "xgboost 0.7248 0.5609 0.5649 0.258 \n", "et 0.7101 0.5365 0.5428 0.148 \n", "dt 0.7048 0.5270 0.5294 0.009 \n", "nb 0.7031 0.5129 0.5173 0.008 \n", "knn 0.5792 0.3579 0.3627 0.011 \n", "svm 0.1047 0.0688 0.0820 0.010 \n", "dummy 0.0000 0.0000 0.0000 0.005 \n", "qda 0.3474 0.1035 0.1171 0.008 " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "[LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage=None,\n", " solver='svd', store_covariance=False, tol=0.0001),\n", " LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n", " intercept_scaling=1, l1_ratio=None, max_iter=1000,\n", " multi_class='auto', n_jobs=None, penalty='l2',\n", " random_state=0, solver='lbfgs', tol=0.0001, verbose=0,\n", " warm_start=False)]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fconf = {\n", " \"fugue.rpc.server\": \"fugue.rpc.flask.FlaskRPCServer\", # keep this value\n", " \"fugue.rpc.flask_server.host\": \"0.0.0.0\", # the driver ip address workers can access\n", " \"fugue.rpc.flask_server.port\": \"3333\", # the open port on the dirver\n", " \"fugue.rpc.flask_server.timeout\": \"2 sec\", # the timeout for worker to talk to driver\n", "}\n", "\n", "be = FugueBackend(\"dask\", fconf, display_remote=True, batch_size=3, top_only=False)\n", "compare_models(n_select=2, parallel=be)" ] }, { "cell_type": "markdown", "id": "d697e56c", "metadata": {}, "source": [ "# Custom Metrics\n", "\n", "You can add custom metrics like before. But in order to make the scorer distributable, it must be serializable. A common function should be fine, but if inside the function, it is using some global variables that are not serializable (for example an `RLock` object), it can cause issues. So try to make the custom function independent from global variables." ] }, { "cell_type": "code", "execution_count": 19, "id": "2614b869", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Name DUMMY\n", "Display Name DUMMY\n", "Score Function \n", "Scorer make_scorer(score_dummy, needs_proba=True, err...\n", "Target pred_proba\n", "Args {}\n", "Greater is Better True\n", "Multiclass True\n", "Custom True\n", "Name: mydummy, dtype: object" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def score_dummy(y_true, y_prob, axis=0):\n", " return 0.0\n", "\n", "add_metric(id = 'mydummy',\n", " name = 'DUMMY',\n", " score_func = score_dummy,\n", " target = 'pred_proba',\n", " greater_is_better = True,\n", " )" ] }, { "cell_type": "markdown", "id": "7ccaa531", "metadata": {}, "source": [ "Adding a function in a class instance is also ok, but make sure all member variables in the class are serializable." ] }, { "cell_type": "code", "execution_count": 20, "id": "83576a2d", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "data": { "text/plain": [ "[GaussianNB(priors=None, var_smoothing=1e-09),\n", " KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n", " metric_params=None, n_jobs=1, n_neighbors=5, p=2,\n", " weights='uniform')]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_models = models().index.tolist()[:5]\n", "compare_models(include=test_models, n_select=2, sort=\"DUMMY\", parallel=FugueBackend(spark))" ] }, { "cell_type": "code", "execution_count": 21, "id": "04d5e7c9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ModelAccuracyAUCRecallPrec.F1KappaMCCDUMMYTT (Sec)
nbNaive Bayes0.76740.83400.73690.67760.70310.51290.51730.00.015
knnK Neighbors Classifier0.70730.76460.54470.62750.57920.35790.36270.00.032
svmSVM - Linear Kernel0.64030.00000.11070.14390.10470.06880.08200.00.011
lrLogistic Regression0.82750.89640.72650.81050.75890.62600.63440.00.433
dtDecision Tree Classifier0.77780.76460.70470.70980.70480.52700.52940.00.020
\n", "
" ], "text/plain": [ " Model Accuracy AUC Recall Prec. F1 \\\n", "nb Naive Bayes 0.7674 0.8340 0.7369 0.6776 0.7031 \n", "knn K Neighbors Classifier 0.7073 0.7646 0.5447 0.6275 0.5792 \n", "svm SVM - Linear Kernel 0.6403 0.0000 0.1107 0.1439 0.1047 \n", "lr Logistic Regression 0.8275 0.8964 0.7265 0.8105 0.7589 \n", "dt Decision Tree Classifier 0.7778 0.7646 0.7047 0.7098 0.7048 \n", "\n", " Kappa MCC DUMMY TT (Sec) \n", "nb 0.5129 0.5173 0.0 0.015 \n", "knn 0.3579 0.3627 0.0 0.032 \n", "svm 0.0688 0.0820 0.0 0.011 \n", "lr 0.6260 0.6344 0.0 0.433 \n", "dt 0.5270 0.5294 0.0 0.020 " ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pull()" ] }, { "cell_type": "code", "execution_count": 22, "id": "8f1d99c5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Name DUMMY2\n", "Display Name DUMMY2\n", "Score Function \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ModelAccuracyAUCRecallPrec.F1KappaMCCDUMMYDUMMY2TT (Sec)
knnK Neighbors Classifier0.70730.76460.54470.62750.57920.35790.36270.01.00.011
dtDecision Tree Classifier0.77780.76460.70470.70980.70480.52700.52940.01.00.010
nbNaive Bayes0.76740.83400.73690.67760.70310.51290.51730.01.00.008
lrLogistic Regression0.82750.89640.72650.81050.75890.62600.63440.01.00.192
svmSVM - Linear Kernel0.64030.00000.11070.14390.10470.06880.08200.00.00.011
\n", "" ], "text/plain": [ " Model Accuracy AUC Recall Prec. F1 \\\n", "knn K Neighbors Classifier 0.7073 0.7646 0.5447 0.6275 0.5792 \n", "dt Decision Tree Classifier 0.7778 0.7646 0.7047 0.7098 0.7048 \n", "nb Naive Bayes 0.7674 0.8340 0.7369 0.6776 0.7031 \n", "lr Logistic Regression 0.8275 0.8964 0.7265 0.8105 0.7589 \n", "svm SVM - Linear Kernel 0.6403 0.0000 0.1107 0.1439 0.1047 \n", "\n", " Kappa MCC DUMMY DUMMY2 TT (Sec) \n", "knn 0.3579 0.3627 0.0 1.0 0.011 \n", "dt 0.5270 0.5294 0.0 1.0 0.010 \n", "nb 0.5129 0.5173 0.0 1.0 0.008 \n", "lr 0.6260 0.6344 0.0 1.0 0.192 \n", "svm 0.0688 0.0820 0.0 0.0 0.011 " ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pull()" ] }, { "cell_type": "markdown", "id": "c7e34629", "metadata": {}, "source": [ "# Notes\n", "\n", "# Spark settings\n", "\n", "It is highly recommended to have only 1 worker on each Spark executor, so the worker can fully utilize all cpus (set `spark.task.cpus`). Also when you do this you should explicitly set `n_jobs` in `setup` to the number of cpus of each executor.\n", "\n", "```python\n", "executor_cores = 4\n", "\n", "spark = SparkSession.builder.config(\"spark.task.cpus\", executor_cores).config(\"spark.executor.cores\", executor_cores).getOrCreate()\n", "\n", "setup(data=get_data(\"juice\", verbose=False, profile=False), target = 'Purchase', session_id=0, n_jobs=executor_cores)\n", "\n", "compare_models(n_select=2, parallel=FugueBackend(spark))\n", "```\n", "\n", "# Databricks\n", "\n", "On Databricks, `spark` is the magic variable representing a SparkSession. But there is no difference to use. You do the exactly same thing as before:\n", "\n", "```python\n", "compare_models(parallel=FugueBackend(spark))\n", "```\n", "\n", "But Databricks, the visualization is difficult, so it may be a good idea to do two things:\n", "\n", "* Set `verbose` to False in `setup`\n", "* Set `display_remote` to False in `FugueBackend`\n", "\n", "# Dask\n", "\n", "Dask has fake distributed modes such as the default (multi-thread) and multi-process modes. The default mode will just work fine (but they are actually running sequentially), and multi-process doesn't work for PyCaret for now because it messes up with PyCaret's global variables. On the other hand, any Spark execution mode will just work fine.\n", "\n", "# Local Parallelization\n", "\n", "For practical use where you try non-trivial data and models, local parallelization (The eaiest way is to use local Dask as backend as shown above) normally doesn't have performance advantage. Because it's very easy to overload the CPUS on training, increasing the contention of resources. The value of local parallelization is to verify the code and give you confidence that the distributed environment will provide the expected result with much shorter time.\n", "\n", "# How to develop \n", "\n", "Distributed systems are powerful but you must follow some good practices to use them:\n", "\n", "1. **From small to large:** initially, you must start with a small set of data, for example in `compare_model` limit the models you want to try to a small number of cheap models, and when you verify they work, you can change to a larger model collection.\n", "2. **From local to distributed:** you should follow this sequence: verify small data locally then verify small data distributedly and then verify large data distributedly. The current design makes the transition seamless. You can do these sequentially: `parallel=None` -> `parallel=FugueBackend()` -> `parallel=FugueBackend(spark)`. In the second step, you can replace with a local SparkSession or local dask." ] }, { "cell_type": "code", "execution_count": null, "id": "ee7d43a6", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 5 }