{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import time\n", "from pprint import pprint\n", "\n", "import matplotlib.pyplot as plt\n", "import nivapy3 as nivapy\n", "import numpy as np\n", "import pandas as pd\n", "import seaborn as sn\n", "import utils\n", "from matplotlib import colors\n", "from sklearn import model_selection\n", "from sklearn.discriminant_analysis import LinearDiscriminantAnalysis\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.linear_model import LogisticRegression, RidgeClassifier\n", "from sklearn.model_selection import RandomizedSearchCV, train_test_split\n", "from sklearn.naive_bayes import GaussianNB\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.svm import SVC\n", "\n", "sn.set(style=\"ticks\")\n", "plt.style.use(\"ggplot\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Comparing models and choosing hyperparameters\n", "\n", "This notebook provides a basic framework for:\n", "\n", " 1. Tuning the hyperparameters of individual classification algorithms, and \n", " 2. Comparing the performance of various different algorithms (with different hyperparameters)\n", " \n", "Scikit-Learn provides a wide range of algorithms for **supervised classification** (see [here](https://scikit-learn.org/stable/supervised_learning.html) for an overview). It should be possible to use any of Sklearn's **classification** (not regression) algorithms in this notebook. I've imported seven suggested algorithms in the code above, but feel free to add others if you wish.\n", "\n", "**Tip:** If you're trying out a new algorithm, start off with a **small training dataset** (maybe 10,000-ish rows) and, if it works, scale things up. If you go straight for millions of rows, you'll likely get lots of crashes (plus 5 to 10 minute delays as the cluster recovers before you can restart).\n", "\n", "**Note:** To run this notebook in its present form, you will need to **sign in to the JupyterHub on one of the larger machines**: either \"High memory\" (48 GB RAM; 8 CPUs) or \"High memory & high CPU\" (60 GB RAM; 16 CPUs). For long-running computations, please remember to **check progress periodically and shut down your server when you're finished**. (This doesn't mean you need to be tied to your computer; you can shut down your server pretty easily from your phone)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Set user options" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Specify area(s) from which to build train/test dataset\n", "area_list = [1, 3, 6]\n", "\n", "# Specify area for 'final' test image (should NOT be in the list above)\n", "pred_area = 2\n", "\n", "# Number of samples to use for training and testing\n", "train_size = 1e4\n", "test_size = 1e4\n", "\n", "# For repeatability of random numbers\n", "seed = 42\n", "\n", "# Scoring metric for model comparison\n", "metric = \"accuracy\"\n", "\n", "# Number of folds for CV\n", "n_folds = 10\n", "\n", "# Whether to equalise images using linear histogram stretch\n", "equalise = False\n", "\n", "assert (\n", " pred_area not in area_list\n", "), \"'pred_area' should be an independent test i.e. not included in 'area_list'.\"" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Define colourbar\n", "# Class codes for \"truth\" datasets\n", "class_codes = {\n", " -1: \"Other\",\n", " 0: \"No data\",\n", " 1: \"Brown algae\",\n", " 2: \"Green algae\",\n", " 3: \"Red algae\",\n", " 4: \"Eelgrass\",\n", " 5: \"Rock\",\n", " 6: \"Sand\",\n", " 7: \"Lichen\",\n", " 8: \"Lichen (2)\",\n", " 9: \"Terrestrial vegetation\",\n", " 10: \"Beach/shingle\",\n", "}\n", "\n", "# Define colours for classes. Ordered -1 to 10 (i.e. same as in 'class_codes', above)\n", "cmap = colors.ListedColormap(\n", " [\n", " \"lightcyan\",\n", " \"black\",\n", " \"tan\",\n", " \"lawngreen\",\n", " \"red\",\n", " \"none\",\n", " \"lightgrey\",\n", " \"gold\",\n", " \"magenta\",\n", " \"none\",\n", " \"mediumseagreen\",\n", " \"orange\",\n", " ]\n", ")\n", "bounds = np.arange(-1.5, 11.5)\n", "norm = colors.BoundaryNorm(bounds, cmap.N)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Build dataset for training and evaluation\n", "\n", "We begin by reading all the data from several different areas, so that our training dataset has a mixture of pixels from different parts of the study region. The three areas chosen above together have ~19 million pixels, which is too large and slow for the kind of experimentation we want to do here. I've therefore set both `train_size` and `test_size` to 2 million rows. This should be more manageable, while still providing a realistic impression of performance.\n", "\n", "We also read in a fourth image, which is kept separately and used for final testing and visualisation.\n", "\n", "You are welcome to **use data from other areas** and **choose different scoring metrics** (see Sklearn's documentation). Also **consider setting `equalise=True`** in the options above. This should have no effect on \"non-parametric\" methods (like random forest) but could affect some of the other classifiers." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total number of samples in dataset: 18.87 million.\n", "Number of samples in training dataset: 0.01 million.\n", "Number of samples in evaluation dataset: 0.01 million.\n" ] } ], "source": [ "# Read train/test image data\n", "df_list = []\n", "for area in area_list:\n", " df = utils.image_to_sample_df(area, equalise=equalise)\n", " df_list.append(df)\n", "\n", "df = pd.concat(df_list, axis=\"rows\")\n", "del df_list\n", "\n", "# Split into training and evaluation\n", "cols = [str(i) for i in range(1, 9)]\n", "X_train, X_eval, y_train, y_eval = train_test_split(\n", " df[cols],\n", " df[\"y\"],\n", " train_size=int(train_size),\n", " test_size=int(test_size),\n", " random_state=seed,\n", " shuffle=True,\n", ")\n", "\n", "assert len(X_train) == len(y_train)\n", "assert len(X_eval) == len(y_eval)\n", "\n", "print(f\"Total number of samples in dataset: {(len(df) / 1E6):.2f} million.\")\n", "print(f\"Number of samples in training dataset: {(len(X_train) / 1E6):.2f} million.\")\n", "print(f\"Number of samples in evaluation dataset: {(len(X_eval) / 1E6):.2f} million.\")" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total number of pixels in test image: 13.32 million.\n" ] } ], "source": [ "# Read bands for 'final' test image\n", "pred_df = utils.image_to_sample_df(pred_area, equalise=equalise, dropna=False)\n", "del pred_df[\"y\"]\n", "\n", "# Read manually classified \"truth\" datset\n", "man_path = f\"/home/jovyan/shared/drones/frisk_oslofjord/raster/ne_akeroya_5cm_area_{pred_area}_man_class.tif\"\n", "man_img, ndv, man_epsg, extent = nivapy.spatial.read_raster(man_path, band_no=1)\n", "\n", "print(f\"Total number of pixels in test image: {(len(pred_df) / 1E6):.2f} million.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Explore hyperparameters for an individual model\n", "\n", "At the top of this notebook I've imported seven classifiers as a starting point (feel free to change them if you like):\n", "\n", " * LogisticRegression\n", " * LinearDiscriminantAnalysis\n", " * RidgeClassifier\n", " * GaussianNB\n", " * KNeighborsClassifier\n", " * DecisionTreeClassifier\n", " * RandomForestClassifier\n", "\n", "The first of these is essentially the same as the method you implemented with Guri (which I think was multinomial logistic regression?). This is a linear approach, as is `LinearDiscriminantAnalysis` and the `RidgeClassifier`. `GaussianNB` is a very simple method, but it can be effective in some circumstances. The `KNeighborsClassifier` is also conceptually simple, but often surprisingly powerful. The last two are tree-based classifiers, where the `RandomForestClassifier` is just an ensemble of `DecisionTreeClassifiers`. These two - together with KNN - are non-parametric methods, which makes them both flexible and powerful.\n", "\n", "As a starting point, I recommend reading the online documentation for each of these models. Most of the algorithms will have a general introduction (linked from [here](https://scikit-learn.org/stable/supervised_learning.html)) plus a more detailed page describing the API and hyperparameter options (e.g. [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) for the `RandomForestClassifier`). For each algorithm, aim to:\n", "\n", " 1. Get a rough understanding of what it does and what assumptions it makes, and\n", " \n", " 2. Write down which hyperparameters you think might be sensible to tune\n", " \n", "Then, for each of the models in turn..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.1. Select the base model\n", "\n", "Change the code below to match the model you want to work with." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# Create the base model to tune\n", "#model = RandomForestClassifier(n_jobs=-1)\n", "\n", "model = SVC(kernel='rbf')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.2. Choose hyperparameters to tune\n", "\n", "Change the code below to create a \"grid\" of hyperparameters to search. These hyperparameters must match the options documented in the API for each algorithm (see above; the code below applies only to the `RandomForestClassifier`). \n", "\n", "Be conservative with the number of combinations you choose to begin with until you get an idea of how long each model takes to fit (e.g. start with a grid of just one or two parameter combinations and work from there). Bear in mind that the combinations multiply quickly and runtimes can become large." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'C': [10, 100, 200], 'gamma': [5e-05, 0.0001, 'auto', 'scale']}\n" ] } ], "source": [ "## Number of trees in random forest\n", "#n_estimators = [10]\n", "#\n", "## Number of features to consider at every split\n", "#max_features = [\"auto\"] # \"sqrt\"\n", "#\n", "## Maximum number of levels in tree\n", "#max_depth = [2, 6, 10, 20]\n", "#\n", "## Create the random grid\n", "#random_grid = {\n", "# \"max_features\": max_features,\n", "# \"max_depth\": max_depth,\n", "# \"n_estimators\": n_estimators,\n", "#}\n", "\n", "C = [10, 100, 200]\n", "gamma = [0.00005, 0.0001, 'auto', 'scale']\n", "random_grid = {\n", " \"C\": C,\n", " \"gamma\": gamma,\n", "}\n", "\n", "pprint(random_grid)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.3. Train model\n", "\n", "The code below trains the model multiple times, exploring the hyperparameter grid." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitting 10 folds for each of 12 candidates, totalling 120 fits\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/opt/conda/lib/python3.8/site-packages/sklearn/model_selection/_search.py:278: UserWarning: The total space of parameters 12 is smaller than n_iter=50. Running 12 iterations. For exhaustive searches, use GridSearchCV.\n", " warnings.warn(\n", "[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.\n", "[Parallel(n_jobs=-1)]: Done 25 tasks | elapsed: 27.4s\n", "[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed: 1.8min finished\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2.16 s, sys: 121 ms, total: 2.28 s\n", "Wall time: 1min 52s\n" ] }, { "data": { "text/plain": [ "RandomizedSearchCV(cv=10, estimator=SVC(), n_iter=50, n_jobs=-1,\n", " param_distributions={'C': [10, 100, 200],\n", " 'gamma': [5e-05, 0.0001, 'auto',\n", " 'scale']},\n", " random_state=42, verbose=2)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "# Random search of parameters, using cross-validation,\n", "model_random = RandomizedSearchCV(\n", " estimator=model,\n", " param_distributions=random_grid,\n", " n_iter=50,\n", " cv=n_folds,\n", " verbose=2,\n", " random_state=seed,\n", " n_jobs=-1,\n", ")\n", "\n", "# Fit the random search model\n", "model_random.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.4. Evaluate model\n", "\n", "The code below picks the \"best\" model from among the ones specified in your hyperparameter grid and makes predictions for the testing dataset. **Make a note of the \"best\" hyperparameter values for each model**, so you can use them again later.\n", "\n", "The code also prints model skill statistics. It's probably a good idea to modify this to **save the output for each model**." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The 'best' model identified has the following parameters:\n", "\n", "SVC(C=200, gamma=5e-05) \n", "\n", "CPU times: user 791 ms, sys: 1.12 ms, total: 792 ms\n", "Wall time: 791 ms\n" ] } ], "source": [ "%%time\n", "\n", "# Predict classes for the evaluation data using the 'best' model found\n", "best = model_random.best_estimator_\n", "best_preds = best.predict(X_eval)\n", "\n", "print(\"The 'best' model identified has the following parameters:\\n\")\n", "print(best, \"\\n\")" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Classification report:\n", " precision recall f1-score support\n", "\n", " Brown algae 0.94 0.95 0.95 3921\n", " Green algae 0.54 0.35 0.42 84\n", " Red algae 0.82 0.20 0.33 44\n", " Rock 0.86 0.94 0.90 3416\n", " Sand 0.86 0.88 0.87 1234\n", " Lichen 0.71 0.41 0.52 58\n", "Terrestrial vegetation 0.87 0.80 0.83 652\n", " Beach/shingle 0.82 0.54 0.65 591\n", "\n", " accuracy 0.89 10000\n", " macro avg 0.80 0.63 0.68 10000\n", " weighted avg 0.89 0.89 0.89 10000\n", "\n", "Classification accuracy: 0.892400\n" ] } ], "source": [ "# Skill summary\n", "# Only use relevant labels from the training dataset\n", "class_labels = [1, 2, 3, 5, 6, 7, 9, 10]\n", "class_names = [class_codes[i] for i in class_labels]\n", "\n", "utils.classification_report(\n", " y_eval,\n", " best_preds,\n", " class_labels,\n", " class_names,\n", ")" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# Tidy up to save memory\n", "del model, model_random" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Compare multiple models\n", "\n", "Section 3 should give a good idea of which hyperparameters work best for each algorithm. In this section, you can take the best model of each kind and compare them against one another.\n", "\n", "At present, the code below uses the **default** hyperparameters for each model (except for the random forest, which we've already worked with and therefore know something about). Where possible, I've setup **parallel processing** (by setting `n_jobs=1`), but note that not all algorithms support this.\n", "\n", "Ideally, you should set the hyperparameters for each of the models below in a similar way, based on your results above." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitting RandomForestClassifier(max_depth=20, n_estimators=10, n_jobs=-1)\n", " Cross-validating...\n", " Refitting and predicting test image...\n", "Fitting SVC(C=200, gamma=5e-05)\n", " Cross-validating...\n", " Refitting and predicting test image...\n", "Done.\n" ] }, { "data": { "text/html": [ "
| \n", " | accuracy | \n", "train_time_per_fold_s | \n", "pred_time_s | \n", "
|---|---|---|---|
| model | \n", "\n", " | \n", " | \n", " |
| SVC | \n", "0.8894 | \n", "0.414523 | \n", "1050.339782 | \n", "
| RF | \n", "0.8644 | \n", "0.037790 | \n", "7.524469 | \n", "