{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Model Complexity Influence\n\nDemonstrate how model complexity influences both prediction accuracy and\ncomputational performance.\n\nWe will be using two datasets:\n - `diabetes_dataset` for regression.\n This dataset consists of 10 measurements taken from diabetes patients.\n The task is to predict disease progression;\n - `20newsgroups_dataset` for classification. This dataset consists of\n newsgroup posts. The task is to predict on which topic (out of 20 topics)\n the post is written about.\n\nWe will model the complexity influence on three different estimators:\n - :class:`~sklearn.linear_model.SGDClassifier` (for classification data)\n which implements stochastic gradient descent learning;\n\n - :class:`~sklearn.svm.NuSVR` (for regression data) which implements\n Nu support vector regression;\n\n - :class:`~sklearn.ensemble.GradientBoostingRegressor` builds an additive\n model in a forward stage-wise fashion. Notice that\n :class:`~sklearn.ensemble.HistGradientBoostingRegressor` is much faster\n than :class:`~sklearn.ensemble.GradientBoostingRegressor` starting with\n intermediate datasets (`n_samples >= 10_000`), which is not the case for\n this example.\n\n\nWe make the model complexity vary through the choice of relevant model\nparameters in each of our selected models. Next, we will measure the influence\non both computational performance (latency) and predictive power (MSE or\nHamming Loss).\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause\n\nimport time\n\nimport matplotlib.pyplot as plt\nimport numpy as np\n\nfrom sklearn import datasets\nfrom sklearn.ensemble import GradientBoostingRegressor\nfrom sklearn.linear_model import SGDClassifier\nfrom sklearn.metrics import hamming_loss, mean_squared_error\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.svm import NuSVR\n\n# Initialize random generator\nnp.random.seed(0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load the data\n\nFirst we load both datasets.\n\n
We are using\n :func:`~sklearn.datasets.fetch_20newsgroups_vectorized` to download 20\n newsgroups dataset. It returns ready-to-use features.
``X`` of the 20 newsgroups dataset is a sparse matrix while ``X``\n of diabetes dataset is a numpy array.