{ "cells": [ { "cell_type": "markdown", "id": "9ca5e171", "metadata": {}, "source": [ "# Random forests intro and regression\n", "\n", "Random Forests are powerful machine learning algorithms used for supervised classification and regression. Random forests works by averaging the predictions of the multiple and randomized decision trees. Decision trees tends to overfit and so by combining multiple decision trees, the effect of overfitting can be minimized. \n", "\n", "Random Forests are type of ensemble models. More about ensembles models in the next notebook. \n", "\n", "Different to other learning algorithms, random forests provide a way to find the importance of each feature and this is implemented in Sklearn. " ] }, { "cell_type": "markdown", "id": "99058d12", "metadata": { "tags": [] }, "source": [ "## Imports" ] }, { "cell_type": "code", "execution_count": null, "id": "df278c5f", "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import numpy as np\n", "import pandas as pd\n", "import seaborn as sns\n", "import sklearn\n", "import matplotlib.pyplot as plt\n", "import pytest\n", "import ipytest\n", "import unittest\n", "\n", "ipytest.autoconfig()" ] }, { "cell_type": "markdown", "id": "35699c38", "metadata": { "tags": [] }, "source": [ "## Loading the data\n", "\n", "In this regression task with random forests, we will use the Machine CPU (Central Processing Unit) dataset which is available at [OpenML](https://www.openml.org/t/5492).\n", "\n", "If you are reading this, it's very likely that you know CPU or you have once(or many times) thought about it when you were buying your computer. In this notebook, we will predict the relative performance of the CPU given the following data: \n", "\n", "* MYCT: machine cycle time in nanoseconds (integer)\n", "* MMIN: minimum main memory in kilobytes (integer)\n", "* MMAX: maximum main memory in kilobytes (integer)\n", "* CACH: cache memory in kilobytes (integer)\n", "* CHMIN: minimum channels in units (integer)\n", "* CHMAX: maximum channels in units (integer)\n", "* PRP: published relative performance (integer) (target variable)" ] }, { "cell_type": "code", "execution_count": null, "id": "67905ee6", "metadata": {}, "outputs": [], "source": [ "# Let's hide warnings\n", "\n", "import warnings\n", "\n", "warnings.filterwarnings(\"ignore\")" ] }, { "cell_type": "code", "execution_count": null, "id": "d3083946", "metadata": {}, "outputs": [], "source": [ "machine_data = pd.read_csv(\"../../../assets/data/machine_cup.csv\")" ] }, { "cell_type": "code", "execution_count": null, "id": "46c09852", "metadata": {}, "outputs": [], "source": [ "type(machine_data)" ] }, { "cell_type": "code", "execution_count": null, "id": "da794464-2da8-4114-a3a9-c20c6db713f7", "metadata": {}, "outputs": [], "source": [ "machine_data.shape" ] }, { "cell_type": "code", "execution_count": null, "id": "e216742c-5603-47e1-879d-9bcfeb2b44e9", "metadata": {}, "outputs": [], "source": [ "machine_data.head()" ] }, { "cell_type": "markdown", "id": "acda9039-bd82-4ebd-97e8-e32b9510638f", "metadata": { "tags": [] }, "source": [ "## Tasks and roles" ] }, { "cell_type": "markdown", "id": "c492fd93", "metadata": { "tags": [] }, "source": [ "### Task 1: Exploratory analysis" ] }, { "cell_type": "markdown", "id": "e9906083", "metadata": {}, "source": [ "Before doing exploratory analysis, let's get the training and test data. " ] }, { "cell_type": "code", "execution_count": null, "id": "98d74144", "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "train_data, test_data = train_test_split(machine_data, test_size=0.2, random_state=20)\n", "print(\n", " \"The size of training data is: {} \\nThe size of testing data is: {}\".format(\n", " len(train_data), len(test_data)\n", " )\n", ")" ] }, { "cell_type": "markdown", "id": "c3fc640e", "metadata": { "tags": [] }, "source": [ "#### Part 1: The histogram" ] }, { "cell_type": "code", "execution_count": null, "id": "61118a4c-dea4-4ce2-bdc3-3a52fae5e1e8", "metadata": {}, "outputs": [], "source": [ "def df_hist(df):\n", " if df is not None and not df.empty:\n", " df.hist(bins = 50, figsize = (15, 10))" ] }, { "cell_type": "code", "execution_count": null, "id": "04856278", "metadata": {}, "outputs": [], "source": [ "df_hist(train_data)\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "d76527cf-1527-4bc5-a9bc-f30619ec3cc3", "metadata": { "tags": [] }, "source": [ "
pandas.DataFrame.hist.\n",
"\n",
"seaborn.pairplot.\n",
"\n",
"