{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# Principal Component Analysis (PCA) on Iris Dataset\n\nThis example shows a well known decomposition technique known as Principal Component\nAnalysis (PCA) on the\n[Iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set).\n\nThis dataset is made of 4 features: sepal length, sepal width, petal length, petal\nwidth. We use PCA to project this 4 feature space into a 3-dimensional space.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Loading the Iris dataset\n\nThe Iris dataset is directly available as part of scikit-learn. It can be loaded\nusing the :func:`~sklearn.datasets.load_iris` function. With the default parameters,\na :class:`~sklearn.utils.Bunch` object is returned, containing the data, the\ntarget values, the feature names, and the target names.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.datasets import load_iris\n\niris = load_iris(as_frame=True)\nprint(iris.keys())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Plot of pairs of features of the Iris dataset\n\nLet's first plot the pairs of features of the Iris dataset.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import seaborn as sns\n\n# Rename classes using the iris target names\niris.frame[\"target\"] = iris.target_names[iris.target]\n_ = sns.pairplot(iris.frame, hue=\"target\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Each data point on each scatter plot refers to one of the 150 iris flowers\nin the dataset, with the color indicating their respective type\n(Setosa, Versicolor, and Virginica).\n\nYou can already see a pattern regarding the Setosa type, which is\neasily identifiable based on its short and wide sepal. Only\nconsidering these two dimensions, sepal width and length, there's still\noverlap between the Versicolor and Virginica types.\n\nThe diagonal of the plot shows the distribution of each feature. We observe\nthat the petal width and the petal length are the most discriminant features\nfor the three types.\n\n## Plot a PCA representation\nLet's apply a Principal Component Analysis (PCA) to the iris dataset\nand then plot the irises across the first three PCA dimensions.\nThis will allow us to better differentiate among the three types!\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import matplotlib.pyplot as plt\n\n# unused but required import for doing 3d projections with matplotlib < 3.2\nimport mpl_toolkits.mplot3d  # noqa: F401\n\nfrom sklearn.decomposition import PCA\n\nfig = plt.figure(1, figsize=(8, 6))\nax = fig.add_subplot(111, projection=\"3d\", elev=-150, azim=110)\n\nX_reduced = PCA(n_components=3).fit_transform(iris.data)\nscatter = ax.scatter(\n    X_reduced[:, 0],\n    X_reduced[:, 1],\n    X_reduced[:, 2],\n    c=iris.target,\n    s=40,\n)\n\nax.set(\n    title=\"First three PCA dimensions\",\n    xlabel=\"1st Eigenvector\",\n    ylabel=\"2nd Eigenvector\",\n    zlabel=\"3rd Eigenvector\",\n)\nax.xaxis.set_ticklabels([])\nax.yaxis.set_ticklabels([])\nax.zaxis.set_ticklabels([])\n\n# Add a legend\nlegend1 = ax.legend(\n    scatter.legend_elements()[0],\n    iris.target_names.tolist(),\n    loc=\"upper right\",\n    title=\"Classes\",\n)\nax.add_artist(legend1)\n\nplt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "PCA will create 3 new features that are a linear combination of the 4 original\nfeatures. In addition, this transformation maximizes the variance. With this\ntransformation, we see that we can identify each species using only the first feature\n(i.e., first eigenvector).\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.9.21"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}