{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Iris flower dataset\n", "\n", "The [iris flower dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set) is a common dataset used in machine learning.\n", "\n", "It has been created Ronald Fisher in 1936. It contains the petal length, petal width, sepal length and sepal width of 150 iris flowers from 3 different species.\n", "\n", "Dataset has been downloaded from [Kaggle](https://www.kaggle.com/uciml/iris).\n", "\n", "To go through this example, you need to install AutoClassWrapper:\n", "```bash\n", "$ python3 -m pip install autoclasswrapper\n", "```\n", "\n", "[AutoClass C](https://ti.arc.nasa.gov/tech/rse/synthesis-projects-applications/autoclass/autoclass-c/) also needs to be installed locally and available in path.\n", "\n", "Here is a quick solution for a Linux Bash shell:\n", "```bash\n", "wget https://ti.arc.nasa.gov/m/project/autoclass/autoclass-c-3-3-6.tar.gz\n", "tar zxvf autoclass-c-3-3-6.tar.gz\n", "rm -f autoclass-c-3-3-6.tar.gz\n", "export PATH=$PATH:$(pwd)/autoclass-c\n", "\n", "# if you use a 64-bit operating system,\n", "# you also need to install the standard 32-bit C libraries:\n", "# sudo apt-get install -y libc6-i386\n", "```" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Python: 3.7.1 | packaged by conda-forge | (default, Feb 26 2019, 04:48:14) \n", "[GCC 7.3.0]\n", "matplotlib: 3.0.3\n", "numpy: 1.16.2\n", "pandas: 0.24.1\n", "AutoClassWrapper: 1.4.1\n" ] } ], "source": [ "from pathlib import Path\n", "import sys\n", "import time\n", "\n", "import matplotlib\n", "import matplotlib.pyplot as plt\n", "from matplotlib.lines import Line2D\n", "import numpy as np\n", "import pandas as pd\n", "\n", "%matplotlib inline\n", "\n", "print(\"Python:\", sys.version)\n", "print(\"matplotlib:\", matplotlib.__version__)\n", "print(\"numpy:\", np.__version__)\n", "print(\"pandas:\", pd.__version__)\n", "\n", "import autoclasswrapper as wrapper\n", "print(\"AutoClassWrapper:\", wrapper.__version__)\n", "\n", "version = sys.version_info \n", "if not ((version.major >= 3) and (version.minor >= 6)):\n", " sys.exit(\"Need Python>=3.6\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dataset preparation" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SepalLengthCmSepalWidthCmPetalLengthCmPetalWidthCmSpecies
Id
15.13.51.40.2Iris-setosa
24.93.01.40.2Iris-setosa
34.73.21.30.2Iris-setosa
44.63.11.50.2Iris-setosa
55.03.61.40.2Iris-setosa
\n", "
" ], "text/plain": [ " SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species\n", "Id \n", "1 5.1 3.5 1.4 0.2 Iris-setosa\n", "2 4.9 3.0 1.4 0.2 Iris-setosa\n", "3 4.7 3.2 1.3 0.2 Iris-setosa\n", "4 4.6 3.1 1.5 0.2 Iris-setosa\n", "5 5.0 3.6 1.4 0.2 Iris-setosa" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv(\"iris.csv\", index_col=\"Id\")\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countuniquetopfreqmeanstdmin25%50%75%max
SepalLengthCm150NaNNaNNaN5.843330.8280664.35.15.86.47.9
SepalWidthCm150NaNNaNNaN3.0540.43359422.833.34.4
PetalLengthCm150NaNNaNNaN3.758671.7644211.64.355.16.9
PetalWidthCm150NaNNaNNaN1.198670.7631610.10.31.31.82.5
Species1503Iris-versicolor50NaNNaNNaNNaNNaNNaNNaN
\n", "
" ], "text/plain": [ " count unique top freq mean std min 25% \\\n", "SepalLengthCm 150 NaN NaN NaN 5.84333 0.828066 4.3 5.1 \n", "SepalWidthCm 150 NaN NaN NaN 3.054 0.433594 2 2.8 \n", "PetalLengthCm 150 NaN NaN NaN 3.75867 1.76442 1 1.6 \n", "PetalWidthCm 150 NaN NaN NaN 1.19867 0.763161 0.1 0.3 \n", "Species 150 3 Iris-versicolor 50 NaN NaN NaN NaN \n", "\n", " 50% 75% max \n", "SepalLengthCm 5.8 6.4 7.9 \n", "SepalWidthCm 3 3.3 4.4 \n", "PetalLengthCm 4.35 5.1 6.9 \n", "PetalWidthCm 1.3 1.8 2.5 \n", "Species NaN NaN NaN " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe(include='all').T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Add discrete values\n", "\n", "Apart iris species, data in this dataset are numerical values only.\n", "\n", "To demonstrate the ability of AutoClass C to handle discrete values, we will convert `PetalWidthCm` column to discrete categorical values." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "def categorize(value):\n", " if value <= 0.75:\n", " return \"small\"\n", " elif 0.75 < value <= 1.75:\n", " return \"medium\"\n", " elif 1.75 < value:\n", " return \"large\"" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SepalLengthCmSepalWidthCmPetalLengthCmPetalWidthCmSpeciesPetalWidthCat
Id
15.13.51.40.2Iris-setosasmall
24.93.01.40.2Iris-setosasmall
34.73.21.30.2Iris-setosasmall
44.63.11.50.2Iris-setosasmall
55.03.61.40.2Iris-setosasmall
\n", "
" ], "text/plain": [ " SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species \\\n", "Id \n", "1 5.1 3.5 1.4 0.2 Iris-setosa \n", "2 4.9 3.0 1.4 0.2 Iris-setosa \n", "3 4.7 3.2 1.3 0.2 Iris-setosa \n", "4 4.6 3.1 1.5 0.2 Iris-setosa \n", "5 5.0 3.6 1.4 0.2 Iris-setosa \n", "\n", " PetalWidthCat \n", "Id \n", "1 small \n", "2 small \n", "3 small \n", "4 small \n", "5 small " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"PetalWidthCat\"] = df[\"PetalWidthCm\"].apply(categorize)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Add missing values\n", "\n", "To demonstrate the ability of AutoClass C to handle missing values, we will delete some values." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SepalLengthCmSepalWidthCmPetalLengthCmPetalWidthCmSpeciesPetalWidthCat
Id
1NaN3.51.40.2Iris-setosasmall
24.9NaN1.40.2Iris-setosasmall
34.73.2NaN0.2Iris-setosasmall
44.63.11.50.2Iris-setosasmall
55.03.61.40.2Iris-setosasmall
\n", "
" ], "text/plain": [ " SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species \\\n", "Id \n", "1 NaN 3.5 1.4 0.2 Iris-setosa \n", "2 4.9 NaN 1.4 0.2 Iris-setosa \n", "3 4.7 3.2 NaN 0.2 Iris-setosa \n", "4 4.6 3.1 1.5 0.2 Iris-setosa \n", "5 5.0 3.6 1.4 0.2 Iris-setosa \n", "\n", " PetalWidthCat \n", "Id \n", "1 small \n", "2 small \n", "3 small \n", "4 small \n", "5 small " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[1, \"SepalLengthCm\"] = np.nan\n", "df.loc[2, \"SepalWidthCm\"] = np.nan\n", "df.loc[3, \"PetalLengthCm\"] = np.nan\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Save dataset in two different files. One with real values and the other one with discrete values (column `PetalWidthCat`).\n", "Missing values must encoded with nothing." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Id\tSepalLengthCm\tSepalWidthCm\tPetalLengthCm\n", "1\t\t3.5\t1.4\n", "2\t4.9\t\t1.4\n", "3\t4.7\t3.2\t\n", "4\t4.6\t3.1\t1.5\n", "5\t5.0\t3.6\t1.4\n", "6\t5.4\t3.9\t1.7\n", "7\t4.6\t3.4\t1.4\n", "8\t5.0\t3.4\t1.5\n", "9\t4.4\t2.9\t1.4\n" ] } ], "source": [ "df.drop([\"Species\", \"PetalWidthCm\", \"PetalWidthCat\"], axis=1).to_csv(\"iris_real.tsv\", sep=\"\\t\", header=True)\n", "!head iris_real.tsv" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Id\tPetalWidthCat\n", "1\tsmall\n", "2\tsmall\n", "3\tsmall\n", "4\tsmall\n", "5\tsmall\n", "6\tsmall\n", "7\tsmall\n", "8\tsmall\n", "9\tsmall\n" ] } ], "source": [ "df[\"PetalWidthCat\"].to_csv(\"iris_discrete.tsv\", sep=\"\\t\", header=True)\n", "!head iris_discrete.tsv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1 - prepare input files" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2019-07-07 19:07:58 INFO Reading data file 'iris_real.tsv' as 'real scalar' with error 0.01\n", "2019-07-07 19:07:58 INFO Detected encoding: ascii\n", "2019-07-07 19:07:59 INFO Found 150 rows and 4 columns\n", "2019-07-07 19:07:59 DEBUG Checking column names\n", "2019-07-07 19:07:59 DEBUG Index name 'Id'\n", "2019-07-07 19:07:59 DEBUG Column name 'SepalLengthCm'\n", "2019-07-07 19:07:59 DEBUG Column name 'SepalWidthCm'\n", "2019-07-07 19:07:59 DEBUG Column name 'PetalLengthCm'\n", "2019-07-07 19:07:59 INFO Checking data format\n", "2019-07-07 19:07:59 INFO Column 'SepalLengthCm'\n", "2019-07-07 19:07:59 INFO count 149.000000\n", "2019-07-07 19:07:59 INFO mean 5.848322\n", "2019-07-07 19:07:59 INFO std 0.828594\n", "2019-07-07 19:07:59 INFO min 4.300000\n", "2019-07-07 19:07:59 INFO 50% 5.800000\n", "2019-07-07 19:07:59 INFO max 7.900000\n", "2019-07-07 19:07:59 INFO ---\n", "2019-07-07 19:07:59 INFO Column 'SepalWidthCm'\n", "2019-07-07 19:07:59 INFO count 149.000000\n", "2019-07-07 19:07:59 INFO mean 3.054362\n", "2019-07-07 19:07:59 INFO std 0.435034\n", "2019-07-07 19:07:59 INFO min 2.000000\n", "2019-07-07 19:07:59 INFO 50% 3.000000\n", "2019-07-07 19:07:59 INFO max 4.400000\n", "2019-07-07 19:07:59 INFO ---\n", "2019-07-07 19:07:59 INFO Column 'PetalLengthCm'\n", "2019-07-07 19:07:59 INFO count 149.000000\n", "2019-07-07 19:07:59 INFO mean 3.775168\n", "2019-07-07 19:07:59 INFO std 1.758720\n", "2019-07-07 19:07:59 INFO min 1.000000\n", "2019-07-07 19:07:59 INFO 50% 4.400000\n", "2019-07-07 19:07:59 INFO max 6.900000\n", "2019-07-07 19:07:59 INFO ---\n", "2019-07-07 19:07:59 INFO Reading data file 'iris_discrete.tsv' as 'discrete'\n", "2019-07-07 19:07:59 INFO Detected encoding: ascii\n", "2019-07-07 19:07:59 INFO Found 150 rows and 2 columns\n", "2019-07-07 19:07:59 DEBUG Checking column names\n", "2019-07-07 19:07:59 DEBUG Index name 'Id'\n", "2019-07-07 19:07:59 DEBUG Column name 'PetalWidthCat'\n", "2019-07-07 19:07:59 INFO Checking data format\n", "2019-07-07 19:07:59 INFO Column 'PetalWidthCat': 3 different values\n", "2019-07-07 19:07:59 INFO Preparing input data\n", "2019-07-07 19:07:59 INFO Final dataframe has 150 lines and 5 columns\n", "2019-07-07 19:07:59 INFO Searching for missing values\n", "2019-07-07 19:07:59 WARNING Missing values found in column: SepalLengthCm\n", "2019-07-07 19:07:59 WARNING Missing values found in column: SepalWidthCm\n", "2019-07-07 19:07:59 WARNING Missing values found in column: PetalLengthCm\n", "2019-07-07 19:07:59 INFO Writing autoclass.db2 file\n", "2019-07-07 19:07:59 INFO If any, missing values will be encoded as '?'\n", "2019-07-07 19:07:59 DEBUG Writing autoclass.tsv file [for later use]\n", "2019-07-07 19:07:59 INFO Writing .hd2 file\n", "2019-07-07 19:07:59 INFO Writing .model file\n", "2019-07-07 19:07:59 INFO Writing .s-params file\n", "2019-07-07 19:07:59 INFO Writing .r-params file\n" ] } ], "source": [ "# Create object to prepare dataset.\n", "clust = wrapper.Input()\n", "\n", "# Load datasets from tsv files.\n", "clust.add_input_data(\"iris_real.tsv\", \"real scalar\")\n", "clust.add_input_data(\"iris_discrete.tsv\", \"discrete\")\n", "\n", "# Prepare input data:\n", "# - create a final dataframe\n", "# - merge datasets if multiple inputs\n", "clust.prepare_input_data()\n", "\n", "# Create files needed by AutoClass.\n", "clust.create_db2_file()\n", "clust.create_hd2_file()\n", "clust.create_model_file()\n", "# We wanted reproducible results to ease documentation.\n", "# But bear in mind, that this parameter is not advised by authors of AutoClass C in production run.\n", "# Use clust.create_sparams_file() instead.\n", "clust.create_sparams_file(reproducible_run=True)\n", "clust.create_rparams_file()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2 - prepare run script & run autoclass" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2019-07-07 19:08:02 INFO AutoClass C executable found in /home/pierre/.soft/bin/autoclass\n", "2019-07-07 19:08:02 INFO Writing run file\n", "2019-07-07 19:08:02 INFO AutoClass C executable found in /home/pierre/.soft/bin/autoclass\n", "2019-07-07 19:08:02 INFO AutoClass C version: AUTOCLASS C (version 3.3.6unx)\n", "2019-07-07 19:08:02 INFO Running clustering...\n" ] } ], "source": [ "# Clean previous status file and results if a classification has already been performed.\n", "!rm -f autoclass-run-* *.results-bin\n", "\n", "# Search autoclass in path.\n", "wrapper.search_autoclass_in_path()\n", "\n", "# Create object to run AutoClass.\n", "run = wrapper.Run()\n", "\n", "# Prepare run script.\n", "run.create_run_file()\n", "\n", "# Run AutoClass.\n", "run.run()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3 - parse and format results" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2019-07-07 19:08:05 INFO Extracting autoclass results\n", "2019-07-07 19:08:05 INFO Found 150 cases classified in 4 classes\n", "2019-07-07 19:08:05 INFO Aggregating input data\n", "2019-07-07 19:08:05 INFO Writing classes + probabilities .tsv file\n", "2019-07-07 19:08:05 INFO Writing .cdt file\n", "2019-07-07 19:08:05 INFO Writing .cdt file (with probabilities)\n", "2019-07-07 19:08:05 INFO Writing class statistics\n", "2019-07-07 19:08:05 INFO Writing dendrogram\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "timer = 0\n", "step = 2\n", "while not Path(\"autoclass-run-success\").exists():\n", " timer += step\n", " sys.stdout.write(\"\\r\")\n", " sys.stdout.write(f\"Time: {timer} sec.\")\n", " sys.stdout.flush()\n", " time.sleep(step)\n", "\n", "results = wrapper.Output()\n", "results.extract_results()\n", "results.aggregate_input_data()\n", "results.write_cdt()\n", "results.write_cdt(with_proba=True)\n", "results.write_class_stats()\n", "results.write_dendrogram()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For comparison, add class number to original dataset." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SepalLengthCmSepalWidthCmPetalLengthCmPetalWidthCmSpeciesPetalWidthCatmain-class
Id
1NaN3.51.40.2Iris-setosasmall1
24.9NaN1.40.2Iris-setosasmall1
34.73.2NaN0.2Iris-setosasmall1
44.63.11.50.2Iris-setosasmall1
55.03.61.40.2Iris-setosasmall1
\n", "
" ], "text/plain": [ " SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species \\\n", "Id \n", "1 NaN 3.5 1.4 0.2 Iris-setosa \n", "2 4.9 NaN 1.4 0.2 Iris-setosa \n", "3 4.7 3.2 NaN 0.2 Iris-setosa \n", "4 4.6 3.1 1.5 0.2 Iris-setosa \n", "5 5.0 3.6 1.4 0.2 Iris-setosa \n", "\n", " PetalWidthCat main-class \n", "Id \n", "1 small 1 \n", "2 small 1 \n", "3 small 1 \n", "4 small 1 \n", "5 small 1 " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_class = pd.read_csv(\"autoclass_out.tsv\", sep=\"\\t\", index_col=\"Id\")\n", "df = pd.concat([df, df_class[\"main-class\"]], axis=1, join=\"outer\")\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Compute class distribution for iris species" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
main-class1234
Species
Iris-setosa50000
Iris-versicolor023027
Iris-virginica016322
\n", "
" ], "text/plain": [ "main-class 1 2 3 4\n", "Species \n", "Iris-setosa 50 0 0 0\n", "Iris-versicolor 0 23 0 27\n", "Iris-virginica 0 16 32 2" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.pivot_table(df, index=[\"Species\"], columns=[\"main-class\"], values=[], aggfunc=len, fill_value=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The setosa species is found only in cluster 1. Note that missing values did not interfere in the classification of the 3 first flowers as setosa.\n", "\n", "The versicolor species is found in cluster 2 and 4.\n", "\n", "The virginica species is found mainly in cluster 3 but also in cluster 2 and 4.\n", "\n", "AutoClass-C determines automatically what is the optimal number of classes. It's always a good idea to analyse the final results to check if some cluster can be merged (for instance cluster 2 and 4).\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 2 }