{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Luokittelu - K-Means Cluster" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Klassinen esimerkki luokittelusta on kurjenmiekkojen (iris) luokittelu kolmeen lajiin (setosa, versicolor, virginica) \n", "terä- (petal) ja verholehtien (sepal) koon mukaan. Seuraavassa kokeilen lajien tunnistamista ilman opetusdataa.\n", "\n", "

K-Means Cluster -menetelmän idea

\n", "\n", "Menetelmän tarkoituksena on löytää datasta K-kappaletta ryhmiä (klustereita, segmenttejä). Ryhmät muodostetaan ryhmäkeskusten ympärille." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "%matplotlib inline\n", "\n", "#Vaikuttaa kaavioiden ulkoasuun:\n", "sns.set()" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ " sepal_length sepal_width petal_length petal_width species\n", "0 5.1 3.5 1.4 0.2 setosa\n", "1 4.9 3.0 1.4 0.2 setosa\n", "2 4.7 3.2 1.3 0.2 setosa\n", "3 4.6 3.1 1.5 0.2 setosa\n", "4 5.0 3.6 1.4 0.2 setosa" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Esimerkkiaineisto löytyy seaborn-kirjastosta:\n", "iris = sns.load_dataset('iris')\n", "iris.head()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "#Feature-matriisi on iris-data ilman species-muuttujaa:\n", "X = iris.drop('species', axis=1)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[6.85 , 3.07368421, 5.74210526, 2.07105263],\n", " [5.006 , 3.428 , 1.462 , 0.246 ],\n", " [5.9016129 , 2.7483871 , 4.39354839, 1.43387097]])" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Gaussian naive bayes -mallin tuonti:\n", "from sklearn.cluster import KMeans\n", "\n", "#Mallin sovitus:\n", "malli = KMeans(n_clusters=3)\n", "malli.fit(X)\n", "\n", "#Ryhmien keskukset (sepal_length, sepal_width, petal_length, petal_width):\n", "malli.cluster_centers_" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "col_0 lkm\n", "K \n", "0 38\n", "1 50\n", "2 62" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Ryhmiin kuulumiset:\n", "X['K'] = malli.predict(X)\n", "pd.crosstab(X['K'], 'lkm')" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "K 0 1 2\n", "sepal_length count 38.000000 50.000000 62.000000\n", " mean 6.850000 5.006000 5.901613\n", " std 0.494155 0.352490 0.466410\n", " min 6.100000 4.300000 4.900000\n", " 25% 6.425000 4.800000 5.600000\n", " 50% 6.700000 5.000000 5.900000\n", " 75% 7.200000 5.200000 6.200000\n", " max 7.900000 5.800000 7.000000\n", "sepal_width count 38.000000 50.000000 62.000000\n", " mean 3.073684 3.428000 2.748387\n", " std 0.290092 0.379064 0.296284\n", " min 2.500000 2.300000 2.000000\n", " 25% 2.925000 3.200000 2.500000\n", " 50% 3.000000 3.400000 2.800000\n", " 75% 3.200000 3.675000 3.000000\n", " max 3.800000 4.400000 3.400000\n", "petal_length count 38.000000 50.000000 62.000000\n", " mean 5.742105 1.462000 4.393548\n", " std 0.488590 0.173664 0.508895\n", " min 4.900000 1.000000 3.000000\n", " 25% 5.425000 1.400000 4.025000\n", " 50% 5.650000 1.500000 4.500000\n", " 75% 6.000000 1.575000 4.800000\n", " max 6.900000 1.900000 5.100000\n", "petal_width count 38.000000 50.000000 62.000000\n", " mean 2.071053 0.246000 1.433871\n", " std 0.279872 0.105386 0.297500\n", " min 1.400000 0.100000 1.000000\n", " 25% 1.825000 0.200000 1.300000\n", " 50% 2.100000 0.200000 1.400000\n", " 75% 2.300000 0.300000 1.575000\n", " max 2.500000 0.600000 2.400000" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Tunnuslukuja ryhmittäin:\n", "X.groupby('K').describe().T" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
