{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Klassifikation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wir nutzen ein Datenset, das handschriftliche Ziffern in Form von 8x8 Feldern mit Werten der Farbstärkte darstellt. Ein Beschreibung des Datensets gib es bei [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html#sklearn.datasets.load_digits) und im \n", "[UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits).\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Wir importieren eine Funktion zu laden des Datensets und rufen dieses auf.\n", "from sklearn.datasets import load_digits\n", "digits = load_digits()" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "sklearn.utils.Bunch" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Die Daten und Metadaten sind in einem sogenannten \"Bunch\"-Objekt organisiert\n", "type(digits)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['DESCR', 'data', 'images', 'target', 'target_names']" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Dieser Bunch hat folgende Attribute.\n", "dir(digits)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ".. _digits_dataset:\n", "\n", "Optical recognition of handwritten digits dataset\n", "--------------------------------------------------\n", "\n", "**Data Set Characteristics:**\n", "\n", " :Number of Instances: 5620\n", " :Number of Attributes: 64\n", " :Attribute Information: 8x8 image of integer pixels in the range 0..16.\n", " :Missing Attribute Values: None\n", " :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)\n", " :Date: July; 1998\n", "\n", "This is a copy of the test set of the UCI ML hand-written digits datasets\n", "https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits\n", "\n", "The data set contains images of hand-written digits: 10 classes where\n", "each class refers to a digit.\n", "\n", "Preprocessing programs made available by NIST were used to extract\n", "normalized bitmaps of handwritten digits from a preprinted form. From a\n", "total of 43 people, 30 contributed to the training set and different 13\n", "to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of\n", "4x4 and the number of on pixels are counted in each block. This generates\n", "an input matrix of 8x8 where each element is an integer in the range\n", "0..16. This reduces dimensionality and gives invariance to small\n", "distortions.\n", "\n", "For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.\n", "T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.\n", "L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,\n", "1994.\n", "\n", ".. topic:: References\n", "\n", " - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their\n", " Applications to Handwritten Digit Recognition, MSc Thesis, Institute of\n", " Graduate Studies in Science and Engineering, Bogazici University.\n", " - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.\n", " - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.\n", " Linear dimensionalityreduction using relevance weighted LDA. School of\n", " Electrical and Electronic Engineering Nanyang Technological University.\n", " 2005.\n", " - Claudio Gentile. A New Approximate Maximal Margin Classification\n", " Algorithm. NIPS. 2000.\n" ] } ], "source": [ "# Schauen wir uns mal die Beschreibung an\n", "print(digits.DESCR)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "numpy.ndarray" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Die eigentlichen Daten sind in einem numpy-Array abgelegt.\n", "type(digits.data)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 0., 0., 5., ..., 0., 0., 0.],\n", " [ 0., 0., 0., ..., 10., 0., 0.],\n", " [ 0., 0., 0., ..., 16., 9., 0.],\n", " ...,\n", " [ 0., 0., 1., ..., 6., 0., 0.],\n", " [ 0., 0., 2., ..., 12., 0., 0.],\n", " [ 0., 0., 10., ..., 12., 1., 0.]])" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Schauen wir es uns mal an.\n", "digits.data" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1797, 64)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Schauen wir uns die Dimension der Matrix an - es handelt \n", "# sich um eine zweidimentionsale Matrix mit 1797 Zeilen und 64 Spalten.\n", "# Es sind 1797 Bilder und 64 (8x8 Felder) Features.\n", "digits.data.shape" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "numpy.ndarray" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Das Target-Attribute ist ebenfalls ein numpy-array ...\n", "type(digits.target)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1797,)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# ... allerding mit nur einer Dimension.\n", "digits.target.shape" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0, 1, 2, ..., 8, 9, 8])" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Jeder Wert entspricht der geschriebenen Nummer\n", "digits.target" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Das Bunch-Objekt hat noch das Attribute \"target_names\"\n", "# Normalerweise wird jeder Zahl in \"targent\" hier ein Name zugeordnen.\n", "# Da es sich aber tatsächlich um Ziffern von 0 - 9 handelt, ist das in diesem\n", "# nicht wirklich nötig.\n", "digits.target_names" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1797" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# In diesem Datenset gibt es zusätzlich noch ein Attribute \"images\".\n", "# Es enthält für jede geschriebene Ziffer die Farbwerte in ein 8x8 Matrix.\n", "len(digits.images)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 0., 0., 5., 13., 9., 1., 0., 0.],\n", " [ 0., 0., 13., 15., 10., 15., 5., 0.],\n", " [ 0., 3., 15., 2., 0., 11., 8., 0.],\n", " [ 0., 4., 12., 0., 0., 8., 8., 0.],\n", " [ 0., 5., 8., 0., 0., 9., 8., 0.],\n", " [ 0., 4., 11., 0., 1., 12., 7., 0.],\n", " [ 0., 2., 14., 5., 10., 12., 0., 0.],\n", " [ 0., 0., 6., 13., 10., 0., 0., 0.]])" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Schauen wir uns zum Beispiel das erst Bild an ...\n", "digits.images[0]" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 0., 0., 11., 12., 0., 0., 0., 0.],\n", " [ 0., 2., 16., 16., 16., 13., 0., 0.],\n", " [ 0., 3., 16., 12., 10., 14., 0., 0.],\n", " [ 0., 1., 16., 1., 12., 15., 0., 0.],\n", " [ 0., 0., 13., 16., 9., 15., 2., 0.],\n", " [ 0., 0., 0., 3., 0., 9., 11., 0.],\n", " [ 0., 0., 0., 0., 9., 15., 4., 0.],\n", " [ 0., 0., 9., 12., 13., 3., 0., 0.]])" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# ... oder das zehnte Bild\n", "digits.images[9]" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Wir können die in dieser Form gespeicherten Farbintensitäten\n", "# auch mit matplotlib anzeigen lassen. Hier zum Beispiel für die\n", "# ersten 30 Bilder (wenn man mehr haben möchte muss man in subplot\n", "# mehr als 3 Zeilen angeben.)\n", "import matplotlib.pyplot as plt\n", "fig, axes = plt.subplots(3, 10, figsize=(10, 5))\n", "for ax, img in zip(axes.ravel(), digits.images):\n", " ax.imshow(img, cmap=plt.cm.gray_r)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "# Um einen Klassifikator für ein Klassifikation zu trainieren\n", "# und dann später seine Güte zu bewerten, wird das Datenset \n", "# (genauer gesagt die Attribute \"data\" und \"target\") in\n", "# ein Trainingsset (75%) und Testset (25%) aufgeteilt. Die Konvention\n", "# ist hier eine großes X für den Variablen der Datenmatrix und ein kleines y\n", "# für den Target-Vektor zu nutzen.\n", "\n", "# Anmerkung - bei einigen der folgenden Schritte wird \n", "# von bestimmten zufälligen Zuständen ausgegagen. Um diese\n", "# fest zu setzen und somit die Analyse reproduzierbar zu machen\n", "# kann man den Parameter random_state nutzen und mit einer Zahl\n", "# versehen.\n", "\n", "from sklearn.model_selection import train_test_split\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " digits['data'], digits['target'], random_state=1)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1347, 64)" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Die Maße der zweidimensionalen Trainigs-Daten-Matrix\n", "X_train.shape" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(450, 64)" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Die Maße der zweidimensionalen Test-Daten-Matrix\n", "X_test.shape" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1347,)" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Die Länge des Trainingsvektor entspricht der Anzahl an \n", "# Zeilen der Trianingsmatrix.\n", "y_train.shape" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(450,)" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Die Länge des Testsvektors entspricht der Anzahl an \n", "# Zeilen der Testsmatrix.\n", "y_test.shape" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "# Wir werden zuerst mit einem k-Nearest-Neighbor-Klassifizierer Arbeiten\n", "# und laden dazu die Klasse ...\n", "from sklearn.neighbors import KNeighborsClassifier" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "# ... und erzeugen ein Objekt davon. Hierbei können wird die Anzahl an \n", "# zu betrachteten Nachbarn angeben:\n", "knn_clf = KNeighborsClassifier(n_neighbors=1)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n", " metric_params=None, n_jobs=None, n_neighbors=1, p=2,\n", " weights='uniform')" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Jetzt trainieren wir den Klassifikator mit den Trainingsdaten.\n", "# Dafür wird in scikit-learn unabhängig von Klassifikator die\n", "# Methode \"fit\" genutzt.\n", "knn_clf.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1, 5, 0, 7, 1, 0, 6, 1, 5, 4, 9, 2, 7, 8, 4, 6, 9, 3, 7, 4, 7, 1,\n", " 8, 6, 0, 9, 6, 1, 3, 7, 5, 9, 8, 3, 2, 8, 8, 1, 1, 0, 7, 9, 0, 0,\n", " 8, 7, 2, 7, 4, 3, 4, 3, 4, 0, 4, 7, 0, 5, 5, 5, 2, 1, 7, 0, 5, 1,\n", " 8, 3, 3, 4, 0, 3, 7, 4, 3, 4, 2, 9, 7, 3, 2, 5, 3, 4, 1, 5, 5, 2,\n", " 9, 2, 2, 2, 2, 7, 0, 8, 1, 7, 4, 2, 3, 8, 2, 3, 3, 0, 2, 9, 3, 2,\n", " 3, 2, 8, 1, 1, 9, 1, 2, 0, 4, 8, 5, 4, 4, 7, 6, 7, 6, 6, 1, 7, 5,\n", " 6, 3, 8, 3, 7, 1, 8, 5, 3, 4, 7, 8, 5, 0, 6, 0, 6, 3, 7, 6, 5, 6,\n", " 2, 2, 2, 3, 0, 7, 6, 5, 6, 4, 1, 0, 6, 0, 6, 4, 0, 9, 3, 8, 1, 2,\n", " 3, 1, 9, 0, 7, 6, 2, 9, 3, 5, 3, 4, 6, 3, 3, 7, 4, 9, 2, 7, 6, 1,\n", " 6, 8, 4, 0, 3, 1, 0, 9, 9, 9, 0, 1, 8, 6, 8, 0, 9, 5, 9, 8, 2, 3,\n", " 5, 3, 0, 8, 7, 4, 0, 3, 3, 3, 6, 3, 3, 2, 9, 1, 6, 9, 0, 4, 2, 2,\n", " 7, 9, 1, 6, 7, 6, 3, 9, 1, 9, 3, 4, 0, 6, 4, 8, 5, 3, 6, 3, 1, 4,\n", " 0, 4, 4, 8, 7, 9, 1, 5, 2, 7, 0, 9, 0, 4, 4, 0, 1, 0, 6, 4, 2, 8,\n", " 5, 0, 2, 6, 0, 1, 8, 2, 0, 9, 5, 6, 7, 0, 5, 0, 9, 1, 4, 7, 1, 7,\n", " 0, 6, 6, 8, 0, 2, 2, 6, 9, 9, 7, 5, 1, 7, 6, 4, 6, 1, 9, 4, 7, 1,\n", " 3, 7, 8, 1, 6, 9, 8, 3, 2, 4, 8, 7, 5, 5, 6, 9, 9, 8, 5, 0, 0, 4,\n", " 9, 3, 0, 4, 9, 4, 2, 5, 4, 9, 6, 4, 2, 6, 0, 0, 5, 6, 7, 1, 9, 2,\n", " 5, 1, 5, 9, 8, 7, 7, 0, 6, 9, 3, 1, 9, 3, 9, 8, 7, 0, 2, 3, 9, 9,\n", " 2, 8, 1, 9, 3, 3, 0, 0, 7, 3, 8, 7, 9, 9, 7, 1, 0, 4, 5, 4, 1, 7,\n", " 3, 6, 5, 4, 9, 0, 5, 9, 1, 4, 5, 0, 4, 3, 4, 2, 3, 9, 0, 8, 7, 8,\n", " 6, 9, 4, 5, 7, 8, 3, 7, 8, 3])" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Herzlichen Glückwunsch - wir haben unser aller erstes \n", "# Klassifikator-Modell gebaut und trainiert. \n", "# Jetzt kann mit diesem neue Daten (also Vektoren der Länger 64, die\n", "# die 8x8 Bilder darstellen) klassifizieren - in diesem\n", "# Fall also Vorauszusagen, welche Ziffer dargestellt wurde.\n", "#\n", "# Wir haben unsere Testdaten noch verfügbar und können die Methode \"predict\"\n", "# des trainierten Klassifiers nutzen und erhalten die Voraussagen.\n", "knn_clf.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.9888888888888889" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Da wir für das Testset aber auch wissen welche Ziffern tatsächlich \n", "# herauskommen sollte, können wir die Methode \"score\" des Klassifiers \n", "# nutzen. Diese führt die Voraussage durch und vergleicht sie mit den \n", "# tatsächlichen Target-Werten. Am Ende bekommen wir einen Wert zwischen \n", "# 0 (schlecht) und 1 (gut).\n", "knn_clf.score(X_test, y_test)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.9911111111111112" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Jetzt führen wir die das gleich Verfahren (Erstellen, Traininen und Testen)\n", "# dieses Classifiers mit 3 Nachbarn als Parameter durch.\n", "knn_clf_3 = KNeighborsClassifier(n_neighbors=3)\n", "knn_clf_3.fit(X_train, y_train)\n", "knn_clf_3.score(X_test, y_test)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "# Das schöne an scikit-learn ist, dass alle Klassifikatoren \n", "# die gleichen Methoden besitzten. Sprich anderen Klassifikatoren\n", "# nutzen auch fit, predict und score.\n", "#\n", "# Machen wir eine Klassifikation mit einem Random-Forest-Klassifikator:\n", "from sklearn.ensemble import RandomForestClassifier\n", "random_forest_cfl = RandomForestClassifier(random_state=1)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,\n", " criterion='gini', max_depth=None, max_features='auto',\n", " max_leaf_nodes=None, max_samples=None,\n", " min_impurity_decrease=0.0, min_impurity_split=None,\n", " min_samples_leaf=1, min_samples_split=2,\n", " min_weight_fraction_leaf=0.0, n_estimators=100,\n", " n_jobs=None, oob_score=False, random_state=1, verbose=0,\n", " warm_start=False)" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "random_forest_cfl.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.98" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "random_forest_cfl.score(X_test, y_test)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "# Das gleiche machen wir nur für eine Klassifikation mit einem \n", "# künstlichen, neuralen Netz (Multi-Layer-Perceptron). Standardmäßig \n", "# hat das Netz ein eine Hidden-Layer mit 100 Nodes.\n", "from sklearn.neural_network import MLPClassifier" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,\n", " beta_2=0.999, early_stopping=False, epsilon=1e-08,\n", " hidden_layer_sizes=(100,), learning_rate='constant',\n", " learning_rate_init=0.001, max_fun=15000, max_iter=200,\n", " momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,\n", " power_t=0.5, random_state=1, shuffle=True, solver='adam',\n", " tol=0.0001, validation_fraction=0.1, verbose=False,\n", " warm_start=False)" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mlpc = MLPClassifier(random_state=1)\n", "mlpc.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.9755555555555555" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mlpc.score(X_test, y_test)" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.9844444444444445" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Wir können die Anzahl an Hidden-Layer und Anzahl an Nodes in diesen\n", "# als Parameter setzen (hier 3 Schichten mit mit 200, 100 und 20 Nodes).\n", "# Man kann das ganze kondenensiert schreiben, indem man die\n", "# Methodenaufrufe direkt verknüpft.\n", "MLPClassifier(random_state=1, hidden_layer_sizes=(200, 100, 20)).fit(\n", " X_train, y_train).score(X_test, y_test)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 2 }