{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Задача №12\n", "\n", "Предсказать сорт винограда из которого сделано вино, используя [результаты химических анализов](https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data) ([описание](http://archive.ics.uci.edu/ml/datasets/Wine) данных), c помощью [KNN](http://www.machinelearning.ru/wiki/index.php?title=Метод_k_ближайших_соседей_%28пример%29) - метода k ближайших соседей с тремя различными метриками. Построить график зависимости величины ошибки от числа соседей k." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "import requests\n", "import io\n", "import scipy\n", "import matplotlib.pyplot as plt\n", "\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.model_selection import train_test_split" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Загружаем данные" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data'\n", "data = requests.get(url)\n", "assert data.status_code == 200" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Конвертируем датасет в обучучающую и тестирующую выборки" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "dataset = scipy.genfromtxt(\n", " io.StringIO(data.text),\n", " delimiter=',',\n", " dtype=[('class', scipy.int8), ('features', scipy.float64, (13,))]\n", ")\n", "\n", "x = [item[1] for item in dataset]\n", "y = [item[0] for item in dataset]\n", "x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Обучение модели" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "neighbors = range(1, 21)\n", "means = {}\n", "\n", "for metric in ('euclidean', 'manhattan', 'chebyshev'):\n", " current_means = []\n", " for k in neighbors:\n", " classifier = KNeighborsClassifier(n_neighbors=k, metric=metric)\n", " classifier.fit(x_train, y_train)\n", " prediction = classifier.predict(x_test)\n", " \n", " current_means.append(scipy.mean(prediction == y_test))\n", " \n", " means[metric] = current_means\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Построение графика" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "for metric, current_means in means.items():\n", " errors = [1 - mean for mean in current_means]\n", " plt.plot(neighbors, errors, label=metric)\n", " \n", "plt.xticks(neighbors[1::2])\n", "plt.xlabel('Count of neighbors')\n", "plt.ylabel('Error')\n", "plt.legend(loc='lower right')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "В качестве метрик рассматривались:\n", "- Евклидова норма(euclidean)\n", "- Сумма модулей(manhattan)\n", "- Максимум модулей(chebyshev)\n", "\n", "Количество ближайших соседей neighbors варьировалось от 1 до 20\n", "\n", "### Вывод\n", "Метрика manhattan достаточно хорошо работает при neighbors >= 7" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }