{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Майнор по Анализу Данных, Группа ИАД-2\n", "## Введение, вспоминаем Python 18/01/2017" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Table of Contents\n", "

1  Майнор по Анализу Данных, Группа ИАД-2
1.1  Введение, вспоминаем Python 18/01/2017
1.2  Как установить Jupyter Notebook у себя дома?!
1.3  Можно ли писать на Python 3?
1.4  Зачем мне нужен этот курс?
1.5  Я стану Data Scientist'ом?!
1.6  Вспоминаем pandas
1.6.1  Рождаемость в США
1.6.2  Качество вина
1.7  Вспоминаем NumPy
1.7.1  Упражнения с векторами и матрицами
1.7.2  Линейная регрессия
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Этот [Jupyter Notebook](http://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html) содержит вспомогательные указания для выполнения семинарских и домашних заданий. В течение курса мы будем преимущественно работать в подобных \"тетрадках\", но может быть иногда будем переключаться на другие среды\\средства.\n", "\n", "(Я использую Python версии 2.x.x, а не 3.x.x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Как установить Jupyter Notebook у себя дома?!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Самый простой и надежный способ - воспользоваться готовым дистрибутивом [Anaconda](https://store.continuum.io/cshop/anaconda/), включающий в себе практически все необходимые модули и утилиты, которые нам понадобятся - IPython, NumPy, SciPy, Matplotlib и **Scikit-Learn**. Просто следуйте указаниям установщика для вашей ОС.\n", "\n", "Рекомендую ознакомиться с этим [постом](https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/) - там приводятся различные интересные возможности \"тетрадок\" о которых вы возможно не знали." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Можно ли писать на Python 3?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Пишите, ради бога. В нашем случае разница будет минимальна, поэтому код можно легко перевести из одной версии в другую." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Зачем мне нужен этот курс?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Данный курс должен дать вам:\n", "* Основные знания и навыки используемые при работе с данными\n", "* Понимание базовых методов прикладной статистики и (о боже!) машинного обучения\n", "* Умение поставить задачу и выбрать метод для ее решения" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Я стану Data Scientist'ом?!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Фундамент будет заложен. А дальше нужна практика и ваша собственная мотивация.\n", "\n", "Что желательно уметь делать, будучи DS:\n", "1. Data Exploration and Preparation\n", "2. Data Representation and Transformation\n", " 1. Modern Databases\n", " 2. Mathematical Representations\n", "3. Computing with Data\n", "4. Data Visualization and Presentation\n", "5. Data Modeling\n", " 1. Generative Modelling (Applied Statistics)\n", " 2. Predictive Modelling (ML)\n", "6. Domain Expertise (optional)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Вспоминаем pandas" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "%matplotlib inline\n", "\n", "plt.style.use('ggplot')\n", "plt.rcParams['figure.figsize'] = (16,8)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Рождаемость в США" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Загрузите два набора данных с информацией о рождаемости в США: [Набор 1](https://www.dropbox.com/s/4v743y3e25lz0an/US_births_1994-2003_CDC_NCHS.csv?dl=0), [Набор 2](https://www.dropbox.com/s/3aoulbiuomamay6/US_births_2000-2014_SSA.csv?dl=0)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "US_births_1994-2003_CDC_NCHS.csv \u001b[31mseminar3.ipynb\u001b[m\u001b[m\r\n", "US_births_2000-2014_SSA.csv \u001b[31msetting_envrmt_old.ipynb\u001b[m\u001b[m\r\n", "Untitled.ipynb \u001b[31msol\u001b[m\u001b[m\r\n", "\u001b[31mda_with_matrices.ipynb\u001b[m\u001b[m \u001b[31msol~\u001b[m\u001b[m\r\n", "\u001b[31mhw1_old.ipynb\u001b[m\u001b[m \u001b[31mtemp.txt\u001b[m\u001b[m\r\n", "\u001b[31mintro.ipynb\u001b[m\u001b[m \u001b[31mtutorial_dataset.csv\u001b[m\u001b[m\r\n", "\u001b[31mseminar1_lns_old.ipynb\u001b[m\u001b[m \u001b[31mtutorial_dataset_2.csv\u001b[m\u001b[m\r\n", "\u001b[31mseminar2.ipynb\u001b[m\u001b[m\r\n" ] } ], "source": [ "!ls " ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df1 = pd.read_csv('US_births_1994-2003_CDC_NCHS.csv')\n", "df2 = pd.read_csv('US_births_2000-2014_SSA.csv')" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearmonthdate_of_monthday_of_weekbirths
36472003122768646
36482003122877645
364920031229112823
365020031230214438
365120031231312374
\n", "
" ], "text/plain": [ " year month date_of_month day_of_week births\n", "3647 2003 12 27 6 8646\n", "3648 2003 12 28 7 7645\n", "3649 2003 12 29 1 12823\n", "3650 2003 12 30 2 14438\n", "3651 2003 12 31 3 12374" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.tail()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearmonthdate_of_monthday_of_weekbirths
020001169083
120001278006
2200013111363
3200014213032
4200015312558
\n", "
" ], "text/plain": [ " year month date_of_month day_of_week births\n", "0 2000 1 1 6 9083\n", "1 2000 1 2 7 8006\n", "2 2000 1 3 1 11363\n", "3 2000 1 4 2 13032\n", "4 2000 1 5 3 12558" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Чем они отличаются? Соедините 2 таблицы, так, чтобы соблюсти целостность информации." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 1) Проверьте, что данные за общий период почти \n", "# не отличаются\n", "# 2) Объедините таблицы, чтобы они охватывали период \n", "# 1994-2014" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df1 = df1.rename(columns={'date_of_month': 'day'})" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearmonthdayday_of_weekbirthsdate
0199411680961994-01-01
1199412777721994-01-02
21994131101421994-01-03
31994142112481994-01-04
41994153110531994-01-05
\n", "
" ], "text/plain": [ " year month day day_of_week births date\n", "0 1994 1 1 6 8096 1994-01-01\n", "1 1994 1 2 7 7772 1994-01-02\n", "2 1994 1 3 1 10142 1994-01-03\n", "3 1994 1 4 2 11248 1994-01-04\n", "4 1994 1 5 3 11053 1994-01-05" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.loc[:, 'date'] = \\\n", "pd.to_datetime(df1.loc[:, ['year', 'month', 'day']])\n", "df1.head()" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearmonthdayday_of_weekbirthsdate
0200011690832000-01-01
1200012780062000-01-02
22000131113632000-01-03
32000142130322000-01-04
42000153125582000-01-05
\n", "
" ], "text/plain": [ " year month day day_of_week births date\n", "0 2000 1 1 6 9083 2000-01-01\n", "1 2000 1 2 7 8006 2000-01-02\n", "2 2000 1 3 1 11363 2000-01-03\n", "3 2000 1 4 2 13032 2000-01-04\n", "4 2000 1 5 3 12558 2000-01-05" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2 = df2.rename(columns={'date_of_month': 'day'})\n", "df2.loc[:, 'date'] = \\\n", "pd.to_datetime(df2.loc[:, ['year', 'month', 'day']])\n", "df2.head()" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df1 = df1.set_index('date')\n", "df2 = df2.set_index('date')" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearmonthdayday_of_weekbirths
date
1994-01-0119941168096
1994-01-0219941277772
1994-01-03199413110142
1994-01-04199414211248
1994-01-05199415311053
\n", "
" ], "text/plain": [ " year month day day_of_week births\n", "date \n", "1994-01-01 1994 1 1 6 8096\n", "1994-01-02 1994 1 2 7 7772\n", "1994-01-03 1994 1 3 1 10142\n", "1994-01-04 1994 1 4 2 11248\n", "1994-01-05 1994 1 5 3 11053" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.head()" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearmonthdayday_of_weekbirths
date
2000-01-0120001169083
2000-01-0220001278006
2000-01-03200013111363
2000-01-04200014213032
2000-01-05200015312558
\n", "
" ], "text/plain": [ " year month day day_of_week births\n", "date \n", "2000-01-01 2000 1 1 6 9083\n", "2000-01-02 2000 1 2 7 8006\n", "2000-01-03 2000 1 3 1 11363\n", "2000-01-04 2000 1 4 2 13032\n", "2000-01-05 2000 1 5 3 12558" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.head()" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [], "source": [ "result = df1.join(df2, how='inner', \n", " lsuffix='_df1', rsuffix='_df2')" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
year_df1month_df1day_df1day_of_week_df1births_df1year_df2month_df2day_df2day_of_week_df2births_df2
date
2003-12-2720031227686462003122768785
2003-12-2820031228776452003122877763
2003-12-292003122911282320031229113125
2003-12-302003123021443820031230214700
2003-12-312003123131237420031231312540
\n", "
" ], "text/plain": [ " year_df1 month_df1 day_df1 day_of_week_df1 births_df1 \\\n", "date \n", "2003-12-27 2003 12 27 6 8646 \n", "2003-12-28 2003 12 28 7 7645 \n", "2003-12-29 2003 12 29 1 12823 \n", "2003-12-30 2003 12 30 2 14438 \n", "2003-12-31 2003 12 31 3 12374 \n", "\n", " year_df2 month_df2 day_df2 day_of_week_df2 births_df2 \n", "date \n", "2003-12-27 2003 12 27 6 8785 \n", "2003-12-28 2003 12 28 7 7763 \n", "2003-12-29 2003 12 29 1 13125 \n", "2003-12-30 2003 12 30 2 14700 \n", "2003-12-31 2003 12 31 3 12540 " ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result.tail()" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "-223.48459958932239" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Сравним рождаемости\n", "result.loc[:, ['births_df1', 'births_df2']]\n", "(result.births_df1 - result.births_df2).mean()" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "count 1461.000000\n", "mean -223.484600\n", "std 68.774771\n", "min -438.000000\n", "25% -271.000000\n", "50% -231.000000\n", "75% -170.000000\n", "max -60.000000\n", "dtype: float64" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(result.births_df1 - result.births_df2).describe()" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df3 = df1.append(df2.loc['2004-01-01':, :])" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "2007-09-10 1\n", "2000-09-26 1\n", "2000-09-28 1\n", "1997-12-03 1\n", "1996-08-22 1\n", "2005-09-22 1\n", "2009-11-14 1\n", "2002-06-28 1\n", "2003-01-25 1\n", "1995-10-11 1\n", "1999-03-15 1\n", "2001-06-26 1\n", "2004-11-28 1\n", "2008-05-02 1\n", "1998-07-02 1\n", "2005-12-17 1\n", "2003-06-29 1\n", "1997-05-14 1\n", "1998-12-04 1\n", "2009-12-01 1\n", "2013-11-21 1\n", "2007-08-20 1\n", "2000-01-08 1\n", "2014-07-28 1\n", "1995-06-15 1\n", "2014-12-25 1\n", "2009-04-28 1\n", "1995-05-06 1\n", "1996-07-24 1\n", "2014-06-26 1\n", " ..\n", "1998-01-03 1\n", "1994-08-01 1\n", "2000-10-01 1\n", "2005-07-24 1\n", "2007-11-01 1\n", "1997-08-30 1\n", "1998-09-13 1\n", "1996-10-30 1\n", "2003-11-06 1\n", "1994-07-19 1\n", "1995-01-06 1\n", "1998-06-10 1\n", "2001-11-12 1\n", "2004-02-24 1\n", "2007-07-29 1\n", "2008-09-18 1\n", "2012-02-21 1\n", "1996-06-05 1\n", "1994-02-18 1\n", "2010-09-12 1\n", "2013-08-15 1\n", "2006-10-18 1\n", "2001-05-25 1\n", "2011-05-13 1\n", "2014-04-27 1\n", "2012-10-24 1\n", "2003-08-30 1\n", "2009-11-21 1\n", "2008-03-31 1\n", "2006-03-22 1\n", "Name: date, dtype: int64" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Проверим, что даты не повторяются\n", "df3.index.value_counts().head()\n", "# Даты уникальны!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Найдите количество детей, рождающихся 6, 13 и 20 числа каждого месяца с учетом дня недели.\n", "\n", "Выделяется ли как-то пятница 13?" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Сделаем таблицу для 6 числа\n", "idx = df3.loc[:, 'day'] == 6\n", "b6 = df3.loc[idx, :].groupby('day_of_week').births.mean()\n", "\n", "# И для всех остальных\n", "idx = df3.loc[:, 'day'] == 13\n", "b13 = df3.loc[idx, :].groupby('day_of_week').births.mean()\n", "\n", "idx = df3.loc[:, 'day'] == 20\n", "b20 = df3.loc[idx, :].groupby('day_of_week').births.mean()" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "day_of_week\n", "1 -597.400000\n", "2 -448.805556\n", "3 -79.028571\n", "4 -167.131579\n", "5 -237.342857\n", "6 -71.361111\n", "7 -19.189189\n", "Name: births, dtype: float64" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b6-b20" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "day_of_week\n", "1 520.142857\n", "2 491.722222\n", "3 403.742857\n", "4 507.052632\n", "5 911.342857\n", "6 93.805556\n", "7 81.891892\n", "Name: births, dtype: float64" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b20-b13" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Качество вина" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Загрузите [датасет](https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv) с информацией о характеристиках вина и его качестве." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Что из себя представляет объект в этом наборе данных? Сколько их?\n", "* Какие признаки описывают объекты? Сколько их?\n", "* Какой признак является целевым?\n", "* Каковы их области значений?\n", "* Есть ли пропуски?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Какие признаки больше всего влияют на целевую переменную?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Создайте новый столбец `quality_cat`, которая будет иметь значение `\"good\"` если `quality > 5` и `\"bad\"` - иначе.
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Нарисуйте гистрограммы признака alcohol в группах с `quality_cat == \"good\"` и `quality_cat == \"bad\"`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Можете ли вы придумать правило для классификации вина на хорошее и плохое по рисунку выше? Пусть это будет нашей первой моделью)\n", "\n", "Напишите функцию `brute_clf_train()` которая бы перебирала пороговое значение по признаку `alcohol` и находило бы \"оптимальное\" (кстати, что значит оптимальное?)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Напишите функцию `brute_clf_predict()` которая бы по значению признака `alcohol` и найденному выше порогу говорила какое качество у вина.\n", "\n", "А заодно выводила бы количество \"ошибок\" на текущем наборе данных" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Проверим, как обобщается наша модель на другие данные.\n", "\n", "* Загрузите другой [датасет](https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv)\n", "* Выполните те же панипуляции с признаками\n", "* Используйте нашу простейшую модель для предсказания качества на новых данных" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Вспоминаем NumPy" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "### Упражнения с векторами и матрицами" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Линейная регрессия" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Загрузите [файл 1](https://www.dropbox.com/s/kg9px9v3xfysak9/tutorial_dataset.csv?dl=0) и [файл 2](https://www.dropbox.com/s/f87gm612o144emx/tutorial_dataset_2.csv?dl=0) в папку с тетрадкой. С помощью функции `loadtxt` в модуле `numpy` загрузите табличные данные одного из файлов. Присвойте y = D[:,0] а X = D[:, 1:].\n", "\n", "Сейчас мы воспользуемся одной магической формулой и построим модель линейной регрессии. Откуда эта формула берется мы узнаем на следующих занятиях.\n", "\n", "Модель линейной регрессии в матричном виде выглядит так: $\\hat{y} = X\\hat{\\beta}$, где\n", "\n", "$$ \\hat{\\beta} = (X^T X)^{-1} X^T y $$\n", "Остатки модели рассчитываются как\n", "$$ \\text{res} = y - \\hat{y} $$\n", "\n", "Итак, еще раз:\n", "\n", "1. Загрузите данные\n", "2. Оцените веса $\\beta$ с помощью формулы\n", "3. Постройте график, на котором по оси Y: остатки, а по оси X: $\\hat{y}$" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# load data\n", "D = np.loadtxt('tutorial_dataset_1.csv', \n", " skiprows=1, \n", " delimiter=',')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Write your code here\n", "#\n", "#\n", "#" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [default]", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.12" }, "nav_menu": {}, "toc": { "navigate_menu": true, "number_sections": false, "sideBar": true, "threshold": 6, "toc_cell": false, "toc_section_display": "block", "toc_window_display": true }, "toc_position": { "height": "924px", "left": "0px", "right": "1622.67px", "top": "108px", "width": "212px" } }, "nbformat": 4, "nbformat_minor": 0 }