{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Майнор по Анализу Данных, Группа ИАД-2\n",
"## Введение, вспоминаем Python 18/01/2017"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Table of Contents\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Этот [Jupyter Notebook](http://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html) содержит вспомогательные указания для выполнения семинарских и домашних заданий. В течение курса мы будем преимущественно работать в подобных \"тетрадках\", но может быть иногда будем переключаться на другие среды\\средства.\n",
"\n",
"(Я использую Python версии 2.x.x, а не 3.x.x)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Как установить Jupyter Notebook у себя дома?!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Самый простой и надежный способ - воспользоваться готовым дистрибутивом [Anaconda](https://store.continuum.io/cshop/anaconda/), включающий в себе практически все необходимые модули и утилиты, которые нам понадобятся - IPython, NumPy, SciPy, Matplotlib и **Scikit-Learn**. Просто следуйте указаниям установщика для вашей ОС.\n",
"\n",
"Рекомендую ознакомиться с этим [постом](https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/) - там приводятся различные интересные возможности \"тетрадок\" о которых вы возможно не знали."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Можно ли писать на Python 3?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Пишите, ради бога. В нашем случае разница будет минимальна, поэтому код можно легко перевести из одной версии в другую."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Зачем мне нужен этот курс?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Данный курс должен дать вам:\n",
"* Основные знания и навыки используемые при работе с данными\n",
"* Понимание базовых методов прикладной статистики и (о боже!) машинного обучения\n",
"* Умение поставить задачу и выбрать метод для ее решения"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Я стану Data Scientist'ом?!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Фундамент будет заложен. А дальше нужна практика и ваша собственная мотивация.\n",
"\n",
"Что желательно уметь делать, будучи DS:\n",
"1. Data Exploration and Preparation\n",
"2. Data Representation and Transformation\n",
" 1. Modern Databases\n",
" 2. Mathematical Representations\n",
"3. Computing with Data\n",
"4. Data Visualization and Presentation\n",
"5. Data Modeling\n",
" 1. Generative Modelling (Applied Statistics)\n",
" 2. Predictive Modelling (ML)\n",
"6. Domain Expertise (optional)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Вспоминаем pandas"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"\n",
"%matplotlib inline\n",
"\n",
"plt.style.use('ggplot')\n",
"plt.rcParams['figure.figsize'] = (16,8)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Рождаемость в США"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Загрузите два набора данных с информацией о рождаемости в США: [Набор 1](https://www.dropbox.com/s/4v743y3e25lz0an/US_births_1994-2003_CDC_NCHS.csv?dl=0), [Набор 2](https://www.dropbox.com/s/3aoulbiuomamay6/US_births_2000-2014_SSA.csv?dl=0)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"US_births_1994-2003_CDC_NCHS.csv \u001b[31mseminar3.ipynb\u001b[m\u001b[m\r\n",
"US_births_2000-2014_SSA.csv \u001b[31msetting_envrmt_old.ipynb\u001b[m\u001b[m\r\n",
"Untitled.ipynb \u001b[31msol\u001b[m\u001b[m\r\n",
"\u001b[31mda_with_matrices.ipynb\u001b[m\u001b[m \u001b[31msol~\u001b[m\u001b[m\r\n",
"\u001b[31mhw1_old.ipynb\u001b[m\u001b[m \u001b[31mtemp.txt\u001b[m\u001b[m\r\n",
"\u001b[31mintro.ipynb\u001b[m\u001b[m \u001b[31mtutorial_dataset.csv\u001b[m\u001b[m\r\n",
"\u001b[31mseminar1_lns_old.ipynb\u001b[m\u001b[m \u001b[31mtutorial_dataset_2.csv\u001b[m\u001b[m\r\n",
"\u001b[31mseminar2.ipynb\u001b[m\u001b[m\r\n"
]
}
],
"source": [
"!ls "
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df1 = pd.read_csv('US_births_1994-2003_CDC_NCHS.csv')\n",
"df2 = pd.read_csv('US_births_2000-2014_SSA.csv')"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" year | \n",
" month | \n",
" date_of_month | \n",
" day_of_week | \n",
" births | \n",
"
\n",
" \n",
" \n",
" \n",
" 3647 | \n",
" 2003 | \n",
" 12 | \n",
" 27 | \n",
" 6 | \n",
" 8646 | \n",
"
\n",
" \n",
" 3648 | \n",
" 2003 | \n",
" 12 | \n",
" 28 | \n",
" 7 | \n",
" 7645 | \n",
"
\n",
" \n",
" 3649 | \n",
" 2003 | \n",
" 12 | \n",
" 29 | \n",
" 1 | \n",
" 12823 | \n",
"
\n",
" \n",
" 3650 | \n",
" 2003 | \n",
" 12 | \n",
" 30 | \n",
" 2 | \n",
" 14438 | \n",
"
\n",
" \n",
" 3651 | \n",
" 2003 | \n",
" 12 | \n",
" 31 | \n",
" 3 | \n",
" 12374 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" year month date_of_month day_of_week births\n",
"3647 2003 12 27 6 8646\n",
"3648 2003 12 28 7 7645\n",
"3649 2003 12 29 1 12823\n",
"3650 2003 12 30 2 14438\n",
"3651 2003 12 31 3 12374"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1.tail()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" year | \n",
" month | \n",
" date_of_month | \n",
" day_of_week | \n",
" births | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2000 | \n",
" 1 | \n",
" 1 | \n",
" 6 | \n",
" 9083 | \n",
"
\n",
" \n",
" 1 | \n",
" 2000 | \n",
" 1 | \n",
" 2 | \n",
" 7 | \n",
" 8006 | \n",
"
\n",
" \n",
" 2 | \n",
" 2000 | \n",
" 1 | \n",
" 3 | \n",
" 1 | \n",
" 11363 | \n",
"
\n",
" \n",
" 3 | \n",
" 2000 | \n",
" 1 | \n",
" 4 | \n",
" 2 | \n",
" 13032 | \n",
"
\n",
" \n",
" 4 | \n",
" 2000 | \n",
" 1 | \n",
" 5 | \n",
" 3 | \n",
" 12558 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" year month date_of_month day_of_week births\n",
"0 2000 1 1 6 9083\n",
"1 2000 1 2 7 8006\n",
"2 2000 1 3 1 11363\n",
"3 2000 1 4 2 13032\n",
"4 2000 1 5 3 12558"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Чем они отличаются? Соедините 2 таблицы, так, чтобы соблюсти целостность информации."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 1) Проверьте, что данные за общий период почти \n",
"# не отличаются\n",
"# 2) Объедините таблицы, чтобы они охватывали период \n",
"# 1994-2014"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df1 = df1.rename(columns={'date_of_month': 'day'})"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" year | \n",
" month | \n",
" day | \n",
" day_of_week | \n",
" births | \n",
" date | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1994 | \n",
" 1 | \n",
" 1 | \n",
" 6 | \n",
" 8096 | \n",
" 1994-01-01 | \n",
"
\n",
" \n",
" 1 | \n",
" 1994 | \n",
" 1 | \n",
" 2 | \n",
" 7 | \n",
" 7772 | \n",
" 1994-01-02 | \n",
"
\n",
" \n",
" 2 | \n",
" 1994 | \n",
" 1 | \n",
" 3 | \n",
" 1 | \n",
" 10142 | \n",
" 1994-01-03 | \n",
"
\n",
" \n",
" 3 | \n",
" 1994 | \n",
" 1 | \n",
" 4 | \n",
" 2 | \n",
" 11248 | \n",
" 1994-01-04 | \n",
"
\n",
" \n",
" 4 | \n",
" 1994 | \n",
" 1 | \n",
" 5 | \n",
" 3 | \n",
" 11053 | \n",
" 1994-01-05 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" year month day day_of_week births date\n",
"0 1994 1 1 6 8096 1994-01-01\n",
"1 1994 1 2 7 7772 1994-01-02\n",
"2 1994 1 3 1 10142 1994-01-03\n",
"3 1994 1 4 2 11248 1994-01-04\n",
"4 1994 1 5 3 11053 1994-01-05"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1.loc[:, 'date'] = \\\n",
"pd.to_datetime(df1.loc[:, ['year', 'month', 'day']])\n",
"df1.head()"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" year | \n",
" month | \n",
" day | \n",
" day_of_week | \n",
" births | \n",
" date | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2000 | \n",
" 1 | \n",
" 1 | \n",
" 6 | \n",
" 9083 | \n",
" 2000-01-01 | \n",
"
\n",
" \n",
" 1 | \n",
" 2000 | \n",
" 1 | \n",
" 2 | \n",
" 7 | \n",
" 8006 | \n",
" 2000-01-02 | \n",
"
\n",
" \n",
" 2 | \n",
" 2000 | \n",
" 1 | \n",
" 3 | \n",
" 1 | \n",
" 11363 | \n",
" 2000-01-03 | \n",
"
\n",
" \n",
" 3 | \n",
" 2000 | \n",
" 1 | \n",
" 4 | \n",
" 2 | \n",
" 13032 | \n",
" 2000-01-04 | \n",
"
\n",
" \n",
" 4 | \n",
" 2000 | \n",
" 1 | \n",
" 5 | \n",
" 3 | \n",
" 12558 | \n",
" 2000-01-05 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" year month day day_of_week births date\n",
"0 2000 1 1 6 9083 2000-01-01\n",
"1 2000 1 2 7 8006 2000-01-02\n",
"2 2000 1 3 1 11363 2000-01-03\n",
"3 2000 1 4 2 13032 2000-01-04\n",
"4 2000 1 5 3 12558 2000-01-05"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2 = df2.rename(columns={'date_of_month': 'day'})\n",
"df2.loc[:, 'date'] = \\\n",
"pd.to_datetime(df2.loc[:, ['year', 'month', 'day']])\n",
"df2.head()"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df1 = df1.set_index('date')\n",
"df2 = df2.set_index('date')"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" year | \n",
" month | \n",
" day | \n",
" day_of_week | \n",
" births | \n",
"
\n",
" \n",
" date | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 1994-01-01 | \n",
" 1994 | \n",
" 1 | \n",
" 1 | \n",
" 6 | \n",
" 8096 | \n",
"
\n",
" \n",
" 1994-01-02 | \n",
" 1994 | \n",
" 1 | \n",
" 2 | \n",
" 7 | \n",
" 7772 | \n",
"
\n",
" \n",
" 1994-01-03 | \n",
" 1994 | \n",
" 1 | \n",
" 3 | \n",
" 1 | \n",
" 10142 | \n",
"
\n",
" \n",
" 1994-01-04 | \n",
" 1994 | \n",
" 1 | \n",
" 4 | \n",
" 2 | \n",
" 11248 | \n",
"
\n",
" \n",
" 1994-01-05 | \n",
" 1994 | \n",
" 1 | \n",
" 5 | \n",
" 3 | \n",
" 11053 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" year month day day_of_week births\n",
"date \n",
"1994-01-01 1994 1 1 6 8096\n",
"1994-01-02 1994 1 2 7 7772\n",
"1994-01-03 1994 1 3 1 10142\n",
"1994-01-04 1994 1 4 2 11248\n",
"1994-01-05 1994 1 5 3 11053"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1.head()"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" year | \n",
" month | \n",
" day | \n",
" day_of_week | \n",
" births | \n",
"
\n",
" \n",
" date | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2000-01-01 | \n",
" 2000 | \n",
" 1 | \n",
" 1 | \n",
" 6 | \n",
" 9083 | \n",
"
\n",
" \n",
" 2000-01-02 | \n",
" 2000 | \n",
" 1 | \n",
" 2 | \n",
" 7 | \n",
" 8006 | \n",
"
\n",
" \n",
" 2000-01-03 | \n",
" 2000 | \n",
" 1 | \n",
" 3 | \n",
" 1 | \n",
" 11363 | \n",
"
\n",
" \n",
" 2000-01-04 | \n",
" 2000 | \n",
" 1 | \n",
" 4 | \n",
" 2 | \n",
" 13032 | \n",
"
\n",
" \n",
" 2000-01-05 | \n",
" 2000 | \n",
" 1 | \n",
" 5 | \n",
" 3 | \n",
" 12558 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" year month day day_of_week births\n",
"date \n",
"2000-01-01 2000 1 1 6 9083\n",
"2000-01-02 2000 1 2 7 8006\n",
"2000-01-03 2000 1 3 1 11363\n",
"2000-01-04 2000 1 4 2 13032\n",
"2000-01-05 2000 1 5 3 12558"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2.head()"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"result = df1.join(df2, how='inner', \n",
" lsuffix='_df1', rsuffix='_df2')"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" year_df1 | \n",
" month_df1 | \n",
" day_df1 | \n",
" day_of_week_df1 | \n",
" births_df1 | \n",
" year_df2 | \n",
" month_df2 | \n",
" day_df2 | \n",
" day_of_week_df2 | \n",
" births_df2 | \n",
"
\n",
" \n",
" date | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2003-12-27 | \n",
" 2003 | \n",
" 12 | \n",
" 27 | \n",
" 6 | \n",
" 8646 | \n",
" 2003 | \n",
" 12 | \n",
" 27 | \n",
" 6 | \n",
" 8785 | \n",
"
\n",
" \n",
" 2003-12-28 | \n",
" 2003 | \n",
" 12 | \n",
" 28 | \n",
" 7 | \n",
" 7645 | \n",
" 2003 | \n",
" 12 | \n",
" 28 | \n",
" 7 | \n",
" 7763 | \n",
"
\n",
" \n",
" 2003-12-29 | \n",
" 2003 | \n",
" 12 | \n",
" 29 | \n",
" 1 | \n",
" 12823 | \n",
" 2003 | \n",
" 12 | \n",
" 29 | \n",
" 1 | \n",
" 13125 | \n",
"
\n",
" \n",
" 2003-12-30 | \n",
" 2003 | \n",
" 12 | \n",
" 30 | \n",
" 2 | \n",
" 14438 | \n",
" 2003 | \n",
" 12 | \n",
" 30 | \n",
" 2 | \n",
" 14700 | \n",
"
\n",
" \n",
" 2003-12-31 | \n",
" 2003 | \n",
" 12 | \n",
" 31 | \n",
" 3 | \n",
" 12374 | \n",
" 2003 | \n",
" 12 | \n",
" 31 | \n",
" 3 | \n",
" 12540 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" year_df1 month_df1 day_df1 day_of_week_df1 births_df1 \\\n",
"date \n",
"2003-12-27 2003 12 27 6 8646 \n",
"2003-12-28 2003 12 28 7 7645 \n",
"2003-12-29 2003 12 29 1 12823 \n",
"2003-12-30 2003 12 30 2 14438 \n",
"2003-12-31 2003 12 31 3 12374 \n",
"\n",
" year_df2 month_df2 day_df2 day_of_week_df2 births_df2 \n",
"date \n",
"2003-12-27 2003 12 27 6 8785 \n",
"2003-12-28 2003 12 28 7 7763 \n",
"2003-12-29 2003 12 29 1 13125 \n",
"2003-12-30 2003 12 30 2 14700 \n",
"2003-12-31 2003 12 31 3 12540 "
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"result.tail()"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"-223.48459958932239"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Сравним рождаемости\n",
"result.loc[:, ['births_df1', 'births_df2']]\n",
"(result.births_df1 - result.births_df2).mean()"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"count 1461.000000\n",
"mean -223.484600\n",
"std 68.774771\n",
"min -438.000000\n",
"25% -271.000000\n",
"50% -231.000000\n",
"75% -170.000000\n",
"max -60.000000\n",
"dtype: float64"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(result.births_df1 - result.births_df2).describe()"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df3 = df1.append(df2.loc['2004-01-01':, :])"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"2007-09-10 1\n",
"2000-09-26 1\n",
"2000-09-28 1\n",
"1997-12-03 1\n",
"1996-08-22 1\n",
"2005-09-22 1\n",
"2009-11-14 1\n",
"2002-06-28 1\n",
"2003-01-25 1\n",
"1995-10-11 1\n",
"1999-03-15 1\n",
"2001-06-26 1\n",
"2004-11-28 1\n",
"2008-05-02 1\n",
"1998-07-02 1\n",
"2005-12-17 1\n",
"2003-06-29 1\n",
"1997-05-14 1\n",
"1998-12-04 1\n",
"2009-12-01 1\n",
"2013-11-21 1\n",
"2007-08-20 1\n",
"2000-01-08 1\n",
"2014-07-28 1\n",
"1995-06-15 1\n",
"2014-12-25 1\n",
"2009-04-28 1\n",
"1995-05-06 1\n",
"1996-07-24 1\n",
"2014-06-26 1\n",
" ..\n",
"1998-01-03 1\n",
"1994-08-01 1\n",
"2000-10-01 1\n",
"2005-07-24 1\n",
"2007-11-01 1\n",
"1997-08-30 1\n",
"1998-09-13 1\n",
"1996-10-30 1\n",
"2003-11-06 1\n",
"1994-07-19 1\n",
"1995-01-06 1\n",
"1998-06-10 1\n",
"2001-11-12 1\n",
"2004-02-24 1\n",
"2007-07-29 1\n",
"2008-09-18 1\n",
"2012-02-21 1\n",
"1996-06-05 1\n",
"1994-02-18 1\n",
"2010-09-12 1\n",
"2013-08-15 1\n",
"2006-10-18 1\n",
"2001-05-25 1\n",
"2011-05-13 1\n",
"2014-04-27 1\n",
"2012-10-24 1\n",
"2003-08-30 1\n",
"2009-11-21 1\n",
"2008-03-31 1\n",
"2006-03-22 1\n",
"Name: date, dtype: int64"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Проверим, что даты не повторяются\n",
"df3.index.value_counts().head()\n",
"# Даты уникальны!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Найдите количество детей, рождающихся 6, 13 и 20 числа каждого месяца с учетом дня недели.\n",
"\n",
"Выделяется ли как-то пятница 13?"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Сделаем таблицу для 6 числа\n",
"idx = df3.loc[:, 'day'] == 6\n",
"b6 = df3.loc[idx, :].groupby('day_of_week').births.mean()\n",
"\n",
"# И для всех остальных\n",
"idx = df3.loc[:, 'day'] == 13\n",
"b13 = df3.loc[idx, :].groupby('day_of_week').births.mean()\n",
"\n",
"idx = df3.loc[:, 'day'] == 20\n",
"b20 = df3.loc[idx, :].groupby('day_of_week').births.mean()"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"day_of_week\n",
"1 -597.400000\n",
"2 -448.805556\n",
"3 -79.028571\n",
"4 -167.131579\n",
"5 -237.342857\n",
"6 -71.361111\n",
"7 -19.189189\n",
"Name: births, dtype: float64"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"b6-b20"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"day_of_week\n",
"1 520.142857\n",
"2 491.722222\n",
"3 403.742857\n",
"4 507.052632\n",
"5 911.342857\n",
"6 93.805556\n",
"7 81.891892\n",
"Name: births, dtype: float64"
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"b20-b13"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Качество вина"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Загрузите [датасет](https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv) с информацией о характеристиках вина и его качестве."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Что из себя представляет объект в этом наборе данных? Сколько их?\n",
"* Какие признаки описывают объекты? Сколько их?\n",
"* Какой признак является целевым?\n",
"* Каковы их области значений?\n",
"* Есть ли пропуски?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Какие признаки больше всего влияют на целевую переменную?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Создайте новый столбец `quality_cat`, которая будет иметь значение `\"good\"` если `quality > 5` и `\"bad\"` - иначе.
"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Нарисуйте гистрограммы признака alcohol в группах с `quality_cat == \"good\"` и `quality_cat == \"bad\"`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Можете ли вы придумать правило для классификации вина на хорошее и плохое по рисунку выше? Пусть это будет нашей первой моделью)\n",
"\n",
"Напишите функцию `brute_clf_train()` которая бы перебирала пороговое значение по признаку `alcohol` и находило бы \"оптимальное\" (кстати, что значит оптимальное?)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Напишите функцию `brute_clf_predict()` которая бы по значению признака `alcohol` и найденному выше порогу говорила какое качество у вина.\n",
"\n",
"А заодно выводила бы количество \"ошибок\" на текущем наборе данных"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Проверим, как обобщается наша модель на другие данные.\n",
"\n",
"* Загрузите другой [датасет](https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv)\n",
"* Выполните те же панипуляции с признаками\n",
"* Используйте нашу простейшую модель для предсказания качества на новых данных"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Вспоминаем NumPy"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"### Упражнения с векторами и матрицами"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Линейная регрессия"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Загрузите [файл 1](https://www.dropbox.com/s/kg9px9v3xfysak9/tutorial_dataset.csv?dl=0) и [файл 2](https://www.dropbox.com/s/f87gm612o144emx/tutorial_dataset_2.csv?dl=0) в папку с тетрадкой. С помощью функции `loadtxt` в модуле `numpy` загрузите табличные данные одного из файлов. Присвойте y = D[:,0] а X = D[:, 1:].\n",
"\n",
"Сейчас мы воспользуемся одной магической формулой и построим модель линейной регрессии. Откуда эта формула берется мы узнаем на следующих занятиях.\n",
"\n",
"Модель линейной регрессии в матричном виде выглядит так: $\\hat{y} = X\\hat{\\beta}$, где\n",
"\n",
"$$ \\hat{\\beta} = (X^T X)^{-1} X^T y $$\n",
"Остатки модели рассчитываются как\n",
"$$ \\text{res} = y - \\hat{y} $$\n",
"\n",
"Итак, еще раз:\n",
"\n",
"1. Загрузите данные\n",
"2. Оцените веса $\\beta$ с помощью формулы\n",
"3. Постройте график, на котором по оси Y: остатки, а по оси X: $\\hat{y}$"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# load data\n",
"D = np.loadtxt('tutorial_dataset_1.csv', \n",
" skiprows=1, \n",
" delimiter=',')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Write your code here\n",
"#\n",
"#\n",
"#"
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python [default]",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.12"
},
"nav_menu": {},
"toc": {
"navigate_menu": true,
"number_sections": false,
"sideBar": true,
"threshold": 6,
"toc_cell": false,
"toc_section_display": "block",
"toc_window_display": true
},
"toc_position": {
"height": "924px",
"left": "0px",
"right": "1622.67px",
"top": "108px",
"width": "212px"
}
},
"nbformat": 4,
"nbformat_minor": 0
}