{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img align=\"center\" src=\"https://raw.githubusercontent.com/FUlyankin/Parsers/master/images%20/cats.jpg\" height=\"1200\" width=\"1200\"> \n",
    "\n",
    "# Домашка №3: основы статистики\n",
    "\n",
    "\n",
    "У всех нас есть датасет по контакту. В нём лежит информация про всех нас. Эту информацию надо как следует проанализировать. Именно этим мы и подолжаем заниматься.\n",
    "\n",
    "Грамотно расчитывайте свои силы и делайте тот объём заданий, который позволит вам получить желаемую оценку :) \n",
    "\n",
    "__Важно:__ за циклы в любом из пунктов вы получаете ноль баллов.\n",
    "\n",
    "----------------------\n",
    "\n",
    "Подгрузим данные и посмотрим на первые пять строчек из таблицы.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>city</th>\n",
       "      <th>country</th>\n",
       "      <th>first_name</th>\n",
       "      <th>home_town</th>\n",
       "      <th>in_hse_memes_group</th>\n",
       "      <th>is_bmm</th>\n",
       "      <th>is_closed</th>\n",
       "      <th>last_name</th>\n",
       "      <th>likes_memes</th>\n",
       "      <th>uid</th>\n",
       "      <th>...</th>\n",
       "      <th>photo_month_mean</th>\n",
       "      <th>photo_repost_cnt</th>\n",
       "      <th>photo_repost_max</th>\n",
       "      <th>photo_repost_mean</th>\n",
       "      <th>photo_repost_median</th>\n",
       "      <th>photo_text_len_cnt</th>\n",
       "      <th>photo_ava_change_cnt</th>\n",
       "      <th>photo_text_url_len_cnt</th>\n",
       "      <th>friends_from_course_cnt</th>\n",
       "      <th>friends_mail_from_course_pct</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Москва</td>\n",
       "      <td>Россия</td>\n",
       "      <td>Александра</td>\n",
       "      <td>Москва</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>Абашкова</td>\n",
       "      <td>60.0</td>\n",
       "      <td>182152789</td>\n",
       "      <td>...</td>\n",
       "      <td>1.333333</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>42.0</td>\n",
       "      <td>0.428571</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Рязань</td>\n",
       "      <td>Россия</td>\n",
       "      <td>Анастасия</td>\n",
       "      <td>Рязань</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>Чуфистова</td>\n",
       "      <td>0.0</td>\n",
       "      <td>148020433</td>\n",
       "      <td>...</td>\n",
       "      <td>2.375000</td>\n",
       "      <td>2.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.105263</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>0.281250</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Москва</td>\n",
       "      <td>Россия</td>\n",
       "      <td>Александр</td>\n",
       "      <td>Омск</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>Головачев</td>\n",
       "      <td>0.0</td>\n",
       "      <td>138413935</td>\n",
       "      <td>...</td>\n",
       "      <td>1.400000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>0.406250</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Анна</td>\n",
       "      <td>NaN</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>Лобанова</td>\n",
       "      <td>0.0</td>\n",
       "      <td>366261055</td>\n",
       "      <td>...</td>\n",
       "      <td>4.166667</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>45.0</td>\n",
       "      <td>0.333333</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>NaN</td>\n",
       "      <td>Россия</td>\n",
       "      <td>Алексей</td>\n",
       "      <td>NaN</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>Пузырный</td>\n",
       "      <td>21.0</td>\n",
       "      <td>111252392</td>\n",
       "      <td>...</td>\n",
       "      <td>3.181818</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>41.0</td>\n",
       "      <td>0.341463</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 98 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     city country  first_name home_town  in_hse_memes_group  is_bmm  \\\n",
       "0  Москва  Россия  Александра    Москва                True    True   \n",
       "1  Рязань  Россия   Анастасия    Рязань                True    True   \n",
       "2  Москва  Россия   Александр      Омск               False    True   \n",
       "3     NaN     NaN        Анна       NaN               False    True   \n",
       "4     NaN  Россия     Алексей       NaN                True    True   \n",
       "\n",
       "   is_closed  last_name  likes_memes        uid  ...  photo_month_mean  \\\n",
       "0      False   Абашкова         60.0  182152789  ...          1.333333   \n",
       "1      False  Чуфистова          0.0  148020433  ...          2.375000   \n",
       "2      False  Головачев          0.0  138413935  ...          1.400000   \n",
       "3      False   Лобанова          0.0  366261055  ...          4.166667   \n",
       "4      False   Пузырный         21.0  111252392  ...          3.181818   \n",
       "\n",
       "   photo_repost_cnt  photo_repost_max  photo_repost_mean  photo_repost_median  \\\n",
       "0               0.0               0.0           0.000000                  0.0   \n",
       "1               2.0               1.0           0.105263                  0.0   \n",
       "2               0.0               0.0           0.000000                  0.0   \n",
       "3               0.0               0.0           0.000000                  0.0   \n",
       "4               0.0               0.0           0.000000                  0.0   \n",
       "\n",
       "   photo_text_len_cnt  photo_ava_change_cnt  photo_text_url_len_cnt  \\\n",
       "0                 0.0                   0.0                     0.0   \n",
       "1                 0.0                   0.0                     0.0   \n",
       "2                 0.0                   0.0                     0.0   \n",
       "3                 0.0                   0.0                     0.0   \n",
       "4                 0.0                   0.0                     0.0   \n",
       "\n",
       "   friends_from_course_cnt  friends_mail_from_course_pct  \n",
       "0                     42.0                      0.428571  \n",
       "1                     32.0                      0.281250  \n",
       "2                     32.0                      0.406250  \n",
       "3                     45.0                      0.333333  \n",
       "4                     41.0                      0.341463  \n",
       "\n",
       "[5 rows x 98 columns]"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df = pd.read_csv('../data/vk_download/vk_main.csv', sep='\\t')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__[1] У скольких человек с маркетинга нет авы?__\n",
    "\n",
    "Используй переменные `is_bmm` и `has_ava_dummy` (значение `True` можно воспринимать как $1$, а `False` — как $0$). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "98"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[(df.is_bmm == 1) & (df.has_ava_dummy == 0)].shape[1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__[2] Сколько парней/девушек с инстой/без указали в профиле, что знают английский?__  \n",
    "\n",
    "Эта информация лежит в переменной `english_dummy`. Значение $1$ означает, что английский язык указан в профиле. По аналогии в `instagram_dummy` $1$ означают, что человек оставил ссылку на свою инсту. __Проинтерпретируйте получившиеся цифры.__ "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "male_dummy  instagram_dummy\n",
       "0           0                  0.126437\n",
       "            1                  0.214286\n",
       "1           0                  0.142857\n",
       "            1                  0.185185\n",
       "Name: english_dummy, dtype: float64"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_agg = df.groupby(['male_dummy', 'instagram_dummy'])['english_dummy']\n",
    "df_agg.sum()/df_agg.count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Люди, которые указали инстаграмм, указывают и знание английского языка. Этот вовсе не означает, что люди с инстаграммом лучше знают язык. Скорее всего, они просто более открыты миру и более подробно заполняют свой профиль в социальной сетке. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__[1] Вывести имена людей, у которых на стенках больше всего эмодзи (топ 5%)__\n",
    "\n",
    " Все эмодзи, которые были оставлены на стенке у студента, лежат в колонке `wall_emoji_trace`.  Отсортируйте всех людей из этого топа по числу эмодзи. При работе с табличкой не забудьте заполнить пропуски."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>first_name</th>\n",
       "      <th>wall_emoji_cnt</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Анастасия</td>\n",
       "      <td>1975</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>231</th>\n",
       "      <td>Григорий</td>\n",
       "      <td>1914</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>296</th>\n",
       "      <td>Юлия</td>\n",
       "      <td>1904</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>113</th>\n",
       "      <td>Elizabeth</td>\n",
       "      <td>1456</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>176</th>\n",
       "      <td>Анжелика</td>\n",
       "      <td>1076</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>196</th>\n",
       "      <td>Алик</td>\n",
       "      <td>1058</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>161</th>\n",
       "      <td>Оля</td>\n",
       "      <td>751</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>214</th>\n",
       "      <td>Дуся</td>\n",
       "      <td>737</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>180</th>\n",
       "      <td>Аня</td>\n",
       "      <td>473</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>253</th>\n",
       "      <td>Стася</td>\n",
       "      <td>446</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>198</th>\n",
       "      <td>Иван</td>\n",
       "      <td>441</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>395</th>\n",
       "      <td>Тимур</td>\n",
       "      <td>402</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>414</th>\n",
       "      <td>Дима</td>\n",
       "      <td>357</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>208</th>\n",
       "      <td>Ден</td>\n",
       "      <td>357</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>199</th>\n",
       "      <td>Дарья</td>\n",
       "      <td>309</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>209</th>\n",
       "      <td>Диана</td>\n",
       "      <td>306</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>369</th>\n",
       "      <td>Полина</td>\n",
       "      <td>293</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>261</th>\n",
       "      <td>Ксения</td>\n",
       "      <td>282</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>154</th>\n",
       "      <td>Виктория</td>\n",
       "      <td>274</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>174</th>\n",
       "      <td>Андрей</td>\n",
       "      <td>239</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>375</th>\n",
       "      <td>Сандрик</td>\n",
       "      <td>223</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>387</th>\n",
       "      <td>Соня</td>\n",
       "      <td>213</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    first_name  wall_emoji_cnt\n",
       "1    Анастасия            1975\n",
       "231   Григорий            1914\n",
       "296       Юлия            1904\n",
       "113  Elizabeth            1456\n",
       "176   Анжелика            1076\n",
       "196       Алик            1058\n",
       "161        Оля             751\n",
       "214       Дуся             737\n",
       "180        Аня             473\n",
       "253      Стася             446\n",
       "198       Иван             441\n",
       "395      Тимур             402\n",
       "414       Дима             357\n",
       "208        Ден             357\n",
       "199      Дарья             309\n",
       "209      Диана             306\n",
       "369     Полина             293\n",
       "261     Ксения             282\n",
       "154   Виктория             274\n",
       "174     Андрей             239\n",
       "375    Сандрик             223\n",
       "387       Соня             213"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['wall_emoji_cnt'] = df['wall_emoji_trace'].fillna('').apply(len)\n",
    "\n",
    "q = df['wall_emoji_cnt'].quantile(0.95)\n",
    "\n",
    "df.loc[df['wall_emoji_cnt'] >= q,\n",
    "       ['first_name', 'wall_emoji_cnt']].sort_values('wall_emoji_cnt', ascending=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__[3] Давайте проанализируем колонку со средним кол-во фото в месяц (`photo_month_mean`)__\n",
    "\n",
    "* Постройте на одной картинке гистограмы для распределения этого показателя по разным полам. \n",
    "* Правда ли, что типичная девушка выкладывает значительно больше фотографий, чем типичный мужчина? (подумайте какой именно показатель типичности нужно выбрать для сравнения и обоснуйте почему)\n",
    "* Для какого пола показатель оказывается более непредсказуемым? (подумайте как именно корректно эту непредсказуемость оценить, обычное стандартное отклонение явно не подходит)\n",
    "\n",
    "Не забывайте подгрузить пакет `matplotlib`! __Все свои рассуждения пишите прямо по ходу кода! Нет рассуждений => нет баллов!__ "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "\n",
      "text/plain": [
       "<Figure size 576x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "df[df.male_dummy == 1].photo_month_mean.hist(figsize=(8,4), alpha=0.3, label=\"male\", bins=50)\n",
    "df[df.male_dummy == 0].photo_month_mean.hist(figsize=(8,4), alpha=0.3, label=\"female\", bins=50)\n",
    "plt.legend();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "У распределений длинные хвосты, есть выбросы в данных. Среднее чувствительно к выбросам, значит более корректно сравнивать \"типичных\" мальчиков и девочек с помощью медиан. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "male_dummy\n",
       "0    1.75\n",
       "1    1.50\n",
       "Name: photo_month_mean, dtype: float64"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.groupby('male_dummy')['photo_month_mean'].agg('median')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Дисперсия, а значит и среднее квадратичное отклонение, чувствительны к выбросам. Чтобы полечить выборку от их тлетворного влияния, нужно сделать срез. Как отсечь выбросы? Ну например, можно считать выбросом всё, что пробивает $95\\%$ квантиль. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "10.837499999999999"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "q = df['photo_month_mean'].quantile(0.95)\n",
    "q"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAeMAAAD8CAYAAABEgMzCAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAAEsFJREFUeJzt3W+MXfV95/H3JxgTlyk11N0rB6w1UpErNqs4yyhKilTN8GcV2qrwoEKpditrhTT7oJumu5W6dKU+WGl3RaVV2zxYrWSFNJY26wmiibCiJi02TKtKXTZxMtsmdhGUhmIwdih446HeBJLvPphDOwWbe2d87vw8975fknXPOfO7937nC/jD73fOPTdVhSRJauc9rQuQJGnaGcaSJDVmGEuS1JhhLElSY4axJEmNGcaSJDVmGEuS1JhhLElSY4axJEmNbdvMN9u1a1ft3bt3M9/yivD6669z7bXXti5jYtjP/tjLftnPfk1CP48fP/5KVf3YsHGbGsZ79+7lq1/96ma+5RVhaWmJubm51mVMDPvZH3vZL/vZr0noZ5LnRxnnMrUkSY0ZxpIkNTY0jJPsS7K85s93kvxKkhuSPJ7kme7x+s0oWJKkSTM0jKvq6araX1X7gduAvwW+ADwIHKuqW4Bj3b4kSVqn9S5T3wn8ZVU9D9wLHOqOHwLu67MwSZKmxXrD+GPA4W57UFWnu+2XgUFvVUmSNEVSVaMNTLYDLwH/pKrOJDlXVTvX/Py1qnrHeeMkC8ACwGAwuG1xcbGfyreQlZUVZmZmWpcxMexnf+xlv+xnvyahn/Pz88eranbYuPV8zvge4GtVdabbP5Nkd1WdTrIbOHuxJ1XVQeAgwOzsbG31z4xtxCR8Vu5KYj/7Yy/7ZT/7NU39XM8y9S/w90vUAEeAA932AeCxvoqSJGmajDQzTnItcDfwr9ccfgh4JMkDwPPA/f2X9+6Wjx4ePmiN/Xu6VfV994yhGkmSNmakMK6q14Effduxv2H16mpJknQZvAOXJEmNGcaSJDVmGEuS1JhhLElSY4axJEmNGcaSJDVmGEuS1JhhLElSY4axJEmNGcaSJDVmGEuS1JhhLElSY4axJEmNGcaSJDVmGEuS1JhhLElSY4axJEmNGcaSJDVmGEuS1JhhLElSY4axJEmNjRTGSXYmeTTJXyQ5meQjSW5I8niSZ7rH68ddrCRJk2jUmfEngS9X1U8AHwBOAg8Cx6rqFuBYty9JktZpaBgn+RHgp4CHAarqe1V1DrgXONQNOwTcN64iJUmaZKPMjG8Gvg38bpKvJ/lUkmuBQVWd7sa8DAzGVaQkSZMsVfXuA5JZ4H8Bt1fVU0k+CXwH+HhV7Vwz7rWqesd54yQLwALAYDC4bXFxsbfiL5x/dV3jd2zftrpxzXW91TCKlZUVZmZmNvU9J5n97I+97Jf97Nck9HN+fv54Vc0OG7dthNc6BZyqqqe6/UdZPT98JsnuqjqdZDdw9mJPrqqDwEGA2dnZmpubG6X+kSwfPbyu8fv37Fjd2NdfDaNYWlqiz9972tnP/tjLftnPfk1TP4cuU1fVy8ALSfZ1h+4ETgBHgAPdsQPAY2OpUJKkCTfKzBjg48Bnk2wHngP+FatB/kiSB4DngfvHU6IkSZNtpDCuqmXgYmved/ZbjiRJ08c7cEmS1JhhLElSY4axJEmNGcaSJDVmGEuS1JhhLElSY4axJEmNGcaSJDVmGEuS1JhhLElSY4axJEmNGcaSJDVmGEuS1JhhLElSY4axJEmNGcaSJDVmGEuS1JhhLElSY4axJEmNGcaSJDVmGEuS1Ni2UQYl+RZwHvg+8GZVzSa5AfgcsBf4FnB/Vb02njIlSZpc65kZz1fV/qqa7fYfBI5V1S3AsW5fkiSt0+UsU98LHOq2DwH3XX45kiRNn1HDuIA/THI8yUJ3bFBVp7vtl4FB79VJkjQFUlXDByU3VtWLSf4R8DjwceBIVe1cM+a1qrr+Is9dABYABoPBbYuLi70Vf+H8q+sav2N7d4r8mut6q2EUKysrzMzMbOp7TjL72R972S/72a9J6Of8/PzxNad3L2mkC7iq6sXu8WySLwAfAs4k2V1Vp5PsBs5e4rkHgYMAs7OzNTc3N+KvMNzy0cPrGr9/z47VjX391TCKpaUl+vy9p5397I+97Jf97Nc09XPoMnWSa5P88FvbwD8HvgEcAQ50ww4Aj42rSEmSJtkoM+MB8IUkb43/n1X15SRfAR5J8gDwPHD/+MqUJGlyDQ3jqnoO+MBFjv8NcOc4ipIkaZp4By5JkhozjCVJaswwliSpMcNYkqTGDGNJkhozjCVJaswwliSpMcNYkqTGDGNJkhozjCVJaswwliSpMcNYkqTGDGNJkhozjCVJaswwliSpMcNYkqTGDGNJkhozjCVJaswwliSpMcNYkqTGDGNJkhobOYyTXJXk60m+2O3fnOSpJM8m+VyS7eMrU5KkybWemfEngJNr9n8T+O2q+nHgNeCBPguTJGlajBTGSW4Cfgb4VLcf4A7g0W7IIeC+cRQoSdKkG3Vm/DvArwE/6PZ/FDhXVW92+6eAG3uuTZKkqbBt2IAkPwucrarjSebW+wZJFoAFgMFgwNLS0npf4pIuXLhmXePPfeuN1Y3T/dUwipWVlV5/72lnP/tjL/tlP/s1Tf0cGsbA7cDPJflp4L3AdcAngZ1JtnWz45uAFy/25Ko6CBwEmJ2drbm5uT7qBmD56OF1jd+/Z8fqxr7+ahjF0tISff7e085+9sde9st+9mua+jl0mbqqfr2qbqqqvcDHgCeq6l8ATwI/3w07ADw2tiolSZpgl/M5438P/Lskz7J6DvnhfkqSJGm6jLJM/XeqaglY6rafAz7Uf0mSJE0X78AlSVJjhrEkSY0ZxpIkNWYYS5LUmGEsSVJjhrEkSY0ZxpIkNWYYS5LUmGEsSVJjhrEkSY0ZxpIkNWYYS5LUmGEsSVJjhrEkSY0ZxpIkNWYYS5LUmGEsSVJjhrEkSY0ZxpIkNWYYS5LUmGEsSVJjQ8M4yXuT/O8k/yfJN5P8x+74zUmeSvJsks8l2T7+ciVJmjyjzIy/C9xRVR8A9gMfTfJh4DeB366qHwdeAx4YX5mSJE2uoWFcq1a63au7PwXcATzaHT8E3DeWCiVJmnAjnTNOclWSZeAs8Djwl8C5qnqzG3IKuHE8JUqSNNlSVaMPTnYCXwB+A/hMt0RNkj3Al6rq/Rd5zgKwADAYDG5bXFzso24ALpx/dV3jd2zftrpxzXW91TCKlZUVZmZmNvU9J5n97I+97Jf97Nck9HN+fv54Vc0OG7dtPS9aVeeSPAl8BNiZZFs3O74JePESzzkIHASYnZ2tubm59bzlu1o+enhd4/fv2bG6sa+/GkaxtLREn7/3tLOf/bGX/bKf/Zqmfo5yNfWPdTNikuwA7gZOAk8CP98NOwA8Nq4iJUmaZKPMjHcDh5JcxWp4P1JVX0xyAlhM8p+ArwMPj7FOSZIm1tAwrqo/Az54kePPAR8aR1GSJE0T78AlSVJjhrEkSY0ZxpIkNWYYS5LUmGEsSVJjhrEkSY0ZxpIkNbau22FudcsvnAPgle+fGfk5d906GFc5kiQBzowlSWrOMJYkqTHDWJKkxgxjSZIaM4wlSWrMMJYkqTHDWJKkxgxjSZIaM4wlSWrMMJYkqTHDWJKkxqbq3tRv2fXSE6MPvmrn6uO+e8ZTjCRp6jkzliSpsaFhnGRPkieTnEjyzSSf6I7fkOTxJM90j9ePv1xJkibPKDPjN4FfrapbgQ8Dv5TkVuBB4FhV3QIc6/YlSdI6DQ3jqjpdVV/rts8DJ4EbgXuBQ92wQ8B94ypSkqRJtq5zxkn2Ah8EngIGVXW6+9HLwKDXyiRJmhKpqtEGJjPAHwH/uao+n+RcVe1c8/PXquod542TLAALAIPB4LbFxcV+KgcunH+1t9e6lB3buwvOr7luw6+xsrLCzMxMTxXJfvbHXvbLfvZrEvo5Pz9/vKpmh40b6aNNSa4Gfg/4bFV9vjt8JsnuqjqdZDdw9mLPraqDwEGA2dnZmpubG+UtR7J89HBvr3Up+/fsWN3YN7fh11haWqLP33va2c/+2Mt+2c9+TVM/R7maOsDDwMmq+q01PzoCHOi2DwCP9V+eJEmTb5SZ8e3ALwJ/nmS5O/YfgIeAR5I8ADwP3D+eEiVJmmxDw7iq/gTIJX58Z7/lSJI0fbwDlyRJjRnGkiQ1ZhhLktSYYSxJUmOGsSRJjRnGkiQ1ZhhLktSYYSxJUmOGsSRJjRnGkiQ1NtK3Nk2z5RfOAfDK98+MNP6uW/1aZ0nS+jgzliSpMcNYkqTGXKYet6e/BN99Y/VxVPvuGV89kqQrjjNjSZIaM4wlSWrMZeqeHT3xD6+63vXSOS587xqWX3j9ouP379m5GWVJkq5gzowlSWrMMJYkqTGXqRt766Yia73bDUa8qYgkTR5nxpIkNTY0jJN8OsnZJN9Yc+yGJI8neaZ7vH68ZUqSNLlGmRl/Bvjo2449CByrqluAY92+JEnagKFhXFV/DLz6tsP3Aoe67UPAfT3XJUnS1NjoOeNBVZ3utl8GvKpIkqQNSlUNH5TsBb5YVe/v9s9V1c41P3+tqi563jjJArAAMBgMbltcXOyh7FUXzr99wj4+b1593Yaet+2N7/DGD97D1e/5Qc8VvdNGa3y7H37vlX2R/crKCjMzM63LmAj2sl/2s1+T0M/5+fnjVTU7bNxG/9Y9k2R3VZ1Oshs4e6mBVXUQOAgwOztbc3NzG3zLd1o+eri31xrmlff90w09b9dLT3D6wjXs3vHdnit6p43W+HZzV/jHp5aWlujz36NpZi/7ZT/7NU393Ogy9RHgQLd9AHisn3IkSZo+o3y06TDwp8C+JKeSPAA8BNyd5Bngrm5fkiRtwNBl6qr6hUv86M6ea5EkaSp5By5Jkhq7si+bvYLseumJ1iUMtZEaX3nfHWOoRJK0Hs6MJUlqzDCWJKkxw1iSpMYMY0mSGjOMJUlqzKupJeDoiTPrGn/XFX7LUElbizNjSZIaM4wlSWrMMJYkqTHDWJKkxgxjSZIa82pqbczTX/q7zeUXzo30lL7ug33h/735rlc/T/SVzmv6vtawfwYX6/1E90naYpwZS5LUmGEsSVJjLlNPuYt+7eJVOzfvvUYwsV/zeIkl581y9MSZoUv+l8NlcGl0zowlSWrMMJYkqTGXqfUOo14dfaUa17Lr5bzHNC7ZjvufwyT0tPd/jzZy6mPfPet/jnrnzFiSpMYuK4yTfDTJ00meTfJgX0VJkjRNNrxMneQq4L8BdwOngK8kOVJVJ/oqTtpsG73ie5jll8bysmO3kX5s1tXvR0+cWXd9+/d0nxQY09Ls+Ytcnf5uNe66xPFN/QTBRq/q73p4xZ6yuczfa7Ndzsz4Q8CzVfVcVX0PWATu7acsSZKmx+WE8Y3AC2v2T3XHJEnSOoz9auokC8BCt7uS5Olxv+cVaBfwSusiJoj97I+97Jf97Nck9PMfjzLocsL4RWDPmv2bumP/QFUdBA5exvtseUm+WlWzreuYFPazP/ayX/azX9PUz8tZpv4KcEuSm5NsBz4GHOmnLEmSpseGZ8ZV9WaSfwP8AXAV8Omq+mZvlUmSNCUu65xxVf0+8Ps91TLJpnqZfgzsZ3/sZb/sZ7+mpp+pqtY1SJI01bwdpiRJjRnGY5RkT5Ink5xI8s0kn2hd01aX5KokX0/yxda1bHVJdiZ5NMlfJDmZ5COta9qqkvzb7r/xbyQ5nOS9rWvaSpJ8OsnZJN9Yc+yGJI8neaZ7vL5ljeNmGI/Xm8CvVtWtwIeBX0pya+OatrpPACdbFzEhPgl8uap+AvgA9nVDktwI/DIwW1XvZ/WC1o+1rWrL+Qzw0bcdexA4VlW3AMe6/YllGI9RVZ2uqq912+dZ/cvOu5RtUJKbgJ8BPtW6lq0uyY8APwU8DFBV36uqrf3dmW1tA3Yk2Qb8ELBF70beRlX9MfDq2w7fCxzqtg8B921qUZvMMN4kSfYCHwSealvJlvY7wK8BP2hdyAS4Gfg28Lvdsv+nklzbuqitqKpeBP4r8NfAaeD/VtUftq1qIgyq6nS3/TKw9b/A+l0YxpsgyQzwe8CvVNV3WtezFSX5WeBsVR1vXcuE2Ab8M+C/V9UHgdeZ8GXAcenOZd7L6v/gvA+4Nsm/bFvVZKnVj/1M9Ed/DOMxS3I1q0H82ar6fOt6trDbgZ9L8i1WvyHsjiT/o21JW9op4FRVvbVS8yir4az1uwv4q6r6dlW9AXwe+MnGNU2CM0l2A3SPZxvXM1aG8RglCavn5E5W1W+1rmcrq6pfr6qbqmovqxfHPFFVzj42qKpeBl5Isq87dCfgd5FvzF8DH07yQ91/83fixXB9OAIc6LYPAI81rGXsDOPxuh34RVZnccvdn59uXZTU+Tjw2SR/BuwH/kvjerakbnXhUeBrwJ+z+vfq1Nw5qg9JDgN/CuxLcirJA8BDwN1JnmF19eGhljWOm3fgkiSpMWfGkiQ1ZhhLktSYYSxJUmOGsSRJjRnGkiQ1ZhhLktSYYSxJUmOGsSRJjf1/Q/wm2yqlkEQAAAAASUVORK5CYII=\n",
      "text/plain": [
       "<Figure size 576x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "df[df.photo_month_mean < q].groupby('male_dummy')['photo_month_mean'].hist(figsize=(8,4), alpha=0.3, bins=30);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>mean</th>\n",
       "      <th>std</th>\n",
       "      <th>median</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>male_dummy</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2.420398</td>\n",
       "      <td>1.815699</td>\n",
       "      <td>1.666667</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2.130265</td>\n",
       "      <td>1.811635</td>\n",
       "      <td>1.500000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                mean       std    median\n",
       "male_dummy                              \n",
       "0           2.420398  1.815699  1.666667\n",
       "1           2.130265  1.811635  1.500000"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[df.photo_month_mean < q].groupby('male_dummy')['photo_month_mean'].agg(['mean', 'std', 'median'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Разброс между группами не различается. Если вы сразу догадались, что данные кишат выбросами и обрезали их, вы вполне себе можете теперь сравнивать типичных представителей по средним. Такой ответ тоже считается правильным. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__[3] Немного сегментации__\n",
    "\n",
    "Давайте попробуем разбить всех людей из нашей таблички на равные сегменты по тому сколько у них есть друзей с потока (`friends_from_course_cnt`).  Для этого: \n",
    "\n",
    "* Заполните пропуски в переменной `friends_from_course_cnt`  нулями.\n",
    "* Найдите для переменной `friends_from_course_cnt` квантили уровня $0.25, 0.5, 0.75$. \n",
    "* Создайте колонку `segment` и заполните её по следующему принципу: \n",
    "    - Поставьте в неё $1$, если у человека друзей меньше, чем квантиль уровня $0.25$ \n",
    "    - Поставьте $2$, если у человека друзей от квантиля $0.25$ до $0.5$\n",
    "    - Поставьте $3$, если у человека друзей от квантиля $0.5$ до $0.75$\n",
    "    - Поставьте $4$, если у человека друзей от квантиля $0.75$\n",
    "    \n",
    "Получившиеся группы - это разные сегменты курса. Одни из них более социальные, другие менее социальные. Используя полученную колонку ответьте на вопросы: \n",
    "\n",
    "- Какое направление (БММ или УБ) оказалось более социальным? \n",
    "- Правда ли, что более социальные люди чаще лайкают мемы в вышкинской группе? \n",
    "- Чётко опишите все выводы, которые вы сделали. \n",
    "\n",
    "__Hint:__ для выявления группы для человека можно написать функцию, которая сравнит его число друзей со всеми квантилями и выдаст группу. Потом эту функцию можно применить к колонке с помощью `apply`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.friends_from_course_cnt.fillna(0, inplace=True)\n",
    "q = df.friends_from_course_cnt.quantile([0.25, 0.5, 0.75]).values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "def find_segm(friends):\n",
    "    if friends <= q[0]:\n",
    "        return 1\n",
    "    elif (friends > q[0])&(friends <= q[1]):\n",
    "        return 2\n",
    "    elif (friends > q[1])&(friends <= q[2]):\n",
    "        return 3\n",
    "    else:\n",
    "        return 4\n",
    "    \n",
    "df['segment'] = df.friends_from_course_cnt.apply(find_segm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>segment</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>is_bmm</th>\n",
       "      <th>segment</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">False</th>\n",
       "      <th>1</th>\n",
       "      <td>63</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>68</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>60</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>72</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">True</th>\n",
       "      <th>1</th>\n",
       "      <td>45</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>42</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>48</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>27</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                segment\n",
       "is_bmm segment         \n",
       "False  1             63\n",
       "       2             68\n",
       "       3             60\n",
       "       4             72\n",
       "True   1             45\n",
       "       2             42\n",
       "       3             48\n",
       "       4             27"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Дальше есть какие-нибудь групп баи вроде таких и пояснения: \n",
    "df.groupby(['is_bmm', 'segment']).agg({'segment': 'count'})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>likes_memes</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>segment</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>22.753623</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>17.550459</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>28.121495</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>29.252525</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         likes_memes\n",
       "segment             \n",
       "1          22.753623\n",
       "2          17.550459\n",
       "3          28.121495\n",
       "4          29.252525"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.groupby('segment').agg({'likes_memes': 'mean'})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Ваши выводы и их обоснование:__ \n",
    "\n",
    "- Если тут не будет написано ваших пояснений, за задание вы получаете ноль баллов!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "--------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__[n]  Удиви нас. Попробуй найти в данных какую-то классную особенность. Если у тебя это получится, мы поставим дополнительные баллы.__ Если вы найдёте полную фигню (сколько всего друзей у Маши или типа того), баллов не будет. Найденный факт реально должен выносить мозг и сносить крышу."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Попробуй меня на вкус!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Настрадался? Выскажи всё, что думаешь обо всём этом [в анонимке по третьему дз.](https://docs.google.com/forms/d/e/1FAIpQLSf5IFDJv8YsZDdkeLN4KXNU64zL9oXMtG5Rp36rsitOYOwYwQ/viewform) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.2"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": false,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}