{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "<img align=\"center\" src=\"https://raw.githubusercontent.com/FUlyankin/Parsers/master/images%20/cats.jpg\" height=\"1200\" width=\"1200\"> \n", "\n", "# Домашка №3: основы статистики\n", "\n", "\n", "У всех нас есть датасет по контакту. В нём лежит информация про всех нас. Эту информацию надо как следует проанализировать. Именно этим мы и подолжаем заниматься.\n", "\n", "Грамотно расчитывайте свои силы и делайте тот объём заданий, который позволит вам получить желаемую оценку :) \n", "\n", "__Важно:__ за циклы в любом из пунктов вы получаете ноль баллов.\n", "\n", "----------------------\n", "\n", "Подгрузим данные и посмотрим на первые пять строчек из таблицы.\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>city</th>\n", " <th>country</th>\n", " <th>first_name</th>\n", " <th>home_town</th>\n", " <th>in_hse_memes_group</th>\n", " <th>is_bmm</th>\n", " <th>is_closed</th>\n", " <th>last_name</th>\n", " <th>likes_memes</th>\n", " <th>uid</th>\n", " <th>...</th>\n", " <th>photo_month_mean</th>\n", " <th>photo_repost_cnt</th>\n", " <th>photo_repost_max</th>\n", " <th>photo_repost_mean</th>\n", " <th>photo_repost_median</th>\n", " <th>photo_text_len_cnt</th>\n", " <th>photo_ava_change_cnt</th>\n", " <th>photo_text_url_len_cnt</th>\n", " <th>friends_from_course_cnt</th>\n", " <th>friends_mail_from_course_pct</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>Москва</td>\n", " <td>Россия</td>\n", " <td>Александра</td>\n", " <td>Москва</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>Абашкова</td>\n", " <td>60.0</td>\n", " <td>182152789</td>\n", " <td>...</td>\n", " <td>1.333333</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.000000</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>42.0</td>\n", " <td>0.428571</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>Рязань</td>\n", " <td>Россия</td>\n", " <td>Анастасия</td>\n", " <td>Рязань</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>Чуфистова</td>\n", " <td>0.0</td>\n", " <td>148020433</td>\n", " <td>...</td>\n", " <td>2.375000</td>\n", " <td>2.0</td>\n", " <td>1.0</td>\n", " <td>0.105263</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>32.0</td>\n", " <td>0.281250</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>Москва</td>\n", " <td>Россия</td>\n", " <td>Александр</td>\n", " <td>Омск</td>\n", " <td>False</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>Головачев</td>\n", " <td>0.0</td>\n", " <td>138413935</td>\n", " <td>...</td>\n", " <td>1.400000</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.000000</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>32.0</td>\n", " <td>0.406250</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>Анна</td>\n", " <td>NaN</td>\n", " <td>False</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>Лобанова</td>\n", " <td>0.0</td>\n", " <td>366261055</td>\n", " <td>...</td>\n", " <td>4.166667</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.000000</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>45.0</td>\n", " <td>0.333333</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>NaN</td>\n", " <td>Россия</td>\n", " <td>Алексей</td>\n", " <td>NaN</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>Пузырный</td>\n", " <td>21.0</td>\n", " <td>111252392</td>\n", " <td>...</td>\n", " <td>3.181818</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.000000</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>41.0</td>\n", " <td>0.341463</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "<p>5 rows × 98 columns</p>\n", "</div>" ], "text/plain": [ " city country first_name home_town in_hse_memes_group is_bmm \\\n", "0 Москва Россия Александра Москва True True \n", "1 Рязань Россия Анастасия Рязань True True \n", "2 Москва Россия Александр Омск False True \n", "3 NaN NaN Анна NaN False True \n", "4 NaN Россия Алексей NaN True True \n", "\n", " is_closed last_name likes_memes uid ... photo_month_mean \\\n", "0 False Абашкова 60.0 182152789 ... 1.333333 \n", "1 False Чуфистова 0.0 148020433 ... 2.375000 \n", "2 False Головачев 0.0 138413935 ... 1.400000 \n", "3 False Лобанова 0.0 366261055 ... 4.166667 \n", "4 False Пузырный 21.0 111252392 ... 3.181818 \n", "\n", " photo_repost_cnt photo_repost_max photo_repost_mean photo_repost_median \\\n", "0 0.0 0.0 0.000000 0.0 \n", "1 2.0 1.0 0.105263 0.0 \n", "2 0.0 0.0 0.000000 0.0 \n", "3 0.0 0.0 0.000000 0.0 \n", "4 0.0 0.0 0.000000 0.0 \n", "\n", " photo_text_len_cnt photo_ava_change_cnt photo_text_url_len_cnt \\\n", "0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 \n", "\n", " friends_from_course_cnt friends_mail_from_course_pct \n", "0 42.0 0.428571 \n", "1 32.0 0.281250 \n", "2 32.0 0.406250 \n", "3 45.0 0.333333 \n", "4 41.0 0.341463 \n", "\n", "[5 rows x 98 columns]" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "df = pd.read_csv('../data/vk_download/vk_main.csv', sep='\\t')\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__[1] У скольких человек с маркетинга нет авы?__\n", "\n", "Используй переменные `is_bmm` и `has_ava_dummy` (значение `True` можно воспринимать как $1$, а `False` — как $0$). " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "98" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[(df.is_bmm == 1) & (df.has_ava_dummy == 0)].shape[1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__[2] Сколько парней/девушек с инстой/без указали в профиле, что знают английский?__ \n", "\n", "Эта информация лежит в переменной `english_dummy`. Значение $1$ означает, что английский язык указан в профиле. По аналогии в `instagram_dummy` $1$ означают, что человек оставил ссылку на свою инсту. __Проинтерпретируйте получившиеся цифры.__ " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "male_dummy instagram_dummy\n", "0 0 0.126437\n", " 1 0.214286\n", "1 0 0.142857\n", " 1 0.185185\n", "Name: english_dummy, dtype: float64" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_agg = df.groupby(['male_dummy', 'instagram_dummy'])['english_dummy']\n", "df_agg.sum()/df_agg.count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Люди, которые указали инстаграмм, указывают и знание английского языка. Этот вовсе не означает, что люди с инстаграммом лучше знают язык. Скорее всего, они просто более открыты миру и более подробно заполняют свой профиль в социальной сетке. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__[1] Вывести имена людей, у которых на стенках больше всего эмодзи (топ 5%)__\n", "\n", " Все эмодзи, которые были оставлены на стенке у студента, лежат в колонке `wall_emoji_trace`. Отсортируйте всех людей из этого топа по числу эмодзи. При работе с табличкой не забудьте заполнить пропуски." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>first_name</th>\n", " <th>wall_emoji_cnt</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>1</th>\n", " <td>Анастасия</td>\n", " <td>1975</td>\n", " </tr>\n", " <tr>\n", " <th>231</th>\n", " <td>Григорий</td>\n", " <td>1914</td>\n", " </tr>\n", " <tr>\n", " <th>296</th>\n", " <td>Юлия</td>\n", " <td>1904</td>\n", " </tr>\n", " <tr>\n", " <th>113</th>\n", " <td>Elizabeth</td>\n", " <td>1456</td>\n", " </tr>\n", " <tr>\n", " <th>176</th>\n", " <td>Анжелика</td>\n", " <td>1076</td>\n", " </tr>\n", " <tr>\n", " <th>196</th>\n", " <td>Алик</td>\n", " <td>1058</td>\n", " </tr>\n", " <tr>\n", " <th>161</th>\n", " <td>Оля</td>\n", " <td>751</td>\n", " </tr>\n", " <tr>\n", " <th>214</th>\n", " <td>Дуся</td>\n", " <td>737</td>\n", " </tr>\n", " <tr>\n", " <th>180</th>\n", " <td>Аня</td>\n", " <td>473</td>\n", " </tr>\n", " <tr>\n", " <th>253</th>\n", " <td>Стася</td>\n", " <td>446</td>\n", " </tr>\n", " <tr>\n", " <th>198</th>\n", " <td>Иван</td>\n", " <td>441</td>\n", " </tr>\n", " <tr>\n", " <th>395</th>\n", " <td>Тимур</td>\n", " <td>402</td>\n", " </tr>\n", " <tr>\n", " <th>414</th>\n", " <td>Дима</td>\n", " <td>357</td>\n", " </tr>\n", " <tr>\n", " <th>208</th>\n", " <td>Ден</td>\n", " <td>357</td>\n", " </tr>\n", " <tr>\n", " <th>199</th>\n", " <td>Дарья</td>\n", " <td>309</td>\n", " </tr>\n", " <tr>\n", " <th>209</th>\n", " <td>Диана</td>\n", " <td>306</td>\n", " </tr>\n", " <tr>\n", " <th>369</th>\n", " <td>Полина</td>\n", " <td>293</td>\n", " </tr>\n", " <tr>\n", " <th>261</th>\n", " <td>Ксения</td>\n", " <td>282</td>\n", " </tr>\n", " <tr>\n", " <th>154</th>\n", " <td>Виктория</td>\n", " <td>274</td>\n", " </tr>\n", " <tr>\n", " <th>174</th>\n", " <td>Андрей</td>\n", " <td>239</td>\n", " </tr>\n", " <tr>\n", " <th>375</th>\n", " <td>Сандрик</td>\n", " <td>223</td>\n", " </tr>\n", " <tr>\n", " <th>387</th>\n", " <td>Соня</td>\n", " <td>213</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " first_name wall_emoji_cnt\n", "1 Анастасия 1975\n", "231 Григорий 1914\n", "296 Юлия 1904\n", "113 Elizabeth 1456\n", "176 Анжелика 1076\n", "196 Алик 1058\n", "161 Оля 751\n", "214 Дуся 737\n", "180 Аня 473\n", "253 Стася 446\n", "198 Иван 441\n", "395 Тимур 402\n", "414 Дима 357\n", "208 Ден 357\n", "199 Дарья 309\n", "209 Диана 306\n", "369 Полина 293\n", "261 Ксения 282\n", "154 Виктория 274\n", "174 Андрей 239\n", "375 Сандрик 223\n", "387 Соня 213" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['wall_emoji_cnt'] = df['wall_emoji_trace'].fillna('').apply(len)\n", "\n", "q = df['wall_emoji_cnt'].quantile(0.95)\n", "\n", "df.loc[df['wall_emoji_cnt'] >= q,\n", " ['first_name', 'wall_emoji_cnt']].sort_values('wall_emoji_cnt', ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__[3] Давайте проанализируем колонку со средним кол-во фото в месяц (`photo_month_mean`)__\n", "\n", "* Постройте на одной картинке гистограмы для распределения этого показателя по разным полам. \n", "* Правда ли, что типичная девушка выкладывает значительно больше фотографий, чем типичный мужчина? (подумайте какой именно показатель типичности нужно выбрать для сравнения и обоснуйте почему)\n", "* Для какого пола показатель оказывается более непредсказуемым? (подумайте как именно корректно эту непредсказуемость оценить, обычное стандартное отклонение явно не подходит)\n", "\n", "Не забывайте подгрузить пакет `matplotlib`! __Все свои рассуждения пишите прямо по ходу кода! Нет рассуждений => нет баллов!__ " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "<Figure size 576x288 with 1 Axes>" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "df[df.male_dummy == 1].photo_month_mean.hist(figsize=(8,4), alpha=0.3, label=\"male\", bins=50)\n", "df[df.male_dummy == 0].photo_month_mean.hist(figsize=(8,4), alpha=0.3, label=\"female\", bins=50)\n", "plt.legend();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "У распределений длинные хвосты, есть выбросы в данных. Среднее чувствительно к выбросам, значит более корректно сравнивать \"типичных\" мальчиков и девочек с помощью медиан. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "male_dummy\n", "0 1.75\n", "1 1.50\n", "Name: photo_month_mean, dtype: float64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby('male_dummy')['photo_month_mean'].agg('median')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Дисперсия, а значит и среднее квадратичное отклонение, чувствительны к выбросам. Чтобы полечить выборку от их тлетворного влияния, нужно сделать срез. Как отсечь выбросы? Ну например, можно считать выбросом всё, что пробивает $95\\%$ квантиль. " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "10.837499999999999" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "q = df['photo_month_mean'].quantile(0.95)\n", "q" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAeMAAAD8CAYAAABEgMzCAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAAEsFJREFUeJzt3W+MXfV95/H3JxgTlyk11N0rB6w1UpErNqs4yyhKilTN8GcV2qrwoEKpditrhTT7oJumu5W6dKU+WGl3RaVV2zxYrWSFNJY26wmiibCiJi02TKtKXTZxMtsmdhGUhmIwdih446HeBJLvPphDOwWbe2d87vw8975fknXPOfO7937nC/jD73fOPTdVhSRJauc9rQuQJGnaGcaSJDVmGEuS1JhhLElSY4axJEmNGcaSJDVmGEuS1JhhLElSY4axJEmNbdvMN9u1a1ft3bt3M9/yivD6669z7bXXti5jYtjP/tjLftnPfk1CP48fP/5KVf3YsHGbGsZ79+7lq1/96ma+5RVhaWmJubm51mVMDPvZH3vZL/vZr0noZ5LnRxnnMrUkSY0ZxpIkNTY0jJPsS7K85s93kvxKkhuSPJ7kme7x+s0oWJKkSTM0jKvq6araX1X7gduAvwW+ADwIHKuqW4Bj3b4kSVqn9S5T3wn8ZVU9D9wLHOqOHwLu67MwSZKmxXrD+GPA4W57UFWnu+2XgUFvVUmSNEVSVaMNTLYDLwH/pKrOJDlXVTvX/Py1qnrHeeMkC8ACwGAwuG1xcbGfyreQlZUVZmZmWpcxMexnf+xlv+xnvyahn/Pz88eranbYuPV8zvge4GtVdabbP5Nkd1WdTrIbOHuxJ1XVQeAgwOzsbG31z4xtxCR8Vu5KYj/7Yy/7ZT/7NU39XM8y9S/w90vUAEeAA932AeCxvoqSJGmajDQzTnItcDfwr9ccfgh4JMkDwPPA/f2X9+6Wjx4ePmiN/Xu6VfV994yhGkmSNmakMK6q14Effduxv2H16mpJknQZvAOXJEmNGcaSJDVmGEuS1JhhLElSY4axJEmNGcaSJDVmGEuS1JhhLElSY4axJEmNGcaSJDVmGEuS1JhhLElSY4axJEmNGcaSJDVmGEuS1JhhLElSY4axJEmNGcaSJDVmGEuS1JhhLElSY4axJEmNjRTGSXYmeTTJXyQ5meQjSW5I8niSZ7rH68ddrCRJk2jUmfEngS9X1U8AHwBOAg8Cx6rqFuBYty9JktZpaBgn+RHgp4CHAarqe1V1DrgXONQNOwTcN64iJUmaZKPMjG8Gvg38bpKvJ/lUkmuBQVWd7sa8DAzGVaQkSZMsVfXuA5JZ4H8Bt1fVU0k+CXwH+HhV7Vwz7rWqesd54yQLwALAYDC4bXFxsbfiL5x/dV3jd2zftrpxzXW91TCKlZUVZmZmNvU9J5n97I+97Jf97Nck9HN+fv54Vc0OG7dthNc6BZyqqqe6/UdZPT98JsnuqjqdZDdw9mJPrqqDwEGA2dnZmpubG6X+kSwfPbyu8fv37Fjd2NdfDaNYWlqiz9972tnP/tjLftnPfk1TP4cuU1fVy8ALSfZ1h+4ETgBHgAPdsQPAY2OpUJKkCTfKzBjg48Bnk2wHngP+FatB/kiSB4DngfvHU6IkSZNtpDCuqmXgYmved/ZbjiRJ08c7cEmS1JhhLElSY4axJEmNGcaSJDVmGEuS1JhhLElSY4axJEmNGcaSJDVmGEuS1JhhLElSY4axJEmNGcaSJDVmGEuS1JhhLElSY4axJEmNGcaSJDVmGEuS1JhhLElSY4axJEmNGcaSJDVmGEuS1Ni2UQYl+RZwHvg+8GZVzSa5AfgcsBf4FnB/Vb02njIlSZpc65kZz1fV/qqa7fYfBI5V1S3AsW5fkiSt0+UsU98LHOq2DwH3XX45kiRNn1HDuIA/THI8yUJ3bFBVp7vtl4FB79VJkjQFUlXDByU3VtWLSf4R8DjwceBIVe1cM+a1qrr+Is9dABYABoPBbYuLi70Vf+H8q+sav2N7d4r8mut6q2EUKysrzMzMbOp7TjL72R972S/72a9J6Of8/PzxNad3L2mkC7iq6sXu8WySLwAfAs4k2V1Vp5PsBs5e4rkHgYMAs7OzNTc3N+KvMNzy0cPrGr9/z47VjX391TCKpaUl+vy9p5397I+97Jf97Nc09XPoMnWSa5P88FvbwD8HvgEcAQ50ww4Aj42rSEmSJtkoM+MB8IUkb43/n1X15SRfAR5J8gDwPHD/+MqUJGlyDQ3jqnoO+MBFjv8NcOc4ipIkaZp4By5JkhozjCVJaswwliSpMcNYkqTGDGNJkhozjCVJaswwliSpMcNYkqTGDGNJkhozjCVJaswwliSpMcNYkqTGDGNJkhozjCVJaswwliSpMcNYkqTGDGNJkhozjCVJaswwliSpMcNYkqTGDGNJkhobOYyTXJXk60m+2O3fnOSpJM8m+VyS7eMrU5KkybWemfEngJNr9n8T+O2q+nHgNeCBPguTJGlajBTGSW4Cfgb4VLcf4A7g0W7IIeC+cRQoSdKkG3Vm/DvArwE/6PZ/FDhXVW92+6eAG3uuTZKkqbBt2IAkPwucrarjSebW+wZJFoAFgMFgwNLS0npf4pIuXLhmXePPfeuN1Y3T/dUwipWVlV5/72lnP/tjL/tlP/s1Tf0cGsbA7cDPJflp4L3AdcAngZ1JtnWz45uAFy/25Ko6CBwEmJ2drbm5uT7qBmD56OF1jd+/Z8fqxr7+ahjF0tISff7e085+9sde9st+9mua+jl0mbqqfr2qbqqqvcDHgCeq6l8ATwI/3w07ADw2tiolSZpgl/M5438P/Lskz7J6DvnhfkqSJGm6jLJM/XeqaglY6rafAz7Uf0mSJE0X78AlSVJjhrEkSY0ZxpIkNWYYS5LUmGEsSVJjhrEkSY0ZxpIkNWYYS5LUmGEsSVJjhrEkSY0ZxpIkNWYYS5LUmGEsSVJjhrEkSY0ZxpIkNWYYS5LUmGEsSVJjhrEkSY0ZxpIkNWYYS5LUmGEsSVJjQ8M4yXuT/O8k/yfJN5P8x+74zUmeSvJsks8l2T7+ciVJmjyjzIy/C9xRVR8A9gMfTfJh4DeB366qHwdeAx4YX5mSJE2uoWFcq1a63au7PwXcATzaHT8E3DeWCiVJmnAjnTNOclWSZeAs8Djwl8C5qnqzG3IKuHE8JUqSNNlSVaMPTnYCXwB+A/hMt0RNkj3Al6rq/Rd5zgKwADAYDG5bXFzso24ALpx/dV3jd2zftrpxzXW91TCKlZUVZmZmNvU9J5n97I+97Jf97Nck9HN+fv54Vc0OG7dtPS9aVeeSPAl8BNiZZFs3O74JePESzzkIHASYnZ2tubm59bzlu1o+enhd4/fv2bG6sa+/GkaxtLREn7/3tLOf/bGX/bKf/Zqmfo5yNfWPdTNikuwA7gZOAk8CP98NOwA8Nq4iJUmaZKPMjHcDh5JcxWp4P1JVX0xyAlhM8p+ArwMPj7FOSZIm1tAwrqo/Az54kePPAR8aR1GSJE0T78AlSVJjhrEkSY0ZxpIkNWYYS5LUmGEsSVJjhrEkSY0ZxpIkNbau22FudcsvnAPgle+fGfk5d906GFc5kiQBzowlSWrOMJYkqTHDWJKkxgxjSZIaM4wlSWrMMJYkqTHDWJKkxgxjSZIaM4wlSWrMMJYkqTHDWJKkxqbq3tRv2fXSE6MPvmrn6uO+e8ZTjCRp6jkzliSpsaFhnGRPkieTnEjyzSSf6I7fkOTxJM90j9ePv1xJkibPKDPjN4FfrapbgQ8Dv5TkVuBB4FhV3QIc6/YlSdI6DQ3jqjpdVV/rts8DJ4EbgXuBQ92wQ8B94ypSkqRJtq5zxkn2Ah8EngIGVXW6+9HLwKDXyiRJmhKpqtEGJjPAHwH/uao+n+RcVe1c8/PXquod542TLAALAIPB4LbFxcV+KgcunH+1t9e6lB3buwvOr7luw6+xsrLCzMxMTxXJfvbHXvbLfvZrEvo5Pz9/vKpmh40b6aNNSa4Gfg/4bFV9vjt8JsnuqjqdZDdw9mLPraqDwEGA2dnZmpubG+UtR7J89HBvr3Up+/fsWN3YN7fh11haWqLP33va2c/+2Mt+2c9+TVM/R7maOsDDwMmq+q01PzoCHOi2DwCP9V+eJEmTb5SZ8e3ALwJ/nmS5O/YfgIeAR5I8ADwP3D+eEiVJmmxDw7iq/gTIJX58Z7/lSJI0fbwDlyRJjRnGkiQ1ZhhLktSYYSxJUmOGsSRJjRnGkiQ1ZhhLktSYYSxJUmOGsSRJjRnGkiQ1NtK3Nk2z5RfOAfDK98+MNP6uW/1aZ0nS+jgzliSpMcNYkqTGXKYet6e/BN99Y/VxVPvuGV89kqQrjjNjSZIaM4wlSWrMZeqeHT3xD6+63vXSOS587xqWX3j9ouP379m5GWVJkq5gzowlSWrMMJYkqTGXqRt766Yia73bDUa8qYgkTR5nxpIkNTY0jJN8OsnZJN9Yc+yGJI8neaZ7vH68ZUqSNLlGmRl/Bvjo2449CByrqluAY92+JEnagKFhXFV/DLz6tsP3Aoe67UPAfT3XJUnS1NjoOeNBVZ3utl8GvKpIkqQNSlUNH5TsBb5YVe/v9s9V1c41P3+tqi563jjJArAAMBgMbltcXOyh7FUXzr99wj4+b1593Yaet+2N7/DGD97D1e/5Qc8VvdNGa3y7H37vlX2R/crKCjMzM63LmAj2sl/2s1+T0M/5+fnjVTU7bNxG/9Y9k2R3VZ1Oshs4e6mBVXUQOAgwOztbc3NzG3zLd1o+eri31xrmlff90w09b9dLT3D6wjXs3vHdnit6p43W+HZzV/jHp5aWlujz36NpZi/7ZT/7NU393Ogy9RHgQLd9AHisn3IkSZo+o3y06TDwp8C+JKeSPAA8BNyd5Bngrm5fkiRtwNBl6qr6hUv86M6ea5EkaSp5By5Jkhq7si+bvYLseumJ1iUMtZEaX3nfHWOoRJK0Hs6MJUlqzDCWJKkxw1iSpMYMY0mSGjOMJUlqzKupJeDoiTPrGn/XFX7LUElbizNjSZIaM4wlSWrMMJYkqTHDWJKkxgxjSZIa82pqbczTX/q7zeUXzo30lL7ug33h/735rlc/T/SVzmv6vtawfwYX6/1E90naYpwZS5LUmGEsSVJjLlNPuYt+7eJVOzfvvUYwsV/zeIkl581y9MSZoUv+l8NlcGl0zowlSWrMMJYkqTGXqfUOo14dfaUa17Lr5bzHNC7ZjvufwyT0tPd/jzZy6mPfPet/jnrnzFiSpMYuK4yTfDTJ00meTfJgX0VJkjRNNrxMneQq4L8BdwOngK8kOVJVJ/oqTtpsG73ie5jll8bysmO3kX5s1tXvR0+cWXd9+/d0nxQY09Ls+Ytcnf5uNe66xPFN/QTBRq/q73p4xZ6yuczfa7Ndzsz4Q8CzVfVcVX0PWATu7acsSZKmx+WE8Y3AC2v2T3XHJEnSOoz9auokC8BCt7uS5Olxv+cVaBfwSusiJoj97I+97Jf97Nck9PMfjzLocsL4RWDPmv2bumP/QFUdBA5exvtseUm+WlWzreuYFPazP/ayX/azX9PUz8tZpv4KcEuSm5NsBz4GHOmnLEmSpseGZ8ZV9WaSfwP8AXAV8Omq+mZvlUmSNCUu65xxVf0+8Ps91TLJpnqZfgzsZ3/sZb/sZ7+mpp+pqtY1SJI01bwdpiRJjRnGY5RkT5Ink5xI8s0kn2hd01aX5KokX0/yxda1bHVJdiZ5NMlfJDmZ5COta9qqkvzb7r/xbyQ5nOS9rWvaSpJ8OsnZJN9Yc+yGJI8neaZ7vL5ljeNmGI/Xm8CvVtWtwIeBX0pya+OatrpPACdbFzEhPgl8uap+AvgA9nVDktwI/DIwW1XvZ/WC1o+1rWrL+Qzw0bcdexA4VlW3AMe6/YllGI9RVZ2uqq912+dZ/cvOu5RtUJKbgJ8BPtW6lq0uyY8APwU8DFBV36uqrf3dmW1tA3Yk2Qb8ELBF70beRlX9MfDq2w7fCxzqtg8B921qUZvMMN4kSfYCHwSealvJlvY7wK8BP2hdyAS4Gfg28Lvdsv+nklzbuqitqKpeBP4r8NfAaeD/VtUftq1qIgyq6nS3/TKw9b/A+l0YxpsgyQzwe8CvVNV3WtezFSX5WeBsVR1vXcuE2Ab8M+C/V9UHgdeZ8GXAcenOZd7L6v/gvA+4Nsm/bFvVZKnVj/1M9Ed/DOMxS3I1q0H82ar6fOt6trDbgZ9L8i1WvyHsjiT/o21JW9op4FRVvbVS8yir4az1uwv4q6r6dlW9AXwe+MnGNU2CM0l2A3SPZxvXM1aG8RglCavn5E5W1W+1rmcrq6pfr6qbqmovqxfHPFFVzj42qKpeBl5Isq87dCfgd5FvzF8DH07yQ91/83fixXB9OAIc6LYPAI81rGXsDOPxuh34RVZnccvdn59uXZTU+Tjw2SR/BuwH/kvjerakbnXhUeBrwJ+z+vfq1Nw5qg9JDgN/CuxLcirJA8BDwN1JnmF19eGhljWOm3fgkiSpMWfGkiQ1ZhhLktSYYSxJUmOGsSRJjRnGkiQ1ZhhLktSYYSxJUmOGsSRJjf1/Q/wm2yqlkEQAAAAASUVORK5CYII=\n", "text/plain": [ "<Figure size 576x288 with 1 Axes>" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "df[df.photo_month_mean < q].groupby('male_dummy')['photo_month_mean'].hist(figsize=(8,4), alpha=0.3, bins=30);" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>mean</th>\n", " <th>std</th>\n", " <th>median</th>\n", " </tr>\n", " <tr>\n", " <th>male_dummy</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>2.420398</td>\n", " <td>1.815699</td>\n", " <td>1.666667</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2.130265</td>\n", " <td>1.811635</td>\n", " <td>1.500000</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " mean std median\n", "male_dummy \n", "0 2.420398 1.815699 1.666667\n", "1 2.130265 1.811635 1.500000" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df.photo_month_mean < q].groupby('male_dummy')['photo_month_mean'].agg(['mean', 'std', 'median'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Разброс между группами не различается. Если вы сразу догадались, что данные кишат выбросами и обрезали их, вы вполне себе можете теперь сравнивать типичных представителей по средним. Такой ответ тоже считается правильным. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__[3] Немного сегментации__\n", "\n", "Давайте попробуем разбить всех людей из нашей таблички на равные сегменты по тому сколько у них есть друзей с потока (`friends_from_course_cnt`). Для этого: \n", "\n", "* Заполните пропуски в переменной `friends_from_course_cnt` нулями.\n", "* Найдите для переменной `friends_from_course_cnt` квантили уровня $0.25, 0.5, 0.75$. \n", "* Создайте колонку `segment` и заполните её по следующему принципу: \n", " - Поставьте в неё $1$, если у человека друзей меньше, чем квантиль уровня $0.25$ \n", " - Поставьте $2$, если у человека друзей от квантиля $0.25$ до $0.5$\n", " - Поставьте $3$, если у человека друзей от квантиля $0.5$ до $0.75$\n", " - Поставьте $4$, если у человека друзей от квантиля $0.75$\n", " \n", "Получившиеся группы - это разные сегменты курса. Одни из них более социальные, другие менее социальные. Используя полученную колонку ответьте на вопросы: \n", "\n", "- Какое направление (БММ или УБ) оказалось более социальным? \n", "- Правда ли, что более социальные люди чаще лайкают мемы в вышкинской группе? \n", "- Чётко опишите все выводы, которые вы сделали. \n", "\n", "__Hint:__ для выявления группы для человека можно написать функцию, которая сравнит его число друзей со всеми квантилями и выдаст группу. Потом эту функцию можно применить к колонке с помощью `apply`." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "df.friends_from_course_cnt.fillna(0, inplace=True)\n", "q = df.friends_from_course_cnt.quantile([0.25, 0.5, 0.75]).values" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "def find_segm(friends):\n", " if friends <= q[0]:\n", " return 1\n", " elif (friends > q[0])&(friends <= q[1]):\n", " return 2\n", " elif (friends > q[1])&(friends <= q[2]):\n", " return 3\n", " else:\n", " return 4\n", " \n", "df['segment'] = df.friends_from_course_cnt.apply(find_segm)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th></th>\n", " <th>segment</th>\n", " </tr>\n", " <tr>\n", " <th>is_bmm</th>\n", " <th>segment</th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th rowspan=\"4\" valign=\"top\">False</th>\n", " <th>1</th>\n", " <td>63</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>68</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>60</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>72</td>\n", " </tr>\n", " <tr>\n", " <th rowspan=\"4\" valign=\"top\">True</th>\n", " <th>1</th>\n", " <td>45</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>42</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>48</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>27</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " segment\n", "is_bmm segment \n", "False 1 63\n", " 2 68\n", " 3 60\n", " 4 72\n", "True 1 45\n", " 2 42\n", " 3 48\n", " 4 27" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Дальше есть какие-нибудь групп баи вроде таких и пояснения: \n", "df.groupby(['is_bmm', 'segment']).agg({'segment': 'count'})" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>likes_memes</th>\n", " </tr>\n", " <tr>\n", " <th>segment</th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>1</th>\n", " <td>22.753623</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>17.550459</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>28.121495</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>29.252525</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " likes_memes\n", "segment \n", "1 22.753623\n", "2 17.550459\n", "3 28.121495\n", "4 29.252525" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby('segment').agg({'likes_memes': 'mean'})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Ваши выводы и их обоснование:__ \n", "\n", "- Если тут не будет написано ваших пояснений, за задание вы получаете ноль баллов!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "--------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__[n] Удиви нас. Попробуй найти в данных какую-то классную особенность. Если у тебя это получится, мы поставим дополнительные баллы.__ Если вы найдёте полную фигню (сколько всего друзей у Маши или типа того), баллов не будет. Найденный факт реально должен выносить мозг и сносить крышу." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Попробуй меня на вкус!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Настрадался? Выскажи всё, что думаешь обо всём этом [в анонимке по третьему дз.](https://docs.google.com/forms/d/e/1FAIpQLSf5IFDJv8YsZDdkeLN4KXNU64zL9oXMtG5Rp36rsitOYOwYwQ/viewform) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.2" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }