{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Домашнее задание по лекции \"Подготовка данных (Data preprocessing)\", Масляков Глеб." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "from time import time" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Задача: улучшить код со слайдов (если возможно)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Пример $1$: (слайд $11$)\n", "перевести денежные суммы формата \"string(sum\\$)\" в целые числа \"integer(sum)\". " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Генерация случайных данных.\n", "\n", "$10$миллионов записей от$0$до$100000\\." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df = pd.DataFrame({'price($)': [str(i) + '$' for i in np.random.choice(a = range(int(1e5)), size = int(1e7))]})" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(10000000, 1)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.shape" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
price($) 01428$
132695$219688$
378519$480773$
\n", "
" ], "text/plain": [ " price($)\n", "0 1428$\n", "1 32695$\n", "2 19688$\n", "3 78519$\n", "4 80773$" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Время работы примера из лекции." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 7.51 s, sys: 470 ms, total: 7.98 s\n", "Wall time: 8.14 s\n" ] } ], "source": [ "%%time\n", "df['.'] = df['price($)'].apply(lambda x: int(x[:-1]))" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " price($).
01428$1428 132695$32695
219688$19688 378519$78519
480773$80773 \n", " " ], "text/plain": [ " price($) .\n", "0 1428$1428\n", "1 32695$ 32695\n", "2 19688$19688\n", "3 78519$ 78519\n", "4 80773$80773" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Удаление получившихся результатов" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "del df['.']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Оптимизированная версия" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 5.3 s, sys: 228 ms, total: 5.52 s\n", "Wall time: 5.53 s\n" ] } ], "source": [ "%%time\n", "df['.'] = df['price($)'].apply(lambda x: x.replace('$', '')).astype(int)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###########" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Данный код ещё и лучше тем, что может применяться в ситуации, когда значок доллара стоит в начале строки." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###########" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### На следующем слайде возникла задача перекодировки строковых категориальных признаков в числа." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Проблема в строчке, где надо было перекодировать слова 'yes' и 'no' в числа$1$и$0$. Её можно реализовать эффективнее." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Генерация данных" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$30$миллионов записей вида ['yes/no', 'warm/cool/cold']" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "f1 = lambda x: 'yes' if x == 1 else 'no'\n", "def f2(x):\n", " if x == 0:\n", " return 'cool'\n", " elif x == 1:\n", " return 'cold'\n", " else:\n", " return 'warm'" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df = pd.DataFrame({'ans': [f1(i) for i in np.random.choice(2, 30000000)],\n", " 'weather': [f2(i) for i in np.random.choice(3, 30000000)]})" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " answeather 0nocool 1yeswarm 2yescool 3nowarm 4yeswarm \n", " " ], "text/plain": [ " ans weather\n", "0 no cool\n", "1 yes warm\n", "2 yes cool\n", "3 no warm\n", "4 yes warm" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Пример из лекции" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2.03 s, sys: 264 ms, total: 2.29 s\n", "Wall time: 2.29 s\n" ] } ], "source": [ "%%time\n", "dct = {'yes': 1, 'no': 0} \n", "df['ans_coded'] = df['ans'].map(dct)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " answeatherans_coded 0nocool0 1yeswarm1 2yescool1 3nowarm0 4yeswarm1 \n", " " ], "text/plain": [ " ans weather ans_coded\n", "0 no cool 0\n", "1 yes warm 1\n", "2 yes cool 1\n", "3 no warm 0\n", "4 yes warm 1" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Удаляем результат" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": true }, "outputs": [], "source": [ "del df['ans_coded']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Оптимизированная версия. Используем встроенную функцию factorize." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1.3 s, sys: 350 ms, total: 1.65 s\n", "Wall time: 1.65 s\n" ] } ], "source": [ "%%time\n", "df['ans_coded'] = df['ans'].factorize(sort=True)[0]" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " answeatherans_coded 0nocool0 1yeswarm1 2yescool1 3nowarm0 4yeswarm1 \n", " " ], "text/plain": [ " ans weather ans_coded\n", "0 no cool 0\n", "1 yes warm 1\n", "2 yes cool 1\n", "3 no warm 0\n", "4 yes warm 1" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": true }, "outputs": [], "source": [ "del df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Корректировка значений (слайд$13$)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "На следующем слайде предлагается извлечь нижнее и верхнее давление из записи вида string(v.d./n.d.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Генерация данных" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$2$миллиона записей. Верхнее от$0$до$200$, нижнее от$0$до$150$" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [], "source": [ "pressure = [str(np.random.choice(200)) + '/' + str(np.random.choice(150)) for _ in range(2000000)]\n", "df = pd.DataFrame(pressure, columns = ['давление'])" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " давление 0112/5 184/20 2128/63 3164/127 4117/53 \n", " " ], "text/plain": [ " давление\n", "0 112/5\n", "1 84/20\n", "2 128/63\n", "3 164/127\n", "4 117/53" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Пример из лекции" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 3.29 s, sys: 321 ms, total: 3.61 s\n", "Wall time: 3.65 s\n" ] } ], "source": [ "%%time\n", "tmp = df['давление'].str.split('/')\n", "df['в.давл.'] = tmp.apply(lambda x: x[0]) \n", "df['н.давл.'] = tmp.apply(lambda x: x[1])" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " давлениев.давл.н.давл. 0112/51125 184/208420 2128/6312863 3164/127164127 4117/5311753 \n", " " ], "text/plain": [ " давление в.давл. н.давл.\n", "0 112/5 112 5\n", "1 84/20 84 20\n", "2 128/63 128 63\n", "3 164/127 164 127\n", "4 117/53 117 53" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Удаляем результаты" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": true }, "outputs": [], "source": [ "del df['в.давл.']\n", "del df['н.давл.']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Оптимизированная версия. Не делаем два apply, а сразу скармливаем pd.DataFrame предварительно переведя в формат list" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2.38 s, sys: 214 ms, total: 2.6 s\n", "Wall time: 2.65 s\n" ] } ], "source": [ "%%time\n", "df[['в.давл.', 'н.давл.']] = pd.DataFrame(df['давление'].str.split('/', 1).tolist(), columns = ['Давление_в','Давление_н'])" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": true }, "outputs": [], "source": [ "del df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Заполняем пропуски средними значениями (слайд 19)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Необходимо заполнить Nanы средними значениями. В лекции предложены три варианта заполнения: средним по всей выборке; средним по обучающей выборке; пропуски в обучающей выборке — средним по обучающей выборке, пропуски в тестовой выборке — средним по тестовой выборке." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Претензии по коду есть именно по третьему варианту." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Генерация данных\n", "\n", "$6,000,000$записей. Значения площадей от$0$до$200$. Доля трейна и теста —$50\\%$(как на слайде). Доля Nanов —$\\frac{1}{3}\$ (тоже как на слайде)." ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df = pd.DataFrame(np.random.choice(200, size = (6000000, 4)), columns = ['площадь', 'площадь 1', 'площадь 2', 'площадь 3'])\n", "x = np.array(['train'] * 3000000 + ['test'] * 3000000)\n", "np.random.shuffle(x)\n", "df['data'] = x\n", "ind = np.arange(6000000)\n", "np.random.shuffle(ind)\n", "df.loc[ind[:2000000], 'площадь'] = np.nan" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
площадьплощадь 1площадь 2площадь 3data
021.018015874train
149.01588394test
2NaN7616760train
3125.0478156train
4NaN5496123test
535.02069168test
62.056152136train
7NaN19719171train
8NaN1165438train
947.07710167test
\n", "
" ], "text/plain": [ " площадь площадь 1 площадь 2 площадь 3 data\n", "0 21.0 180 158 74 train\n", "1 49.0 158 83 94 test\n", "2 NaN 76 167 60 train\n", "3 125.0 47 81 56 train\n", "4 NaN 54 96 123 test\n", "5 35.0 20 69 168 test\n", "6 2.0 56 152 136 train\n", "7 NaN 197 19 171 train\n", "8 NaN 116 54 38 train\n", "9 47.0 77 101 67 test" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Пример из лекции. Очень объёмный код." ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 3.91 s, sys: 265 ms, total: 4.18 s\n", "Wall time: 4.18 s\n" ] } ], "source": [ "%%time\n", "df.loc[df['data'] == 'train', 'площадь'] = df[df['data'] == 'train']['площадь'].fillna(df[df['data'] == 'train']['площадь'].mean())\n", "df.loc[df['data'] == 'test', 'площадь'] = df[df['data'] == 'test']['площадь'].fillna(df[df['data'] == 'test']['площадь'].mean())" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
площадьплощадь 1площадь 2площадь 3data
021.00000018015874train
149.0000001588394test
299.4545537616760train
3125.000000478156train
499.5129475496123test
535.0000002069168test
62.00000056152136train
799.45455319719171train
899.4545531165438train
947.0000007710167test
\n", "
" ], "text/plain": [ " площадь площадь 1 площадь 2 площадь 3 data\n", "0 21.000000 180 158 74 train\n", "1 49.000000 158 83 94 test\n", "2 99.454553 76 167 60 train\n", "3 125.000000 47 81 56 train\n", "4 99.512947 54 96 123 test\n", "5 35.000000 20 69 168 test\n", "6 2.000000 56 152 136 train\n", "7 99.454553 197 19 171 train\n", "8 99.454553 116 54 38 train\n", "9 47.000000 77 101 67 test" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Возвращаем Nanы на место" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df.loc[ind[:2000000], 'площадь'] = np.nan" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Напрашивается сделать группировку по столбцу \"data\". Также можно воспользоваться функцией transform." ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1.71 s, sys: 281 ms, total: 1.99 s\n", "Wall time: 2.02 s\n" ] } ], "source": [ "%%time\n", "df['площадь'] = df.groupby(\"data\")['площадь'].transform(lambda x: x.fillna(x.mean()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "В одну строчку. В два раза быстрее." ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
площадьплощадь 1площадь 2площадь 3data
021.00000018015874train
149.0000001588394test
299.4545537616760train
3125.000000478156train
499.5129475496123test
535.0000002069168test
62.00000056152136train
799.45455319719171train
899.4545531165438train
947.0000007710167test
\n", "
" ], "text/plain": [ " площадь площадь 1 площадь 2 площадь 3 data\n", "0 21.000000 180 158 74 train\n", "1 49.000000 158 83 94 test\n", "2 99.454553 76 167 60 train\n", "3 125.000000 47 81 56 train\n", "4 99.512947 54 96 123 test\n", "5 35.000000 20 69 168 test\n", "6 2.000000 56 152 136 train\n", "7 99.454553 197 19 171 train\n", "8 99.454553 116 54 38 train\n", "9 47.000000 77 101 67 test" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Можно попробовать ещё сильнее ускорить." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Возвращаем Nanы" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df.loc[ind[:2000000], 'площадь'] = np.nan" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Супер оптимизация" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 510 ms, sys: 58.7 ms, total: 569 ms\n", "Wall time: 569 ms\n" ] } ], "source": [ "%%time\n", "df.loc[df['площадь'].isnull(), 'площадь'] = df.groupby('data')['площадь'].transform('mean')" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
площадьплощадь 1площадь 2площадь 3data
021.00000018015874train
149.0000001588394test
299.4545537616760train
3125.000000478156train
499.5129475496123test
535.0000002069168test
62.00000056152136train
799.45455319719171train
899.4545531165438train
947.0000007710167test
\n", "
" ], "text/plain": [ " площадь площадь 1 площадь 2 площадь 3 data\n", "0 21.000000 180 158 74 train\n", "1 49.000000 158 83 94 test\n", "2 99.454553 76 167 60 train\n", "3 125.000000 47 81 56 train\n", "4 99.512947 54 96 123 test\n", "5 35.000000 20 69 168 test\n", "6 2.000000 56 152 136 train\n", "7 99.454553 197 19 171 train\n", "8 99.454553 116 54 38 train\n", "9 47.000000 77 101 67 test" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ещё в несколько раз быстрее." ] } ], "metadata": { "kernelspec": { "display_name": "Python [Root]", "language": "python", "name": "Python [Root]" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }