{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Машинное обучение, ФКН ВШЭ // Семинар 4" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:18.309671Z", "start_time": "2019-09-28T19:10:14.654448Z" }, "cell_style": "center" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Populating the interactive namespace from numpy and matplotlib\n" ] } ], "source": [ "%pylab inline\n", "\n", "import pandas as pd\n", "import seaborn as sns\n", "from tqdm import tqdm\n", "from sklearn.datasets import fetch_20newsgroups\n", "\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.linear_model import Ridge\n", "from sklearn.metrics import mean_squared_error" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Работа с текстовыми данными\n", "\n", "Как правило, модели машинного обучения действуют в предположении, что матрица \"объект-признак\" является вещественнозначной, поэтому при работе с текстами сперва для каждого из них необходимо составить его признаковое описание. Для этого широко используются техники векторизации, tf-idf и пр. Рассмотрим их на примере [датасета](https://www.dropbox.com/s/f9xsff8xluriy95/banki_responses.json.bz2?dl=0) отзывов о банках.\n", "\n", "Сперва загрузим данные:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:18.959288Z", "start_time": "2019-09-28T19:10:18.346983Z" } }, "outputs": [], "source": [ "data = fetch_20newsgroups(subset='all', categories=['comp.graphics', 'sci.med'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Данные содержат тексты новостей, которые надо классифицировать на разделы." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:18.987312Z", "start_time": "2019-09-28T19:10:18.964475Z" } }, "outputs": [ { "data": { "text/plain": [ "['comp.graphics', 'sci.med']" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['target_names']" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:19.012942Z", "start_time": "2019-09-28T19:10:18.997929Z" } }, "outputs": [], "source": [ "texts = data['data']\n", "target = data['target']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Например:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:19.060074Z", "start_time": "2019-09-28T19:10:19.019838Z" }, "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "'From: dyer@spdcc.com (Steve Dyer)\\nSubject: Re: Analgesics with Diuretics\\nOrganization: S.P. Dyer Computer Consulting, Cambridge MA\\n\\nIn article Lawrence Curcio writes:\\n>I sometimes see OTC preparations for muscle aches/back aches that\\n>combine aspirin with a diuretic.\\n\\nYou certainly do not see OTC preparations advertised as such.\\nThe only such ridiculous concoctions are nostrums for premenstrual\\nsyndrome, ostensibly to treat headache and \"bloating\" simultaneously.\\nThey\\'re worthless.\\n\\n>The idea seems to be to reduce\\n>inflammation by getting rid of fluid. Does this actually work? \\n\\nThat\\'s not the idea, and no, they don\\'t work.\\n\\n-- \\nSteve Dyer\\ndyer@ursa-major.spdcc.com aka {ima,harvard,rayssd,linus,m2c}!spdcc!dyer\\n'" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "texts[0]" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:19.164963Z", "start_time": "2019-09-28T19:10:19.131795Z" } }, "outputs": [ { "data": { "text/plain": [ "'sci.med'" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['target_names'][target[0]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Bag-of-words\n", "\n", "Самый очевидный способ формирования признакового описания текстов — векторизация. Пусть у нас имеется коллекция текстов $D = \\{d_i\\}_{i=1}^l$ и словарь всех слов, встречающихся в выборке $V = \\{v_j\\}_{j=1}^d.$ В этом случае некоторый текст $d_i$ описывается вектором $(x_{ij})_{j=1}^d,$ где\n", "$$x_{ij} = \\sum_{v \\in d_i} [v = v_j].$$\n", "\n", "Таким образом, текст $d_i$ описывается вектором количества вхождений каждого слова из словаря в данный текст." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:19.833837Z", "start_time": "2019-09-28T19:10:19.175011Z" } }, "outputs": [], "source": [ "from sklearn.feature_extraction.text import CountVectorizer\n", "\n", "vectorizer = CountVectorizer(encoding='utf8', min_df=1)\n", "_ = vectorizer.fit(texts)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Результатом является разреженная матрица." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:19.851338Z", "start_time": "2019-09-28T19:10:19.837509Z" } }, "outputs": [ { "data": { "text/plain": [ "<1x32548 sparse matrix of type ''\n", "\twith 86 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vectorizer.transform(texts[:1])" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:19.872060Z", "start_time": "2019-09-28T19:10:19.854688Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 0 86]\n", "[ 3905 3983 4143 4345 4665 4701 4712 5074 5176 5198 5242 5619\n", " 5870 6348 6984 7232 7630 8267 8451 8460 8682 8733 8916 9557\n", " 10811 10812 10901 10933 10971 11312 11488 13133 13226 13463 13866 14726\n", " 14806 15682 15805 15952 16147 18002 18031 18373 18740 18781 18790 18936\n", " 20420 21036 21164 21166 21494 21518 21622 21769 21839 21856 23589 23602\n", " 24556 24592 24803 25502 25513 26464 26474 27021 27398 27518 27940 28199\n", " 28286 28687 29187 29189 29264 29300 29500 29837 30702 31915 32005 32052\n", " 32095 32392]\n", "[2 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 6 2 1 2 1 1 1 1\n", " 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 2 1 2 1 1 1 2 1 1 1 3 2 1 2 1\n", " 2 3 2 1 3 1 1 2 2 1 1 1]\n" ] } ], "source": [ "print(vectorizer.transform(texts[:1]).indptr)\n", "print(vectorizer.transform(texts[:1]).indices)\n", "print(vectorizer.transform(texts[:1]).data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### TF-IDF\n", "\n", "Ещё один способ работы с текстовыми данными — [TF-IDF](https://en.wikipedia.org/wiki/Tf–idf) (**T**erm **F**requency–**I**nverse **D**ocument **F**requency). Рассмотрим коллекцию текстов $D$. Для каждого уникального слова $t$ из документа $d \\in D$ вычислим следующие величины:\n", "\n", "1. Term Frequency – количество вхождений слова в отношении к общему числу слов в тексте:\n", "$$\\text{tf}(t, d) = \\frac{n_{td}}{\\sum_{t \\in d} n_{td}},$$\n", "где $n_{td}$ — количество вхождений слова $t$ в текст $d$.\n", "1. Inverse Document Frequency\n", "$$\\text{idf}(t, D) = \\log \\frac{\\left| D \\right|}{\\left| \\{d\\in D: t \\in d\\} \\right|},$$\n", "где $\\left| \\{d\\in D: t \\in d\\} \\right|$ – количество текстов в коллекции, содержащих слово $t$.\n", "\n", "Тогда для каждой пары (слово, текст) $(t, d)$ вычислим величину:\n", "\n", "$$\\text{tf-idf}(t,d, D) = \\text{tf}(t, d)\\cdot \\text{idf}(t, D).$$\n", "\n", "Отметим, что значение $\\text{tf}(t, d)$ корректируется для часто встречающихся общеупотребимых слов при помощи значения $\\text{idf}(t, D)$." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:20.548074Z", "start_time": "2019-09-28T19:10:19.877266Z" } }, "outputs": [], "source": [ "from sklearn.feature_extraction.text import TfidfVectorizer\n", "\n", "vectorizer = TfidfVectorizer(encoding='utf8', min_df=1)\n", "_ = vectorizer.fit(texts)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "На выходе получаем разреженную матрицу." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:20.569113Z", "start_time": "2019-09-28T19:10:20.550328Z" } }, "outputs": [ { "data": { "text/plain": [ "<1x32548 sparse matrix of type ''\n", "\twith 86 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vectorizer.transform(texts[:1])" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:20.601343Z", "start_time": "2019-09-28T19:10:20.574776Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 0 86]\n", "[32392 32095 32052 32005 31915 30702 29837 29500 29300 29264 29189 29187\n", " 28687 28286 28199 27940 27518 27398 27021 26474 26464 25513 25502 24803\n", " 24592 24556 23602 23589 21856 21839 21769 21622 21518 21494 21166 21164\n", " 21036 20420 18936 18790 18781 18740 18373 18031 18002 16147 15952 15805\n", " 15682 14806 14726 13866 13463 13226 13133 11488 11312 10971 10933 10901\n", " 10812 10811 9557 8916 8733 8682 8460 8451 8267 7630 7232 6984\n", " 6348 5870 5619 5242 5198 5176 5074 4712 4701 4665 4345 4143\n", " 3983 3905]\n", "[0.02775776 0.030364 0.10357777 0.10097852 0.05551552 0.08913878\n", " 0.0751644 0.05521512 0.02543534 0.07527191 0.05440145 0.04646966\n", " 0.07125018 0.0955189 0.01649287 0.12280349 0.25018259 0.0710195\n", " 0.09802838 0.05646637 0.09712269 0.10057076 0.09482619 0.08113136\n", " 0.04893556 0.09057384 0.23738007 0.11869004 0.18429156 0.12343474\n", " 0.01703927 0.04332515 0.12343474 0.01848065 0.05632337 0.12343474\n", " 0.03769659 0.0854585 0.06358581 0.07172143 0.09057384 0.12343474\n", " 0.0833942 0.10531547 0.08659639 0.08913878 0.01987357 0.08913878\n", " 0.1325496 0.08719619 0.07172143 0.06089811 0.01649287 0.04338328\n", " 0.09689049 0.04826154 0.48428276 0.03809625 0.03886054 0.03418115\n", " 0.11500976 0.11869004 0.10357777 0.08984071 0.12343474 0.04302873\n", " 0.09925063 0.06487218 0.16490298 0.06679843 0.0833942 0.03432907\n", " 0.11869004 0.02634071 0.05392398 0.10946036 0.03071828 0.03099918\n", " 0.02850602 0.15610804 0.03925604 0.10531547 0.0819996 0.10946036\n", " 0.05913563 0.23738007]\n" ] } ], "source": [ "print(vectorizer.transform(texts[:1]).indptr)\n", "print(vectorizer.transform(texts[:1]).indices)\n", "print(vectorizer.transform(texts[:1]).data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Заметим, что оба метода возвращают вектор длины 32548 (размер нашего словаря)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Заметим, что одно и то же слово может встречаться в различных формах (например, \"сотрудник\" и \"сотрудника\"), но описанные выше методы интерпретируют их как различные слова, что делает признаковое описание избыточным. Устранить эту проблему можно при помощи **лемматизации** и **стемминга**.\n", "\n", "### Стемминг\n", "\n", "[**Stemming**](https://en.wikipedia.org/wiki/Stemming) – это процесс нахождения основы слова. В результате применения данной процедуры однокоренные слова, как правило, преобразуются к одинаковому виду.\n", "\n", "**Примеры стемминга:**\n", "\n", "| Word | Stem |\n", "| ----------- |:-------------:|\n", "| вагон | вагон |\n", "| вагона | вагон |\n", "| вагоне | вагон |\n", "| вагонов | вагон |\n", "| вагоном | вагон |\n", "| вагоны | вагон |\n", "| важная | важн |\n", "| важнее | важн |\n", "| важнейшие | важн |\n", "| важнейшими | важн |\n", "| важничал | важнича |\n", "| важно | важн |\n", "\n", "[Snowball](http://snowball.tartarus.org/) – фрэймворк для написания алгоритмов стемминга. Алгоритмы стемминга отличаются для разных языков и используют знания о конкретном языке – списки окончаний для разных чистей речи, разных склонений и т.д. Пример алгоритма для русского языка – [Russian stemming](http://snowballstem.org/algorithms/russian/stemmer.html)." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:21.109437Z", "start_time": "2019-09-28T19:10:20.604439Z" } }, "outputs": [], "source": [ "import nltk\n", "stemmer = nltk.stem.snowball.RussianStemmer()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:21.130340Z", "start_time": "2019-09-28T19:10:21.112772Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "машин обучен\n" ] } ], "source": [ "print(stemmer.stem(u'машинное'), stemmer.stem(u'обучение'))" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:25.274844Z", "start_time": "2019-09-28T19:10:21.135760Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 1000/1000 [00:04<00:00, 242.80it/s]\n" ] } ], "source": [ "stemmer = nltk.stem.snowball.EnglishStemmer()\n", "\n", "def stem_text(text, stemmer):\n", " tokens = text.split()\n", " return ' '.join(map(lambda w: stemmer.stem(w), tokens))\n", "\n", "stemmed_texts = []\n", "for t in tqdm(texts[:1000]):\n", " stemmed_texts.append(stem_text(t, stemmer))" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:25.290119Z", "start_time": "2019-09-28T19:10:25.279611Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "From: dyer@spdcc.com (Steve Dyer)\n", "Subject: Re: Analgesics with Diuretics\n", "Organization: S.P. Dyer Computer Consulting, Cambridge MA\n", "\n", "In article Lawrence Curcio writes:\n", ">I sometimes see OTC preparations for muscle aches/back aches that\n", ">combine aspirin with a diuretic.\n", "\n", "You certainly do not see OTC preparations advertised as such.\n", "The only such ridiculous concoctions are nostrums for premenstrual\n", "syndrome, ostensibly to treat headache and \"bloating\" simultaneously.\n", "They're worthless.\n", "\n", ">The idea seems to be to reduce\n", ">inflammation by getting rid of fluid. Does this actually work? \n", "\n", "That's not the idea, and no, they don't work.\n", "\n", "-- \n", "Steve Dyer\n", "dyer@ursa-major.spdcc.com aka {ima,harvard,rayssd,linus,m2c}!spdcc!dyer\n", "\n" ] } ], "source": [ "print(texts[0])" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:25.306834Z", "start_time": "2019-09-28T19:10:25.294638Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "from: dyer@spdcc.com (steve dyer) subject: re: analges with diuret organization: s.p. dyer comput consulting, cambridg ma in articl lawrenc curcio writes: >i sometim see otc prepar for muscl aches/back ach that >combin aspirin with a diuretic. you certain do not see otc prepar advertis as such. the onli such ridicul concoct are nostrum for premenstru syndrome, ostens to treat headach and \"bloating\" simultaneously. they'r worthless. >the idea seem to be to reduc >inflamm by get rid of fluid. doe this actual work? that not the idea, and no, they don't work. -- steve dyer dyer@ursa-major.spdcc.com aka {ima,harvard,rayssd,linus,m2c}!spdcc!dy\n" ] } ], "source": [ "print(stemmed_texts[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Как видим, стеммер работает не очень быстро и запускать его для всей выборки достаточно накладно." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Лемматизация\n", "\n", "[Лемматизация](https://en.wikipedia.org/wiki/Lemmatisation) — процесс приведения слова к его нормальной форме (**лемме**):\n", "- для существительных — именительный падеж, единственное число;\n", "- для прилагательных — именительный падеж, единственное число, мужской род;\n", "- для глаголов, причастий, деепричастий — глагол в инфинитиве." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Например, для русского языка есть библиотека pymorphy2." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:25.442948Z", "start_time": "2019-09-28T19:10:25.311429Z" } }, "outputs": [], "source": [ "import pymorphy2\n", "morph = pymorphy2.MorphAnalyzer()" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:25.468072Z", "start_time": "2019-09-28T19:10:25.447023Z" } }, "outputs": [ { "data": { "text/plain": [ "Parse(word='играющих', tag=OpencorporaTag('PRTF,impf,tran,pres,actv plur,gent'), normal_form='играть', score=0.16666666666666666, methods_stack=((, 'играющих', 303, 34),))" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "morph.parse('играющих')[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Сравним работу стеммера и лемматизатора на примере:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:25.489016Z", "start_time": "2019-09-28T19:10:25.474203Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "игра\n" ] } ], "source": [ "stemmer = nltk.stem.snowball.RussianStemmer()\n", "print(stemmer.stem('играющих'))" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:25.502206Z", "start_time": "2019-09-28T19:10:25.493051Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "играть\n" ] } ], "source": [ "print(morph.parse('играющих')[0].normal_form)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Трансформация признаков и целевой переменной" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Разберёмся, как может влиять трансформация признаков или целевой переменной на качество модели. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Логарифмирование \n", "\n", "Воспользуется датасетом с ценами на дома, с которым мы уже сталкивались ранее ([House Prices: Advanced Regression Techniques](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview))." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:27.600964Z", "start_time": "2019-09-28T19:10:25.507394Z" } }, "outputs": [], "source": [ "!wget https://www.dropbox.com/s/h5x86yvaf384vnt/train.csv" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:27.669451Z", "start_time": "2019-09-28T19:10:27.604504Z" } }, "outputs": [], "source": [ "data = pd.read_csv('train.csv')\n", "\n", "data = data.drop(columns=[\"Id\"])\n", "y = data[\"SalePrice\"]\n", "X = data.drop(columns=[\"SalePrice\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Посмотрим на распределение целевой переменной " ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:28.386207Z", "start_time": "2019-09-28T19:10:27.671842Z" } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.figure(figsize=(12, 5))\n", "\n", "plt.subplot(1, 2, 1)\n", "sns.distplot(y, label='target')\n", "plt.title('target')\n", "\n", "plt.subplot(1, 2, 2)\n", "sns.distplot(data.GrLivArea, label='area')\n", "plt.title('area')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Видим, что распределения несимметричные с тяжёлыми правыми хвостами.\n", "\n", "Оставим только числовые признаки, пропуски заменим средним значением." ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:28.464075Z", "start_time": "2019-09-28T19:10:28.401693Z" } }, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y, test_size=0.3, random_state=10)\n", "\n", "numeric_data = X_train.select_dtypes([np.number])\n", "numeric_data_mean = numeric_data.mean()\n", "numeric_features = numeric_data.columns\n", "\n", "X_train = X_train.fillna(numeric_data_mean)[numeric_features]\n", "X_test = X_test.fillna(numeric_data_mean)[numeric_features]" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2019-09-26T13:06:11.582867Z", "start_time": "2019-09-26T13:06:11.570901Z" } }, "source": [ "Если разбирать линейную регрессия с [вероятностной](https://github.com/esokolov/ml-course-hse/blob/master/2018-fall/seminars/sem04-linregr.pdf) точки зрения, то можно получить, что шум должен быть распределён нормально. Поэтому лучше, когда целевая переменная распределена также нормально.\n", "\n", "Если прологарифмировать целевую переменную, то её распределение станет больше похоже на нормальное:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:28.771830Z", "start_time": "2019-09-28T19:10:28.469382Z" } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "sns.distplot(np.log(y+1), label='target')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Сравним качество линейной регрессии в двух случаях:\n", "1. Целевая переменная без изменений.\n", "2. Целевая переменная прологарифмирована.\n", "\n", "Не забудем вернуть во втором случае взять экспоненту от предсказаний!" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:28.833250Z", "start_time": "2019-09-28T19:10:28.801394Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test RMSE = 32085.7681\n" ] } ], "source": [ "model = Ridge()\n", "model.fit(X_train, y_train)\n", "y_pred = model.predict(X_test)\n", "\n", "print(\"Test RMSE = %.4f\" % mean_squared_error(y_test, y_pred) ** 0.5)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:28.860025Z", "start_time": "2019-09-28T19:10:28.840089Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test RMSE = 26649.2742\n" ] } ], "source": [ "model = Ridge()\n", "model.fit(X_train, np.log(y_train+1))\n", "y_pred = np.exp(model.predict(X_test))-1\n", "\n", "print(\"Test RMSE = %.4f\" % mean_squared_error(y_test, y_pred) ** 0.5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Попробуем аналогично логарифмировать один из признаков, имеющих также смещённое распределение (этот признак был вторым по важности!)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:28.883024Z", "start_time": "2019-09-28T19:10:28.866508Z" } }, "outputs": [], "source": [ "X_train.GrLivArea = np.log(X_train.GrLivArea + 1)\n", "X_test.GrLivArea = np.log(X_test.GrLivArea + 1)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:28.915423Z", "start_time": "2019-09-28T19:10:28.887909Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test RMSE = 31893.8891\n" ] } ], "source": [ "model = Ridge()\n", "model.fit(X_train[numeric_features], y_train)\n", "y_pred = model.predict(X_test[numeric_features])\n", "\n", "print(\"Test RMSE = %.4f\" % mean_squared_error(y_test, y_pred) ** 0.5)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:28.946812Z", "start_time": "2019-09-28T19:10:28.919832Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test RMSE = 25935.0780\n" ] } ], "source": [ "model = Ridge()\n", "model.fit(X_train[numeric_features], np.log(y_train+1))\n", "y_pred = np.exp(model.predict(X_test[numeric_features]))-1\n", "\n", "print(\"Test RMSE = %.4f\" % mean_squared_error(y_test, y_pred) ** 0.5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Как видим, преобразование признаков влияет слабее. Признаков много, а вклад размывается по всем. К тому же, проверять распределение множества признаков технически сложнее, чем одной целевой переменной." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Бинаризация" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Мы уже смотрели, как полиномиальные признаки могут помочь при восстановлении нелинейной зависимости линейной моделью. Альтернативный подход заключается в бинаризации признаков. Мы разбиваем ось значений одного из признаков на куски (бины) и добавляем для каждого куска-бина новый признак-индикатор попадения в этот бин." ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:28.966252Z", "start_time": "2019-09-28T19:10:28.952975Z" } }, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression\n", "\n", "np.random.seed(36)\n", "X = np.random.uniform(0, 1, size=100)\n", "y = np.cos(1.5 * np.pi * X) + np.random.normal(scale=0.1, size=X.shape)" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:29.287834Z", "start_time": "2019-09-28T19:10:28.972865Z" } }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.scatter(X, y)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:29.304489Z", "start_time": "2019-09-28T19:10:29.292850Z" } }, "outputs": [], "source": [ "X = X.reshape((-1, 1))\n", "thresholds = np.arange(0.2, 1.1, 0.2).reshape((1, -1))\n", "\n", "X_expand = np.hstack((\n", " X,\n", " ((X > thresholds[:, :-1]) & (X <= thresholds[:, 1:])).astype(int)))" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:29.318935Z", "start_time": "2019-09-28T19:10:29.308660Z" } }, "outputs": [], "source": [ "from sklearn.model_selection import KFold\n", "from sklearn.model_selection import cross_val_score" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:29.345543Z", "start_time": "2019-09-28T19:10:29.322509Z" } }, "outputs": [ { "data": { "text/plain": [ "0.20553980048560808" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "-np.mean(cross_val_score(\n", " LinearRegression(), X, y, cv=KFold(n_splits=3, random_state=123),\n", " scoring='neg_mean_squared_error'))" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:29.378306Z", "start_time": "2019-09-28T19:10:29.349811Z" } }, "outputs": [ { "data": { "text/plain": [ "0.05580385745900118" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "-np.mean(cross_val_score(\n", " LinearRegression(), X_expand, y, cv=KFold(n_splits=3, random_state=123),\n", " scoring='neg_mean_squared_error'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Так линейная модель может лучше восстанавливать нелинейные зависимости." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Транзакционные данные" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Напоследок посмотрим, как можно извлекать признаки из транзакционных данных. \n", "\n", "Транзакционные данные характеризуются тем, что есть много строк, характеризующихся моментов времени и некоторым числом (суммой денег, например). При этом если это банк, то каждому человеку принадлежит не одна транзакция, а чаще всего надо предсказывать некоторые сущности для клиентов. Таким образом, надо получить признаки для пользователей из множества их транзакций. Этим мы и займёмся.\n", "\n", "Для примера возьмём данные [отсюда](https://www.kaggle.com/regivm/retailtransactiondata/). Задача детектирования фродовых клиентов." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:34.228625Z", "start_time": "2019-09-28T19:10:29.385812Z" } }, "outputs": [], "source": [ "!wget https://www.dropbox.com/s/zhgyiugyrzjs5xb/Retail_Data_Response.csv\n", "!wget https://www.dropbox.com/s/5xepi2t6d81s9o3/Retail_Data_Transactions.csv" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:34.314276Z", "start_time": "2019-09-28T19:10:34.231504Z" } }, "outputs": [], "source": [ "customers = pd.read_csv('Retail_Data_Response.csv')\n", "transactions = pd.read_csv('Retail_Data_Transactions.csv')" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:34.339478Z", "start_time": "2019-09-28T19:10:34.316390Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
customer_idresponse
0CS11120
1CS11130
2CS11141
3CS11151
4CS11161
\n", "
" ], "text/plain": [ " customer_id response\n", "0 CS1112 0\n", "1 CS1113 0\n", "2 CS1114 1\n", "3 CS1115 1\n", "4 CS1116 1" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "customers.head()" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:34.393258Z", "start_time": "2019-09-28T19:10:34.345560Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
customer_idtrans_datetran_amount
0CS529511-Feb-1335
1CS476815-Mar-1539
2CS212226-Feb-1352
3CS121716-Nov-1199
4CS185020-Nov-1378
\n", "
" ], "text/plain": [ " customer_id trans_date tran_amount\n", "0 CS5295 11-Feb-13 35\n", "1 CS4768 15-Mar-15 39\n", "2 CS2122 26-Feb-13 52\n", "3 CS1217 16-Nov-11 99\n", "4 CS1850 20-Nov-13 78" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "transactions.head()" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:35.873166Z", "start_time": "2019-09-28T19:10:34.443594Z" } }, "outputs": [], "source": [ "transactions.trans_date = transactions.trans_date.apply(\n", " lambda x: datetime.datetime.strptime(x, '%d-%b-%y'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Посмотрим на распределение целевой переменной:" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:35.891923Z", "start_time": "2019-09-28T19:10:35.875688Z" } }, "outputs": [ { "data": { "text/plain": [ "0.09398605461940732" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "customers.response.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Получаем примерно 1 к 9 положительных примеров. Если такие данные разбивать на части для кросс валидации, то может получиться так, что в одну из частей попадёт слишком мало положительных примеров, а в другую — наоборот. На случай такого неравномерного баланса классов есть StratifiedKFold, который бьёт данные так, чтобы баланс классов во всех частях был одинаковым." ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:35.909700Z", "start_time": "2019-09-28T19:10:35.898951Z" } }, "outputs": [], "source": [ "from sklearn.model_selection import StratifiedKFold" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Когда строк на каждый объект много, можно считать различные статистики. Например, средние, минимальные и максимальные суммы, потраченные клиентом, количество транзакий, ..." ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:36.003307Z", "start_time": "2019-09-28T19:10:35.914780Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
customer_idresponsemeanstdcountminmax
0CS1112067.46666719.7660121536105
1CS1113074.50000021.254102203698
2CS1114175.36842121.3416921937105
3CS1115175.40909118.1518962241104
4CS1116165.92307722.9400001340105
\n", "
" ], "text/plain": [ " customer_id response mean std count min max\n", "0 CS1112 0 67.466667 19.766012 15 36 105\n", "1 CS1113 0 74.500000 21.254102 20 36 98\n", "2 CS1114 1 75.368421 21.341692 19 37 105\n", "3 CS1115 1 75.409091 18.151896 22 41 104\n", "4 CS1116 1 65.923077 22.940000 13 40 105" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "agg_transactions = transactions.groupby('customer_id').tran_amount.agg(\n", " ['mean', 'std', 'count', 'min', 'max']).reset_index()\n", "\n", "data = pd.merge(customers, agg_transactions, how='left', on='customer_id')\n", "\n", "data.head()" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:36.210653Z", "start_time": "2019-09-28T19:10:36.008305Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/ekayumov/anaconda/envs/py36/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.\n", " FutureWarning)\n", "/Users/ekayumov/anaconda/envs/py36/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.\n", " FutureWarning)\n", "/Users/ekayumov/anaconda/envs/py36/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.\n", " FutureWarning)\n" ] }, { "data": { "text/plain": [ "0.594866904556827" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.linear_model import LogisticRegression\n", "\n", "np.mean(cross_val_score(\n", " LogisticRegression(),\n", " X=data.drop(['customer_id', 'response'], axis=1),\n", " y=data.response,\n", " cv=StratifiedKFold(n_splits=3, random_state=123),\n", " scoring='roc_auc'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Но каждая транзакция снабжена датой! Можно посчитать статистики только по свежим транзакциям. Добавим их." ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:36.235061Z", "start_time": "2019-09-28T19:10:36.218040Z" } }, "outputs": [ { "data": { "text/plain": [ "(Timestamp('2011-05-16 00:00:00'), Timestamp('2015-03-16 00:00:00'))" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "transactions.trans_date.min(), transactions.trans_date.max()" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:36.858305Z", "start_time": "2019-09-28T19:10:36.240648Z" } }, "outputs": [], "source": [ "agg_transactions = transactions.loc[transactions.trans_date.apply(\n", " lambda x: x.year == 2014)].groupby('customer_id').tran_amount.agg(\n", " ['mean', 'std', 'count', 'min', 'max']).reset_index()" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:36.886707Z", "start_time": "2019-09-28T19:10:36.860734Z" } }, "outputs": [], "source": [ "data = pd.merge(data, agg_transactions, how='left', on='customer_id', suffixes=('', '_2014'))\n", "data = data.fillna(0)" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "ExecuteTime": { "end_time": "2019-09-28T19:10:37.020392Z", "start_time": "2019-09-28T19:10:36.889377Z" }, "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/ekayumov/anaconda/envs/py36/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.\n", " FutureWarning)\n", "/Users/ekayumov/anaconda/envs/py36/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.\n", " FutureWarning)\n", "/Users/ekayumov/anaconda/envs/py36/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.\n", " FutureWarning)\n" ] }, { "data": { "text/plain": [ "0.6483871707242365" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.mean(cross_val_score(\n", " LogisticRegression(),\n", " X=data.drop(['customer_id', 'response'], axis=1),\n", " y=data.response,\n", " cv=StratifiedKFold(n_splits=3, random_state=123),\n", " scoring='roc_auc'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Можно также считать дату первой и последней транзакциями пользователей, среднее время между транзакциями и прочее." ] } ], "metadata": { "kernelspec": { "display_name": "py36", "language": "python", "name": "py36" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": true }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "position": { "height": "144px", "left": "792px", "right": "20px", "top": "166px", "width": "350px" }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 1 }