{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", " \n", "## [mlcourse.ai](mlcourse.ai) – Open Machine Learning Course \n", "###
Author: Maxim Kashirin, ODS Slack nickname: Maxim Kashirin\n", " \n", "##
Prediction of changes in the ruble exchange rate on Lenta.ru news" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Research plan**\n", " - Dataset and features description\n", " - Exploratory data analysis\n", " - Visual analysis of the features\n", " - Patterns, insights, pecularities of data\n", " - Data preprocessing\n", " - Feature engineering and description\n", " - Cross-validation, hyperparameter tuning\n", " - Validation and learning curves\n", " - Prediction for hold-out and test samples\n", " - Model evaluation with metrics description\n", " - Conclusions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 1. Dataset and features description" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This individual project considers the task of predicting the change in the exchange rate of the national currency of Russia based on news in one of the major Internet publications Lenta.ru. The original dataset with news in Russian but all important things are translated into English.\n", "Economic and geopolitical events affect the exchange rate, while the exchange rate itself affects people's lives. Predicting exchange rate changes can be useful for better management of your personal budget." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is no dataset that fully satisfies the requirements. It needs to be assembled from two other datasets.\n", "\n", "**Lenta.ru news [dataset](https://www.kaggle.com/yutkin/corpus-of-russian-news-articles-from-lenta)**\n", "\n", "1. tags - news tag or subcategory, categorical;\n", "1. text - news text, text;\n", "1. title - news title, text;\n", "1. topic - news topic (category), categorical;\n", "1. url - news url (format: https://lenta.ru/news/year/month/day/slug/), text\n", "\n", "**Russian financial indicators (download from [investing.com](https://www.investing.com/currencies/usd-rub-historical-data))**\n", "1. date - rate date, datetime 2001-2018;\n", "1. price - ruble to dollar rate (in rubles for one dollar), float;\n", "1. change% - percentage change from previous day, float;\n", "\n", "\n", "Then generate new dataset with news and exchange rate:\n", "\n", "1. date\n", "1. price\n", "1. tags\n", "1. text\n", "1. title\n", "1. topic\n", "1. url\n", "1. change% - target variable\n", "\n", "Precomputed data: https://goo.gl/jKQzDb\n", "\n", "Original data: https://goo.gl/yE8g6m" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import numpy as np\n", "import pandas as pd\n", "from scipy import stats\n", "from sklearn.model_selection import KFold\n", "from urllib.parse import urlparse\n", "import datetime\n", "import matplotlib\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "from collections import Counter\n", "import gc\n", "import warnings\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the beginning to comply with the conditions it is necessary to translate part of the data set" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "news_df = pd.read_csv('./data/news_lenta_orig.csv')\n", "mapping = {\n", " 'Все': 'Все (All)',\n", " 'Политика': 'Политика (Politics)',\n", " 'Общество': 'Общество (Society)',\n", " 'Украина': 'Украина (Ukraine)',\n", " 'Происшествия': 'Происшествия (Accidents)',\n", " 'Футбол': 'Футбол (Football)',\n", " 'Госэкономика': 'Госэкономика (Government economy)',\n", " 'Кино': 'Кино (Movies)',\n", " 'Бизнес': 'Бизнес (Business)',\n", " 'Интернет': 'Интернет (WWW)',\n", " 'Наука': 'Наука (Science)',\n", " 'Следствие и суд': 'Следствие и суд (Investigation and trial)',\n", " 'Музыка': 'Музыка (Music)',\n", " 'Люди': 'Люди (People)',\n", " 'Преступность': 'Преступность (Crime)',\n", " 'Космос': 'Космос (Space)',\n", " 'События': 'События (Events)',\n", " 'Конфликты': 'Конфликты (Conflicts)',\n", " 'Coцсети': 'Coцсети (Social networks)',\n", " 'Летние виды': 'Летние виды (Summer sports)',\n", " 'ТВ и радио': 'ТВ и радио (TV and radio)',\n", " 'Деловой климат': 'Деловой климат (Business relationship)',\n", " 'Криминал': 'Криминал (Crime)',\n", " 'Явления': 'Явления (Phenomena)',\n", " 'Регионы': 'Регионы (Regions)',\n", " 'Гаджеты': 'Гаджеты (Gadgets)',\n", " 'Мир': 'Мир (World)',\n", " 'Бокс и ММА': 'Бокс и ММА (Boxing and MMA)',\n", " 'Игры': 'Игры (Games)',\n", " 'Звери': 'Звери (Wild)',\n", " 'Стиль': 'Стиль (Style)',\n", " 'Искусство': 'Искусство (Art)',\n", " 'Пресса': 'Пресса (Press)',\n", " 'Рынки': 'Рынки (Markets)',\n", " 'Зимние виды': 'Зимние виды (Winter sports)',\n", " 'Полиция и спецслужбы': 'Полиция и спецслужбы (Police and special services)',\n", " 'Кавказ': 'Кавказ (Caucasus)',\n", " 'Москва': 'Москва (Moscow)',\n", " 'Деньги': 'Деньги (Money)',\n", " 'Прибалтика': 'Прибалтика (Baltic)',\n", " 'Книги': 'Книги (Books)',\n", " 'Театр': 'Театр (Theatre)',\n", " 'Техника': 'Техника (Technology)',\n", " 'Средняя Азия': 'Средняя Азия (Middle Asia)',\n", " 'Мировой бизнес': 'Мировой бизнес (World Business)',\n", " 'Хоккей': 'Хоккей (Hockey)',\n", " 'Белоруссия': 'Белоруссия (Belorussia)',\n", " 'Движение': 'Движение (Movement)',\n", " 'ОИ-2018': 'ОИ-2018 (Olympic games 2018)',\n", " 'Оружие': 'Оружие (Weapon)',\n", " 'Инструменты': 'Инструменты (Tools)',\n", " 'Казахстан': 'Казахстан (Kazakhstan)',\n", " 'Достижения': 'Достижения (Achievements)',\n", " 'Софт': 'Софт (Software)',\n", " 'Россия': 'Россия (Russia)',\n", " 'Внешний вид': 'Внешний вид (Appearance)',\n", " 'Часы': 'Часы (Watch)',\n", " 'Мнения': 'Мнения (Opinions)',\n", " 'Вирусные ролики': 'Вирусные ролики (Viral)',\n", " 'Мемы': 'Мемы (Memes)',\n", " 'Еда': 'Еда (Food)',\n", " 'Молдавия': 'Молдавия (Moldavia)',\n", " 'Катастрофы': 'Катастрофы (Disasters)',\n", " 'Вещи': 'Вещи (clothes)',\n", " 'Реклама': 'Реклама (Advertisement)',\n", " 'Автобизнес': 'Автобизнес (Car business)',\n", " 'История': 'История (History)',\n", " 'Жизнь': 'Жизнь (Life)',\n", " 'Финансы компаний': 'Финансы компаний (Finance companies)',\n", " 'Авто': 'Авто (Cars)',\n", " 'Киберпреступность': 'Киберпреступность (Cybercrime)',\n", " 'Туризм': 'Туризм (Tourism)',\n", " 'Преступная Россия': 'Преступная Россия (Criminal Russia)',\n", " 'Первая мировая': 'Первая мировая (World War I)',\n", " 'Социальная сфера': 'Социальная сфера (Social Sphere)',\n", " 'Экология': 'Экология (Ecology)',\n", " 'Наследие': 'Наследие (Legacy)',\n", " 'Госрегулирование': 'Госрегулирование (Government regulation)',\n", " 'Производители': 'Производители (Manufacturers)',\n", " 'Вкусы': 'Вкусы (Taste)',\n", " 'ЧМ-2018': 'ЧМ-2018 (World cup 2018)',\n", " 'Аналитика рынка': 'Аналитика рынка (Market Analytics)',\n", " 'Фотография': 'Фотография (Photos)',\n", " 'Крым': 'Крым (Crimea)',\n", " 'Страноведение': 'Страноведение (Geography)',\n", " 'Выборы': 'Выборы (Elections)',\n", " 'Мировой опыт': 'Мировой опыт (World experience)',\n", " 'Вооружение': 'Вооружение (Armament)',\n", " 'Культпросвет': 'Культпросвет (Cultural enlightenment)',\n", " 'Инновации': 'Инновации (Innovation)'\n", "}\n", "news_df['tags'] = news_df['tags'].map(mapping)\n", "mapping = {\n", " 'Россия': 'Россия (Russia)',\n", " 'Мир': 'Мир (World)',\n", " 'Экономика': 'Экономика (Economy)',\n", " 'Спорт': 'Спорт (Sport)',\n", " 'Культура': 'Культура (Culture)',\n", " 'Бывший СССР': 'Бывший СССР (Former USSR)',\n", " 'Наука и техника': 'Наука и техника (Science and technology)',\n", " 'Интернет и СМИ': 'Интернет и СМИ (Internet and media)',\n", " 'Из жизни': 'Из жизни (Life stories)',\n", " 'Силовые структуры': 'Силовые структуры (Security or military services)',\n", " 'Бизнес': 'Бизнес (Business)',\n", " 'Ценности': 'Ценности (Values)',\n", " 'Путешествия': 'Путешествия (Travels)',\n", " '69-я параллель': '69-я параллель (69th parallel)',\n", " 'Крым': 'Крым (Crimea)',\n", " 'Культпросвет ': 'Культпросвет (Cultural enlightenment)',\n", " 'Легпром': 'Легпром (Light industry)',\n", " 'Библиотека': 'Библиотека (Library)',\n", " 'Дом': 'Дом (Home)',\n", " 'Оружие': 'Оружие (Weapon)',\n", " 'ЧМ-2014': 'ЧМ-2014 (World Cup 2014)',\n", " 'МедНовости': 'МедНовости (Medicine)',\n", " 'Сочи': 'Сочи (Sochi)'\n", "}\n", "news_df['topic'] = news_df['topic'].map(mapping)\n", "news_df.to_csv('./data/news_lenta.csv', index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The exchange rate needs a little preprocessing" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "currency_df = pd.read_csv('./data/usd_orig.csv')\n", "currency_df.drop([\"open\",\"max\",\"min\"], inplace=True, axis=1)\n", "currency_df['date'] = pd.to_datetime(currency_df['date'])\n", "currency_df['price'] = currency_df['price'].apply(lambda x: x.replace(',', '.'))\n", "currency_df['change%'] = currency_df['change%'].apply(lambda x: x.replace(',', '.').replace('%', ''))\n", "currency_df.to_csv('./data/usd.csv', index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load currency exchange data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "currency_df = pd.read_csv('./data/usd.csv', parse_dates=['date', ])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load Lenta.ru news" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "news_df = pd.read_csv('./data/news_lenta.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Original dataset doesn't contain 'publication date' feature but it can be extracted from url" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def parse_date_from_url(x):\n", " path = urlparse(x).path.strip('/').split('/')\n", " return datetime.datetime(int(path[1]), int(path[2]), int(path[3]))\n", "\n", "news_df['date'] = news_df['url'].apply(parse_date_from_url)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can combine datasets" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "full_df = currency_df.merge(news_df, on='date', how='inner', suffixes=('_currency', '_news'))\n", "# full_df.to_csv('./data/data.csv', index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Delete redundant variables to save memory (currency_df will need later)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "del news_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 2. Exploratory data analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see on dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(f'Number of rows: {full_df.shape[0]}')\n", "full_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Data in dataset has following types:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "full_df.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following time period is considered in dataset:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "f'Period: from {full_df[\"date\"].min().date()} to {full_df[\"date\"].max().date()}'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see numeric features:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "full_df[['price', 'change%',]].describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Explore the distribution of numerical values" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for col_label in ['price', 'change%']:\n", " col = full_df[col_label]\n", " col_mean = col.mean()\n", " _, normal_distribution_probability = stats.normaltest(col)\n", " skewness = stats.skew(col)\n", " print(col_label + ':')\n", " print(f' Mean: {col_mean}')\n", " print(f' Normal distribution probability: {normal_distribution_probability}')\n", " print(f' Skewness: {skewness}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Target variable \"change%\" means approximately equal to zero, target variable not distributed normally" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Consider the tags that are used in dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.Series(full_df['tags'].unique())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's explore topics:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.Series(full_df['topic'].unique())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 3. Visual analysis of the features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Target variable**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# For simplicity, we will use dataset with the exchange rate\n", "target_variable_exploration_df = currency_df.copy()\n", "target_variable_exploration_df['year_month'] = target_variable_exploration_df['date']\\\n", " .apply(lambda x: datetime.datetime(x.year, x.month, 1))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tve_df_by_year_month = target_variable_exploration_df.groupby('year_month')\n", "\n", "mean_values = tve_df_by_year_month.mean()\n", "\n", "fig, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=(15,5))\n", "ax1.set_title('Rate change from previous day in percentages (target variable)')\n", "ax1.plot(\n", " mean_values.index, \n", " mean_values['change%'],\n", " mean_values.index, \n", " np.zeros(len(mean_values.index)))\n", "ax2.set_title('Ruble exchange rate (Rubles per USD)')\n", "ax2.plot(\n", " mean_values.index, \n", " mean_values['price'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Until 2008, the rate is stable\n", "* Then there is a sharp increase in the value of USD (see https://en.wikipedia.org/wiki/Great_Recession_in_Russia)\n", "* Until 2014, again, the course is stable at around 40,\n", "* Then comes the financial crisis (see https://en.wikipedia.org/wiki/Russian_financial_crisis_(2014%E2%80%932017)) and the value of the USD rises sharply again." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Consider the data in the context of each year." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "full_df_with_year = full_df.copy()\n", "full_df_with_year['year'] = full_df['date'].apply(lambda x: x.year)\n", "year_group = full_df_with_year.groupby('year')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Number of news per year:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "year_group['date'].count().plot(kind='bar')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "News topics per year:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "year_group['topic'].value_counts().groupby(level=0).head(6).plot(kind='barh', figsize=(15,20))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "News tags per year:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tags_per_years = year_group['tags'].value_counts()\n", "tags_per_years.drop(['Все (All)'], level=1, inplace=True)\n", "tags_per_years.groupby(level=0).head(6).plot(kind='barh', figsize=(15,15))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gc.collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Conclusion**\n", "\n", "* Exchange rate is unstable. There are drastic changes in values;\n", "* Every year the number of articles grew. In 2014 there is an exception: the number of articles is less than in 2013 and 2015;\n", "* The most popular topics are 'Russia' and 'World';\n", "* The most popular tags are 'Society', 'Ukraine', 'Politics';\n", "* Until 2013 there are no tags." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 4. Patterns, insights, pecularities of data " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Consider the value of the target variable in different time sections." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "target_variable_exploration_df['month'] = target_variable_exploration_df['date']\\\n", " .apply(lambda x: x.month)\n", "target_variable_exploration_df['day_of_month'] = target_variable_exploration_df['date']\\\n", " .apply(lambda x: x.day)\n", "target_variable_exploration_df['day_of_week'] = target_variable_exploration_df['date']\\\n", " .apply(lambda x: x.weekday())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tve_df_by_month = target_variable_exploration_df.groupby('month')\n", "tve_df_by_dom = target_variable_exploration_df.groupby('day_of_month')\n", "tve_df_by_dow = target_variable_exploration_df.groupby('day_of_week')\n", "mean_values = tve_df_by_month.mean(), tve_df_by_dom.mean(), tve_df_by_dow.mean()\n", "titles = 'month', 'day of month', 'day of week'\n", "\n", "fig, axis = plt.subplots(1, 3, figsize=(15,4))\n", "for i in range(3):\n", " axis[i].set_title(titles[i])\n", " axis[i].plot(\n", " mean_values[i].index, \n", " mean_values[i]['change%'],\n", " mean_values[i].index, \n", " np.zeros(len(mean_values[i].index)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's use NLTK library and mystem for text normalization. \n", "* NLTK is a leading platform for building Python programs to work with human language data. \n", "* The program MyStem performs a morphological analysis of the text in Russian." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import nltk\n", "nltk.download(\"stopwords\")\n", "from nltk.corpus import stopwords\n", "from pymystem3 import Mystem\n", "from string import punctuation, whitespace" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mystem = Mystem() \n", "# set has O(1) for \"in\" operation \n", "russian_stopwords = set(stopwords.words(\"russian\"))\n", "ext_punctuation = set(punctuation + '«»')\n", "whitespace_set = set(whitespace + '\\xa0')\n", "\n", "# text to normalized list of words\n", "def tokenize_text(text):\n", " tokens = mystem.lemmatize(text.lower())\n", " tokens = [token for token in tokens if token not in russian_stopwords\\\n", " and token not in whitespace_set\\\n", " and token.strip() not in ext_punctuation]\n", " return tokens\n", "\n", "# text to normalized text \n", "def preprocess_text(text):\n", " return \" \".join(tokenize_text(text))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will try to find the days in which the course jumps were the strongest and analyze the agenda." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "abs_quantile = full_df['change%'].apply(np.abs).quantile(0.995)\n", "print(f'Analyze days with |change%| more than {abs_quantile}')\n", "change999_df = full_df[(full_df['change%'].apply(np.abs) > abs_quantile)]\n", "print(f'News count: {change999_df[\"topic\"].shape[0]}')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "change999_df['date'].value_counts().sort_index()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The strongest jumps of the course were in the second half of 2009 and in December 2014" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "change999_df['topic'].value_counts().head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "change999_df['tags'].value_counts().head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unfortunately, until 2012 there are no tags (all marked with the tag \"All\"), so tags in 2009 cannot be assessed. However, in 2014, the \"Politics\" stands out higher than in the total sample. The \"World\" topic in the general set takes the 3rd place in 2009 and the 4th place in 2014, while for the days in question it is in second place with a small margin from the first. From this we can conclude that the news in the topic \"World\" with the tag \"Politics\" can most of all influence the exchange rate." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this data subset let's try to find the most popular words which contained in titles" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "world_politics = change999_df[\n", " (change999_df['topic'] == 'Мир (World)') & (change999_df['tags'] == 'Политика (Politics)')]\n", "titles = world_politics['title'].apply(tokenize_text).values\n", "flat_list = [item.lower() for sublist in titles for item in sublist]\n", "word_counts = list(Counter(flat_list).items())\n", "word_counts_df = pd.DataFrame(word_counts, columns=['word', 'count']).sort_values(by=['count', 'word'], ascending=False)\n", "\n", "top10_word_df = word_counts_df.iloc[:10]\n", "# Перевод слов\n", "translation_series = pd.Series([\n", " 'sanctions', \n", " 'Russia',\n", " 'USA', \n", " 'against',\n", " 'president',\n", " 'call', \n", " 'Relation',\n", " 'Obama',\n", " 'Moscow',\n", " 'Cuba'\n", "], index=top10_word_df.index)\n", "\n", "top10_word_df = pd.concat([translation_series, top10_word_df], axis=1)\n", "top10_word_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The most popular word turned out to be “sanctions”, it was repeated 12 times in the reviewed headlines, then the sides of this interaction “USA” and “Russia”." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For comparison, consider the most popular words in the titles throughout the dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "all_titles = full_df['title'].apply(tokenize_text).values\n", "flat_list = [item.lower() for sublist in all_titles for item in sublist]\n", "word_counts = list(Counter(flat_list).items())\n", "word_counts_df = pd.DataFrame(word_counts, columns=['word', 'count']).sort_values(by=['count', 'word'], ascending=False)\n", "word_counts_df.head(30)\n", "\n", "top10_word_df = word_counts_df.iloc[:10]\n", "# Перевод слов\n", "translation_series = pd.Series([\n", " 'Russia', \n", " 'Russian',\n", " 'USA', \n", " 'new',\n", " 'year',\n", " 'Moscow', \n", " 'call',\n", " 'court',\n", " 'dollar',\n", " 'person'\n", "], index=top10_word_df.index)\n", "\n", "top10_word_df = pd.concat([translation_series, top10_word_df], axis=1)\n", "top10_word_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In general, \"sanction\" is not a popular word." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Also consider the number of articles in the considered headings in different years." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "all_world_politics = full_df[(full_df['topic'] == 'Мир (World)') & (full_df['tags'] == 'Политика (Politics)')]\n", "year = all_world_politics['date'].apply(lambda x: x.year)\n", "all_world_politics_group = all_world_politics.groupby(year)\n", "all_world_politics_group['title'].count().plot(kind='bar')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Conclusion**\n", "\n", "* Day, day of the week and season can be useful as features. For example, in winter and summer, the currency rate increases, and in the off season, on the contrary, it decreases.\n", "* During the sharp changes in the exchange rate in the news, the most publications were about sanctions against Russia and the most popular tag is \"Politics\" and the most popular topic is \"World\".\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "del change999_df, world_politics, titles, flat_list, word_counts_df, all_world_politics, all_world_politics_group\n", "gc.collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 5. Data preprocessing" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import OneHotEncoder" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fill N/A values and drop target variable" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y, X = full_df['change%'], full_df.drop('change%', axis=1)\n", "X['topic'] = X['topic'].fillna('Empty')\n", "X['tags'] = X['tags'].fillna('Empty')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Transform categorical features to dummy encoding" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "one_hot_topics = OneHotEncoder().fit_transform(X[['topic']])\n", "one_hot_tags = OneHotEncoder().fit_transform(X[['tags']])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 6. Feature engineering and description " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.feature_extraction.text import TfidfVectorizer\n", "from scipy.sparse import csr_matrix, hstack, load_npz, save_npz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Text features processing**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Perform preprocessing of text features, eliminating various word forms" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "# preprocessing titles (takes a lot of time)\n", "#titles_normalized = X['title'].apply(preprocess_text)\n", "#titles_normalized.to_csv('./data/titles_normalized.csv', header=True, index_label='idx')\n", "\n", "# loading preprocessed titles\n", "titles_normalized = pd.read_csv('./data/titles_normalized.csv', index_col='idx')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "# preprocessing texts (takes a lot of time)\n", "#text_normalized = X['text'].apply(preprocess_text)\n", "#text_normalized.to_csv('./data/text_normalized.csv', header=True, index_label='idx')\n", "\n", "# loading preprocessed texts\n", "text_normalized = pd.read_csv('./data/text_normalized.csv', index_col='idx')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Perform TF-IDF encoding" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "# vectorize titles and texts (takes a lot of time)\n", "# title_tfidf = TfidfVectorizer(max_features=100000).fit_transform(titles_normalized['title'])\n", "# save_npz('./data/title_tfidf.npz', title_tfidf)\n", "# text_tfidf = TfidfVectorizer(max_features=100000).fit_transform(text_normalized['text']) \n", "# save_npz('./data/text_tfidf.npz', text_tfidf)\n", "\n", "# loading vectorized data\n", "title_tfidf = load_npz('./data/title_tfidf.npz')\n", "text_tfidf = load_npz('./data/text_tfidf.npz') " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Add \"world politics sanctions-related\" feature because these news are most frequent when exchange rate greatly change." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "wpsr = full_df.apply(lambda row: 1 if row['topic'] == 'Мир (World)' \\\n", " and row['tags'] == 'Политика (Politics)'\\\n", " # 'sanction' in row['title']\n", " and 'санкция' in row['title']\\\n", " else 0, \\\n", " axis=1) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "wpsr.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Date features processing**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will try to get useful data from the date of publication, based on the graphs obtained in the paragraph above." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "date_features_df = pd.DataFrame(index=full_df.index)\n", "\n", "# in summer and winter, an increase in the rate is observed, and in the off season, on the contrary.\n", "date_features_df['winter-summer'] = full_df['date'].apply(lambda x: 1 if x.month in [1, 2, 5, 6, 7, 8, 12] else 0)\n", "date_features_df['off_season'] = full_df['date'].apply(lambda x: 1 if x.month in [3, 4, 9, 10] else 0)\n", "\n", "threshold = 1e-2\n", "# Select the days in the month when the rate rises or falls\n", "pos_change, zero_change, neg_change = [], [], []\n", "for i, dom in enumerate(mean_values[1]['change%']):\n", " if dom > threshold:\n", " pos_change.append(i + 1)\n", " elif -threshold <= dom <= threshold:\n", " zero_change.append(i + 1)\n", " else:\n", " neg_change.append(i + 1)\n", "pos_change, zero_change, neg_change = set(pos_change), set(zero_change), set(neg_change)\n", "\n", "date_features_df['pos_change_dom'] = full_df['date'].apply(lambda x: 1 if x.day in pos_change else 0)\n", "date_features_df['zero_change_dom'] = full_df['date'].apply(lambda x: 1 if x.day in zero_change else 0)\n", "date_features_df['neg_change_dom'] = full_df['date'].apply(lambda x: 1 if x.day in neg_change else 0)\n", "\n", "# Highlight the days in the week when the rate goes up or down.\n", "pos_change, zero_change, neg_change = [], [], []\n", "for i, dom in enumerate(mean_values[2]['change%']):\n", " if dom > threshold:\n", " pos_change.append(i + 1)\n", " elif -threshold <= dom <= threshold:\n", " zero_change.append(i + 1)\n", " else:\n", " neg_change.append(i + 1)\n", "pos_change, zero_change, neg_change = set(pos_change), set(zero_change), set(neg_change)\n", "\n", "date_features_df['pos_change_dow'] = full_df['date'].apply(lambda x: 1 if x.weekday() in pos_change else 0)\n", "date_features_df['zero_change_dow'] = full_df['date'].apply(lambda x: 1 if x.weekday() in zero_change else 0)\n", "date_features_df['neg_change_dow'] = full_df['date'].apply(lambda x: 1 if x.weekday() in neg_change else 0)\n", "\n", "date_features_df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(13, 7)) \n", "sns.heatmap(date_features_df.corr('spearman'), annot=True, fmt='.2f', cmap=\"YlGnBu\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The resulting samples will consist of the following features:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_sparse = csr_matrix(hstack([\n", " wpsr.values.reshape(-1, 1),\n", " date_features_df,\n", " one_hot_topics, \n", " one_hot_tags, \n", " title_tfidf, \n", " text_tfidf,\n", "]))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "del titles_normalized, text_normalized,\\\n", " title_tfidf, text_tfidf, date_features_df,\\\n", " wpsr,one_hot_topics, one_hot_tags\n", "gc.collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 7. Cross-validation, hyperparameter tuning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 8. Validation and learning curves" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Memory usage optimization\n", "del full_df, X\n", "gc.collect()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split, validation_curve, learning_curve, cross_val_score, GridSearchCV\n", "from sklearn.linear_model import Ridge, Lasso, RidgeCV, LassoCV\n", "from sklearn.metrics import mean_squared_error\n", "from vowpalwabbit.sklearn_vw import VWRegressor\n", "from sklearn.ensemble import BaggingRegressor" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def plot_with_err(x, data, **kwargs):\n", " mu, std = data.mean(1), data.std(1)\n", " lines = plt.plot(x, mu, '-', **kwargs)\n", " plt.fill_between(x, mu - std, mu + std, edgecolor='none',\n", " facecolor=lines[0].get_color(), alpha=0.2)\n", " \n", "def plot_validation_curve_gscv(gscv, param, logx=False):\n", " plot_func = plt.semilogx if logx else plt.plot\n", " res = gscv.cv_results_\n", " x = [p[param] for p in res['params']]\n", "\n", " # minus - neg-mse to mse\n", " mu, std = -res['mean_train_score'], res['std_train_score']\n", " lines = plot_func(x, mu, label='train')\n", " plt.fill_between(x, mu - std, mu + std, edgecolor='none', facecolor=lines[0].get_color(), alpha=0.2)\n", "\n", " # minus - neg-mse to mse\n", " mu, std = -res['mean_test_score'], res['std_test_score']\n", " lines = plot_func(x, mu, label='test')\n", " plt.fill_between(x, mu - std, mu + std, edgecolor='none', facecolor=lines[0].get_color(), alpha=0.2)\n", " plt.legend()\n", " plt.grid(True)\n", " \n", "def plot_learning_curve(reg, tain_sizes, X, y):\n", " N_train, val_train, val_test = learning_curve(reg, X, y, \n", " train_sizes=train_sizes, \n", " cv=KFold(n_splits=3, shuffle=True, random_state=17),\n", " n_jobs=-1, scoring='neg_mean_squared_error')\n", " # minus - neg-mse to mse\n", " plot_with_err(N_train, -val_train, label='training scores')\n", " plot_with_err(N_train, -val_test, label='validation scores')\n", " plt.xlabel('Training Set Size'); \n", " plt.ylabel('mse')\n", " plt.legend()\n", " plt.grid(True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = train_test_split(X_sparse, y, test_size=0.25, random_state=17)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_sizes = np.linspace(0.05, 1, 10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1. Ridge (Linear with L2 reg and sgd solver)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "parameters = {'alpha': np.logspace(-8, 2, 10)}\n", "ridge_model = Ridge(random_state=17, solver='sag')\n", "ridge_gscv = GridSearchCV(ridge_model, parameters, \n", " cv=cv=KFold(n_splits=3, shuffle=True, random_state=17),\n", " scoring='neg_mean_squared_error', n_jobs=-1,\n", " return_train_score=True, verbose=5)\n", "ridge_gscv.fit(X_train, y_train)\n", "# minus - neg-mse to mse\n", "print(-ridge_gscv.best_score_)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_validation_curve_gscv(ridge_gscv, 'alpha', True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_learning_curve(ridge_gscv.best_estimator_, train_sizes, X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.Bagging (Base model is Ridge)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "parameters = {'n_estimators': np.arange(2, 19, 4)}\n", "ridge_model_base = Ridge(random_state=17, solver='sag', alpha=ridge_gscv.best_params_['alpha'])\n", "bagging_model = BaggingRegressor(ridge_model_base, random_state=17, verbose=True)\n", "bagging_gscv = GridSearchCV(bagging_model, parameters, \n", " cv=KFold(n_splits=3, shuffle=True, random_state=17), scoring='neg_mean_squared_error', \n", " n_jobs=-1, return_train_score=True, verbose=5)\n", "bagging_gscv.fit(X_train, y_train)\n", "# minus - neg-mse to mse\n", "print(-bagging_gscv.best_score_)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_validation_curve_gscv(bagging_gscv, 'n_estimators')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_learning_curve(bagging_gscv.best_estimator_, train_sizes, X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Bagging does not bring any result. As can be seen in the graphs, regardless of the number of models, the result does not change." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3. Vowpal Wabbit" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "parameters = {'l2': np.logspace(-8, -2, 6)}\n", "wv_model = VWRegressor(loss_function='squared', random_seed=17)\n", "vw_gscv = GridSearchCV(wv_model, parameters, \n", " cv=KFold(n_splits=3, shuffle=True, random_state=17),\n", " scoring='neg_mean_squared_error', n_jobs=-1,\n", " return_train_score=True, verbose=True)\n", "vw_gscv.fit(X_train, y_train)\n", "# minus - neg-mse to mse\n", "print(-vw_gscv.best_score_)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_validation_curve_gscv(vw_gscv, 'l2', logx=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_learning_curve(vw_gscv.best_estimator_, train_sizes, X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result is almost the same as Ridge." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 9. Prediction for hold-out and test samples " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import mean_squared_error" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ridge_best_model = ridge_gscv.best_estimator_\n", "ridge_pred = ridge_best_model.predict(X_test)\n", "mean_squared_error(y_test, ridge_pred)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "vw_best_model = vw_gscv.best_estimator_\n", "vw_pred = vw_best_model.predict(X_test)\n", "mean_squared_error(y_test, vw_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try to combine predictions" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "coeffs = np.arange(0.1, 0.91, 0.1)\n", "mse = [mean_squared_error(y_test, ridge_pred * c + vw_pred * (1 - c)) for c in coeffs]\n", "m = np.argmin(mse)\n", "coeffs[m], mse[m]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First 100 predictions" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "c = coeffs[m]\n", "prediction = ridge_pred * c + vw_pred * (1 - c)\n", "plt.figure(figsize=(20, 7))\n", "plt.plot(prediction[:100], \"g\", label=\"prediction\", linewidth=2.0)\n", "plt.plot(y_test[:100].values, label=\"actual\", linewidth=2.0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Last 100 predictions" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(20, 7))\n", "plt.plot(prediction[-100:], \"g\", label=\"prediction\", linewidth=2.0)\n", "plt.plot(y_test[-100:].values, label=\"actual\", linewidth=2.0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Based on the comparison of predictions and true values, we can conclude that the model is failed." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 10. Model evaluation with metrics description" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Three models are considered as models:\n", "1. Ridge\n", "1. Bagging\n", "1. Vowpal Wabbit\n", "\n", "Selected metric is MSE.\n", "\n", "MSE necessary to reduce errors, with a strong penalty for large errors.\n", "\n", "Error around ~ 0.8 is pretty high, which allows to draw conclusions about the poor quality of the model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 11. Conclusions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "During the individual project, the following actions were performed:\n", "1. Necessary dataset created;\n", "1. The values in the data set, including the values of the target variable, are explored;\n", "1. The extreme values of the target variable are explored and which values in the data set affect the target variable;\n", "1. An attempt was made to create new features;\n", "1. Trained several models on the features;\n", "1. The results were unsuccessful.\n", "\n", "I tried several methods and approaches to the solution, but did not reach a positive result. I would be grateful for help in finding errors." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0" } }, "nbformat": 4, "nbformat_minor": 2 }