{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "## Open Machine Learning Course\n", "
Author: [Yury Kashnitsky](https://www.linkedin.com/in/festline)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First baseline in Kaggle Inclass [competition](https://www.kaggle.com/c/how-good-is-your-medium-article) \"How good is your Medium article?\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Import libraries." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import numpy as np\n", "import pandas as pd\n", "from matplotlib import pyplot as plt\n", "%matplotlib inline\n", "import json\n", "from tqdm import tqdm_notebook\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "from sklearn.metrics import mean_absolute_error" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following code will help to throw away all HTML tags from an article content." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from html.parser import HTMLParser\n", "\n", "class MLStripper(HTMLParser):\n", " def __init__(self):\n", " self.reset()\n", " self.strict = False\n", " self.convert_charrefs= True\n", " self.fed = []\n", " def handle_data(self, d):\n", " self.fed.append(d)\n", " def get_data(self):\n", " return ''.join(self.fed)\n", "\n", "def strip_tags(html):\n", " s = MLStripper()\n", " s.feed(html)\n", " return s.get_data()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's have two paths – to raw data (downloaded from competition's page and ungzipped) and to processed data. Change this if you'd like to." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "PATH_TO_RAW_DATA = '../../raw_data/'\n", "PATH_TO_PROCESSED_DATA = '../../processed_data/'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Assume you have all data downloaded from competition's [page](https://www.kaggle.com/c/how-good-is-your-medium-article/data) in the PATH_TO_RAW_DATA folder and `.gz` files are ungzipped." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!ls -l $PATH_TO_RAW_DATA" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Supplementary function to read a JSON line without crashing on escape characters. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def read_json_line(line=None):\n", " result = None\n", " try: \n", " result = json.loads(line)\n", " except Exception as e: \n", " # Find the offending character index:\n", " idx_to_replace = int(str(e).split(' ')[-1].replace(')','')) \n", " # Remove the offending character:\n", " new_line = list(line)\n", " new_line[idx_to_replace] = ' '\n", " new_line = ''.join(new_line) \n", " return read_json_line(line=new_line)\n", " return result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This function takes a JSON and forms a txt file leaving only article content. When you resort to feature engineering and extract various features from articles, a good idea is to modify this function." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def preprocess(path_to_inp_json_file, path_to_out_txt_file):\n", " with open(path_to_inp_json_file, encoding='utf-8') as inp_file, \\\n", " open(path_to_out_txt_file, 'w', encoding='utf-8') as out_file:\n", " for line in tqdm_notebook(inp_file):\n", " json_data = read_json_line(line)\n", " content = json_data['content'].replace('\\n', ' ').replace('\\r', ' ')\n", " content_no_html_tags = strip_tags(content)\n", " out_file.write(content_no_html_tags + '\\n')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "preprocess(path_to_inp_json_file=os.path.join(PATH_TO_RAW_DATA, 'train.json'),\n", " path_to_out_txt_file=os.path.join(PATH_TO_PROCESSED_DATA, 'train_raw_content.txt'))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "preprocess(path_to_inp_json_file=os.path.join(PATH_TO_RAW_DATA, 'test.json'),\n", " path_to_out_txt_file=os.path.join(PATH_TO_PROCESSED_DATA, 'test_raw_content.txt'))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!wc -l $PATH_TO_PROCESSED_DATA/*_raw_content.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll use a linear model (`Ridge`) with a very simple feature extractor – `CountVectorizer`, meaning that we resort to the Bag-of-Words approach. For now, we are leaving only 50k features. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cv = CountVectorizer(max_features=50000)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "with open(os.path.join(PATH_TO_PROCESSED_DATA, 'train_raw_content.txt'), encoding='utf-8') as input_train_file:\n", " X_train = cv.fit_transform(input_train_file)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "with open(os.path.join(PATH_TO_PROCESSED_DATA, 'test_raw_content.txt'), encoding='utf-8') as input_test_file:\n", " X_test = cv.transform(input_test_file)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_train.shape, X_test.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Read targets from file." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_target = pd.read_csv(os.path.join(PATH_TO_RAW_DATA, 'train_log1p_recommends.csv'), \n", " index_col='id')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y_train = train_target['log_recommends'].values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Make a 30%-holdout set. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_part_size = int(0.7 * train_target.shape[0])\n", "X_train_part = X_train[:train_part_size, :]\n", "y_train_part = y_train[:train_part_size]\n", "X_valid = X_train[train_part_size:, :]\n", "y_valid = y_train[train_part_size:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we are ready to fit a linear model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import Ridge" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ridge = Ridge(random_state=17)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "ridge.fit(X_train_part, y_train_part);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ridge_pred = ridge.predict(X_valid)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's plot predictions and targets for the holdout set. Recall that these are #recommendations (= #claps) of Medium articles with the `np.log1p` transformation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.hist(y_valid, bins=30, alpha=.5, color='red', label='true', range=(0,10));\n", "plt.hist(ridge_pred, bins=30, alpha=.5, color='green', label='pred', range=(0,10));\n", "plt.legend();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, the prediction is far from perfect, and we get MAE $\\approx$ 1.3 that corresponds to $\\approx$ 2.7 error in #recommendations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "valid_mae = mean_absolute_error(y_valid, ridge_pred)\n", "valid_mae, np.expm1(valid_mae)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, train the model on the full accessible training set, make predictions for the test set and form a submission file. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "ridge.fit(X_train, y_train);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "ridge_test_pred = ridge.predict(X_test)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def write_submission_file(prediction, filename,\n", " path_to_sample=os.path.join(PATH_TO_RAW_DATA, 'sample_submission.csv')):\n", " submission = pd.read_csv(path_to_sample, index_col='id')\n", " \n", " submission['log_recommends'] = prediction\n", " submission.to_csv(filename)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "write_submission_file(prediction=ridge_test_pred, \n", " filename='first_ridge.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With this, you'll get 1.91185 on [public leaderboard](https://www.kaggle.com/c/how-good-is-your-medium-article/leaderboard). This is much higher than our validation MAE. This indicates that the target distribution in test set somewhat differs from that of the training set (recent Medium articles are more popular). This shouldn't confuse us as long as we see a correlation between local improvements and improvements on the leaderboard. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some ideas for improvement:\n", "- Engineer good features, this is the key to success. Some simple features will be based on publication time, authors, content length and so on\n", "- You may not ignore HTML and extract some features from there\n", "- You'd better experiment with your validation scheme. You should see a correlation between your local improvements and LB score\n", "- Try TF-IDF, ngrams, Word2Vec and GloVe embeddings\n", "- Try various NLP techniques like stemming and lemmatization\n", "- Tune hyperparameters. In our example, we've left only 50k features and used `C`=1 as a regularization parameter, this can be changed \n", "- SGD and Vowpal Wabbit will learn much faster\n", "- In our course, we don't cover neural nets. But it's not obliged to use GRUs or LSTMs in this competition. " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 2 }