{ "cells": [ { "cell_type": "markdown", "metadata": { "toc": "true" }, "source": [ "# Table of Contents\n", "

1  Data Challenge : Mangaki - September 2017
1.1  Reading data
1.2  First prediction model
1.3  Better predicted models
1.3.1  Using watched.csv
1.3.1.1  Maping string-rating to probability of seeing the movie
1.3.2  Using titles.csv
1.4  Evaluation from the data challenge platform
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Challenge : Mangaki - September 2017\n", "\n", "> - See [here for more information](http://universityofbigdata.net/competition/5085548788056064?lang=en).\n", "> - Author: [Lilian Besson](http://perso.crans.org/besson/).\n", "> - License: [MIT License](https://lbesson.mit-license.org/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reading data\n", "We have a few CSV files, let start by reading them." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from tqdm import tqdm\n", "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-rw-r--r-- 1 lilian lilian 350K juin 27 15:10 titles.csv\r\n", "-rw-r--r-- 1 lilian lilian 3,2M juin 27 15:25 watched.csv\r\n", "-rw-r--r-- 1 lilian lilian 1010K juin 27 15:34 test.csv\r\n", "-rw-r--r-- 1 lilian lilian 124K juin 28 17:55 train.csv\r\n", "-rw-r--r-- 1 lilian lilian 2,4M juin 28 17:57 submission.csv\r\n" ] } ], "source": [ "!ls -larth *.csv" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "'submission.csv' -> 'submission.csv.old'\r\n" ] } ], "source": [ "!cp -vf submission.csv submission.csv.old" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [], "source": [ "train = pd.read_csv(\"train.csv\")\n", "test = pd.read_csv(\"test.csv\")\n", "titles = pd.read_csv(\"titles.csv\")\n", "watched = pd.read_csv(\"watched.csv\")" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['album', 'anime', 'manga'], dtype=object)" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.unique(titles.category)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just to check they have correctly been read:" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
user_idwork_idrating
05040410
150817130
2178070531
365888530
4100394010
\n", "
" ], "text/plain": [ " user_id work_id rating\n", "0 50 4041 0\n", "1 508 1713 0\n", "2 1780 7053 1\n", "3 658 8853 0\n", "4 1003 9401 0" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "11112" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "(1, 1982)" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "(2, 9884)" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train[:5]\n", "len(train)\n", "min(train['user_id']), max(train['user_id'])\n", "min(train['work_id']), max(train['work_id'])" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
user_idwork_id
04861086
115093296
26171086
32709648
44593647
\n", "
" ], "text/plain": [ " user_id work_id\n", "0 486 1086\n", "1 1509 3296\n", "2 617 1086\n", "3 270 9648\n", "4 459 3647" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "100015" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "(0, 1982)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "(2, 9884)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test[:5]\n", "len(test)\n", "min(test['user_id']), max(test['user_id'])\n", "min(test['work_id']), max(test['work_id'])" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
user_idwork_idrating
07178025dislike
111061027neutral
219703949neutral
316859815like
417033482like
\n", "
" ], "text/plain": [ " user_id work_id rating\n", "0 717 8025 dislike\n", "1 1106 1027 neutral\n", "2 1970 3949 neutral\n", "3 1685 9815 like\n", "4 1703 3482 like" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "198970" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "(0, 1982)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "(0, 9896)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "watched[:5]\n", "len(watched)\n", "min(watched['user_id']), max(watched['user_id'])\n", "min(watched['work_id']), max(watched['work_id'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## First prediction model\n", "\n", "- For each movie, compute the empirical average `rating` of users who saw it, using data from the train data.\n", "- And simply use this to predict for the other users in test data." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "submission = test.copy()" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "total_average_rating = train.rating.mean()" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
user_idwork_id
04861086
115093296
26171086
32709648
44593647
\n", "
" ], "text/plain": [ " user_id work_id\n", "0 486 1086\n", "1 1509 3296\n", "2 617 1086\n", "3 270 9648\n", "4 459 3647" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "100015" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "submission[:5]\n", "len(submission)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "works_id = np.unique(np.append(test.work_id.unique(), train.work_id.unique()))" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mean_rating
20
30
40
50
90
\n", "
" ], "text/plain": [ " mean_rating\n", "2 0\n", "3 0\n", "4 0\n", "5 0\n", "9 0" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "2706" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mean_ratings = pd.DataFrame(data={'mean_rating': 0}, index=works_id)\n", "mean_ratings[:5]\n", "len(mean_ratings)" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mean_rating
20.230769
3NaN
40.500000
50.333333
90.200000
\n", "
" ], "text/plain": [ " mean_rating\n", "2 0.230769\n", "3 NaN\n", "4 0.500000\n", "5 0.333333\n", "9 0.200000" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "2706" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "computed_means = pd.DataFrame(data={'mean_rating': train.groupby('work_id').mean()['rating']}, index=works_id)\n", "computed_means[:5]\n", "len(computed_means)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "mean_ratings.update(computed_means)" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mean_rating
20.230769
30.000000
40.500000
50.333333
90.200000
220.000000
230.000000
240.000000
270.333333
280.333333
\n", "
" ], "text/plain": [ " mean_rating\n", "2 0.230769\n", "3 0.000000\n", "4 0.500000\n", "5 0.333333\n", "9 0.200000\n", "22 0.000000\n", "23 0.000000\n", "24 0.000000\n", "27 0.333333\n", "28 0.333333" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "2706" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mean_ratings[:10]\n", "len(mean_ratings)" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "submission = submission.join(mean_ratings, on='work_id')\n", "submission.rename_axis({'mean_rating': 'prob_willsee'}, axis=\"columns\", inplace=True)" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "# in case of mean on empty values\n", "submission.fillna(value=total_average_rating, inplace=True)" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
user_idwork_idprob_willsee
048610860.440000
1150932960.000000
261710860.440000
327096480.529412
445936470.333333
54135620.724138
6178091560.333333
728455020.500000
8152172500.500000
913082090.500000
\n", "
" ], "text/plain": [ " user_id work_id prob_willsee\n", "0 486 1086 0.440000\n", "1 1509 3296 0.000000\n", "2 617 1086 0.440000\n", "3 270 9648 0.529412\n", "4 459 3647 0.333333\n", "5 41 3562 0.724138\n", "6 1780 9156 0.333333\n", "7 284 5502 0.500000\n", "8 1521 7250 0.500000\n", "9 130 8209 0.500000" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "submission[:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let save it to `submission_naive1.csv`:" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [], "source": [ "submission.to_csv(\"submission_naive1.csv\", index=False)" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "-rw-rw-r-- 1 lilian lilian 2,0M sept. 24 14:40 submission_naive1.csv\r\n" ] } ], "source": [ "!ls -larth submission_naive1.csv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Better predicted models" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using `watched.csv`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The bonus data set `watched` can give a lot of information. There is 200000 entries in it and only 100000 in `test.csv`." ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(100015, 198970)" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(test), len(watched)" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['dislike', 'like', 'love', 'neutral']" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ratings = np.unique(watched.rating).tolist()\n", "ratings" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
user_idwork_idrating
07178025dislike
111061027neutral
219703949neutral
316859815like
417033482like
\n", "
" ], "text/plain": [ " user_id work_id rating\n", "0 717 8025 dislike\n", "1 1106 1027 neutral\n", "2 1970 3949 neutral\n", "3 1685 9815 like\n", "4 1703 3482 like" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "watched[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Maping string-rating to probability of seeing the movie\n", "By using the train data `(user, work)` that are also in `watched`, we can learn to map string rating, i.e., `'dislike', 'neutral', 'like', 'love'`, to probability of having see the movie." ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [], "source": [ "watched.rename_axis({'rating': 'strrating'}, axis=\"columns\", inplace=True)" ] }, { "cell_type": "code", "execution_count": 85, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
user_idwork_idstrrating
07178025dislike
111061027neutral
219703949neutral
316859815like
417033482like
\n", "
" ], "text/plain": [ " user_id work_id strrating\n", "0 717 8025 dislike\n", "1 1106 1027 neutral\n", "2 1970 3949 neutral\n", "3 1685 9815 like\n", "4 1703 3482 like" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "watched[:5]" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
user_idwork_idrating
05040410
150817130
2178070531
365888530
4100394010
\n", "
" ], "text/plain": [ " user_id work_id rating\n", "0 50 4041 0\n", "1 508 1713 0\n", "2 1780 7053 1\n", "3 658 8853 0\n", "4 1003 9401 0" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Is there pairs `(user, work)` for which both train data and watched data are available (i.e., both see/notsee and liked/disliked) ?" ] }, { "cell_type": "code", "execution_count": 109, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
user_idwork_idratingstrrating
\n", "
" ], "text/plain": [ "Empty DataFrame\n", "Columns: [user_id, work_id, rating, strrating]\n", "Index: []" ] }, "execution_count": 109, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train.merge(watched, on=['user_id', 'work_id'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And what about test data?" ] }, { "cell_type": "code", "execution_count": 108, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
user_idwork_idstrrating
\n", "
" ], "text/plain": [ "Empty DataFrame\n", "Columns: [user_id, work_id, strrating]\n", "Index: []" ] }, "execution_count": 108, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test.merge(watched, on=['user_id', 'work_id'])" ] }, { "cell_type": "code", "execution_count": 144, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
user_id_xwork_iduser_id_ystrrating
048610861120neutral
148610861934dislike
24861086684neutral
3486108645neutral
448610861245neutral
548610861705dislike
648610861671like
748610861082neutral
848610861672dislike
948610861606dislike
1048610861392dislike
1148610861463dislike
1248610861668neutral
134861086657neutral
144861086466dislike
154861086114neutral
1648610861795like
174861086843neutral
184861086910dislike
194861086831neutral
2048610861416dislike
214861086356neutral
2248610861125dislike
2348610861587dislike
2448610861962like
254861086208dislike
264861086231neutral
274861086610dislike
2848610861333like
294861086181neutral
...............
17983485123845951635like
17983486123845951268like
17983487123845951559like
1798348812384595561dislike
1798348912384595872neutral
17983490123845951802neutral
1798349112384595297like
1798349212384595274like
1798349312384595968neutral
1798349412384595962dislike
1798349542545951635like
1798349642545951268like
1798349742545951559like
179834984254595561dislike
179834994254595872neutral
1798350042545951802neutral
179835014254595297like
179835024254595274like
179835034254595968neutral
179835044254595962dislike
1798350580245951635like
1798350680245951268like
1798350780245951559like
179835088024595561dislike
179835098024595872neutral
1798351080245951802neutral
179835118024595297like
179835128024595274like
179835138024595968neutral
179835148024595962dislike
\n", "

17983515 rows × 4 columns

\n", "
" ], "text/plain": [ " user_id_x work_id user_id_y strrating\n", "0 486 1086 1120 neutral\n", "1 486 1086 1934 dislike\n", "2 486 1086 684 neutral\n", "3 486 1086 45 neutral\n", "4 486 1086 1245 neutral\n", "5 486 1086 1705 dislike\n", "6 486 1086 1671 like\n", "7 486 1086 1082 neutral\n", "8 486 1086 1672 dislike\n", "9 486 1086 1606 dislike\n", "10 486 1086 1392 dislike\n", "11 486 1086 1463 dislike\n", "12 486 1086 1668 neutral\n", "13 486 1086 657 neutral\n", "14 486 1086 466 dislike\n", "15 486 1086 114 neutral\n", "16 486 1086 1795 like\n", "17 486 1086 843 neutral\n", "18 486 1086 910 dislike\n", "19 486 1086 831 neutral\n", "20 486 1086 1416 dislike\n", "21 486 1086 356 neutral\n", "22 486 1086 1125 dislike\n", "23 486 1086 1587 dislike\n", "24 486 1086 1962 like\n", "25 486 1086 208 dislike\n", "26 486 1086 231 neutral\n", "27 486 1086 610 dislike\n", "28 486 1086 1333 like\n", "29 486 1086 181 neutral\n", "... ... ... ... ...\n", "17983485 1238 4595 1635 like\n", "17983486 1238 4595 1268 like\n", "17983487 1238 4595 1559 like\n", "17983488 1238 4595 561 dislike\n", "17983489 1238 4595 872 neutral\n", "17983490 1238 4595 1802 neutral\n", "17983491 1238 4595 297 like\n", "17983492 1238 4595 274 like\n", "17983493 1238 4595 968 neutral\n", "17983494 1238 4595 962 dislike\n", "17983495 425 4595 1635 like\n", "17983496 425 4595 1268 like\n", "17983497 425 4595 1559 like\n", "17983498 425 4595 561 dislike\n", "17983499 425 4595 872 neutral\n", "17983500 425 4595 1802 neutral\n", "17983501 425 4595 297 like\n", "17983502 425 4595 274 like\n", "17983503 425 4595 968 neutral\n", "17983504 425 4595 962 dislike\n", "17983505 802 4595 1635 like\n", "17983506 802 4595 1268 like\n", "17983507 802 4595 1559 like\n", "17983508 802 4595 561 dislike\n", "17983509 802 4595 872 neutral\n", "17983510 802 4595 1802 neutral\n", "17983511 802 4595 297 like\n", "17983512 802 4595 274 like\n", "17983513 802 4595 968 neutral\n", "17983514 802 4595 962 dislike\n", "\n", "[17983515 rows x 4 columns]" ] }, "execution_count": 144, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test.merge(watched, on=['work_id'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "No! So we can forget about the `user_id`, and we will learn how to map liked/disliked to see/notsee for each movie." ] }, { "cell_type": "code", "execution_count": 105, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
user_id_xwork_idstrratinguser_id_yrating
07178025dislike8630
17178025dislike3290
27178025dislike10460
37178025dislike7940
47178025dislike8201
\n", "
" ], "text/plain": [ " user_id_x work_id strrating user_id_y rating\n", "0 717 8025 dislike 863 0\n", "1 717 8025 dislike 329 0\n", "2 717 8025 dislike 1046 0\n", "3 717 8025 dislike 794 0\n", "4 717 8025 dislike 820 1" ] }, "execution_count": 105, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_train = watched.merge(train, on='work_id')\n", "all_train[:5]" ] }, { "cell_type": "code", "execution_count": 106, "metadata": {}, "outputs": [], "source": [ "del all_train['user_id_x']\n", "del all_train['user_id_y']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can delete the `user_id` axes." ] }, { "cell_type": "code", "execution_count": 107, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
work_idstrratingrating
08025dislike0
18025dislike0
28025dislike0
38025dislike0
48025dislike1
\n", "
" ], "text/plain": [ " work_id strrating rating\n", "0 8025 dislike 0\n", "1 8025 dislike 0\n", "2 8025 dislike 0\n", "3 8025 dislike 0\n", "4 8025 dislike 1" ] }, "execution_count": 107, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_train[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can first get the average rating of each work:" ] }, { "cell_type": "code", "execution_count": 130, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "work_id\n", "2 0.230769\n", "4 0.500000\n", "5 0.333333\n", "9 0.200000\n", "23 0.000000\n", "24 0.000000\n", "27 0.333333\n", "28 0.333333\n", "33 0.250000\n", "48 0.400000\n", "Name: rating, dtype: float64" ] }, "execution_count": 130, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_train.groupby('work_id').rating.mean()[:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This table now contains, for each work, a list of mapping from `strrating` to `rating`.\n", "It can be combined into a concise mapping, like in this form:" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [], "source": [ "mapping_strrating_probwillsee = {\n", " 'dislike': 0,\n", " 'neutral': 0.50,\n", " 'like': 0.75,\n", " 'love': 1,\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Manually, for instance for one movie:" ] }, { "cell_type": "code", "execution_count": 129, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
work_idstrratingrating
08025dislike0
18025dislike0
28025dislike0
38025dislike0
48025dislike1
58025dislike0
68025dislike0
78025dislike1
88025dislike0
98025dislike0
108025dislike0
118025dislike0
128025dislike0
138025dislike1
148025dislike0
158025dislike1
168025dislike0
178025dislike1
188025dislike1
198025dislike0
208025dislike0
218025dislike0
228025dislike0
238025dislike1
248025dislike0
258025dislike0
268025dislike1
278025dislike0
288025dislike0
298025dislike0
............
177548025dislike0
177558025dislike0
177568025dislike0
177578025dislike0
177588025dislike0
177598025dislike1
177608025dislike0
177618025dislike1
177628025dislike0
177638025dislike1
177648025dislike1
177848025dislike0
177858025dislike0
177868025dislike0
177878025dislike0
177888025dislike1
177898025dislike0
177908025dislike0
177918025dislike1
177928025dislike0
177938025dislike0
177948025dislike0
177958025dislike0
177968025dislike0
177978025dislike1
177988025dislike0
177998025dislike1
178008025dislike0
178018025dislike1
178028025dislike1
\n", "

4294 rows × 3 columns

\n", "
" ], "text/plain": [ " work_id strrating rating\n", "0 8025 dislike 0\n", "1 8025 dislike 0\n", "2 8025 dislike 0\n", "3 8025 dislike 0\n", "4 8025 dislike 1\n", "5 8025 dislike 0\n", "6 8025 dislike 0\n", "7 8025 dislike 1\n", "8 8025 dislike 0\n", "9 8025 dislike 0\n", "10 8025 dislike 0\n", "11 8025 dislike 0\n", "12 8025 dislike 0\n", "13 8025 dislike 1\n", "14 8025 dislike 0\n", "15 8025 dislike 1\n", "16 8025 dislike 0\n", "17 8025 dislike 1\n", "18 8025 dislike 1\n", "19 8025 dislike 0\n", "20 8025 dislike 0\n", "21 8025 dislike 0\n", "22 8025 dislike 0\n", "23 8025 dislike 1\n", "24 8025 dislike 0\n", "25 8025 dislike 0\n", "26 8025 dislike 1\n", "27 8025 dislike 0\n", "28 8025 dislike 0\n", "29 8025 dislike 0\n", "... ... ... ...\n", "17754 8025 dislike 0\n", "17755 8025 dislike 0\n", "17756 8025 dislike 0\n", "17757 8025 dislike 0\n", "17758 8025 dislike 0\n", "17759 8025 dislike 1\n", "17760 8025 dislike 0\n", "17761 8025 dislike 1\n", "17762 8025 dislike 0\n", "17763 8025 dislike 1\n", "17764 8025 dislike 1\n", "17784 8025 dislike 0\n", "17785 8025 dislike 0\n", "17786 8025 dislike 0\n", "17787 8025 dislike 0\n", "17788 8025 dislike 1\n", "17789 8025 dislike 0\n", "17790 8025 dislike 0\n", "17791 8025 dislike 1\n", "17792 8025 dislike 0\n", "17793 8025 dislike 0\n", "17794 8025 dislike 0\n", "17795 8025 dislike 0\n", "17796 8025 dislike 0\n", "17797 8025 dislike 1\n", "17798 8025 dislike 0\n", "17799 8025 dislike 1\n", "17800 8025 dislike 0\n", "17801 8025 dislike 1\n", "17802 8025 dislike 1\n", "\n", "[4294 rows x 3 columns]" ] }, "execution_count": 129, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_train[(all_train.work_id == 8025) & (all_train.strrating == 'dislike')]" ] }, { "cell_type": "code", "execution_count": 133, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.31578947368421051" ] }, "execution_count": 133, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_train[all_train.work_id == 8025].rating.mean()" ] }, { "cell_type": "code", "execution_count": 134, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4294" ] }, "execution_count": 134, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "0.31578947368421051" ] }, "execution_count": 134, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(all_train[(all_train.work_id == 8025) & (all_train.strrating == 'dislike')].rating)\n", "all_train[(all_train.work_id == 8025) & (all_train.strrating == 'dislike')].rating.mean()" ] }, { "cell_type": "code", "execution_count": 135, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4598" ] }, "execution_count": 135, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "0.31578947368421051" ] }, "execution_count": 135, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(all_train[(all_train.work_id == 8025) & (all_train.strrating == 'neutral')].rating)\n", "all_train[(all_train.work_id == 8025) & (all_train.strrating == 'neutral')].rating.mean()" ] }, { "cell_type": "code", "execution_count": 141, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "8151" ] }, "execution_count": 141, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "0.31578947368421051" ] }, "execution_count": 141, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(all_train[(all_train.work_id == 8025) & (all_train.strrating == 'like')].rating)\n", "all_train[(all_train.work_id == 8025) & (all_train.strrating == 'like')].rating.mean()" ] }, { "cell_type": "code", "execution_count": 142, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "817" ] }, "execution_count": 142, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "0.31578947368421051" ] }, "execution_count": 142, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(all_train[(all_train.work_id == 8025) & (all_train.strrating == 'love')].rating)\n", "all_train[(all_train.work_id == 8025) & (all_train.strrating == 'love')].rating.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's weird!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using `titles.csv`\n", "I don't think I want to use the titles, but clustering the works by categories could help, maybe." ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['album', 'anime', 'manga']" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "categories = np.unique(titles.category).tolist()\n", "categories" ] }, { "cell_type": "code", "execution_count": 132, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "There is 2808 work(s) in category 'manga'.\n", "There is 1 work(s) in category 'album'.\n", "There is 7088 work(s) in category 'anime'.\n" ] } ], "source": [ "for cat in categories:\n", " print(\"There is {:>5} work(s) in category '{}'.\".format(sum(titles.category == cat), cat))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One category is alone, let rewrite it to `'anime'`." ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [], "source": [ "categories = {\n", " 'anime': 0,\n", " 'album': 0,\n", " 'manga': 1,\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "TODO !" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluation from the data challenge platform" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "TODO !" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.3" }, "toc": { "colors": { "hover_highlight": "#DAA520", "running_highlight": "#FF0000", "selected_highlight": "#FFD700" }, "moveMenuLeft": true, "nav_menu": { "height": "196px", "width": "251px" }, "navigate_menu": true, "number_sections": true, "sideBar": true, "threshold": 4, "toc_cell": true, "toc_section_display": "block", "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }