{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"toc": "true"
},
"source": [
"# Table of Contents\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Challenge : Mangaki - September 2017\n",
"\n",
"> - See [here for more information](http://universityofbigdata.net/competition/5085548788056064?lang=en).\n",
"> - Author: [Lilian Besson](http://perso.crans.org/besson/).\n",
"> - License: [MIT License](https://lbesson.mit-license.org/)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Reading data\n",
"We have a few CSV files, let start by reading them."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from tqdm import tqdm\n",
"import numpy as np\n",
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"-rw-r--r-- 1 lilian lilian 350K juin 27 15:10 titles.csv\r\n",
"-rw-r--r-- 1 lilian lilian 3,2M juin 27 15:25 watched.csv\r\n",
"-rw-r--r-- 1 lilian lilian 1010K juin 27 15:34 test.csv\r\n",
"-rw-r--r-- 1 lilian lilian 124K juin 28 17:55 train.csv\r\n",
"-rw-r--r-- 1 lilian lilian 2,4M juin 28 17:57 submission.csv\r\n"
]
}
],
"source": [
"!ls -larth *.csv"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"'submission.csv' -> 'submission.csv.old'\r\n"
]
}
],
"source": [
"!cp -vf submission.csv submission.csv.old"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [],
"source": [
"train = pd.read_csv(\"train.csv\")\n",
"test = pd.read_csv(\"test.csv\")\n",
"titles = pd.read_csv(\"titles.csv\")\n",
"watched = pd.read_csv(\"watched.csv\")"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['album', 'anime', 'manga'], dtype=object)"
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.unique(titles.category)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Just to check they have correctly been read:"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user_id | \n",
" work_id | \n",
" rating | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 50 | \n",
" 4041 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" 508 | \n",
" 1713 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" 1780 | \n",
" 7053 | \n",
" 1 | \n",
"
\n",
" \n",
" 3 | \n",
" 658 | \n",
" 8853 | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" 1003 | \n",
" 9401 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user_id work_id rating\n",
"0 50 4041 0\n",
"1 508 1713 0\n",
"2 1780 7053 1\n",
"3 658 8853 0\n",
"4 1003 9401 0"
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"11112"
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"(1, 1982)"
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"(2, 9884)"
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train[:5]\n",
"len(train)\n",
"min(train['user_id']), max(train['user_id'])\n",
"min(train['work_id']), max(train['work_id'])"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user_id | \n",
" work_id | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 486 | \n",
" 1086 | \n",
"
\n",
" \n",
" 1 | \n",
" 1509 | \n",
" 3296 | \n",
"
\n",
" \n",
" 2 | \n",
" 617 | \n",
" 1086 | \n",
"
\n",
" \n",
" 3 | \n",
" 270 | \n",
" 9648 | \n",
"
\n",
" \n",
" 4 | \n",
" 459 | \n",
" 3647 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user_id work_id\n",
"0 486 1086\n",
"1 1509 3296\n",
"2 617 1086\n",
"3 270 9648\n",
"4 459 3647"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"100015"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"(0, 1982)"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"(2, 9884)"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test[:5]\n",
"len(test)\n",
"min(test['user_id']), max(test['user_id'])\n",
"min(test['work_id']), max(test['work_id'])"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user_id | \n",
" work_id | \n",
" rating | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 717 | \n",
" 8025 | \n",
" dislike | \n",
"
\n",
" \n",
" 1 | \n",
" 1106 | \n",
" 1027 | \n",
" neutral | \n",
"
\n",
" \n",
" 2 | \n",
" 1970 | \n",
" 3949 | \n",
" neutral | \n",
"
\n",
" \n",
" 3 | \n",
" 1685 | \n",
" 9815 | \n",
" like | \n",
"
\n",
" \n",
" 4 | \n",
" 1703 | \n",
" 3482 | \n",
" like | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user_id work_id rating\n",
"0 717 8025 dislike\n",
"1 1106 1027 neutral\n",
"2 1970 3949 neutral\n",
"3 1685 9815 like\n",
"4 1703 3482 like"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"198970"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"(0, 1982)"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"(0, 9896)"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"watched[:5]\n",
"len(watched)\n",
"min(watched['user_id']), max(watched['user_id'])\n",
"min(watched['work_id']), max(watched['work_id'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## First prediction model\n",
"\n",
"- For each movie, compute the empirical average `rating` of users who saw it, using data from the train data.\n",
"- And simply use this to predict for the other users in test data."
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"submission = test.copy()"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"total_average_rating = train.rating.mean()"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user_id | \n",
" work_id | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 486 | \n",
" 1086 | \n",
"
\n",
" \n",
" 1 | \n",
" 1509 | \n",
" 3296 | \n",
"
\n",
" \n",
" 2 | \n",
" 617 | \n",
" 1086 | \n",
"
\n",
" \n",
" 3 | \n",
" 270 | \n",
" 9648 | \n",
"
\n",
" \n",
" 4 | \n",
" 459 | \n",
" 3647 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user_id work_id\n",
"0 486 1086\n",
"1 1509 3296\n",
"2 617 1086\n",
"3 270 9648\n",
"4 459 3647"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"100015"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"submission[:5]\n",
"len(submission)"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"works_id = np.unique(np.append(test.work_id.unique(), train.work_id.unique()))"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" mean_rating | \n",
"
\n",
" \n",
" \n",
" \n",
" 2 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" 0 | \n",
"
\n",
" \n",
" 5 | \n",
" 0 | \n",
"
\n",
" \n",
" 9 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" mean_rating\n",
"2 0\n",
"3 0\n",
"4 0\n",
"5 0\n",
"9 0"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"2706"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mean_ratings = pd.DataFrame(data={'mean_rating': 0}, index=works_id)\n",
"mean_ratings[:5]\n",
"len(mean_ratings)"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" mean_rating | \n",
"
\n",
" \n",
" \n",
" \n",
" 2 | \n",
" 0.230769 | \n",
"
\n",
" \n",
" 3 | \n",
" NaN | \n",
"
\n",
" \n",
" 4 | \n",
" 0.500000 | \n",
"
\n",
" \n",
" 5 | \n",
" 0.333333 | \n",
"
\n",
" \n",
" 9 | \n",
" 0.200000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" mean_rating\n",
"2 0.230769\n",
"3 NaN\n",
"4 0.500000\n",
"5 0.333333\n",
"9 0.200000"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"2706"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"computed_means = pd.DataFrame(data={'mean_rating': train.groupby('work_id').mean()['rating']}, index=works_id)\n",
"computed_means[:5]\n",
"len(computed_means)"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [],
"source": [
"mean_ratings.update(computed_means)"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" mean_rating | \n",
"
\n",
" \n",
" \n",
" \n",
" 2 | \n",
" 0.230769 | \n",
"
\n",
" \n",
" 3 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" 4 | \n",
" 0.500000 | \n",
"
\n",
" \n",
" 5 | \n",
" 0.333333 | \n",
"
\n",
" \n",
" 9 | \n",
" 0.200000 | \n",
"
\n",
" \n",
" 22 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" 23 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" 24 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" 27 | \n",
" 0.333333 | \n",
"
\n",
" \n",
" 28 | \n",
" 0.333333 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" mean_rating\n",
"2 0.230769\n",
"3 0.000000\n",
"4 0.500000\n",
"5 0.333333\n",
"9 0.200000\n",
"22 0.000000\n",
"23 0.000000\n",
"24 0.000000\n",
"27 0.333333\n",
"28 0.333333"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"2706"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mean_ratings[:10]\n",
"len(mean_ratings)"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [],
"source": [
"submission = submission.join(mean_ratings, on='work_id')\n",
"submission.rename_axis({'mean_rating': 'prob_willsee'}, axis=\"columns\", inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [],
"source": [
"# in case of mean on empty values\n",
"submission.fillna(value=total_average_rating, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user_id | \n",
" work_id | \n",
" prob_willsee | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 486 | \n",
" 1086 | \n",
" 0.440000 | \n",
"
\n",
" \n",
" 1 | \n",
" 1509 | \n",
" 3296 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" 2 | \n",
" 617 | \n",
" 1086 | \n",
" 0.440000 | \n",
"
\n",
" \n",
" 3 | \n",
" 270 | \n",
" 9648 | \n",
" 0.529412 | \n",
"
\n",
" \n",
" 4 | \n",
" 459 | \n",
" 3647 | \n",
" 0.333333 | \n",
"
\n",
" \n",
" 5 | \n",
" 41 | \n",
" 3562 | \n",
" 0.724138 | \n",
"
\n",
" \n",
" 6 | \n",
" 1780 | \n",
" 9156 | \n",
" 0.333333 | \n",
"
\n",
" \n",
" 7 | \n",
" 284 | \n",
" 5502 | \n",
" 0.500000 | \n",
"
\n",
" \n",
" 8 | \n",
" 1521 | \n",
" 7250 | \n",
" 0.500000 | \n",
"
\n",
" \n",
" 9 | \n",
" 130 | \n",
" 8209 | \n",
" 0.500000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user_id work_id prob_willsee\n",
"0 486 1086 0.440000\n",
"1 1509 3296 0.000000\n",
"2 617 1086 0.440000\n",
"3 270 9648 0.529412\n",
"4 459 3647 0.333333\n",
"5 41 3562 0.724138\n",
"6 1780 9156 0.333333\n",
"7 284 5502 0.500000\n",
"8 1521 7250 0.500000\n",
"9 130 8209 0.500000"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"submission[:10]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let save it to `submission_naive1.csv`:"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [],
"source": [
"submission.to_csv(\"submission_naive1.csv\", index=False)"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"-rw-rw-r-- 1 lilian lilian 2,0M sept. 24 14:40 submission_naive1.csv\r\n"
]
}
],
"source": [
"!ls -larth submission_naive1.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Better predicted models"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Using `watched.csv`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The bonus data set `watched` can give a lot of information. There is 200000 entries in it and only 100000 in `test.csv`."
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(100015, 198970)"
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(test), len(watched)"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['dislike', 'like', 'love', 'neutral']"
]
},
"execution_count": 68,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ratings = np.unique(watched.rating).tolist()\n",
"ratings"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user_id | \n",
" work_id | \n",
" rating | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 717 | \n",
" 8025 | \n",
" dislike | \n",
"
\n",
" \n",
" 1 | \n",
" 1106 | \n",
" 1027 | \n",
" neutral | \n",
"
\n",
" \n",
" 2 | \n",
" 1970 | \n",
" 3949 | \n",
" neutral | \n",
"
\n",
" \n",
" 3 | \n",
" 1685 | \n",
" 9815 | \n",
" like | \n",
"
\n",
" \n",
" 4 | \n",
" 1703 | \n",
" 3482 | \n",
" like | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user_id work_id rating\n",
"0 717 8025 dislike\n",
"1 1106 1027 neutral\n",
"2 1970 3949 neutral\n",
"3 1685 9815 like\n",
"4 1703 3482 like"
]
},
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"watched[:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Maping string-rating to probability of seeing the movie\n",
"By using the train data `(user, work)` that are also in `watched`, we can learn to map string rating, i.e., `'dislike', 'neutral', 'like', 'love'`, to probability of having see the movie."
]
},
{
"cell_type": "code",
"execution_count": 84,
"metadata": {},
"outputs": [],
"source": [
"watched.rename_axis({'rating': 'strrating'}, axis=\"columns\", inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 85,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user_id | \n",
" work_id | \n",
" strrating | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 717 | \n",
" 8025 | \n",
" dislike | \n",
"
\n",
" \n",
" 1 | \n",
" 1106 | \n",
" 1027 | \n",
" neutral | \n",
"
\n",
" \n",
" 2 | \n",
" 1970 | \n",
" 3949 | \n",
" neutral | \n",
"
\n",
" \n",
" 3 | \n",
" 1685 | \n",
" 9815 | \n",
" like | \n",
"
\n",
" \n",
" 4 | \n",
" 1703 | \n",
" 3482 | \n",
" like | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user_id work_id strrating\n",
"0 717 8025 dislike\n",
"1 1106 1027 neutral\n",
"2 1970 3949 neutral\n",
"3 1685 9815 like\n",
"4 1703 3482 like"
]
},
"execution_count": 85,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"watched[:5]"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user_id | \n",
" work_id | \n",
" rating | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 50 | \n",
" 4041 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" 508 | \n",
" 1713 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" 1780 | \n",
" 7053 | \n",
" 1 | \n",
"
\n",
" \n",
" 3 | \n",
" 658 | \n",
" 8853 | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" 1003 | \n",
" 9401 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user_id work_id rating\n",
"0 50 4041 0\n",
"1 508 1713 0\n",
"2 1780 7053 1\n",
"3 658 8853 0\n",
"4 1003 9401 0"
]
},
"execution_count": 69,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train[:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Is there pairs `(user, work)` for which both train data and watched data are available (i.e., both see/notsee and liked/disliked) ?"
]
},
{
"cell_type": "code",
"execution_count": 109,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user_id | \n",
" work_id | \n",
" rating | \n",
" strrating | \n",
"
\n",
" \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
"Empty DataFrame\n",
"Columns: [user_id, work_id, rating, strrating]\n",
"Index: []"
]
},
"execution_count": 109,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train.merge(watched, on=['user_id', 'work_id'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And what about test data?"
]
},
{
"cell_type": "code",
"execution_count": 108,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user_id | \n",
" work_id | \n",
" strrating | \n",
"
\n",
" \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
"Empty DataFrame\n",
"Columns: [user_id, work_id, strrating]\n",
"Index: []"
]
},
"execution_count": 108,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test.merge(watched, on=['user_id', 'work_id'])"
]
},
{
"cell_type": "code",
"execution_count": 144,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user_id_x | \n",
" work_id | \n",
" user_id_y | \n",
" strrating | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 486 | \n",
" 1086 | \n",
" 1120 | \n",
" neutral | \n",
"
\n",
" \n",
" 1 | \n",
" 486 | \n",
" 1086 | \n",
" 1934 | \n",
" dislike | \n",
"
\n",
" \n",
" 2 | \n",
" 486 | \n",
" 1086 | \n",
" 684 | \n",
" neutral | \n",
"
\n",
" \n",
" 3 | \n",
" 486 | \n",
" 1086 | \n",
" 45 | \n",
" neutral | \n",
"
\n",
" \n",
" 4 | \n",
" 486 | \n",
" 1086 | \n",
" 1245 | \n",
" neutral | \n",
"
\n",
" \n",
" 5 | \n",
" 486 | \n",
" 1086 | \n",
" 1705 | \n",
" dislike | \n",
"
\n",
" \n",
" 6 | \n",
" 486 | \n",
" 1086 | \n",
" 1671 | \n",
" like | \n",
"
\n",
" \n",
" 7 | \n",
" 486 | \n",
" 1086 | \n",
" 1082 | \n",
" neutral | \n",
"
\n",
" \n",
" 8 | \n",
" 486 | \n",
" 1086 | \n",
" 1672 | \n",
" dislike | \n",
"
\n",
" \n",
" 9 | \n",
" 486 | \n",
" 1086 | \n",
" 1606 | \n",
" dislike | \n",
"
\n",
" \n",
" 10 | \n",
" 486 | \n",
" 1086 | \n",
" 1392 | \n",
" dislike | \n",
"
\n",
" \n",
" 11 | \n",
" 486 | \n",
" 1086 | \n",
" 1463 | \n",
" dislike | \n",
"
\n",
" \n",
" 12 | \n",
" 486 | \n",
" 1086 | \n",
" 1668 | \n",
" neutral | \n",
"
\n",
" \n",
" 13 | \n",
" 486 | \n",
" 1086 | \n",
" 657 | \n",
" neutral | \n",
"
\n",
" \n",
" 14 | \n",
" 486 | \n",
" 1086 | \n",
" 466 | \n",
" dislike | \n",
"
\n",
" \n",
" 15 | \n",
" 486 | \n",
" 1086 | \n",
" 114 | \n",
" neutral | \n",
"
\n",
" \n",
" 16 | \n",
" 486 | \n",
" 1086 | \n",
" 1795 | \n",
" like | \n",
"
\n",
" \n",
" 17 | \n",
" 486 | \n",
" 1086 | \n",
" 843 | \n",
" neutral | \n",
"
\n",
" \n",
" 18 | \n",
" 486 | \n",
" 1086 | \n",
" 910 | \n",
" dislike | \n",
"
\n",
" \n",
" 19 | \n",
" 486 | \n",
" 1086 | \n",
" 831 | \n",
" neutral | \n",
"
\n",
" \n",
" 20 | \n",
" 486 | \n",
" 1086 | \n",
" 1416 | \n",
" dislike | \n",
"
\n",
" \n",
" 21 | \n",
" 486 | \n",
" 1086 | \n",
" 356 | \n",
" neutral | \n",
"
\n",
" \n",
" 22 | \n",
" 486 | \n",
" 1086 | \n",
" 1125 | \n",
" dislike | \n",
"
\n",
" \n",
" 23 | \n",
" 486 | \n",
" 1086 | \n",
" 1587 | \n",
" dislike | \n",
"
\n",
" \n",
" 24 | \n",
" 486 | \n",
" 1086 | \n",
" 1962 | \n",
" like | \n",
"
\n",
" \n",
" 25 | \n",
" 486 | \n",
" 1086 | \n",
" 208 | \n",
" dislike | \n",
"
\n",
" \n",
" 26 | \n",
" 486 | \n",
" 1086 | \n",
" 231 | \n",
" neutral | \n",
"
\n",
" \n",
" 27 | \n",
" 486 | \n",
" 1086 | \n",
" 610 | \n",
" dislike | \n",
"
\n",
" \n",
" 28 | \n",
" 486 | \n",
" 1086 | \n",
" 1333 | \n",
" like | \n",
"
\n",
" \n",
" 29 | \n",
" 486 | \n",
" 1086 | \n",
" 181 | \n",
" neutral | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 17983485 | \n",
" 1238 | \n",
" 4595 | \n",
" 1635 | \n",
" like | \n",
"
\n",
" \n",
" 17983486 | \n",
" 1238 | \n",
" 4595 | \n",
" 1268 | \n",
" like | \n",
"
\n",
" \n",
" 17983487 | \n",
" 1238 | \n",
" 4595 | \n",
" 1559 | \n",
" like | \n",
"
\n",
" \n",
" 17983488 | \n",
" 1238 | \n",
" 4595 | \n",
" 561 | \n",
" dislike | \n",
"
\n",
" \n",
" 17983489 | \n",
" 1238 | \n",
" 4595 | \n",
" 872 | \n",
" neutral | \n",
"
\n",
" \n",
" 17983490 | \n",
" 1238 | \n",
" 4595 | \n",
" 1802 | \n",
" neutral | \n",
"
\n",
" \n",
" 17983491 | \n",
" 1238 | \n",
" 4595 | \n",
" 297 | \n",
" like | \n",
"
\n",
" \n",
" 17983492 | \n",
" 1238 | \n",
" 4595 | \n",
" 274 | \n",
" like | \n",
"
\n",
" \n",
" 17983493 | \n",
" 1238 | \n",
" 4595 | \n",
" 968 | \n",
" neutral | \n",
"
\n",
" \n",
" 17983494 | \n",
" 1238 | \n",
" 4595 | \n",
" 962 | \n",
" dislike | \n",
"
\n",
" \n",
" 17983495 | \n",
" 425 | \n",
" 4595 | \n",
" 1635 | \n",
" like | \n",
"
\n",
" \n",
" 17983496 | \n",
" 425 | \n",
" 4595 | \n",
" 1268 | \n",
" like | \n",
"
\n",
" \n",
" 17983497 | \n",
" 425 | \n",
" 4595 | \n",
" 1559 | \n",
" like | \n",
"
\n",
" \n",
" 17983498 | \n",
" 425 | \n",
" 4595 | \n",
" 561 | \n",
" dislike | \n",
"
\n",
" \n",
" 17983499 | \n",
" 425 | \n",
" 4595 | \n",
" 872 | \n",
" neutral | \n",
"
\n",
" \n",
" 17983500 | \n",
" 425 | \n",
" 4595 | \n",
" 1802 | \n",
" neutral | \n",
"
\n",
" \n",
" 17983501 | \n",
" 425 | \n",
" 4595 | \n",
" 297 | \n",
" like | \n",
"
\n",
" \n",
" 17983502 | \n",
" 425 | \n",
" 4595 | \n",
" 274 | \n",
" like | \n",
"
\n",
" \n",
" 17983503 | \n",
" 425 | \n",
" 4595 | \n",
" 968 | \n",
" neutral | \n",
"
\n",
" \n",
" 17983504 | \n",
" 425 | \n",
" 4595 | \n",
" 962 | \n",
" dislike | \n",
"
\n",
" \n",
" 17983505 | \n",
" 802 | \n",
" 4595 | \n",
" 1635 | \n",
" like | \n",
"
\n",
" \n",
" 17983506 | \n",
" 802 | \n",
" 4595 | \n",
" 1268 | \n",
" like | \n",
"
\n",
" \n",
" 17983507 | \n",
" 802 | \n",
" 4595 | \n",
" 1559 | \n",
" like | \n",
"
\n",
" \n",
" 17983508 | \n",
" 802 | \n",
" 4595 | \n",
" 561 | \n",
" dislike | \n",
"
\n",
" \n",
" 17983509 | \n",
" 802 | \n",
" 4595 | \n",
" 872 | \n",
" neutral | \n",
"
\n",
" \n",
" 17983510 | \n",
" 802 | \n",
" 4595 | \n",
" 1802 | \n",
" neutral | \n",
"
\n",
" \n",
" 17983511 | \n",
" 802 | \n",
" 4595 | \n",
" 297 | \n",
" like | \n",
"
\n",
" \n",
" 17983512 | \n",
" 802 | \n",
" 4595 | \n",
" 274 | \n",
" like | \n",
"
\n",
" \n",
" 17983513 | \n",
" 802 | \n",
" 4595 | \n",
" 968 | \n",
" neutral | \n",
"
\n",
" \n",
" 17983514 | \n",
" 802 | \n",
" 4595 | \n",
" 962 | \n",
" dislike | \n",
"
\n",
" \n",
"
\n",
"
17983515 rows × 4 columns
\n",
"
"
],
"text/plain": [
" user_id_x work_id user_id_y strrating\n",
"0 486 1086 1120 neutral\n",
"1 486 1086 1934 dislike\n",
"2 486 1086 684 neutral\n",
"3 486 1086 45 neutral\n",
"4 486 1086 1245 neutral\n",
"5 486 1086 1705 dislike\n",
"6 486 1086 1671 like\n",
"7 486 1086 1082 neutral\n",
"8 486 1086 1672 dislike\n",
"9 486 1086 1606 dislike\n",
"10 486 1086 1392 dislike\n",
"11 486 1086 1463 dislike\n",
"12 486 1086 1668 neutral\n",
"13 486 1086 657 neutral\n",
"14 486 1086 466 dislike\n",
"15 486 1086 114 neutral\n",
"16 486 1086 1795 like\n",
"17 486 1086 843 neutral\n",
"18 486 1086 910 dislike\n",
"19 486 1086 831 neutral\n",
"20 486 1086 1416 dislike\n",
"21 486 1086 356 neutral\n",
"22 486 1086 1125 dislike\n",
"23 486 1086 1587 dislike\n",
"24 486 1086 1962 like\n",
"25 486 1086 208 dislike\n",
"26 486 1086 231 neutral\n",
"27 486 1086 610 dislike\n",
"28 486 1086 1333 like\n",
"29 486 1086 181 neutral\n",
"... ... ... ... ...\n",
"17983485 1238 4595 1635 like\n",
"17983486 1238 4595 1268 like\n",
"17983487 1238 4595 1559 like\n",
"17983488 1238 4595 561 dislike\n",
"17983489 1238 4595 872 neutral\n",
"17983490 1238 4595 1802 neutral\n",
"17983491 1238 4595 297 like\n",
"17983492 1238 4595 274 like\n",
"17983493 1238 4595 968 neutral\n",
"17983494 1238 4595 962 dislike\n",
"17983495 425 4595 1635 like\n",
"17983496 425 4595 1268 like\n",
"17983497 425 4595 1559 like\n",
"17983498 425 4595 561 dislike\n",
"17983499 425 4595 872 neutral\n",
"17983500 425 4595 1802 neutral\n",
"17983501 425 4595 297 like\n",
"17983502 425 4595 274 like\n",
"17983503 425 4595 968 neutral\n",
"17983504 425 4595 962 dislike\n",
"17983505 802 4595 1635 like\n",
"17983506 802 4595 1268 like\n",
"17983507 802 4595 1559 like\n",
"17983508 802 4595 561 dislike\n",
"17983509 802 4595 872 neutral\n",
"17983510 802 4595 1802 neutral\n",
"17983511 802 4595 297 like\n",
"17983512 802 4595 274 like\n",
"17983513 802 4595 968 neutral\n",
"17983514 802 4595 962 dislike\n",
"\n",
"[17983515 rows x 4 columns]"
]
},
"execution_count": 144,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test.merge(watched, on=['work_id'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"No! So we can forget about the `user_id`, and we will learn how to map liked/disliked to see/notsee for each movie."
]
},
{
"cell_type": "code",
"execution_count": 105,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user_id_x | \n",
" work_id | \n",
" strrating | \n",
" user_id_y | \n",
" rating | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 717 | \n",
" 8025 | \n",
" dislike | \n",
" 863 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" 717 | \n",
" 8025 | \n",
" dislike | \n",
" 329 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" 717 | \n",
" 8025 | \n",
" dislike | \n",
" 1046 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" 717 | \n",
" 8025 | \n",
" dislike | \n",
" 794 | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" 717 | \n",
" 8025 | \n",
" dislike | \n",
" 820 | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user_id_x work_id strrating user_id_y rating\n",
"0 717 8025 dislike 863 0\n",
"1 717 8025 dislike 329 0\n",
"2 717 8025 dislike 1046 0\n",
"3 717 8025 dislike 794 0\n",
"4 717 8025 dislike 820 1"
]
},
"execution_count": 105,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_train = watched.merge(train, on='work_id')\n",
"all_train[:5]"
]
},
{
"cell_type": "code",
"execution_count": 106,
"metadata": {},
"outputs": [],
"source": [
"del all_train['user_id_x']\n",
"del all_train['user_id_y']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can delete the `user_id` axes."
]
},
{
"cell_type": "code",
"execution_count": 107,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" work_id | \n",
" strrating | \n",
" rating | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" 8025 | \n",
" dislike | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" work_id strrating rating\n",
"0 8025 dislike 0\n",
"1 8025 dislike 0\n",
"2 8025 dislike 0\n",
"3 8025 dislike 0\n",
"4 8025 dislike 1"
]
},
"execution_count": 107,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_train[:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can first get the average rating of each work:"
]
},
{
"cell_type": "code",
"execution_count": 130,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"work_id\n",
"2 0.230769\n",
"4 0.500000\n",
"5 0.333333\n",
"9 0.200000\n",
"23 0.000000\n",
"24 0.000000\n",
"27 0.333333\n",
"28 0.333333\n",
"33 0.250000\n",
"48 0.400000\n",
"Name: rating, dtype: float64"
]
},
"execution_count": 130,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_train.groupby('work_id').rating.mean()[:10]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This table now contains, for each work, a list of mapping from `strrating` to `rating`.\n",
"It can be combined into a concise mapping, like in this form:"
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {},
"outputs": [],
"source": [
"mapping_strrating_probwillsee = {\n",
" 'dislike': 0,\n",
" 'neutral': 0.50,\n",
" 'like': 0.75,\n",
" 'love': 1,\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Manually, for instance for one movie:"
]
},
{
"cell_type": "code",
"execution_count": 129,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" work_id | \n",
" strrating | \n",
" rating | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" 8025 | \n",
" dislike | \n",
" 1 | \n",
"
\n",
" \n",
" 5 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 6 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 7 | \n",
" 8025 | \n",
" dislike | \n",
" 1 | \n",
"
\n",
" \n",
" 8 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 9 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 10 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 11 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 12 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 13 | \n",
" 8025 | \n",
" dislike | \n",
" 1 | \n",
"
\n",
" \n",
" 14 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 15 | \n",
" 8025 | \n",
" dislike | \n",
" 1 | \n",
"
\n",
" \n",
" 16 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 17 | \n",
" 8025 | \n",
" dislike | \n",
" 1 | \n",
"
\n",
" \n",
" 18 | \n",
" 8025 | \n",
" dislike | \n",
" 1 | \n",
"
\n",
" \n",
" 19 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 20 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 21 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 22 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 23 | \n",
" 8025 | \n",
" dislike | \n",
" 1 | \n",
"
\n",
" \n",
" 24 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 25 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 26 | \n",
" 8025 | \n",
" dislike | \n",
" 1 | \n",
"
\n",
" \n",
" 27 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 28 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 29 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 17754 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 17755 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 17756 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 17757 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 17758 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 17759 | \n",
" 8025 | \n",
" dislike | \n",
" 1 | \n",
"
\n",
" \n",
" 17760 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 17761 | \n",
" 8025 | \n",
" dislike | \n",
" 1 | \n",
"
\n",
" \n",
" 17762 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 17763 | \n",
" 8025 | \n",
" dislike | \n",
" 1 | \n",
"
\n",
" \n",
" 17764 | \n",
" 8025 | \n",
" dislike | \n",
" 1 | \n",
"
\n",
" \n",
" 17784 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 17785 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 17786 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 17787 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 17788 | \n",
" 8025 | \n",
" dislike | \n",
" 1 | \n",
"
\n",
" \n",
" 17789 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 17790 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 17791 | \n",
" 8025 | \n",
" dislike | \n",
" 1 | \n",
"
\n",
" \n",
" 17792 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 17793 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 17794 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 17795 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 17796 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 17797 | \n",
" 8025 | \n",
" dislike | \n",
" 1 | \n",
"
\n",
" \n",
" 17798 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 17799 | \n",
" 8025 | \n",
" dislike | \n",
" 1 | \n",
"
\n",
" \n",
" 17800 | \n",
" 8025 | \n",
" dislike | \n",
" 0 | \n",
"
\n",
" \n",
" 17801 | \n",
" 8025 | \n",
" dislike | \n",
" 1 | \n",
"
\n",
" \n",
" 17802 | \n",
" 8025 | \n",
" dislike | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
4294 rows × 3 columns
\n",
"
"
],
"text/plain": [
" work_id strrating rating\n",
"0 8025 dislike 0\n",
"1 8025 dislike 0\n",
"2 8025 dislike 0\n",
"3 8025 dislike 0\n",
"4 8025 dislike 1\n",
"5 8025 dislike 0\n",
"6 8025 dislike 0\n",
"7 8025 dislike 1\n",
"8 8025 dislike 0\n",
"9 8025 dislike 0\n",
"10 8025 dislike 0\n",
"11 8025 dislike 0\n",
"12 8025 dislike 0\n",
"13 8025 dislike 1\n",
"14 8025 dislike 0\n",
"15 8025 dislike 1\n",
"16 8025 dislike 0\n",
"17 8025 dislike 1\n",
"18 8025 dislike 1\n",
"19 8025 dislike 0\n",
"20 8025 dislike 0\n",
"21 8025 dislike 0\n",
"22 8025 dislike 0\n",
"23 8025 dislike 1\n",
"24 8025 dislike 0\n",
"25 8025 dislike 0\n",
"26 8025 dislike 1\n",
"27 8025 dislike 0\n",
"28 8025 dislike 0\n",
"29 8025 dislike 0\n",
"... ... ... ...\n",
"17754 8025 dislike 0\n",
"17755 8025 dislike 0\n",
"17756 8025 dislike 0\n",
"17757 8025 dislike 0\n",
"17758 8025 dislike 0\n",
"17759 8025 dislike 1\n",
"17760 8025 dislike 0\n",
"17761 8025 dislike 1\n",
"17762 8025 dislike 0\n",
"17763 8025 dislike 1\n",
"17764 8025 dislike 1\n",
"17784 8025 dislike 0\n",
"17785 8025 dislike 0\n",
"17786 8025 dislike 0\n",
"17787 8025 dislike 0\n",
"17788 8025 dislike 1\n",
"17789 8025 dislike 0\n",
"17790 8025 dislike 0\n",
"17791 8025 dislike 1\n",
"17792 8025 dislike 0\n",
"17793 8025 dislike 0\n",
"17794 8025 dislike 0\n",
"17795 8025 dislike 0\n",
"17796 8025 dislike 0\n",
"17797 8025 dislike 1\n",
"17798 8025 dislike 0\n",
"17799 8025 dislike 1\n",
"17800 8025 dislike 0\n",
"17801 8025 dislike 1\n",
"17802 8025 dislike 1\n",
"\n",
"[4294 rows x 3 columns]"
]
},
"execution_count": 129,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_train[(all_train.work_id == 8025) & (all_train.strrating == 'dislike')]"
]
},
{
"cell_type": "code",
"execution_count": 133,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.31578947368421051"
]
},
"execution_count": 133,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_train[all_train.work_id == 8025].rating.mean()"
]
},
{
"cell_type": "code",
"execution_count": 134,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"4294"
]
},
"execution_count": 134,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"0.31578947368421051"
]
},
"execution_count": 134,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(all_train[(all_train.work_id == 8025) & (all_train.strrating == 'dislike')].rating)\n",
"all_train[(all_train.work_id == 8025) & (all_train.strrating == 'dislike')].rating.mean()"
]
},
{
"cell_type": "code",
"execution_count": 135,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"4598"
]
},
"execution_count": 135,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"0.31578947368421051"
]
},
"execution_count": 135,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(all_train[(all_train.work_id == 8025) & (all_train.strrating == 'neutral')].rating)\n",
"all_train[(all_train.work_id == 8025) & (all_train.strrating == 'neutral')].rating.mean()"
]
},
{
"cell_type": "code",
"execution_count": 141,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"8151"
]
},
"execution_count": 141,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"0.31578947368421051"
]
},
"execution_count": 141,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(all_train[(all_train.work_id == 8025) & (all_train.strrating == 'like')].rating)\n",
"all_train[(all_train.work_id == 8025) & (all_train.strrating == 'like')].rating.mean()"
]
},
{
"cell_type": "code",
"execution_count": 142,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"817"
]
},
"execution_count": 142,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"text/plain": [
"0.31578947368421051"
]
},
"execution_count": 142,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(all_train[(all_train.work_id == 8025) & (all_train.strrating == 'love')].rating)\n",
"all_train[(all_train.work_id == 8025) & (all_train.strrating == 'love')].rating.mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That's weird!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Using `titles.csv`\n",
"I don't think I want to use the titles, but clustering the works by categories could help, maybe."
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['album', 'anime', 'manga']"
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"categories = np.unique(titles.category).tolist()\n",
"categories"
]
},
{
"cell_type": "code",
"execution_count": 132,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"There is 2808 work(s) in category 'manga'.\n",
"There is 1 work(s) in category 'album'.\n",
"There is 7088 work(s) in category 'anime'.\n"
]
}
],
"source": [
"for cat in categories:\n",
" print(\"There is {:>5} work(s) in category '{}'.\".format(sum(titles.category == cat), cat))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One category is alone, let rewrite it to `'anime'`."
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [],
"source": [
"categories = {\n",
" 'anime': 0,\n",
" 'album': 0,\n",
" 'manga': 1,\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"TODO !"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Evaluation from the data challenge platform"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"TODO !"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.3"
},
"toc": {
"colors": {
"hover_highlight": "#DAA520",
"running_highlight": "#FF0000",
"selected_highlight": "#FFD700"
},
"moveMenuLeft": true,
"nav_menu": {
"height": "196px",
"width": "251px"
},
"navigate_menu": true,
"number_sections": true,
"sideBar": true,
"threshold": 4,
"toc_cell": true,
"toc_section_display": "block",
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 2
}