{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# MovieLens 1M data pre-processing\n",
"\n",
"This notebook contains code used to load and pre-process the [MovieLens 1M dataset](https://grouplens.org/datasets/movielens/1m/), consisting of a collection of 1 million movie ratings by different users along with side information about them, and taking extra information about the movies from the [tag genome dataset](https://grouplens.org/datasets/movielens/tag-genome/) which is taken from the larger [MovieLens 25M dataset](https://grouplens.org/datasets/movielens/25m/).\n",
"\n",
"This data is then used in a [usage guide](http://nbviewer.jupyter.org/github/david-cortes/cmfrec/blob/master/example/cmfrec_movielens_sideinfo.ipynb) for building recommender systems with the the [cmfrec](https://github.com/david-cortes/cmfrec) package.\n",
"\n",
"The user side information is enhanced with an external dataset about [US zip codes](http://federalgovernmentzipcodes.us/), [US states](http://www.fonz.net/blog/archives/2008/04/06/csv-of-states-and-state-abbreviations/), and [US geographical regions](https://www.infoplease.com/us/states/sizing-states), while the item information (tag genome) - a very high-dimensional dataset - is simplified by taking the first 50 principal components.\n",
"\n",
"### Loading the ratings data"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" UserId | \n",
" ItemId | \n",
" Rating | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 1193 | \n",
" 5 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 661 | \n",
" 3 | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" 914 | \n",
" 3 | \n",
"
\n",
" \n",
" 3 | \n",
" 1 | \n",
" 3408 | \n",
" 4 | \n",
"
\n",
" \n",
" 4 | \n",
" 1 | \n",
" 2355 | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" UserId ItemId Rating\n",
"0 1 1193 5\n",
"1 1 661 3\n",
"2 1 914 3\n",
"3 1 3408 4\n",
"4 1 2355 5"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy as np, pandas as pd, re\n",
"\n",
"ratings = pd.read_table(\n",
" 'ml-1m/ratings.dat',\n",
" sep='::', engine='python',\n",
" names=['UserId','ItemId','Rating','Timestamp']\n",
")\n",
"ratings = ratings.drop(\"Timestamp\", axis=1)\n",
"ratings.head()"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of users: 6040\n",
"Number of items: 3706\n",
"Number of ratings: 1000209\n"
]
}
],
"source": [
"print(\"Number of users: %d\" % ratings[\"UserId\"].nunique())\n",
"print(\"Number of items: %d\" % ratings[\"ItemId\"].nunique())\n",
"print(\"Number of ratings: %d\" % ratings[\"Rating\"].count())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Loading the movies titles"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" ItemId | \n",
" title | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" Toy Story (1995) | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" Jumanji (1995) | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" Grumpier Old Men (1995) | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" Waiting to Exhale (1995) | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" Father of the Bride Part II (1995) | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" ItemId title\n",
"0 1 Toy Story (1995)\n",
"1 2 Jumanji (1995)\n",
"2 3 Grumpier Old Men (1995)\n",
"3 4 Waiting to Exhale (1995)\n",
"4 5 Father of the Bride Part II (1995)"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movie_titles = pd.read_table(\n",
" 'ml-1m/movies.dat',\n",
" sep='::', engine='python', header=None, encoding='latin_1',\n",
" names=['ItemId', 'title', 'genres']\n",
")\n",
"movie_titles = movie_titles[['ItemId', 'title']]\n",
"\n",
"movie_titles.head()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"movie_id_to_title = {i.ItemId: i.title for i in movie_titles.itertuples()}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Loading the tag genome"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" ItemId | \n",
" tag1 | \n",
" tag2 | \n",
" tag3 | \n",
" tag4 | \n",
" tag5 | \n",
" tag6 | \n",
" tag7 | \n",
" tag8 | \n",
" tag9 | \n",
" ... | \n",
" tag1119 | \n",
" tag1120 | \n",
" tag1121 | \n",
" tag1122 | \n",
" tag1123 | \n",
" tag1124 | \n",
" tag1125 | \n",
" tag1126 | \n",
" tag1127 | \n",
" tag1128 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 0.02875 | \n",
" 0.02375 | \n",
" 0.06250 | \n",
" 0.07575 | \n",
" 0.14075 | \n",
" 0.14675 | \n",
" 0.06350 | \n",
" 0.20375 | \n",
" 0.2020 | \n",
" ... | \n",
" 0.04050 | \n",
" 0.01425 | \n",
" 0.03050 | \n",
" 0.03500 | \n",
" 0.14125 | \n",
" 0.05775 | \n",
" 0.03900 | \n",
" 0.02975 | \n",
" 0.08475 | \n",
" 0.02200 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 0.04125 | \n",
" 0.04050 | \n",
" 0.06275 | \n",
" 0.08275 | \n",
" 0.09100 | \n",
" 0.06125 | \n",
" 0.06925 | \n",
" 0.09600 | \n",
" 0.0765 | \n",
" ... | \n",
" 0.05250 | \n",
" 0.01575 | \n",
" 0.01250 | \n",
" 0.02000 | \n",
" 0.12225 | \n",
" 0.03275 | \n",
" 0.02100 | \n",
" 0.01100 | \n",
" 0.10525 | \n",
" 0.01975 | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" 0.04675 | \n",
" 0.05550 | \n",
" 0.02925 | \n",
" 0.08700 | \n",
" 0.04750 | \n",
" 0.04775 | \n",
" 0.04600 | \n",
" 0.14275 | \n",
" 0.0285 | \n",
" ... | \n",
" 0.06275 | \n",
" 0.01950 | \n",
" 0.02225 | \n",
" 0.02300 | \n",
" 0.12200 | \n",
" 0.03475 | \n",
" 0.01700 | \n",
" 0.01800 | \n",
" 0.09100 | \n",
" 0.01775 | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" 0.03425 | \n",
" 0.03800 | \n",
" 0.04050 | \n",
" 0.03100 | \n",
" 0.06500 | \n",
" 0.03575 | \n",
" 0.02900 | \n",
" 0.08650 | \n",
" 0.0320 | \n",
" ... | \n",
" 0.05325 | \n",
" 0.02800 | \n",
" 0.01675 | \n",
" 0.03875 | \n",
" 0.18200 | \n",
" 0.07050 | \n",
" 0.01625 | \n",
" 0.01425 | \n",
" 0.08850 | \n",
" 0.01500 | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" 0.04300 | \n",
" 0.05325 | \n",
" 0.03800 | \n",
" 0.04100 | \n",
" 0.05400 | \n",
" 0.06725 | \n",
" 0.02775 | \n",
" 0.07650 | \n",
" 0.0215 | \n",
" ... | \n",
" 0.05350 | \n",
" 0.02050 | \n",
" 0.01425 | \n",
" 0.02550 | \n",
" 0.19225 | \n",
" 0.02675 | \n",
" 0.01625 | \n",
" 0.01300 | \n",
" 0.08700 | \n",
" 0.01600 | \n",
"
\n",
" \n",
"
\n",
"
5 rows × 1129 columns
\n",
"
"
],
"text/plain": [
" ItemId tag1 tag2 tag3 tag4 tag5 tag6 tag7 \\\n",
"0 1 0.02875 0.02375 0.06250 0.07575 0.14075 0.14675 0.06350 \n",
"1 2 0.04125 0.04050 0.06275 0.08275 0.09100 0.06125 0.06925 \n",
"2 3 0.04675 0.05550 0.02925 0.08700 0.04750 0.04775 0.04600 \n",
"3 4 0.03425 0.03800 0.04050 0.03100 0.06500 0.03575 0.02900 \n",
"4 5 0.04300 0.05325 0.03800 0.04100 0.05400 0.06725 0.02775 \n",
"\n",
" tag8 tag9 ... tag1119 tag1120 tag1121 tag1122 tag1123 tag1124 \\\n",
"0 0.20375 0.2020 ... 0.04050 0.01425 0.03050 0.03500 0.14125 0.05775 \n",
"1 0.09600 0.0765 ... 0.05250 0.01575 0.01250 0.02000 0.12225 0.03275 \n",
"2 0.14275 0.0285 ... 0.06275 0.01950 0.02225 0.02300 0.12200 0.03475 \n",
"3 0.08650 0.0320 ... 0.05325 0.02800 0.01675 0.03875 0.18200 0.07050 \n",
"4 0.07650 0.0215 ... 0.05350 0.02050 0.01425 0.02550 0.19225 0.02675 \n",
"\n",
" tag1125 tag1126 tag1127 tag1128 \n",
"0 0.03900 0.02975 0.08475 0.02200 \n",
"1 0.02100 0.01100 0.10525 0.01975 \n",
"2 0.01700 0.01800 0.09100 0.01775 \n",
"3 0.01625 0.01425 0.08850 0.01500 \n",
"4 0.01625 0.01300 0.08700 0.01600 \n",
"\n",
"[5 rows x 1129 columns]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movies = pd.read_csv('ml-25m/movies.csv')\n",
"movies = movies[['movieId', 'title']]\n",
"movies = pd.merge(movies, movie_titles)\n",
"movies = movies[['movieId', 'ItemId']]\n",
"\n",
"tags = pd.read_csv('ml-25m/genome-scores.csv')\n",
"tags_wide = tags.pivot(index='movieId', columns='tagId', values='relevance')\n",
"tags_wide.columns=[\"tag\"+str(i) for i in tags_wide.columns]\n",
"\n",
"item_side_info = pd.merge(movies, tags_wide, how='inner', left_on='movieId', right_index=True)\n",
"item_side_info = item_side_info.drop('movieId', axis=1)\n",
"item_side_info.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Dimensionality reduction for the tag genome through PCA"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" ItemId | \n",
" pc1 | \n",
" pc2 | \n",
" pc3 | \n",
" pc4 | \n",
" pc5 | \n",
" pc6 | \n",
" pc7 | \n",
" pc8 | \n",
" pc9 | \n",
" ... | \n",
" pc41 | \n",
" pc42 | \n",
" pc43 | \n",
" pc44 | \n",
" pc45 | \n",
" pc46 | \n",
" pc47 | \n",
" pc48 | \n",
" pc49 | \n",
" pc50 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 1.193171 | \n",
" 2.085621 | \n",
" 2.634135 | \n",
" 1.156088 | \n",
" 0.721649 | \n",
" 0.995436 | \n",
" 1.250474 | \n",
" -0.779532 | \n",
" 1.616702 | \n",
" ... | \n",
" -0.317134 | \n",
" -0.070338 | \n",
" -0.019553 | \n",
" 0.169051 | \n",
" 0.201415 | \n",
" -0.094831 | \n",
" -0.250461 | \n",
" -0.149919 | \n",
" -0.031735 | \n",
" -0.177708 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" -1.333533 | \n",
" 1.743796 | \n",
" 1.352161 | \n",
" 0.795724 | \n",
" -0.484175 | \n",
" 0.380645 | \n",
" 0.804462 | \n",
" -0.598527 | \n",
" 0.917250 | \n",
" ... | \n",
" 0.300060 | \n",
" -0.261956 | \n",
" 0.054457 | \n",
" 0.003863 | \n",
" 0.304605 | \n",
" -0.315796 | \n",
" 0.360203 | \n",
" 0.152770 | \n",
" 0.144790 | \n",
" -0.096549 | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" -1.363395 | \n",
" -0.017107 | \n",
" 0.530395 | \n",
" -0.316202 | \n",
" 0.469430 | \n",
" 0.164630 | \n",
" 0.019083 | \n",
" 0.159188 | \n",
" -0.232969 | \n",
" ... | \n",
" 0.215020 | \n",
" -0.060682 | \n",
" -0.280852 | \n",
" 0.001087 | \n",
" 0.084960 | \n",
" -0.257190 | \n",
" -0.136963 | \n",
" -0.113914 | \n",
" 0.128352 | \n",
" -0.203658 | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" -1.237840 | \n",
" -0.993731 | \n",
" 0.809815 | \n",
" -0.303009 | \n",
" -0.088991 | \n",
" -0.049621 | \n",
" -0.179544 | \n",
" -0.771278 | \n",
" -0.400499 | \n",
" ... | \n",
" 0.066207 | \n",
" 0.056054 | \n",
" -0.223027 | \n",
" 0.400157 | \n",
" 0.292300 | \n",
" 0.260936 | \n",
" -0.307608 | \n",
" -0.224141 | \n",
" 0.488955 | \n",
" 0.439189 | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" -1.611499 | \n",
" -0.251899 | \n",
" 1.126443 | \n",
" -0.135702 | \n",
" 0.403340 | \n",
" 0.187289 | \n",
" 0.108451 | \n",
" -0.275341 | \n",
" -0.261142 | \n",
" ... | \n",
" 0.109560 | \n",
" -0.086042 | \n",
" -0.236327 | \n",
" 0.461589 | \n",
" 0.013350 | \n",
" -0.192557 | \n",
" -0.234025 | \n",
" -0.369643 | \n",
" -0.041060 | \n",
" -0.074656 | \n",
"
\n",
" \n",
"
\n",
"
5 rows × 51 columns
\n",
"
"
],
"text/plain": [
" ItemId pc1 pc2 pc3 pc4 pc5 pc6 \\\n",
"0 1 1.193171 2.085621 2.634135 1.156088 0.721649 0.995436 \n",
"1 2 -1.333533 1.743796 1.352161 0.795724 -0.484175 0.380645 \n",
"2 3 -1.363395 -0.017107 0.530395 -0.316202 0.469430 0.164630 \n",
"3 4 -1.237840 -0.993731 0.809815 -0.303009 -0.088991 -0.049621 \n",
"4 5 -1.611499 -0.251899 1.126443 -0.135702 0.403340 0.187289 \n",
"\n",
" pc7 pc8 pc9 ... pc41 pc42 pc43 pc44 \\\n",
"0 1.250474 -0.779532 1.616702 ... -0.317134 -0.070338 -0.019553 0.169051 \n",
"1 0.804462 -0.598527 0.917250 ... 0.300060 -0.261956 0.054457 0.003863 \n",
"2 0.019083 0.159188 -0.232969 ... 0.215020 -0.060682 -0.280852 0.001087 \n",
"3 -0.179544 -0.771278 -0.400499 ... 0.066207 0.056054 -0.223027 0.400157 \n",
"4 0.108451 -0.275341 -0.261142 ... 0.109560 -0.086042 -0.236327 0.461589 \n",
"\n",
" pc45 pc46 pc47 pc48 pc49 pc50 \n",
"0 0.201415 -0.094831 -0.250461 -0.149919 -0.031735 -0.177708 \n",
"1 0.304605 -0.315796 0.360203 0.152770 0.144790 -0.096549 \n",
"2 0.084960 -0.257190 -0.136963 -0.113914 0.128352 -0.203658 \n",
"3 0.292300 0.260936 -0.307608 -0.224141 0.488955 0.439189 \n",
"4 0.013350 -0.192557 -0.234025 -0.369643 -0.041060 -0.074656 \n",
"\n",
"[5 rows x 51 columns]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.decomposition import PCA\n",
"\n",
"pca_obj = PCA(n_components = 50)\n",
"item_sideinfo_reduced = item_side_info.drop(\"ItemId\", axis=1)\n",
"item_sideinfo_pca = pca_obj.fit_transform(item_sideinfo_reduced)\n",
"\n",
"item_sideinfo_pca = pd.DataFrame(\n",
" item_sideinfo_pca,\n",
" columns=[\"pc\"+str(i+1) for i in range(item_sideinfo_pca.shape[1])]\n",
")\n",
"item_sideinfo_pca['ItemId'] = item_side_info[\"ItemId\"].to_numpy()\n",
"item_sideinfo_pca = item_sideinfo_pca[[\"ItemId\"] + item_sideinfo_pca.columns[:50].tolist()]\n",
"item_sideinfo_pca.head()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of items from MovieLens 1M with side info: 3080\n"
]
}
],
"source": [
"print(\"Number of items from MovieLens 1M with side info: %d\" %\n",
" ratings[\"ItemId\"][np.in1d(ratings[\"ItemId\"], item_sideinfo_pca[\"ItemId\"])].nunique())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Loading the states data"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"zipcode_abbs = pd.read_csv(\"states.csv\", low_memory=False)\n",
"zipcode_abbs_dct = {z.State: z.Abbreviation for z in zipcode_abbs.itertuples()}\n",
"us_regs_table = [\n",
" ('New England', 'Connecticut, Maine, Massachusetts, New Hampshire, Rhode Island, Vermont'),\n",
" ('Middle Atlantic', 'Delaware, Maryland, New Jersey, New York, Pennsylvania'),\n",
" ('South', 'Alabama, Arkansas, Florida, Georgia, Kentucky, Louisiana, Mississippi, Missouri, North Carolina, South Carolina, Tennessee, Virginia, West Virginia'),\n",
" ('Midwest', 'Illinois, Indiana, Iowa, Kansas, Michigan, Minnesota, Nebraska, North Dakota, Ohio, South Dakota, Wisconsin'),\n",
" ('Southwest', 'Arizona, New Mexico, Oklahoma, Texas'),\n",
" ('West', 'Alaska, California, Colorado, Hawaii, Idaho, Montana, Nevada, Oregon, Utah, Washington, Wyoming')\n",
" ]\n",
"us_regs_table = [(x[0], [i.strip() for i in x[1].split(\",\")]) for x in us_regs_table]\n",
"us_regs_dct = dict()\n",
"for r in us_regs_table:\n",
" for s in r[1]:\n",
" us_regs_dct[zipcode_abbs_dct[s]] = r[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Loading the zip codes data"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Zipcode | \n",
" Region | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 501 | \n",
" Middle Atlantic | \n",
"
\n",
" \n",
" 1 | \n",
" 544 | \n",
" Middle Atlantic | \n",
"
\n",
" \n",
" 2 | \n",
" 601 | \n",
" UsOther | \n",
"
\n",
" \n",
" 3 | \n",
" 602 | \n",
" UsOther | \n",
"
\n",
" \n",
" 4 | \n",
" 603 | \n",
" UsOther | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Zipcode Region\n",
"0 501 Middle Atlantic\n",
"1 544 Middle Atlantic\n",
"2 601 UsOther\n",
"3 602 UsOther\n",
"4 603 UsOther"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"zipcode_info = pd.read_csv(\"free-zipcode-database.csv\", low_memory=False)\n",
"zipcode_info = zipcode_info.groupby('Zipcode').first().reset_index()\n",
"zipcode_info.loc[lambda x: x[\"Country\"] != \"US\", 'State'] = 'UnknownOrNonUS'\n",
"zipcode_info['Region'] = zipcode_info['State'].copy()\n",
"zipcode_info.loc[lambda x: x[\"Country\"] == \"US\", \"Region\"] = (\n",
" zipcode_info\n",
" .loc[lambda x: x[\"Country\"] == \"US\"]\n",
" [\"Region\"]\n",
" .map(lambda x: us_regs_dct[x] if x in us_regs_dct else 'UsOther')\n",
")\n",
"zipcode_info = zipcode_info[['Zipcode', 'Region']]\n",
"zipcode_info.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Loading the user demographic information"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" UserId | \n",
" Gender | \n",
" Age | \n",
" Occupation | \n",
" Zipcode | \n",
" Region | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" F | \n",
" 1 | \n",
" K-12 student | \n",
" 48067 | \n",
" Midwest | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" M | \n",
" 56 | \n",
" self-employed | \n",
" 70072 | \n",
" South | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" M | \n",
" 25 | \n",
" scientist | \n",
" 55117 | \n",
" Midwest | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" M | \n",
" 45 | \n",
" executive/managerial | \n",
" 2460 | \n",
" New England | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" M | \n",
" 25 | \n",
" writer | \n",
" 55455 | \n",
" Midwest | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" UserId Gender Age Occupation Zipcode Region\n",
"0 1 F 1 K-12 student 48067 Midwest\n",
"1 2 M 56 self-employed 70072 South\n",
"2 3 M 25 scientist 55117 Midwest\n",
"3 4 M 45 executive/managerial 2460 New England\n",
"4 5 M 25 writer 55455 Midwest"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"users = pd.read_table(\n",
" 'ml-1m/users.dat',\n",
" sep='::', engine='python', encoding='cp1252',\n",
" names=[\"UserId\", \"Gender\", \"Age\", \"Occupation\", \"Zipcode\"]\n",
")\n",
"users[\"Zipcode\"] = users[\"Zipcode\"].map(lambda x: int(re.sub(\"-.*\", \"\", x)))\n",
"users = pd.merge(users, zipcode_info, on='Zipcode', how='left')\n",
"users['Region'] = users[\"Region\"].fillna('UnknownOrNonUS')\n",
"\n",
"occupations = {\n",
" 0: \"\\\"other\\\" or not specified\",\n",
" 1: \"academic/educator\",\n",
" 2: \"artist\",\n",
" 3: \"clerical/admin\",\n",
" 4: \"college/grad student\",\n",
" 5: \"customer service\",\n",
" 6: \"doctor/health care\",\n",
" 7: \"executive/managerial\",\n",
" 8: \"farmer\",\n",
" 9: \"homemaker\",\n",
" 10: \"K-12 student\",\n",
" 11: \"lawyer\",\n",
" 12: \"programmer\",\n",
" 13: \"retired\",\n",
" 14: \"sales/marketing\",\n",
" 15: \"scientist\",\n",
" 16: \"self-employed\",\n",
" 17: \"technician/engineer\",\n",
" 18: \"tradesman/craftsman\",\n",
" 19: \"unemployed\",\n",
" 20: \"writer\"\n",
"}\n",
"users['Occupation'] = users[\"Occupation\"].map(occupations)\n",
"users['Age'] = users[\"Age\"].map(lambda x: str(x))\n",
"users.head()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" UserId | \n",
" Gender_F | \n",
" Gender_M | \n",
" Age_1 | \n",
" Age_18 | \n",
" Age_25 | \n",
" Age_35 | \n",
" Age_45 | \n",
" Age_50 | \n",
" Age_56 | \n",
" ... | \n",
" Occupation_unemployed | \n",
" Occupation_writer | \n",
" Region_Middle Atlantic | \n",
" Region_Midwest | \n",
" Region_New England | \n",
" Region_South | \n",
" Region_Southwest | \n",
" Region_UnknownOrNonUS | \n",
" Region_UsOther | \n",
" Region_West | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" True | \n",
" False | \n",
" True | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" ... | \n",
" False | \n",
" False | \n",
" False | \n",
" True | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" False | \n",
" True | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" True | \n",
" ... | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" True | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" False | \n",
" True | \n",
" False | \n",
" False | \n",
" True | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" ... | \n",
" False | \n",
" False | \n",
" False | \n",
" True | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" False | \n",
" True | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" True | \n",
" False | \n",
" False | \n",
" ... | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" True | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" False | \n",
" True | \n",
" False | \n",
" False | \n",
" True | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" ... | \n",
" False | \n",
" True | \n",
" False | \n",
" True | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
"
\n",
"
5 rows × 39 columns
\n",
"
"
],
"text/plain": [
" UserId Gender_F Gender_M Age_1 Age_18 Age_25 Age_35 Age_45 Age_50 \\\n",
"0 1 True False True False False False False False \n",
"1 2 False True False False False False False False \n",
"2 3 False True False False True False False False \n",
"3 4 False True False False False False True False \n",
"4 5 False True False False True False False False \n",
"\n",
" Age_56 ... Occupation_unemployed Occupation_writer \\\n",
"0 False ... False False \n",
"1 True ... False False \n",
"2 False ... False False \n",
"3 False ... False False \n",
"4 False ... False True \n",
"\n",
" Region_Middle Atlantic Region_Midwest Region_New England Region_South \\\n",
"0 False True False False \n",
"1 False False False True \n",
"2 False True False False \n",
"3 False False True False \n",
"4 False True False False \n",
"\n",
" Region_Southwest Region_UnknownOrNonUS Region_UsOther Region_West \n",
"0 False False False False \n",
"1 False False False False \n",
"2 False False False False \n",
"3 False False False False \n",
"4 False False False False \n",
"\n",
"[5 rows x 39 columns]"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"user_side_info = pd.get_dummies(users[['UserId', 'Gender', 'Age', 'Occupation', 'Region']])\n",
"user_side_info.head()"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of users with demographic information: 6040\n"
]
}
],
"source": [
"print(\"Number of users with demographic information: %d\" %\n",
" user_side_info[\"UserId\"].nunique())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Saving the data for usage in a different notebook"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"import pickle\n",
"\n",
"pickle.dump(ratings, open(\"ratings.p\", \"wb\"))\n",
"pickle.dump(item_sideinfo_pca, open(\"item_sideinfo_pca.p\", \"wb\"))\n",
"pickle.dump(user_side_info, open(\"user_side_info.p\", \"wb\"))\n",
"pickle.dump(movie_id_to_title, open(\"movie_id_to_title.p\", \"wb\"))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}