{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# MovieLens 1M data pre-processing\n", "\n", "This notebook contains code used to load and pre-process the [MovieLens 1M dataset](https://grouplens.org/datasets/movielens/1m/), consisting of a collection of 1 million movie ratings by different users along with side information about them, and taking extra information about the movies from the [tag genome dataset](https://grouplens.org/datasets/movielens/tag-genome/) which is taken from the larger [MovieLens 25M dataset](https://grouplens.org/datasets/movielens/25m/).\n", "\n", "This data is then used in a [usage guide](http://nbviewer.jupyter.org/github/david-cortes/cmfrec/blob/master/example/cmfrec_movielens_sideinfo.ipynb) for building recommender systems with the the [cmfrec](https://github.com/david-cortes/cmfrec) package.\n", "\n", "The user side information is enhanced with an external dataset about [US zip codes](http://federalgovernmentzipcodes.us/), [US states](http://www.fonz.net/blog/archives/2008/04/06/csv-of-states-and-state-abbreviations/), and [US geographical regions](https://www.infoplease.com/us/states/sizing-states), while the item information (tag genome) - a very high-dimensional dataset - is simplified by taking the first 50 principal components.\n", "\n", "### Loading the ratings data" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
UserIdItemIdRating
0111935
116613
219143
3134084
4123555
\n", "
" ], "text/plain": [ " UserId ItemId Rating\n", "0 1 1193 5\n", "1 1 661 3\n", "2 1 914 3\n", "3 1 3408 4\n", "4 1 2355 5" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np, pandas as pd, re\n", "\n", "ratings = pd.read_table(\n", " 'ml-1m/ratings.dat',\n", " sep='::', engine='python',\n", " names=['UserId','ItemId','Rating','Timestamp']\n", ")\n", "ratings = ratings.drop(\"Timestamp\", axis=1)\n", "ratings.head()" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of users: 6040\n", "Number of items: 3706\n", "Number of ratings: 1000209\n" ] } ], "source": [ "print(\"Number of users: %d\" % ratings[\"UserId\"].nunique())\n", "print(\"Number of items: %d\" % ratings[\"ItemId\"].nunique())\n", "print(\"Number of ratings: %d\" % ratings[\"Rating\"].count())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Loading the movies titles" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ItemIdtitle
01Toy Story (1995)
12Jumanji (1995)
23Grumpier Old Men (1995)
34Waiting to Exhale (1995)
45Father of the Bride Part II (1995)
\n", "
" ], "text/plain": [ " ItemId title\n", "0 1 Toy Story (1995)\n", "1 2 Jumanji (1995)\n", "2 3 Grumpier Old Men (1995)\n", "3 4 Waiting to Exhale (1995)\n", "4 5 Father of the Bride Part II (1995)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "movie_titles = pd.read_table(\n", " 'ml-1m/movies.dat',\n", " sep='::', engine='python', header=None, encoding='latin_1',\n", " names=['ItemId', 'title', 'genres']\n", ")\n", "movie_titles = movie_titles[['ItemId', 'title']]\n", "\n", "movie_titles.head()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "movie_id_to_title = {i.ItemId: i.title for i in movie_titles.itertuples()}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Loading the tag genome" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ItemIdtag1tag2tag3tag4tag5tag6tag7tag8tag9...tag1119tag1120tag1121tag1122tag1123tag1124tag1125tag1126tag1127tag1128
010.028750.023750.062500.075750.140750.146750.063500.203750.2020...0.040500.014250.030500.035000.141250.057750.039000.029750.084750.02200
120.041250.040500.062750.082750.091000.061250.069250.096000.0765...0.052500.015750.012500.020000.122250.032750.021000.011000.105250.01975
230.046750.055500.029250.087000.047500.047750.046000.142750.0285...0.062750.019500.022250.023000.122000.034750.017000.018000.091000.01775
340.034250.038000.040500.031000.065000.035750.029000.086500.0320...0.053250.028000.016750.038750.182000.070500.016250.014250.088500.01500
450.043000.053250.038000.041000.054000.067250.027750.076500.0215...0.053500.020500.014250.025500.192250.026750.016250.013000.087000.01600
\n", "

5 rows × 1129 columns

\n", "
" ], "text/plain": [ " ItemId tag1 tag2 tag3 tag4 tag5 tag6 tag7 \\\n", "0 1 0.02875 0.02375 0.06250 0.07575 0.14075 0.14675 0.06350 \n", "1 2 0.04125 0.04050 0.06275 0.08275 0.09100 0.06125 0.06925 \n", "2 3 0.04675 0.05550 0.02925 0.08700 0.04750 0.04775 0.04600 \n", "3 4 0.03425 0.03800 0.04050 0.03100 0.06500 0.03575 0.02900 \n", "4 5 0.04300 0.05325 0.03800 0.04100 0.05400 0.06725 0.02775 \n", "\n", " tag8 tag9 ... tag1119 tag1120 tag1121 tag1122 tag1123 tag1124 \\\n", "0 0.20375 0.2020 ... 0.04050 0.01425 0.03050 0.03500 0.14125 0.05775 \n", "1 0.09600 0.0765 ... 0.05250 0.01575 0.01250 0.02000 0.12225 0.03275 \n", "2 0.14275 0.0285 ... 0.06275 0.01950 0.02225 0.02300 0.12200 0.03475 \n", "3 0.08650 0.0320 ... 0.05325 0.02800 0.01675 0.03875 0.18200 0.07050 \n", "4 0.07650 0.0215 ... 0.05350 0.02050 0.01425 0.02550 0.19225 0.02675 \n", "\n", " tag1125 tag1126 tag1127 tag1128 \n", "0 0.03900 0.02975 0.08475 0.02200 \n", "1 0.02100 0.01100 0.10525 0.01975 \n", "2 0.01700 0.01800 0.09100 0.01775 \n", "3 0.01625 0.01425 0.08850 0.01500 \n", "4 0.01625 0.01300 0.08700 0.01600 \n", "\n", "[5 rows x 1129 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "movies = pd.read_csv('ml-25m/movies.csv')\n", "movies = movies[['movieId', 'title']]\n", "movies = pd.merge(movies, movie_titles)\n", "movies = movies[['movieId', 'ItemId']]\n", "\n", "tags = pd.read_csv('ml-25m/genome-scores.csv')\n", "tags_wide = tags.pivot(index='movieId', columns='tagId', values='relevance')\n", "tags_wide.columns=[\"tag\"+str(i) for i in tags_wide.columns]\n", "\n", "item_side_info = pd.merge(movies, tags_wide, how='inner', left_on='movieId', right_index=True)\n", "item_side_info = item_side_info.drop('movieId', axis=1)\n", "item_side_info.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dimensionality reduction for the tag genome through PCA" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ItemIdpc1pc2pc3pc4pc5pc6pc7pc8pc9...pc41pc42pc43pc44pc45pc46pc47pc48pc49pc50
011.1931712.0856212.6341351.1560880.7216490.9954361.250474-0.7795321.616702...-0.317134-0.070338-0.0195530.1690510.201415-0.094831-0.250461-0.149919-0.031735-0.177708
12-1.3335331.7437961.3521610.795724-0.4841750.3806450.804462-0.5985270.917250...0.300060-0.2619560.0544570.0038630.304605-0.3157960.3602030.1527700.144790-0.096549
23-1.363395-0.0171070.530395-0.3162020.4694300.1646300.0190830.159188-0.232969...0.215020-0.060682-0.2808520.0010870.084960-0.257190-0.136963-0.1139140.128352-0.203658
34-1.237840-0.9937310.809815-0.303009-0.088991-0.049621-0.179544-0.771278-0.400499...0.0662070.056054-0.2230270.4001570.2923000.260936-0.307608-0.2241410.4889550.439189
45-1.611499-0.2518991.126443-0.1357020.4033400.1872890.108451-0.275341-0.261142...0.109560-0.086042-0.2363270.4615890.013350-0.192557-0.234025-0.369643-0.041060-0.074656
\n", "

5 rows × 51 columns

\n", "
" ], "text/plain": [ " ItemId pc1 pc2 pc3 pc4 pc5 pc6 \\\n", "0 1 1.193171 2.085621 2.634135 1.156088 0.721649 0.995436 \n", "1 2 -1.333533 1.743796 1.352161 0.795724 -0.484175 0.380645 \n", "2 3 -1.363395 -0.017107 0.530395 -0.316202 0.469430 0.164630 \n", "3 4 -1.237840 -0.993731 0.809815 -0.303009 -0.088991 -0.049621 \n", "4 5 -1.611499 -0.251899 1.126443 -0.135702 0.403340 0.187289 \n", "\n", " pc7 pc8 pc9 ... pc41 pc42 pc43 pc44 \\\n", "0 1.250474 -0.779532 1.616702 ... -0.317134 -0.070338 -0.019553 0.169051 \n", "1 0.804462 -0.598527 0.917250 ... 0.300060 -0.261956 0.054457 0.003863 \n", "2 0.019083 0.159188 -0.232969 ... 0.215020 -0.060682 -0.280852 0.001087 \n", "3 -0.179544 -0.771278 -0.400499 ... 0.066207 0.056054 -0.223027 0.400157 \n", "4 0.108451 -0.275341 -0.261142 ... 0.109560 -0.086042 -0.236327 0.461589 \n", "\n", " pc45 pc46 pc47 pc48 pc49 pc50 \n", "0 0.201415 -0.094831 -0.250461 -0.149919 -0.031735 -0.177708 \n", "1 0.304605 -0.315796 0.360203 0.152770 0.144790 -0.096549 \n", "2 0.084960 -0.257190 -0.136963 -0.113914 0.128352 -0.203658 \n", "3 0.292300 0.260936 -0.307608 -0.224141 0.488955 0.439189 \n", "4 0.013350 -0.192557 -0.234025 -0.369643 -0.041060 -0.074656 \n", "\n", "[5 rows x 51 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.decomposition import PCA\n", "\n", "pca_obj = PCA(n_components = 50)\n", "item_sideinfo_reduced = item_side_info.drop(\"ItemId\", axis=1)\n", "item_sideinfo_pca = pca_obj.fit_transform(item_sideinfo_reduced)\n", "\n", "item_sideinfo_pca = pd.DataFrame(\n", " item_sideinfo_pca,\n", " columns=[\"pc\"+str(i+1) for i in range(item_sideinfo_pca.shape[1])]\n", ")\n", "item_sideinfo_pca['ItemId'] = item_side_info[\"ItemId\"].to_numpy()\n", "item_sideinfo_pca = item_sideinfo_pca[[\"ItemId\"] + item_sideinfo_pca.columns[:50].tolist()]\n", "item_sideinfo_pca.head()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of items from MovieLens 1M with side info: 3080\n" ] } ], "source": [ "print(\"Number of items from MovieLens 1M with side info: %d\" %\n", " ratings[\"ItemId\"][np.in1d(ratings[\"ItemId\"], item_sideinfo_pca[\"ItemId\"])].nunique())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Loading the states data" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "zipcode_abbs = pd.read_csv(\"states.csv\", low_memory=False)\n", "zipcode_abbs_dct = {z.State: z.Abbreviation for z in zipcode_abbs.itertuples()}\n", "us_regs_table = [\n", " ('New England', 'Connecticut, Maine, Massachusetts, New Hampshire, Rhode Island, Vermont'),\n", " ('Middle Atlantic', 'Delaware, Maryland, New Jersey, New York, Pennsylvania'),\n", " ('South', 'Alabama, Arkansas, Florida, Georgia, Kentucky, Louisiana, Mississippi, Missouri, North Carolina, South Carolina, Tennessee, Virginia, West Virginia'),\n", " ('Midwest', 'Illinois, Indiana, Iowa, Kansas, Michigan, Minnesota, Nebraska, North Dakota, Ohio, South Dakota, Wisconsin'),\n", " ('Southwest', 'Arizona, New Mexico, Oklahoma, Texas'),\n", " ('West', 'Alaska, California, Colorado, Hawaii, Idaho, Montana, Nevada, Oregon, Utah, Washington, Wyoming')\n", " ]\n", "us_regs_table = [(x[0], [i.strip() for i in x[1].split(\",\")]) for x in us_regs_table]\n", "us_regs_dct = dict()\n", "for r in us_regs_table:\n", " for s in r[1]:\n", " us_regs_dct[zipcode_abbs_dct[s]] = r[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Loading the zip codes data" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ZipcodeRegion
0501Middle Atlantic
1544Middle Atlantic
2601UsOther
3602UsOther
4603UsOther
\n", "
" ], "text/plain": [ " Zipcode Region\n", "0 501 Middle Atlantic\n", "1 544 Middle Atlantic\n", "2 601 UsOther\n", "3 602 UsOther\n", "4 603 UsOther" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "zipcode_info = pd.read_csv(\"free-zipcode-database.csv\", low_memory=False)\n", "zipcode_info = zipcode_info.groupby('Zipcode').first().reset_index()\n", "zipcode_info.loc[lambda x: x[\"Country\"] != \"US\", 'State'] = 'UnknownOrNonUS'\n", "zipcode_info['Region'] = zipcode_info['State'].copy()\n", "zipcode_info.loc[lambda x: x[\"Country\"] == \"US\", \"Region\"] = (\n", " zipcode_info\n", " .loc[lambda x: x[\"Country\"] == \"US\"]\n", " [\"Region\"]\n", " .map(lambda x: us_regs_dct[x] if x in us_regs_dct else 'UsOther')\n", ")\n", "zipcode_info = zipcode_info[['Zipcode', 'Region']]\n", "zipcode_info.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Loading the user demographic information" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
UserIdGenderAgeOccupationZipcodeRegion
01F1K-12 student48067Midwest
12M56self-employed70072South
23M25scientist55117Midwest
34M45executive/managerial2460New England
45M25writer55455Midwest
\n", "
" ], "text/plain": [ " UserId Gender Age Occupation Zipcode Region\n", "0 1 F 1 K-12 student 48067 Midwest\n", "1 2 M 56 self-employed 70072 South\n", "2 3 M 25 scientist 55117 Midwest\n", "3 4 M 45 executive/managerial 2460 New England\n", "4 5 M 25 writer 55455 Midwest" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "users = pd.read_table(\n", " 'ml-1m/users.dat',\n", " sep='::', engine='python', encoding='cp1252',\n", " names=[\"UserId\", \"Gender\", \"Age\", \"Occupation\", \"Zipcode\"]\n", ")\n", "users[\"Zipcode\"] = users[\"Zipcode\"].map(lambda x: int(re.sub(\"-.*\", \"\", x)))\n", "users = pd.merge(users, zipcode_info, on='Zipcode', how='left')\n", "users['Region'] = users[\"Region\"].fillna('UnknownOrNonUS')\n", "\n", "occupations = {\n", " 0: \"\\\"other\\\" or not specified\",\n", " 1: \"academic/educator\",\n", " 2: \"artist\",\n", " 3: \"clerical/admin\",\n", " 4: \"college/grad student\",\n", " 5: \"customer service\",\n", " 6: \"doctor/health care\",\n", " 7: \"executive/managerial\",\n", " 8: \"farmer\",\n", " 9: \"homemaker\",\n", " 10: \"K-12 student\",\n", " 11: \"lawyer\",\n", " 12: \"programmer\",\n", " 13: \"retired\",\n", " 14: \"sales/marketing\",\n", " 15: \"scientist\",\n", " 16: \"self-employed\",\n", " 17: \"technician/engineer\",\n", " 18: \"tradesman/craftsman\",\n", " 19: \"unemployed\",\n", " 20: \"writer\"\n", "}\n", "users['Occupation'] = users[\"Occupation\"].map(occupations)\n", "users['Age'] = users[\"Age\"].map(lambda x: str(x))\n", "users.head()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
UserIdGender_FGender_MAge_1Age_18Age_25Age_35Age_45Age_50Age_56...Occupation_unemployedOccupation_writerRegion_Middle AtlanticRegion_MidwestRegion_New EnglandRegion_SouthRegion_SouthwestRegion_UnknownOrNonUSRegion_UsOtherRegion_West
01TrueFalseTrueFalseFalseFalseFalseFalseFalse...FalseFalseFalseTrueFalseFalseFalseFalseFalseFalse
12FalseTrueFalseFalseFalseFalseFalseFalseTrue...FalseFalseFalseFalseFalseTrueFalseFalseFalseFalse
23FalseTrueFalseFalseTrueFalseFalseFalseFalse...FalseFalseFalseTrueFalseFalseFalseFalseFalseFalse
34FalseTrueFalseFalseFalseFalseTrueFalseFalse...FalseFalseFalseFalseTrueFalseFalseFalseFalseFalse
45FalseTrueFalseFalseTrueFalseFalseFalseFalse...FalseTrueFalseTrueFalseFalseFalseFalseFalseFalse
\n", "

5 rows × 39 columns

\n", "
" ], "text/plain": [ " UserId Gender_F Gender_M Age_1 Age_18 Age_25 Age_35 Age_45 Age_50 \\\n", "0 1 True False True False False False False False \n", "1 2 False True False False False False False False \n", "2 3 False True False False True False False False \n", "3 4 False True False False False False True False \n", "4 5 False True False False True False False False \n", "\n", " Age_56 ... Occupation_unemployed Occupation_writer \\\n", "0 False ... False False \n", "1 True ... False False \n", "2 False ... False False \n", "3 False ... False False \n", "4 False ... False True \n", "\n", " Region_Middle Atlantic Region_Midwest Region_New England Region_South \\\n", "0 False True False False \n", "1 False False False True \n", "2 False True False False \n", "3 False False True False \n", "4 False True False False \n", "\n", " Region_Southwest Region_UnknownOrNonUS Region_UsOther Region_West \n", "0 False False False False \n", "1 False False False False \n", "2 False False False False \n", "3 False False False False \n", "4 False False False False \n", "\n", "[5 rows x 39 columns]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "user_side_info = pd.get_dummies(users[['UserId', 'Gender', 'Age', 'Occupation', 'Region']])\n", "user_side_info.head()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of users with demographic information: 6040\n" ] } ], "source": [ "print(\"Number of users with demographic information: %d\" %\n", " user_side_info[\"UserId\"].nunique())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Saving the data for usage in a different notebook" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "import pickle\n", "\n", "pickle.dump(ratings, open(\"ratings.p\", \"wb\"))\n", "pickle.dump(item_sideinfo_pca, open(\"item_sideinfo_pca.p\", \"wb\"))\n", "pickle.dump(user_side_info, open(\"user_side_info.p\", \"wb\"))\n", "pickle.dump(movie_id_to_title, open(\"movie_id_to_title.p\", \"wb\"))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.0" } }, "nbformat": 4, "nbformat_minor": 2 }