{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Generator\n", "\n", "In this notebook we generate fake listening history for users of a music streaming service. \n", "\n", "The simulated data is uses the [last.fm 1K data set](http://ocelma.net/MusicRecommendationDataset/lastfm-1K.html) as a source, using only the list of artists the user has listened to and the user names from this data set. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import random\n", "import pandas as pd\n", "import numpy as np\n", "\n", "from datasketching.minhash import SimpleMinhash\n", "from datasketching.minhash import murmurmaker" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0 1 2\n", "3990444 user_000203 2008-03-12T01:04:14Z The Long Blondes\n", "7157816 user_000367 2007-09-06T17:22:11Z Bryan Adams\n", "9726142 user_000521 2009-02-17T16:52:23Z Panic At The Disco\n", "7301995 user_000377 2008-07-15T04:23:48Z Leonel Nunes\n", "5604797 user_000290 2008-08-15T13:29:49Z Prong\n", "11090790 user_000593 2007-04-05T13:59:55Z Fred Frith\n", "14016597 user_000743 2007-10-24T09:20:28Z American Music Club\n", "7782064 user_000412 2006-07-25T09:59:55Z The Saints\n", "8053680 user_000427 2006-01-28T06:30:27Z Nirvana\n", "12391956 user_000672 2009-02-28T05:05:46Z Fear Before The March Of Flames\n" ] } ], "source": [ "df = pd.read_parquet(\"data/music.parquet\") #load in the last.fm data set\n", "\n", "df = df.drop(df[df[\"2\"].str.len() > 60].index) # we remove long band names.\n", "\n", "print(df.sample(10, random_state=1))\n", "\n", "artists = df['2'].unique() " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To save on memory we replace the artist names with integers. We save the dictionary which maps from artist names to integers to file, so that we can recover the artist names later. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "dartists = {y:x+1 for x,y in enumerate(set(artists))}\n", "dartists_inv = {x+1:y for x,y in enumerate(set(artists))}\n", "import pickle\n", "f = open(\"data/dartists.pkl\",\"wb\")\n", "pickle.dump(dartists_inv,f)\n", "f.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pseudo users are generated such that their listening history is a mixture of listening histories of 'similar' users in the last.fm data set, where similarity is determined by comparing the [MinHash](https://en.wikipedia.org/wiki/MinHash) signature of the users' listening history. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def generate_minhash_sig(user_dat, nhash):\n", " mh = SimpleMinhash(nhash)\n", " for row in user_dat:\n", " mh.add(row)\n", " return mh\n", "\n", "def unique_artists(df):\n", " uniques = df['2'].unique()\n", " return [dartists[artist] for artist in uniques]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "grouped_df = df.groupby(['0']) #group the data set by user name\n", "un_artists = grouped_df.apply(unique_artists) #identify all artists listened to by each user\n", "mh_sigs = un_artists.apply(generate_minhash_sig, nhash = 128) #compute MinHash signature\n", "\n", "users = df['0'].unique() \n", "dusers = {x+1:y for x,y in enumerate(sorted(set(users)))} #Generating dictionary of user names. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Given a 'parent' user, x, from the last.fm data set, listening history for a new user, y, is simulated such that: \n", "\n", "1. y has listened to a random sample of 90% of the artists x has listened to,\n", "2. for 5 users 'similar' to x, y has listened to 2% of their listening history. \n", "\n", "\n", "The 5 'similar' users are chosen at random from the ten users with minhash signatures most similar to x. From these users' history, we remove all artists that x also listened to. \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "new_users = pd.DataFrame( columns=['user', 'artist','plays']) \n", "ii = 0 \n", "kk = 0\n", "sv = 0\n", "for u in range(0, 992): \n", " print(u) \n", " x = mh_sigs[u]\n", " artists_listened = len(un_artists[u])\n", " to_sample = int(np.floor(artists_listened)*0.02)\n", " sim=[]\n", " for mh in range(0, 992):\n", " sim.append(mh_sigs[mh].similarity(mh_sigs[0]))\n", " \n", " similar = set(sorted(sim, reverse=True)[1:11]) # the ten largest similarities\n", " similar_users = ([i for i, e in enumerate(sim) if e in similar]) # extract the user values\n", " \n", " \n", " user_play_fr = grouped_df.get_group(dusers[(u+1)]).groupby(['2']).count()['1'].values\n", " \n", " \n", " for j in range(0, 50):\n", " ### make 50 new users for each user\n", " kk += 1 \n", " username = kk\n", " #print(username)\n", " selected = random.sample(similar_users, 5)\n", " listened = []\n", " for k in selected:\n", " possible = np.setdiff1d(un_artists[k], (list(un_artists[u])+listened))\n", " listened = listened + list(np.random.choice(un_artists[k], size = to_sample, replace = False))\n", " \n", " listened = listened + list(np.random.choice(un_artists[u], size=int(np.floor(artists_listened*0.9)), replace=False))\n", " \n", " ### now simulate user plays. \n", " user_plays = np.random.choice(user_play_fr, size=len(listened), replace = False)\n", " \n", " user_data = {'user':np.repeat(username,len(listened), axis=0) , 'artist':listened, 'plays':user_plays} \n", " user_df = pd.DataFrame(user_data) \n", " new_users = pd.concat([new_users, user_df])\n", " \n", " ii += 1\n", " if ii == 62:\n", " sv +=1\n", " ### write file to parquet every 20th user, and begin a new file\n", " filename='data/userdat'+str(sv)+'.parquet'\n", " print(filename)\n", " new_users.to_parquet(filename)\n", " ii = 0\n", " new_users = pd.DataFrame( columns=['user', 'artist','plays']) \n", " \n", " \n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" } }, "nbformat": 4, "nbformat_minor": 2 }