## Data Generator

In this notebook we generate fake listening history for users of a music streaming service. 

The simulated data is uses the [last.fm 1K data set](http://ocelma.net/MusicRecommendationDataset/lastfm-1K.html) as a source, using only the list of artists the user has listened to and the user names from this data set. 

In [1]:
import random
import pandas as pd
import numpy as np

from datasketching.minhash import SimpleMinhash
from datasketching.minhash import murmurmaker

In [2]:
df = pd.read_parquet("data/music.parquet") #load in the last.fm data set

df = df.drop(df[df["2"].str.len() > 60].index) # we remove long band names.

print(df.sample(10, random_state=1))

artists = df['2'].unique() 

 0 1 2
3990444 user_000203 2008-03-12T01:04:14Z The Long Blondes
7157816 user_000367 2007-09-06T17:22:11Z Bryan Adams
9726142 user_000521 2009-02-17T16:52:23Z Panic At The Disco
7301995 user_000377 2008-07-15T04:23:48Z Leonel Nunes
5604797 user_000290 2008-08-15T13:29:49Z Prong
11090790 user_000593 2007-04-05T13:59:55Z Fred Frith
14016597 user_000743 2007-10-24T09:20:28Z American Music Club
7782064 user_000412 2006-07-25T09:59:55Z The Saints
8053680 user_000427 2006-01-28T06:30:27Z Nirvana
12391956 user_000672 2009-02-28T05:05:46Z Fear Before The March Of Flames


To save on memory we replace the artist names with integers. We save the dictionary which maps from artist names to integers to file, so that we can recover the artist names later. 

In [5]:
dartists = {y:x+1 for x,y in enumerate(set(artists))}
dartists_inv = {x+1:y for x,y in enumerate(set(artists))}
import pickle
f = open("data/dartists.pkl","wb")
pickle.dump(dartists_inv,f)
f.close()

Pseudo users are generated such that their listening history is a mixture of listening histories of 'similar' users in the last.fm data set, where similarity is determined by comparing the [MinHash](https://en.wikipedia.org/wiki/MinHash) signature of the users' listening history. 

In [None]:
def generate_minhash_sig(user_dat, nhash):
 mh = SimpleMinhash(nhash)
 for row in user_dat:
 mh.add(row)
 return mh

def unique_artists(df):
 uniques = df['2'].unique()
 return [dartists[artist] for artist in uniques]

In [None]:
grouped_df = df.groupby(['0']) #group the data set by user name
un_artists = grouped_df.apply(unique_artists) #identify all artists listened to by each user
mh_sigs = un_artists.apply(generate_minhash_sig, nhash = 128) #compute MinHash signature

users = df['0'].unique() 
dusers = {x+1:y for x,y in enumerate(sorted(set(users)))} #Generating dictionary of user names. 

Given a 'parent' user, x, from the last.fm data set, listening history for a new user, y, is simulated such that: 

1. y has listened to a random sample of 90% of the artists x has listened to,
2. for 5 users 'similar' to x, y has listened to 2% of their listening history. 


The 5 'similar' users are chosen at random from the ten users with minhash signatures most similar to x. From these users' history, we remove all artists that x also listened to. 


In [None]:
new_users = pd.DataFrame( columns=['user', 'artist','plays']) 
ii = 0 
kk = 0
sv = 0
for u in range(0, 992): 
 print(u) 
 x = mh_sigs[u]
 artists_listened = len(un_artists[u])
 to_sample = int(np.floor(artists_listened)*0.02)
 sim=[]
 for mh in range(0, 992):
 sim.append(mh_sigs[mh].similarity(mh_sigs[0]))
 
 similar = set(sorted(sim, reverse=True)[1:11]) # the ten largest similarities
 similar_users = ([i for i, e in enumerate(sim) if e in similar]) # extract the user values
 
 
 user_play_fr = grouped_df.get_group(dusers[(u+1)]).groupby(['2']).count()['1'].values
 
 
 for j in range(0, 50):
 ### make 50 new users for each user
 kk += 1 
 username = kk
 #print(username)
 selected = random.sample(similar_users, 5)
 listened = []
 for k in selected:
 possible = np.setdiff1d(un_artists[k], (list(un_artists[u])+listened))
 listened = listened + list(np.random.choice(un_artists[k], size = to_sample, replace = False))
 
 listened = listened + list(np.random.choice(un_artists[u], size=int(np.floor(artists_listened*0.9)), replace=False))
 
 ### now simulate user plays. 
 user_plays = np.random.choice(user_play_fr, size=len(listened), replace = False)
 
 user_data = {'user':np.repeat(username,len(listened), axis=0) , 'artist':listened, 'plays':user_plays} 
 user_df = pd.DataFrame(user_data) 
 new_users = pd.concat([new_users, user_df])
 
 ii += 1
 if ii == 62:
 sv +=1
 ### write file to parquet every 20th user, and begin a new file
 filename='data/userdat'+str(sv)+'.parquet'
 print(filename)
 new_users.to_parquet(filename)
 ii = 0
 new_users = pd.DataFrame( columns=['user', 'artist','plays']) 
 
 
