{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Generator\n",
    "\n",
    "In this notebook we generate fake listening history for users of a music streaming service. \n",
    "\n",
    "The simulated data is uses the [last.fm 1K data set](http://ocelma.net/MusicRecommendationDataset/lastfm-1K.html) as a source, using only the list of artists the user has listened to and the user names from this data set.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import random\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "from datasketching.minhash import SimpleMinhash\n",
    "from datasketching.minhash import murmurmaker"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                    0                     1                                2\n",
      "3990444   user_000203  2008-03-12T01:04:14Z                 The Long Blondes\n",
      "7157816   user_000367  2007-09-06T17:22:11Z                      Bryan Adams\n",
      "9726142   user_000521  2009-02-17T16:52:23Z               Panic At The Disco\n",
      "7301995   user_000377  2008-07-15T04:23:48Z                     Leonel Nunes\n",
      "5604797   user_000290  2008-08-15T13:29:49Z                            Prong\n",
      "11090790  user_000593  2007-04-05T13:59:55Z                       Fred Frith\n",
      "14016597  user_000743  2007-10-24T09:20:28Z              American Music Club\n",
      "7782064   user_000412  2006-07-25T09:59:55Z                       The Saints\n",
      "8053680   user_000427  2006-01-28T06:30:27Z                          Nirvana\n",
      "12391956  user_000672  2009-02-28T05:05:46Z  Fear Before The March Of Flames\n"
     ]
    }
   ],
   "source": [
    "df = pd.read_parquet(\"data/music.parquet\") #load in the last.fm data set\n",
    "\n",
    "df = df.drop(df[df[\"2\"].str.len() > 60].index) # we remove long band names.\n",
    "\n",
    "print(df.sample(10, random_state=1))\n",
    "\n",
    "artists = df['2'].unique() "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To save on memory we replace the artist names with integers. We save the dictionary which maps from artist names to integers to file, so that we can recover the artist names later. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "dartists = {y:x+1 for x,y in enumerate(set(artists))}\n",
    "dartists_inv = {x+1:y for x,y in enumerate(set(artists))}\n",
    "import pickle\n",
    "f = open(\"data/dartists.pkl\",\"wb\")\n",
    "pickle.dump(dartists_inv,f)\n",
    "f.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pseudo users are generated such that their listening history is a mixture of listening histories of 'similar' users in the last.fm data set, where similarity is determined by comparing the [MinHash](https://en.wikipedia.org/wiki/MinHash) signature of the users' listening history. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def generate_minhash_sig(user_dat, nhash):\n",
    "    mh = SimpleMinhash(nhash)\n",
    "    for row in user_dat:\n",
    "        mh.add(row)\n",
    "    return mh\n",
    "\n",
    "def unique_artists(df):\n",
    "    uniques = df['2'].unique()\n",
    "    return [dartists[artist] for artist in uniques]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "grouped_df = df.groupby(['0']) #group the data set by user name\n",
    "un_artists = grouped_df.apply(unique_artists) #identify all artists listened to by each user\n",
    "mh_sigs = un_artists.apply(generate_minhash_sig, nhash = 128) #compute MinHash signature\n",
    "\n",
    "users = df['0'].unique() \n",
    "dusers = {x+1:y for x,y in enumerate(sorted(set(users)))} #Generating dictionary of user names. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Given a 'parent' user, x, from the last.fm data set, listening history for a new user, y, is simulated such that: \n",
    "\n",
    "1. y has listened to a random sample of 90% of the artists x has listened to,\n",
    "2. for 5 users 'similar' to x, y has listened to 2% of their listening history. \n",
    "\n",
    "\n",
    "The 5 'similar' users are chosen at random from the ten users with minhash signatures most similar to x. From these users' history, we remove all artists that x also listened to. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "new_users = pd.DataFrame( columns=['user', 'artist','plays'])    \n",
    "ii = 0 \n",
    "kk = 0\n",
    "sv = 0\n",
    "for u in range(0, 992):    \n",
    "    print(u) \n",
    "    x = mh_sigs[u]\n",
    "    artists_listened = len(un_artists[u])\n",
    "    to_sample = int(np.floor(artists_listened)*0.02)\n",
    "    sim=[]\n",
    "    for mh in range(0, 992):\n",
    "        sim.append(mh_sigs[mh].similarity(mh_sigs[0]))\n",
    "    \n",
    "    similar = set(sorted(sim, reverse=True)[1:11]) # the ten largest similarities\n",
    "    similar_users = ([i for i, e in enumerate(sim) if e in similar]) # extract the user values\n",
    "    \n",
    "    \n",
    "    user_play_fr = grouped_df.get_group(dusers[(u+1)]).groupby(['2']).count()['1'].values\n",
    "    \n",
    "    \n",
    "    for j in range(0, 50):\n",
    "        ### make 50 new users for each user\n",
    "        kk += 1 \n",
    "        username = kk\n",
    "        #print(username)\n",
    "        selected = random.sample(similar_users, 5)\n",
    "        listened = []\n",
    "        for k in selected:\n",
    "            possible = np.setdiff1d(un_artists[k], (list(un_artists[u])+listened))\n",
    "            listened = listened + list(np.random.choice(un_artists[k], size = to_sample, replace = False))\n",
    "            \n",
    "        listened = listened + list(np.random.choice(un_artists[u], size=int(np.floor(artists_listened*0.9)), replace=False))\n",
    "        \n",
    "        ### now simulate user plays. \n",
    "        user_plays = np.random.choice(user_play_fr, size=len(listened), replace = False)\n",
    "        \n",
    "        user_data = {'user':np.repeat(username,len(listened), axis=0) , 'artist':listened, 'plays':user_plays} \n",
    "        user_df = pd.DataFrame(user_data) \n",
    "        new_users = pd.concat([new_users, user_df])\n",
    "        \n",
    "    ii += 1\n",
    "    if ii == 62:\n",
    "        sv +=1\n",
    "        ### write file to parquet every 20th user, and begin a new file\n",
    "        filename='data/userdat'+str(sv)+'.parquet'\n",
    "        print(filename)\n",
    "        new_users.to_parquet(filename)\n",
    "        ii = 0\n",
    "        new_users = pd.DataFrame( columns=['user', 'artist','plays'])    \n",
    "        \n",
    "        \n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}