# RePlay recommender models comparison

### Dataset
We will compare RePlay models on __MovieLens 1m__. 

### Dataset preprocessing: 
Ratings greater than or equal to 3 are considered as positive interactions.

### Data split
Dataset is split by date so that 20% of the last interactions as are placed in the test part. Cold items and users are dropped.

### Predict:
We will predict top-10 most relevant films for each user.

### Metrics
Quality metrics used:__ndcg@k, hitrate@k, map@k, mrr@k__ for k = 1, 5, 10
Additional metrics used: __coverage@k__ and __surprisal@k__.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
%config Completer.use_jedi = False

In [3]:
import warnings
from optuna.exceptions import ExperimentalWarning
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=ExperimentalWarning)

In [4]:
import logging
import pandas as pd
import time

from pyspark.sql import functions as sf, types as st
from pyspark.sql.types import IntegerType

from replay.data_preparator import DataPreparator
from replay.experiment import Experiment
from replay.metrics import Coverage, HitRate, MRR, MAP, NDCG, Surprisal
from replay.models import (
    ALSWrap, 
    ADMMSLIM, 
    KNN,
    LightFMWrap, 
    MultVAE, 
    NeuroMF, 
    SLIM, 
    PopRec, 
    RandomRec, 
    Wilson, 
    Word2VecRec
)

from replay.models.base_rec import HybridRecommender
from replay.session_handler import State
from replay.splitters import DateSplitter
from replay.utils import get_log_info
from rs_datasets import MovieLens

`State` object allows passing existing Spark session or create a new one, which will be used by the all RePlay modules.

To create session with custom parameters ``spark.driver.memory`` and ``spark.sql.shuffle.partitions`` use function `get_spark_session` from `session_handler` module.

In [5]:
spark = State().session
spark

22/02/25 18:05:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/02/25 18:05:28 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
22/02/25 18:05:28 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [6]:
logger = logging.getLogger("replay")

In [7]:
K = 10
K_list_metrics = [1, 5, 10]
BUDGET = 20
SEED = 12345

## 0. Preprocessing <a name='data-preparator'></a>

### 0.1 Data loading

In [8]:
data = MovieLens("1m")
data.info()

ratings


Unnamed: 0,user_id,item_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968



users


Unnamed: 0,user_id,gender,age,occupation,zip_code
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117



items


Unnamed: 0,item_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance





#### log preprocessing

In [10]:
preparator = DataPreparator()
log, _, _ = preparator(data.ratings, mapping={"relevance": "rating"})
print(get_log_info(log))



total lines: 1000209, total users: 6040, total items: 3706


                                                                                

In [11]:
# will consider ratings >= 3 as positive feedback. A positive feedback is treated with relevance = 1
only_positives_log = log.filter(sf.col('relevance') >= 3).withColumn('relevance', sf.lit(1))
only_positives_log.count()

836478

In [12]:
user_features=None
item_features=None

### 0.2. Data split

In [13]:
# train/test split 
train_spl = DateSplitter(
    test_start=0.2,
    drop_cold_items=True,
    drop_cold_users=True,
)
train, test = train_spl.split(only_positives_log)
print('train info:\n', get_log_info(train))
print('test info:\n', get_log_info(test))

22/02/25 18:05:49 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
                                                                                

train info:
 total lines: 669181, total users: 5397, total items: 3569




test info:
 total lines: 86542, total users: 1139, total items: 3279


                                                                                

In [14]:
# train/test split for hyperparameters selection
opt_train, opt_val = train_spl.split(train)
opt_train.count(), opt_val.count()

22/02/25 18:06:05 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.


(535343, 24241)

In [15]:
# negative feedback will be used for Wilson models
only_negatives_log = log.filter(sf.col('relevance') < 3).withColumn('relevance', sf.lit(0.))
test_start = test.agg(sf.min('timestamp')).collect()[0][0]

# train with both positive and negative feedback
pos_neg_train=(train
              .withColumn('relevance', sf.lit(1))
              .union(only_negatives_log.filter(sf.col('timestamp') < test_start))
             )
pos_neg_train.count()

798993

In [16]:
train.show(2)

+---------+---------+--------+--------+
|relevance|timestamp|user_idx|item_idx|
+---------+---------+--------+--------+
|        1|975735012|     677|    1314|
|        1|975736432|     677|    1282|
+---------+---------+--------+--------+
only showing top 2 rows



# 1. Metrics definition

In [17]:
# experiment is used for metrics calculation
e = Experiment(test, {MAP(): K, NDCG(): K, HitRate(): K_list_metrics, Coverage(train): K, Surprisal(train): K, MRR(): K})

# 2. Models training

In [18]:
def fit_predict_add_res(name, model, experiment, train, suffix=''):
    """
    Run fit_predict for the `model`, measure time on fit_predict and evaluate metrics
    """
    start_time=time.time()
    
    fit_predict_params = {'log': train, 'k': K, 'users': test.select('user_idx').distinct()}
    if isinstance(model, Wilson):
        fit_predict_params['log'] = pos_neg_train

    if isinstance(model, HybridRecommender):
        fit_predict_params['item_features'] = item_features
        fit_predict_params['user_features'] = user_features
    
    pred=model.fit_predict(**fit_predict_params)
    pred.count()
    fit_predict_time = time.time() - start_time
    
    experiment.add_result(name + suffix, pred)
    experiment.results.loc[name + suffix, 'fit_pred_time'] = fit_predict_time
    
    print(experiment.results[['NDCG@{}'.format(K), 'MRR@{}'.format(K), 'Coverage@{}'.format(K), 'fit_pred_time']].sort_values('NDCG@{}'.format(K), ascending=False))

In [19]:
def full_pipeline(models, experiment, train, suffix='', budget=BUDGET):
    """
    For each model:
        -  if required: run hyperparameters search, set best params and save param values to `experiment`
        - pass model to `fit_predict_add_res`        
    """
    
    for name, [model, params] in models.items():
        model.logger.info(msg='{} started'.format(name))
        if params != 'no_opt':
            model.logger.info(msg='{} optimization started'.format(name))
            best_params = model.optimize(opt_train, 
                                         opt_val, 
                                         param_borders=params, 
                                         item_features=item_features,
                                         user_features=user_features,
                                         k=K, 
                                         budget=budget)
            model.set_params(**best_params)
            logger.info(msg='best params for {} are: {}'.format(name, best_params))
            experiment.results.loc[name + suffix, 'params'] = best_params.__repr__()
        
        logger.info(msg='{} fit_predict started'.format(name))
        fit_predict_add_res(name, model, experiment, train, suffix)        

## 2.1. Non-personalized models

In [20]:
non_personalized_models = {'Popular Recommender': [PopRec(), 'no_opt'], 
          'Random Recommender (uniform)': [RandomRec(seed=SEED, distribution='uniform'), 'no_opt'], 
          'Random Recommender (popularity-based)': [RandomRec(seed=SEED, distribution='popular_based'), {"alpha": [-0.5, 100]}],
          'Wilson Recommender': [Wilson(), 'no_opt']}

In [21]:
%%time
full_pipeline(non_personalized_models, e, train)

25-Feb-22 18:06:10, replay, INFO: Popular Recommender started
25-Feb-22 18:06:10, replay, INFO: Popular Recommender fit_predict started
22/02/25 18:06:16 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/02/25 18:06:27 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/02/25 18:06:27 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/02/25 18:06:34 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/02/25 18:06:59 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25-Feb-22 18:07:09, replay, INFO: Rando

                      NDCG@10    MRR@10  Coverage@10  fit_pred_time
Popular Recommender  0.243783  0.390426     0.033903       16.93119


25-Feb-22 18:07:57, replay, INFO: Random Recommender (popularity-based) started 
25-Feb-22 18:07:57, replay, INFO: Random Recommender (popularity-based) optimization started
[32m[I 2022-02-25 18:07:57,597][0m A new study created in memory with name: no-name-0c8690d5-63e3-4ac4-a88e-4c0635389c0b[0m


                               NDCG@10    MRR@10  Coverage@10  fit_pred_time
Popular Recommender           0.243783  0.390426     0.033903      16.931190
Random Recommender (uniform)  0.021725  0.054846     0.957691      11.719672


[32m[I 2022-02-25 18:08:09,888][0m Trial 0 finished with value: 0.070029319068223 and parameters: {'distribution': 'popular_based', 'alpha': 0.0}. Best is trial 0 with value: 0.070029319068223.[0m
22/02/25 18:08:09 WARN CacheManager: Asked to cache already cached data.
22/02/25 18:08:09 WARN CacheManager: Asked to cache already cached data.
[32m[I 2022-02-25 18:08:24,813][0m Trial 1 finished with value: 0.05713612473413634 and parameters: {'distribution': 'popular_based', 'alpha': 75.03346685193002}. Best is trial 0 with value: 0.070029319068223.[0m
22/02/25 18:08:24 WARN CacheManager: Asked to cache already cached data.
22/02/25 18:08:24 WARN CacheManager: Asked to cache already cached data.
[32m[I 2022-02-25 18:08:36,593][0m Trial 2 finished with value: 0.05153169911235664 and parameters: {'distribution': 'popular_based', 'alpha': 97.54888710139787}. Best is trial 0 with value: 0.070029319068223.[0m
22/02/25 18:08:36 WARN CacheManager: Asked to cache already cached data.
22/

                                        NDCG@10    MRR@10  Coverage@10  \
Popular Recommender                    0.243783  0.390426     0.033903   
Random Recommender (popularity-based)  0.066665  0.150434     0.760437   
Random Recommender (uniform)           0.021725  0.054846     0.957691   

                                       fit_pred_time  
Popular Recommender                        16.931190  
Random Recommender (popularity-based)      10.100435  
Random Recommender (uniform)               11.719672  


22/02/25 18:12:28 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/02/25 18:12:34 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/02/25 18:12:34 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/02/25 18:12:40 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/02/25 18:12:52 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.

                                        NDCG@10    MRR@10  Coverage@10  \
Popular Recommender                    0.243783  0.390426     0.033903   
Wilson Recommender                     0.092121  0.180976     0.017092   
Random Recommender (popularity-based)  0.066665  0.150434     0.760437   
Random Recommender (uniform)           0.021725  0.054846     0.957691   

                                       fit_pred_time  
Popular Recommender                        16.931190  
Wilson Recommender                         16.660789  
Random Recommender (popularity-based)      10.100435  
Random Recommender (uniform)               11.719672  
CPU times: user 3.49 s, sys: 1.33 s, total: 4.82 s
Wall time: 6min 50s




In [22]:
e.results.sort_values('NDCG@10', ascending=False)

Unnamed: 0,Coverage@10,HitRate@1,HitRate@5,HitRate@10,MAP@10,MRR@10,NDCG@10,Surprisal@10,fit_pred_time,params
Popular Recommender,0.033903,0.28446,0.53029,0.645303,0.157301,0.390426,0.243783,0.118354,16.93119,
Wilson Recommender,0.017092,0.083406,0.34504,0.414399,0.045002,0.180976,0.092121,0.26219,16.660789,
Random Recommender (popularity-based),0.760437,0.071115,0.261633,0.378402,0.027026,0.150434,0.066665,0.344784,10.100435,"{'distribution': 'popular_based', 'alpha': 24...."
Random Recommender (uniform),0.957691,0.017559,0.100088,0.167691,0.007332,0.054846,0.021725,0.538677,11.719672,


In [23]:
e.results.to_csv('res_21_rel_1.csv')

## 2.2  Personalized models without features

In [24]:
common_models = {
          'ADMM SLIM': [ADMMSLIM(seed=SEED), {"lambda_1": [1e-6, 10],
                                              "lambda_2": [1e-6, 1000]},],
          'Implicit ALS': [ALSWrap(seed=SEED), None], 
          'Explicit ALS': [ALSWrap(seed=SEED, implicit_prefs=False), None], 
          'KNN': [KNN(), None], 
          'LightFM': [LightFMWrap(random_state=SEED), {"no_components": [8, 512]}], 
          'SLIM': [SLIM(seed=SEED), None]}

In [25]:
%%time
full_pipeline(common_models, e, train)

25-Feb-22 18:13:01, replay, INFO: ADMM SLIM started
25-Feb-22 18:13:01, replay, INFO: ADMM SLIM optimization started
[32m[I 2022-02-25 18:13:01,337][0m A new study created in memory with name: no-name-474fa5aa-13fa-4eb0-8877-90a6ada17610[0m
22/02/25 18:13:01 WARN CacheManager: Asked to cache already cached data.
22/02/25 18:13:01 WARN CacheManager: Asked to cache already cached data.
[32m[I 2022-02-25 18:13:27,023][0m Trial 0 finished with value: 0.21300281804105106 and parameters: {'lambda_1': 0.8417364694294401, 'lambda_2': 62.68159062953527}. Best is trial 0 with value: 0.21300281804105106.[0m
22/02/25 18:13:27 WARN CacheManager: Asked to cache already cached data.
22/02/25 18:13:27 WARN CacheManager: Asked to cache already cached data.
22/02/25 18:13:35 WARN TaskSetManager: Stage 2016 contains a task of very large size (3403 KiB). The maximum recommended task size is 1000 KiB.
[32m[I 2022-02-25 18:13:47,243][0m Trial 1 finished with value: 0.16633606135193407 and parameters

[32m[I 2022-02-25 18:19:01,726][0m Trial 12 finished with value: 0.19905606955492455 and parameters: {'lambda_1': 6.517023092744527, 'lambda_2': 805.8144516511958}. Best is trial 0 with value: 0.21300281804105106.[0m
22/02/25 18:19:01 WARN CacheManager: Asked to cache already cached data.
22/02/25 18:19:01 WARN CacheManager: Asked to cache already cached data.
[32m[I 2022-02-25 18:19:34,402][0m Trial 13 finished with value: 0.16177655469968685 and parameters: {'lambda_1': 0.7561910399408325, 'lambda_2': 5.183021744998351}. Best is trial 0 with value: 0.21300281804105106.[0m
22/02/25 18:19:34 WARN CacheManager: Asked to cache already cached data.
22/02/25 18:19:34 WARN CacheManager: Asked to cache already cached data.
[32m[I 2022-02-25 18:19:58,256][0m Trial 14 finished with value: 0.18577685563341298 and parameters: {'lambda_1': 9.373073583775676, 'lambda_2': 39.9159121546185}. Best is trial 0 with value: 0.21300281804105106.[0m
22/02/25 18:19:58 WARN CacheManager: Asked to ca

                                        NDCG@10    MRR@10  Coverage@10  \
Popular Recommender                    0.243783  0.390426     0.033903   
ADMM SLIM                              0.216480  0.373958     0.348837   
Wilson Recommender                     0.092121  0.180976     0.017092   
Random Recommender (popularity-based)  0.066665  0.150434     0.760437   
Random Recommender (uniform)           0.021725  0.054846     0.957691   

                                       fit_pred_time  
Popular Recommender                        16.931190  
ADMM SLIM                                  56.977886  
Wilson Recommender                         16.660789  
Random Recommender (popularity-based)      10.100435  
Random Recommender (uniform)               11.719672  


22/02/25 18:24:26 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
22/02/25 18:24:26 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
22/02/25 18:24:26 WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK
22/02/25 18:24:26 WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeRefLAPACK
[32m[I 2022-02-25 18:24:53,988][0m Trial 0 finished with value: 0.2087745613888222 and parameters: {'rank': 10}. Best is trial 0 with value: 0.2087745613888222.[0m
22/02/25 18:24:54 WARN CacheManager: Asked to cache already cached data.
22/02/25 18:24:54 WARN CacheManager: Asked to cache already cached data.
22/02/25 18:25:00 WARN DAGScheduler: Broadcasting large task binary with size 1004.8 KiB
22/02/25 18:25:01 WARN DAGScheduler: Broadcasting large task binary with size 1129.3 KiB
22/02/25 18:25:03 WARN DAGScheduler: Broadcasting large task binary with size 1

[32m[I 2022-02-25 18:29:34,674][0m Trial 5 finished with value: 0.19340064264131435 and parameters: {'rank': 29}. Best is trial 0 with value: 0.2087745613888222.[0m
22/02/25 18:29:34 WARN CacheManager: Asked to cache already cached data.
22/02/25 18:29:34 WARN CacheManager: Asked to cache already cached data.
[32m[I 2022-02-25 18:30:01,211][0m Trial 6 finished with value: 0.19864077980817504 and parameters: {'rank': 25}. Best is trial 0 with value: 0.2087745613888222.[0m
22/02/25 18:30:01 WARN CacheManager: Asked to cache already cached data.
22/02/25 18:30:01 WARN CacheManager: Asked to cache already cached data.
22/02/25 18:30:32 WARN DAGScheduler: Broadcasting large task binary with size 1069.1 KiB
22/02/25 18:30:35 WARN DAGScheduler: Broadcasting large task binary with size 1093.4 KiB
22/02/25 18:30:38 WARN DAGScheduler: Broadcasting large task binary with size 1159.5 KiB
22/02/25 18:30:39 WARN DAGScheduler: Broadcasting large task binary with size 1142.4 KiB
22/02/25 18:30:4

                                        NDCG@10    MRR@10  Coverage@10  \
Implicit ALS                           0.253444  0.406855     0.131129   
Popular Recommender                    0.243783  0.390426     0.033903   
ADMM SLIM                              0.216480  0.373958     0.348837   
Wilson Recommender                     0.092121  0.180976     0.017092   
Random Recommender (popularity-based)  0.066665  0.150434     0.760437   
Random Recommender (uniform)           0.021725  0.054846     0.957691   

                                       fit_pred_time  
Implicit ALS                               32.185843  
Popular Recommender                        16.931190  
ADMM SLIM                                  56.977886  
Wilson Recommender                         16.660789  
Random Recommender (popularity-based)      10.100435  
Random Recommender (uniform)               11.719672  


[32m[I 2022-02-25 18:40:58,997][0m Trial 0 finished with value: 0.008955289357265354 and parameters: {'rank': 10}. Best is trial 0 with value: 0.008955289357265354.[0m
22/02/25 18:40:59 WARN CacheManager: Asked to cache already cached data.
22/02/25 18:40:59 WARN CacheManager: Asked to cache already cached data.
[32m[I 2022-02-25 18:42:04,407][0m Trial 1 finished with value: 0.023477530684574626 and parameters: {'rank': 154}. Best is trial 1 with value: 0.023477530684574626.[0m
22/02/25 18:42:04 WARN CacheManager: Asked to cache already cached data.
22/02/25 18:42:04 WARN CacheManager: Asked to cache already cached data.
[32m[I 2022-02-25 18:44:06,220][0m Trial 2 finished with value: 0.019533663325742814 and parameters: {'rank': 252}. Best is trial 1 with value: 0.023477530684574626.[0m
22/02/25 18:44:06 WARN CacheManager: Asked to cache already cached data.
22/02/25 18:44:06 WARN CacheManager: Asked to cache already cached data.
[32m[I 2022-02-25 18:44:43,129][0m Trial 3 fi

22/02/25 18:50:23 WARN CacheManager: Asked to cache already cached data.
22/02/25 18:50:23 WARN CacheManager: Asked to cache already cached data.
[32m[I 2022-02-25 18:51:07,353][0m Trial 12 finished with value: 0.028913571540665126 and parameters: {'rank': 26}. Best is trial 5 with value: 0.0351628297834589.[0m
22/02/25 18:51:07 WARN CacheManager: Asked to cache already cached data.
22/02/25 18:51:07 WARN CacheManager: Asked to cache already cached data.
[32m[I 2022-02-25 18:51:49,459][0m Trial 13 finished with value: 0.03223115300900205 and parameters: {'rank': 68}. Best is trial 5 with value: 0.0351628297834589.[0m
22/02/25 18:51:49 WARN CacheManager: Asked to cache already cached data.
22/02/25 18:51:49 WARN CacheManager: Asked to cache already cached data.
[32m[I 2022-02-25 18:52:37,976][0m Trial 14 finished with value: 0.02440488953053382 and parameters: {'rank': 87}. Best is trial 5 with value: 0.0351628297834589.[0m
22/02/25 18:52:38 WARN CacheManager: Asked to cache al

                                        NDCG@10    MRR@10  Coverage@10  \
Implicit ALS                           0.253444  0.406855     0.131129   
Popular Recommender                    0.243783  0.390426     0.033903   
ADMM SLIM                              0.216480  0.373958     0.348837   
Wilson Recommender                     0.092121  0.180976     0.017092   
Random Recommender (popularity-based)  0.066665  0.150434     0.760437   
Random Recommender (uniform)           0.021725  0.054846     0.957691   
Explicit ALS                           0.017995  0.041331     0.569908   

                                       fit_pred_time  
Implicit ALS                               32.185843  
Popular Recommender                        16.931190  
ADMM SLIM                                  56.977886  
Wilson Recommender                         16.660789  
Random Recommender (popularity-based)      10.100435  
Random Recommender (uniform)               11.719672  
Explicit ALS          

[32m[I 2022-02-25 19:00:47,700][0m Trial 0 finished with value: 0.20815145550561892 and parameters: {'num_neighbours': 10, 'shrink': 0}. Best is trial 0 with value: 0.20815145550561892.[0m
[32m[I 2022-02-25 19:01:02,302][0m Trial 1 finished with value: 0.23523974161744457 and parameters: {'num_neighbours': 75, 'shrink': 78}. Best is trial 1 with value: 0.23523974161744457.[0m
[32m[I 2022-02-25 19:01:17,643][0m Trial 2 finished with value: 0.21441020220440898 and parameters: {'num_neighbours': 16, 'shrink': 30}. Best is trial 1 with value: 0.23523974161744457.[0m
[32m[I 2022-02-25 19:01:32,878][0m Trial 3 finished with value: 0.22448744295760434 and parameters: {'num_neighbours': 44, 'shrink': 27}. Best is trial 1 with value: 0.23523974161744457.[0m
[32m[I 2022-02-25 19:01:43,331][0m Trial 4 finished with value: 0.23279789836463874 and parameters: {'num_neighbours': 82, 'shrink': 51}. Best is trial 1 with value: 0.23523974161744457.[0m
[32m[I 2022-02-25 19:02:01,535][0m

22/02/25 19:06:47 WARN CacheManager: Asked to cache already cached data.
22/02/25 19:06:47 WARN CacheManager: Asked to cache already cached data.


                                        NDCG@10    MRR@10  Coverage@10  \
KNN                                    0.258407  0.412565     0.054077   
Implicit ALS                           0.253444  0.406855     0.131129   
Popular Recommender                    0.243783  0.390426     0.033903   
ADMM SLIM                              0.216480  0.373958     0.348837   
Wilson Recommender                     0.092121  0.180976     0.017092   
Random Recommender (popularity-based)  0.066665  0.150434     0.760437   
Random Recommender (uniform)           0.021725  0.054846     0.957691   
Explicit ALS                           0.017995  0.041331     0.569908   

                                       fit_pred_time  
KNN                                        36.018561  
Implicit ALS                               32.185843  
Popular Recommender                        16.931190  
ADMM SLIM                                  56.977886  
Wilson Recommender                         16.660789  
Ran

[32m[I 2022-02-25 19:07:29,124][0m Trial 0 finished with value: 0.1875257299916022 and parameters: {'loss': 'warp', 'no_components': 128}. Best is trial 0 with value: 0.1875257299916022.[0m
22/02/25 19:07:29 WARN CacheManager: Asked to cache already cached data.
22/02/25 19:07:29 WARN CacheManager: Asked to cache already cached data.
[32m[I 2022-02-25 19:08:00,855][0m Trial 1 finished with value: 0.1710494878307369 and parameters: {'loss': 'warp', 'no_components': 267}. Best is trial 0 with value: 0.1875257299916022.[0m
22/02/25 19:08:00 WARN CacheManager: Asked to cache already cached data.
22/02/25 19:08:00 WARN CacheManager: Asked to cache already cached data.
[32m[I 2022-02-25 19:08:35,211][0m Trial 2 finished with value: 0.21059182723504485 and parameters: {'loss': 'warp', 'no_components': 19}. Best is trial 2 with value: 0.21059182723504485.[0m
22/02/25 19:08:35 WARN CacheManager: Asked to cache already cached data.
22/02/25 19:08:35 WARN CacheManager: Asked to cache alr

[32m[I 2022-02-25 19:13:21,382][0m Trial 11 finished with value: 0.20817796681780582 and parameters: {'loss': 'warp', 'no_components': 30}. Best is trial 4 with value: 0.21570849200557923.[0m
22/02/25 19:13:21 WARN CacheManager: Asked to cache already cached data.
22/02/25 19:13:21 WARN CacheManager: Asked to cache already cached data.
[32m[I 2022-02-25 19:13:54,813][0m Trial 12 finished with value: 0.21646685479648656 and parameters: {'loss': 'warp', 'no_components': 8}. Best is trial 12 with value: 0.21646685479648656.[0m
22/02/25 19:13:54 WARN CacheManager: Asked to cache already cached data.
22/02/25 19:13:54 WARN CacheManager: Asked to cache already cached data.
[32m[I 2022-02-25 19:14:28,815][0m Trial 13 finished with value: 0.21974415173227532 and parameters: {'loss': 'warp', 'no_components': 9}. Best is trial 13 with value: 0.21974415173227532.[0m
22/02/25 19:14:28 WARN CacheManager: Asked to cache already cached data.
22/02/25 19:14:28 WARN CacheManager: Asked to cach

                                        NDCG@10    MRR@10  Coverage@10  \
LightFM                                0.267207  0.436674     0.156346   
KNN                                    0.258407  0.412565     0.054077   
Implicit ALS                           0.253444  0.406855     0.131129   
Popular Recommender                    0.243783  0.390426     0.033903   
ADMM SLIM                              0.216480  0.373958     0.348837   
Wilson Recommender                     0.092121  0.180976     0.017092   
Random Recommender (popularity-based)  0.066665  0.150434     0.760437   
Random Recommender (uniform)           0.021725  0.054846     0.957691   
Explicit ALS                           0.017995  0.041331     0.569908   

                                       fit_pred_time  
LightFM                                    28.989394  
KNN                                        36.018561  
Implicit ALS                               32.185843  
Popular Recommender                    

[32m[I 2022-02-25 19:20:47,377][0m Trial 0 finished with value: 0.18690087157310825 and parameters: {'beta': 0.01, 'lambda_': 0.01}. Best is trial 0 with value: 0.18690087157310825.[0m
22/02/25 19:20:47 WARN CacheManager: Asked to cache already cached data.
22/02/25 19:20:47 WARN CacheManager: Asked to cache already cached data.
[32m[I 2022-02-25 19:21:11,009][0m Trial 1 finished with value: 0.18594299730857067 and parameters: {'beta': 0.008860922345504325, 'lambda_': 0.0010236160434899768}. Best is trial 0 with value: 0.18690087157310825.[0m
22/02/25 19:21:11 WARN CacheManager: Asked to cache already cached data.
22/02/25 19:21:11 WARN CacheManager: Asked to cache already cached data.
[32m[I 2022-02-25 19:21:33,441][0m Trial 2 finished with value: 0.1914804215004526 and parameters: {'beta': 0.03399489931280087, 'lambda_': 1.0689126947930436e-05}. Best is trial 2 with value: 0.1914804215004526.[0m
22/02/25 19:21:33 WARN CacheManager: Asked to cache already cached data.
22/02/2

[32m[I 2022-02-25 19:25:50,260][0m Trial 15 finished with value: 0.20300583504319952 and parameters: {'beta': 0.28289285694841343, 'lambda_': 0.0017039249402949794}. Best is trial 11 with value: 0.23527354383676072.[0m
22/02/25 19:25:50 WARN CacheManager: Asked to cache already cached data.
22/02/25 19:25:50 WARN CacheManager: Asked to cache already cached data.
[32m[I 2022-02-25 19:26:05,324][0m Trial 16 finished with value: 0.20343160064300064 and parameters: {'beta': 0.0006544018473466481, 'lambda_': 0.07281205873403812}. Best is trial 11 with value: 0.23527354383676072.[0m
22/02/25 19:26:05 WARN CacheManager: Asked to cache already cached data.
22/02/25 19:26:05 WARN CacheManager: Asked to cache already cached data.
[32m[I 2022-02-25 19:26:24,075][0m Trial 17 finished with value: 0.18172222659066103 and parameters: {'beta': 5.497105051749887e-06, 'lambda_': 0.0015014482291685223}. Best is trial 11 with value: 0.23527354383676072.[0m
22/02/25 19:26:24 WARN CacheManager: Ask

                                        NDCG@10    MRR@10  Coverage@10  \
SLIM                                   0.270859  0.434489     0.063323   
LightFM                                0.267207  0.436674     0.156346   
KNN                                    0.258407  0.412565     0.054077   
Implicit ALS                           0.253444  0.406855     0.131129   
Popular Recommender                    0.243783  0.390426     0.033903   
ADMM SLIM                              0.216480  0.373958     0.348837   
Wilson Recommender                     0.092121  0.180976     0.017092   
Random Recommender (popularity-based)  0.066665  0.150434     0.760437   
Random Recommender (uniform)           0.021725  0.054846     0.957691   
Explicit ALS                           0.017995  0.041331     0.569908   

                                       fit_pred_time  
SLIM                                       41.092457  
LightFM                                    28.989394  
KNN                 

                                                                                

In [26]:
e.results.sort_values('NDCG@10', ascending=False)

Unnamed: 0,Coverage@10,HitRate@1,HitRate@5,HitRate@10,MAP@10,MRR@10,NDCG@10,Surprisal@10,fit_pred_time,params
SLIM,0.063323,0.324846,0.583845,0.693591,0.176329,0.434489,0.270859,0.133323,41.092457,"{'beta': 4.65156643702147, 'lambda_': 0.000203..."
LightFM,0.156346,0.324846,0.581212,0.694469,0.170251,0.436674,0.267207,0.169012,28.989394,"{'loss': 'warp', 'no_components': 9}"
KNN,0.054077,0.302897,0.556629,0.648815,0.168665,0.412565,0.258407,0.138554,36.018561,"{'num_neighbours': 75, 'shrink': 78}"
Implicit ALS,0.131129,0.292362,0.562774,0.681299,0.16214,0.406855,0.253444,0.163824,32.185843,{'rank': 8}
Popular Recommender,0.033903,0.28446,0.53029,0.645303,0.157301,0.390426,0.243783,0.118354,16.93119,
ADMM SLIM,0.348837,0.258121,0.541703,0.647937,0.127043,0.373958,0.21648,0.221984,56.977886,"{'lambda_1': 0.8417364694294401, 'lambda_2': 6..."
Wilson Recommender,0.017092,0.083406,0.34504,0.414399,0.045002,0.180976,0.092121,0.26219,16.660789,
Random Recommender (popularity-based),0.760437,0.071115,0.261633,0.378402,0.027026,0.150434,0.066665,0.344784,10.100435,"{'distribution': 'popular_based', 'alpha': 24...."
Random Recommender (uniform),0.957691,0.017559,0.100088,0.167691,0.007332,0.054846,0.021725,0.538677,11.719672,
Explicit ALS,0.569908,0.017559,0.070237,0.124671,0.006534,0.041331,0.017995,0.540517,50.138072,{'rank': 32}


In [27]:
e.results.to_csv('res_22_rel_1.csv')

## 2.3 Neural models

In [28]:
nets = {'MultVAE with default parameters': [MultVAE(), 'no_opt'],
        'NeuroMF with default parameters': [NeuroMF(), 'no_opt'], 
        'Word2Vec with default parameters': [Word2VecRec(seed=SEED), 'no_opt'],
        'MultVAE with optimized parameters': [MultVAE(), {"learning_rate": [0.001, 0.5],
                                   "dropout": [0, 0.5],
                                    "l2_reg": [1e-6, 5]
                                   }],
        'NeuroMF with optimized parameters': [NeuroMF(), {
                                    "learning_rate": [0.001, 0.5],
                                    "l2_reg": [1e-6, 5],
                                    "count_negative_sample": [1, 20]
                                    }],
        'Word2Vec with optimized parameters': [Word2VecRec(seed=SEED), None]}

25-Feb-22 19:29:58, replay, INFO: The model is neural network with non-distributed training
25-Feb-22 19:29:58, replay, INFO: The model is neural network with non-distributed training
25-Feb-22 19:29:58, replay, INFO: The model is neural network with non-distributed training
25-Feb-22 19:29:58, replay, INFO: The model is neural network with non-distributed training


In [29]:
%%time
full_pipeline(nets, e, train, budget=10)

25-Feb-22 19:30:02, replay, INFO: MultVAE with default parameters started
25-Feb-22 19:30:02, replay, INFO: MultVAE with default parameters fit_predict started
22/02/25 19:30:02 WARN CacheManager: Asked to cache already cached data.
22/02/25 19:30:02 WARN CacheManager: Asked to cache already cached data.
2022-02-25 19:30:12,385 ignite.handlers.early_stopping.EarlyStopping INFO: EarlyStopping: Stop training
25-Feb-22 19:32:21, replay, INFO: NeuroMF with default parameters started       8]
25-Feb-22 19:32:21, replay, INFO: NeuroMF with default parameters fit_predict started
22/02/25 19:32:21 WARN CacheManager: Asked to cache already cached data.
22/02/25 19:32:21 WARN CacheManager: Asked to cache already cached data.


                                        NDCG@10    MRR@10  Coverage@10  \
SLIM                                   0.270859  0.434489     0.063323   
LightFM                                0.267207  0.436674     0.156346   
KNN                                    0.258407  0.412565     0.054077   
Implicit ALS                           0.253444  0.406855     0.131129   
Popular Recommender                    0.243783  0.390426     0.033903   
ADMM SLIM                              0.216480  0.373958     0.348837   
Wilson Recommender                     0.092121  0.180976     0.017092   
MultVAE with default parameters        0.075648  0.120041     0.011488   
Random Recommender (popularity-based)  0.066665  0.150434     0.760437   
Random Recommender (uniform)           0.021725  0.054846     0.957691   
Explicit ALS                           0.017995  0.041331     0.569908   

                                       fit_pred_time  
SLIM                                       41.092457  
L

2022-02-25 19:35:45,008 ignite.handlers.early_stopping.EarlyStopping INFO: EarlyStopping: Stop training
25-Feb-22 19:42:15, replay, INFO: Word2Vec with default parameters started      
25-Feb-22 19:42:15, replay, INFO: Word2Vec with default parameters fit_predict started
22/02/25 19:42:15 WARN CacheManager: Asked to cache already cached data.
22/02/25 19:42:15 WARN CacheManager: Asked to cache already cached data.


                                        NDCG@10    MRR@10  Coverage@10  \
SLIM                                   0.270859  0.434489     0.063323   
LightFM                                0.267207  0.436674     0.156346   
KNN                                    0.258407  0.412565     0.054077   
Implicit ALS                           0.253444  0.406855     0.131129   
Popular Recommender                    0.243783  0.390426     0.033903   
ADMM SLIM                              0.216480  0.373958     0.348837   
NeuroMF with default parameters        0.198796  0.336243     0.261138   
Wilson Recommender                     0.092121  0.180976     0.017092   
MultVAE with default parameters        0.075648  0.120041     0.011488   
Random Recommender (popularity-based)  0.066665  0.150434     0.760437   
Random Recommender (uniform)           0.021725  0.054846     0.957691   
Explicit ALS                           0.017995  0.041331     0.569908   

                                     

25-Feb-22 19:52:55, replay, INFO: MultVAE with optimized parameters started     8]8]
25-Feb-22 19:52:55, replay, INFO: MultVAE with optimized parameters optimization started
[32m[I 2022-02-25 19:52:55,977][0m A new study created in memory with name: no-name-406f77e5-8a2b-4499-ad1d-3dd200f5f28d[0m
22/02/25 19:52:55 WARN CacheManager: Asked to cache already cached data.
22/02/25 19:52:55 WARN CacheManager: Asked to cache already cached data.


                                        NDCG@10    MRR@10  Coverage@10  \
SLIM                                   0.270859  0.434489     0.063323   
LightFM                                0.267207  0.436674     0.156346   
KNN                                    0.258407  0.412565     0.054077   
Implicit ALS                           0.253444  0.406855     0.131129   
Popular Recommender                    0.243783  0.390426     0.033903   
ADMM SLIM                              0.216480  0.373958     0.348837   
NeuroMF with default parameters        0.198796  0.336243     0.261138   
Word2Vec with default parameters       0.139835  0.247189     0.139255   
Wilson Recommender                     0.092121  0.180976     0.017092   
MultVAE with default parameters        0.075648  0.120041     0.011488   
Random Recommender (popularity-based)  0.066665  0.150434     0.760437   
Random Recommender (uniform)           0.021725  0.054846     0.957691   
Explicit ALS                          

2022-02-25 19:53:04,801 ignite.handlers.early_stopping.EarlyStopping INFO: EarlyStopping: Stop training
[32m[I 2022-02-25 19:53:19,996][0m Trial 0 finished with value: 0.18084826743321916 and parameters: {'learning_rate': 0.059780365208038436, 'epochs': 100, 'latent_dim': 200, 'hidden_dim': 600, 'dropout': 0.09979924534622853, 'anneal': 0.1, 'l2_reg': 0.25185273202603975, 'gamma': 0.99}. Best is trial 0 with value: 0.18084826743321916.[0m
22/02/25 19:53:20 WARN CacheManager: Asked to cache already cached data.
22/02/25 19:53:20 WARN CacheManager: Asked to cache already cached data.
2022-02-25 19:53:32,065 ignite.handlers.early_stopping.EarlyStopping INFO: EarlyStopping: Stop training
[32m[I 2022-02-25 19:53:49,494][0m Trial 1 finished with value: 0.20853843319338372 and parameters: {'learning_rate': 0.3583302116611014, 'epochs': 100, 'latent_dim': 200, 'hidden_dim': 600, 'dropout': 0.28709685309539446, 'anneal': 0.1, 'l2_reg': 1.666118292644979e-05, 'gamma': 0.99}. Best is trial 1

                                        NDCG@10    MRR@10  Coverage@10  \
SLIM                                   0.270859  0.434489     0.063323   
LightFM                                0.267207  0.436674     0.156346   
KNN                                    0.258407  0.412565     0.054077   
Implicit ALS                           0.253444  0.406855     0.131129   
Popular Recommender                    0.243783  0.390426     0.033903   
MultVAE with optimized parameters      0.236728  0.378478     0.034744   
ADMM SLIM                              0.216480  0.373958     0.348837   
NeuroMF with default parameters        0.198796  0.336243     0.261138   
Word2Vec with default parameters       0.139835  0.247189     0.139255   
Wilson Recommender                     0.092121  0.180976     0.017092   
MultVAE with default parameters        0.075648  0.120041     0.011488   
Random Recommender (popularity-based)  0.066665  0.150434     0.760437   
Random Recommender (uniform)          

22/02/25 19:59:45 WARN CacheManager: Asked to cache already cached data.
22/02/25 19:59:45 WARN CacheManager: Asked to cache already cached data.
[32m[I 2022-02-25 20:10:41,181][0m Trial 0 finished with value: 0.2064613019160715 and parameters: {'embedding_gmf_dim': 128, 'embedding_mlp_dim': 128, 'learning_rate': 0.007286919999637349, 'l2_reg': 5.2988661733736925e-06, 'gamma': 0.99, 'count_negative_sample': 5}. Best is trial 0 with value: 0.2064613019160715.[0m
22/02/25 20:10:41 WARN CacheManager: Asked to cache already cached data.
22/02/25 20:10:41 WARN CacheManager: Asked to cache already cached data.
2022-02-25 20:27:33,479 ignite.handlers.early_stopping.EarlyStopping INFO: EarlyStopping: Stop training
[32m[I 2022-02-25 20:28:14,765][0m Trial 1 finished with value: 0.17299232765028 and parameters: {'embedding_gmf_dim': 128, 'embedding_mlp_dim': 128, 'learning_rate': 0.1233331520732471, 'l2_reg': 4.803592936767287e-05, 'gamma': 0.99, 'count_negative_sample': 13}. Best is trial 

2022-02-25 22:10:21,855 ignite.handlers.early_stopping.EarlyStopping INFO: EarlyStopping: Stop training
[32m[I 2022-02-25 22:11:02,352][0m Trial 9 finished with value: 0.17543632527491465 and parameters: {'embedding_gmf_dim': 128, 'embedding_mlp_dim': 128, 'learning_rate': 0.060119225245243824, 'l2_reg': 0.02935821245727142, 'gamma': 0.99, 'count_negative_sample': 6}. Best is trial 6 with value: 0.22209780168477072.[0m
25-Feb-22 22:11:02, replay, INFO: best params for NeuroMF with optimized parameters are: {'embedding_gmf_dim': 128, 'embedding_mlp_dim': 128, 'learning_rate': 0.01546369650648298, 'l2_reg': 0.3132798237751134, 'gamma': 0.99, 'count_negative_sample': 3}
25-Feb-22 22:11:02, replay, INFO: NeuroMF with optimized parameters fit_predict started
22/02/25 22:11:02 WARN CacheManager: Asked to cache already cached data.
22/02/25 22:11:02 WARN CacheManager: Asked to cache already cached data.
25-Feb-22 22:27:22, replay, INFO: Word2Vec with optimized parameters started    
25-Feb

                                        NDCG@10    MRR@10  Coverage@10  \
SLIM                                   0.270859  0.434489     0.063323   
LightFM                                0.267207  0.436674     0.156346   
KNN                                    0.258407  0.412565     0.054077   
Implicit ALS                           0.253444  0.406855     0.131129   
Popular Recommender                    0.243783  0.390426     0.033903   
NeuroMF with optimized parameters      0.239874  0.397850     0.092463   
MultVAE with optimized parameters      0.236728  0.378478     0.034744   
ADMM SLIM                              0.216480  0.373958     0.348837   
NeuroMF with default parameters        0.198796  0.336243     0.261138   
Word2Vec with default parameters       0.139835  0.247189     0.139255   
Wilson Recommender                     0.092121  0.180976     0.017092   
MultVAE with default parameters        0.075648  0.120041     0.011488   
Random Recommender (popularity-based) 

[32m[I 2022-02-25 22:29:57,549][0m Trial 0 finished with value: 0.13781682129218153 and parameters: {'rank': 100, 'window_size': 1, 'use_idf': False}. Best is trial 0 with value: 0.13781682129218153.[0m
22/02/25 22:29:57 WARN CacheManager: Asked to cache already cached data.
22/02/25 22:29:57 WARN CacheManager: Asked to cache already cached data.
[32m[I 2022-02-25 22:35:13,468][0m Trial 1 finished with value: 0.0307612393780298 and parameters: {'rank': 193, 'window_size': 72, 'use_idf': True}. Best is trial 0 with value: 0.13781682129218153.[0m
22/02/25 22:35:13 WARN CacheManager: Asked to cache already cached data.
22/02/25 22:35:13 WARN CacheManager: Asked to cache already cached data.
[32m[I 2022-02-25 22:39:04,541][0m Trial 2 finished with value: 0.04790512088419334 and parameters: {'rank': 203, 'window_size': 37, 'use_idf': False}. Best is trial 0 with value: 0.13781682129218153.[0m
22/02/25 22:39:04 WARN CacheManager: Asked to cache already cached data.
22/02/25 22:39:04

                                        NDCG@10    MRR@10  Coverage@10  \
SLIM                                   0.270859  0.434489     0.063323   
LightFM                                0.267207  0.436674     0.156346   
KNN                                    0.258407  0.412565     0.054077   
Implicit ALS                           0.253444  0.406855     0.131129   
Popular Recommender                    0.243783  0.390426     0.033903   
NeuroMF with optimized parameters      0.239874  0.397850     0.092463   
MultVAE with optimized parameters      0.236728  0.378478     0.034744   
ADMM SLIM                              0.216480  0.373958     0.348837   
NeuroMF with default parameters        0.198796  0.336243     0.261138   
Word2Vec with default parameters       0.139835  0.247189     0.139255   
Word2Vec with optimized parameters     0.139835  0.247189     0.139255   
Wilson Recommender                     0.092121  0.180976     0.017092   
MultVAE with default parameters       



In [30]:
e.results.sort_values('NDCG@10', ascending=False)

Unnamed: 0,Coverage@10,HitRate@1,HitRate@5,HitRate@10,MAP@10,MRR@10,NDCG@10,Surprisal@10,fit_pred_time,params
SLIM,0.063323,0.324846,0.583845,0.693591,0.176329,0.434489,0.270859,0.133323,41.092457,"{'beta': 4.65156643702147, 'lambda_': 0.000203..."
LightFM,0.156346,0.324846,0.581212,0.694469,0.170251,0.436674,0.267207,0.169012,28.989394,"{'loss': 'warp', 'no_components': 9}"
KNN,0.054077,0.302897,0.556629,0.648815,0.168665,0.412565,0.258407,0.138554,36.018561,"{'num_neighbours': 75, 'shrink': 78}"
Implicit ALS,0.131129,0.292362,0.562774,0.681299,0.16214,0.406855,0.253444,0.163824,32.185843,{'rank': 8}
Popular Recommender,0.033903,0.28446,0.53029,0.645303,0.157301,0.390426,0.243783,0.118354,16.93119,
NeuroMF with optimized parameters,0.092463,0.291484,0.542581,0.654083,0.150862,0.39785,0.239874,0.160814,657.766227,"{'embedding_gmf_dim': 128, 'embedding_mlp_dim'..."
MultVAE with optimized parameters,0.034744,0.273047,0.524144,0.643547,0.151765,0.378478,0.236728,0.121098,45.044486,"{'learning_rate': 0.015677916317796903, 'epoch..."
ADMM SLIM,0.348837,0.258121,0.541703,0.647937,0.127043,0.373958,0.21648,0.221984,56.977886,"{'lambda_1': 0.8417364694294401, 'lambda_2': 6..."
NeuroMF with default parameters,0.261138,0.216857,0.492537,0.622476,0.113413,0.336243,0.198796,0.222385,283.866455,
Word2Vec with default parameters,0.139255,0.147498,0.38367,0.500439,0.074579,0.247189,0.139835,0.237858,161.589033,


In [31]:
e.results.to_csv('res_23_rel_1.csv')

## 2.4 Models considering features

### 2.4.1 item features preprocessing

In [32]:
%%time
preparator = DataPreparator()
log, _, item_features = preparator(data.ratings, item_features=data.items, mapping={"relevance": "rating"})

                                                                                

CPU times: user 566 ms, sys: 37.5 ms, total: 603 ms
Wall time: 4.07 s


In [33]:
item_features.show(2)

+----------------+--------------------+--------+
|           title|              genres|item_idx|
+----------------+--------------------+--------+
|Toy Story (1995)|Animation|Childre...|      29|
|  Jumanji (1995)|Adventure|Childre...|     393|
+----------------+--------------------+--------+
only showing top 2 rows



In [34]:
year = item_features.withColumn('year', sf.substring(sf.col('title'), -5, 4).astype(st.IntegerType())).select('item_idx', 'year')
year.show(2)

+--------+----+
|item_idx|year|
+--------+----+
|      29|1995|
|     393|1995|
+--------+----+
only showing top 2 rows



In [35]:
genres = (
    spark.createDataFrame(data.items[["item_id", "genres"]].rename({'item_id': 'item_idx'}, axis=1))
    .select(
        "item_idx",
        sf.split("genres", "\|").alias("genres")
    )
)

In [36]:
genres_list = (
    genres.select(sf.explode("genres").alias("genre"))
    .distinct().filter('genre <> "(no genres listed)"')
    .toPandas()["genre"].tolist()
)

In [37]:
genres_list

['Documentary',
 'Adventure',
 'Animation',
 'Comedy',
 'Thriller',
 'Sci-Fi',
 'Musical',
 'Horror',
 'Action',
 'Fantasy',
 'War',
 'Mystery',
 "Children's",
 'Drama',
 'Film-Noir',
 'Crime',
 'Western',
 'Romance']

In [38]:
item_features = genres
for genre in genres_list:
    item_features = item_features.withColumn(
        genre,
        sf.array_contains(sf.col("genres"), genre).astype(IntegerType())
    )
item_features = item_features.drop("genres").cache()
item_features.count()

3883

In [39]:
item_features = item_features.join(year, on='item_idx', how='inner')
item_features.count()

3813

In [40]:
item_features.cache()

DataFrame[item_idx: int, Documentary: int, Adventure: int, Animation: int, Comedy: int, Thriller: int, Sci-Fi: int, Musical: int, Horror: int, Action: int, Fantasy: int, War: int, Mystery: int, Children's: int, Drama: int, Film-Noir: int, Crime: int, Western: int, Romance: int, year: int]

In [41]:
item_features.show(3)

+--------+-----------+---------+---------+------+--------+------+-------+------+------+-------+---+-------+----------+-----+---------+-----+-------+-------+----+
|item_idx|Documentary|Adventure|Animation|Comedy|Thriller|Sci-Fi|Musical|Horror|Action|Fantasy|War|Mystery|Children's|Drama|Film-Noir|Crime|Western|Romance|year|
+--------+-----------+---------+---------+------+--------+------+-------+------+------+-------+---+-------+----------+-----+---------+-----+-------+-------+----+
|      29|          0|        1|        0|     0|       0|     1|      0|     0|     0|      0|  0|      0|         0|    0|        0|    0|      0|      0|1995|
|     393|          0|        0|        0|     0|       0|     0|      0|     0|     1|      0|  0|      0|         0|    0|        0|    0|      0|      0|1995|
|     648|          0|        1|        0|     0|       0|     0|      0|     0|     1|      0|  0|      1|         0|    0|        0|    0|      0|      0|1995|
+--------+-----------+------

### 2.4.2 Models training

In [42]:
models_with_features = {'LightFM with item features': [LightFMWrap(random_state=SEED), {"no_components": [8, 512]}]}

In [43]:
%%time
full_pipeline(models_with_features, e, train)

25-Feb-22 23:12:34, replay, INFO: LightFM with item features started
25-Feb-22 23:12:34, replay, INFO: LightFM with item features optimization started
[32m[I 2022-02-25 23:12:34,833][0m A new study created in memory with name: no-name-a86246ae-1ea9-4636-9278-763f3bf90129[0m
22/02/25 23:12:34 WARN CacheManager: Asked to cache already cached data.
[32m[I 2022-02-25 23:14:18,532][0m Trial 0 finished with value: 0.2022408950793562 and parameters: {'loss': 'warp', 'no_components': 128}. Best is trial 0 with value: 0.2022408950793562.[0m
22/02/25 23:14:18 WARN CacheManager: Asked to cache already cached data.
22/02/25 23:14:18 WARN CacheManager: Asked to cache already cached data.
[32m[I 2022-02-25 23:15:50,494][0m Trial 1 finished with value: 0.21235478041254502 and parameters: {'loss': 'warp', 'no_components': 63}. Best is trial 1 with value: 0.21235478041254502.[0m
22/02/25 23:15:50 WARN CacheManager: Asked to cache already cached data.
22/02/25 23:15:50 WARN CacheManager: Asked 

                                        NDCG@10    MRR@10  Coverage@10  \
SLIM                                   0.270859  0.434489     0.063323   
LightFM                                0.267207  0.436674     0.156346   
KNN                                    0.258407  0.412565     0.054077   
Implicit ALS                           0.253444  0.406855     0.131129   
LightFM with item features             0.250395  0.403145     0.096105   
Popular Recommender                    0.243783  0.390426     0.033903   
NeuroMF with optimized parameters      0.239874  0.397850     0.092463   
MultVAE with optimized parameters      0.236728  0.378478     0.034744   
ADMM SLIM                              0.216480  0.373958     0.348837   
NeuroMF with default parameters        0.198796  0.336243     0.261138   
Word2Vec with default parameters       0.139835  0.247189     0.139255   
Word2Vec with optimized parameters     0.139835  0.247189     0.139255   
Wilson Recommender                    



In [44]:
e.results.sort_values('NDCG@10', ascending=False)

Unnamed: 0,Coverage@10,HitRate@1,HitRate@5,HitRate@10,MAP@10,MRR@10,NDCG@10,Surprisal@10,fit_pred_time,params
SLIM,0.063323,0.324846,0.583845,0.693591,0.176329,0.434489,0.270859,0.133323,41.092457,"{'beta': 4.65156643702147, 'lambda_': 0.000203..."
LightFM,0.156346,0.324846,0.581212,0.694469,0.170251,0.436674,0.267207,0.169012,28.989394,"{'loss': 'warp', 'no_components': 9}"
KNN,0.054077,0.302897,0.556629,0.648815,0.168665,0.412565,0.258407,0.138554,36.018561,"{'num_neighbours': 75, 'shrink': 78}"
Implicit ALS,0.131129,0.292362,0.562774,0.681299,0.16214,0.406855,0.253444,0.163824,32.185843,{'rank': 8}
LightFM with item features,0.096105,0.279192,0.567164,0.686567,0.158074,0.403145,0.250395,0.152727,59.903357,"{'loss': 'warp', 'no_components': 16}"
Popular Recommender,0.033903,0.28446,0.53029,0.645303,0.157301,0.390426,0.243783,0.118354,16.93119,
NeuroMF with optimized parameters,0.092463,0.291484,0.542581,0.654083,0.150862,0.39785,0.239874,0.160814,657.766227,"{'embedding_gmf_dim': 128, 'embedding_mlp_dim'..."
MultVAE with optimized parameters,0.034744,0.273047,0.524144,0.643547,0.151765,0.378478,0.236728,0.121098,45.044486,"{'learning_rate': 0.015677916317796903, 'epoch..."
ADMM SLIM,0.348837,0.258121,0.541703,0.647937,0.127043,0.373958,0.21648,0.221984,56.977886,"{'lambda_1': 0.8417364694294401, 'lambda_2': 6..."
NeuroMF with default parameters,0.261138,0.216857,0.492537,0.622476,0.113413,0.336243,0.198796,0.222385,283.866455,


In [45]:
e.results.to_csv('res_25_rel_1.csv')

In [48]:
df = e.results.drop([
    'NeuroMF with optimized parameters', 
    'MultVAE with default parameters', 
    'Word2Vec with optimized parameters'
]).rename(
    index={
        'Popular Recommender': 'PopRec', 
        'Random Recommender (uniform)': 'RandomRec (uniform)', 
        'Random Recommender (popularity-based)': 'RandomRec (popular)',
        'Wilson Recommender': 'Wilson', 'Implicit ALS': 'ALS (Implicit)', 'Explicit ALS': 'ALS (Explicit)',
        'NeuroMF with default parameters': 'NeuroMF', 'MultVAE with optimized parameters': 'MultVAE',
        'Word2Vec with default parameters': 'Word2Vec', 'LightFM with item features': 'LightFM (w/ feats)'
                }).sort_values('NDCG@10', ascending=False)
df

Unnamed: 0,Coverage@10,HitRate@1,HitRate@5,HitRate@10,MAP@10,MRR@10,NDCG@10,Surprisal@10,fit_pred_time,params
SLIM,0.063323,0.324846,0.583845,0.693591,0.176329,0.434489,0.270859,0.133323,41.092457,"{'beta': 4.65156643702147, 'lambda_': 0.000203..."
LightFM,0.156346,0.324846,0.581212,0.694469,0.170251,0.436674,0.267207,0.169012,28.989394,"{'loss': 'warp', 'no_components': 9}"
KNN,0.054077,0.302897,0.556629,0.648815,0.168665,0.412565,0.258407,0.138554,36.018561,"{'num_neighbours': 75, 'shrink': 78}"
ALS (Implicit),0.131129,0.292362,0.562774,0.681299,0.16214,0.406855,0.253444,0.163824,32.185843,{'rank': 8}
LightFM (w/ feats),0.096105,0.279192,0.567164,0.686567,0.158074,0.403145,0.250395,0.152727,59.903357,"{'loss': 'warp', 'no_components': 16}"
PopRec,0.033903,0.28446,0.53029,0.645303,0.157301,0.390426,0.243783,0.118354,16.93119,
MultVAE,0.034744,0.273047,0.524144,0.643547,0.151765,0.378478,0.236728,0.121098,45.044486,"{'learning_rate': 0.015677916317796903, 'epoch..."
ADMM SLIM,0.348837,0.258121,0.541703,0.647937,0.127043,0.373958,0.21648,0.221984,56.977886,"{'lambda_1': 0.8417364694294401, 'lambda_2': 6..."
NeuroMF,0.261138,0.216857,0.492537,0.622476,0.113413,0.336243,0.198796,0.222385,283.866455,
Word2Vec,0.139255,0.147498,0.38367,0.500439,0.074579,0.247189,0.139835,0.237858,161.589033,


In [49]:
df.index.name = 'Model'

In [50]:
df = df.round(3)[['HitRate@10', 'MAP@10', 'MRR@10', 'NDCG@10', 'Coverage@10', 'Surprisal@10', 'fit_pred_time']]
df = df.rename(columns={'HitRate@10': 'HitRate', 'MAP@10': 'MAP', 'MRR@10': 'MRR',
                        'NDCG@10': 'NDCG', 'Coverage@10': 'Coverage', 
                        'Surprisal@10': 'Surprisal'})
df.to_csv('res_1m.csv')

# 3. Results

The best results by quality and time were shown by the commonly-used models such as ALS, SLIM and LightFM. 

In [51]:
df.head()

Unnamed: 0_level_0,HitRate,MAP,MRR,NDCG,Coverage,Surprisal,fit_pred_time
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
SLIM,0.694,0.176,0.434,0.271,0.063,0.133,41.092
LightFM,0.694,0.17,0.437,0.267,0.156,0.169,28.989
KNN,0.649,0.169,0.413,0.258,0.054,0.139,36.019
ALS (Implicit),0.681,0.162,0.407,0.253,0.131,0.164,32.186
LightFM (w/ feats),0.687,0.158,0.403,0.25,0.096,0.153,59.903
