# Yelp star ratings classification and external factors

by <a href="http://keithqu.com">Keith Qu</a>

It may be a good idea to segregate the data by business type (restaurant, hardware store, etc.). It could be easier and less computationally intensive per category. But it would be interesting to find very general external features that can help determine star ratings for all businesses. These are factors that business owners have no direct control over, but knowing their effects can help with forming a plan to counteract negative customer sentiment.

Three stages: first we look at comment text alone, to see how accurate we can predict star rating based on that. Then we add in weather effects. Since star ratings are highly subjective, users may be influenced by many things when it comes to the rating. Finally, we'll see if the day of the week that a review is written on can help predict ratings. If these factors have an effect on sentiment, they will undoubtedly affect the review text as well, but there may also be subtle additional effects on the star rating.

The goal isn't so much to painstakingly tune a NN for the last bit of accuracy, but rather to see if adding one or two engineered features can have a significant improvement regardless of model. Or if I can embarrass myself with a complete lack of improvement!

I picked weather and day of week since they are known to have effects on customer activity (how many customers visit), but let's see if they can also help predict ratings.

In [1]:
import pandas as pd
import numpy as np

from nltk.corpus import stopwords
from nltk import WordNetLemmatizer
from nltk import pos_tag, word_tokenize

from keras.models import Sequential
from keras.layers import (Dense, Dropout, Input, LSTM, Activation, Flatten,
                          Convolution1D, MaxPooling1D, Bidirectional,
                         GlobalMaxPooling1D, Embedding, BatchNormalization,
                         SpatialDropout1D)
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import np_utils
from keras.optimizers import SGD

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, roc_auc_score, auc
from sklearn.preprocessing import LabelEncoder

from datetime import datetime

%matplotlib inline

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
PATH = "/d/data/yelpdata/dataset/"
#PATH = "d:\\data\\yelpdata\\dataset\\"
WEAT = f'{PATH}processed_weather/'

In [3]:
#businesses = pd.read_csv(f'{PATH}business_on.csv', index_col=0)
reviews = pd.read_csv(f'{PATH}review_on.csv', index_col=0)

In [4]:
reviews = reviews[['stars','text']]

In [5]:
reviews['text'].fillna('empty', inplace=True)

### Stage 1: Predicting star rating based on review text alone

In [6]:
def clean_up(t):
    t = t.strip().lower()
    words = t.split()
    
    # first get rid of the stopwords, or a lemmatized stopword might not
    # be recognized as a stopword
    
    imp_words = ' '.join(w for w in words if w not in set(stopwords.words('english')))

    # lemmatize based on adjectives (J), verbs (V), nouns (N) and adverbs (R) to
    # return only the base words (as opposed to stemming which can return
    # non-words). e.g. ponies -> poni with stemming, and pony with lemmatizing
    
    final_words = ''
    
    lemma = WordNetLemmatizer()
    for (w,tag) in pos_tag(word_tokenize(imp_words)):
        if tag.startswith('J'):
            final_words += ' '+ lemma.lemmatize(w, pos='a')
        elif tag.startswith('V'):
            final_words += ' '+ lemma.lemmatize(w, pos='v')
        elif tag.startswith('N'):
            final_words += ' '+ lemma.lemmatize(w, pos='n')
        elif tag.startswith('R'):
            final_words += ' '+ lemma.lemmatize(w, pos='r')
        else:
            final_words += ' '+ w
    
    return final_words

# what a great name. do_stuff

def do_stuff (df):
    text = df['text'].copy()
    
    text.replace(to_replace={r'[^\x00-\x7F]':' '},inplace=True,regex=True)
    text.replace(to_replace={r'[^a-zA-Z]': ' '},inplace=True,regex=True)
    
    # Then lower case, tokenize and lemmatize

    # with over 600,000 entries, this is going to be one hell of a long apply...
    
    text = text.apply(lambda t:clean_up(t))
    return text

In [7]:
# bidirectional LSTM, as described by Zhou et. al. (2016) http://www.aclweb.org/anthology/C16-1329
def lstm_model (X_train, y_train,test, val='no'):
    model = Sequential()
    model.add(Embedding(50000,300,input_length=500,weights=[emb_matrix]))
    model.add(Convolution1D(filters=128, kernel_size=5, padding='same', activation='relu'))
    model.add(MaxPooling1D(5))
    model.add(Dropout(0.2))
    
    model.add(Bidirectional(LSTM(128,dropout=0.1,recurrent_dropout=0.1)))
    
    model.add(Dense(5,activation='softmax'))
    
    sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9)     
    model.compile(loss='categorical_crossentropy',optimizer=sgd,metrics=['accuracy'])
    
    if val == 'no':
        model.fit(X_train,y_train,batch_size=128,epochs=3)
    else:
        model.fit(X_train,y_train,batch_size=128,epochs=3,validation_split=0.2)
    pred = model.predict(test)
    return pred    

In [8]:
# converging to a very conventional convolutional NN model to convert non-conversational text to star ratings
# uh... with a non-convex loss function
# an LSTM network could do better, but it would also take significantly longer to run
#
# (not actually using the CNN model here)

def cnn_model (X_train, y_train, test, val='no'):
    model=Sequential()
    model.add(Embedding(50000,128,input_length=500))
    model.add(Convolution1D(128,5,activation='relu'))
    model.add(MaxPooling1D(5))
    model.add(Dropout(0.2))
    
    model.add(Convolution1D(128,5,activation='relu'))
    model.add(MaxPooling1D(5))
    model.add(Dropout(0.2))
    
    model.add(Convolution1D(128,5,activation='relu'))
    model.add(MaxPooling1D(35))
    model.add(Flatten())
    model.add(Dense(128,activation='relu'))
    model.add(Dropout(0.2))
    
    model.add(Dense(5,activation='softmax'))
    
    sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9)     
    model.compile(loss='categorical_crossentropy',optimizer=sgd,metrics=['accuracy'])
    
    if val == 'no':
        model.fit(X_train,y_train,batch_size=128,epochs=5)
    else:
        model.fit(X_train,y_train,batch_size=128,epochs=5,validation_split=0.2)
    pred = model.predict(test)
    return pred

In [9]:
#data = do_stuff(reviews)

In [10]:
#data.to_csv(f'{PATH}review_on_processed_text.csv')

In [11]:
data = pd.Series.from_csv(f'{PATH}review_on_processed_text.csv', index_col=0)

In [12]:
stars = reviews['stars']

In [13]:
del reviews

In [14]:
enc = LabelEncoder()
enc.fit(stars)
y = enc.transform(stars)
dummy_y = np_utils.to_categorical(y)

In [15]:
data.fillna('empty', inplace=True)

In [16]:
tok = Tokenizer(num_words=50000)
tok.fit_on_texts(data)

sequenced = tok.texts_to_sequences(data)
padded = pad_sequences(sequenced,maxlen=500)

In [17]:
# getting the pretrained weight matrix
# based on https://www.kaggle.com/jhoward/improved-lstm-baseline-glove-dropout
# by which I mean it's pretty much just that...

EMBED_FILE = '/d/data/glove.42B.300d.txt'

def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
embeddings_index = dict(get_coefs(*o.strip().split()) for o in open(EMBED_FILE))

embed_size = 300
max_features = 50000
maxlen = 500

all_embs = np.stack(embeddings_index.values())
emb_mean,emb_std = all_embs.mean(), all_embs.std()

word_index = tok.word_index
nb_words = min(50000, len(word_index))
emb_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
for word, i in word_index.items():
    if i >= max_features: continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None: emb_matrix[i] = embedding_vector

In [18]:
del embedding_vector, embeddings_index

In [15]:
X_train, X_test, y_train, y_test = train_test_split(padded, dummy_y, test_size=0.2, random_state=202)

In [17]:
#del data,emb_mean,emb_std,embed_size

In [27]:
# normally 1 comes before 2, but... this just starts at 2
pred2 = lstm_model (X_train, y_train, X_test, val='yes')

Train on 405993 samples, validate on 101499 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


Maybe one more epoch would've been helpful, but for my purposes now that's fine.

In [29]:
roc_auc_score(y_test,pred2)

0.8826335410697205

In [30]:
preds2 = np.argmax(pred2, axis=1)

In [31]:
ys = np.argmax(y_test, axis=1)

In [32]:
print(classification_report(ys,preds2))

             precision    recall  f1-score   support

          0       0.77      0.74      0.75     15267
          1       0.49      0.39      0.43     13048
          2       0.52      0.45      0.49     22102
          3       0.54      0.64      0.59     39179
          4       0.70      0.69      0.70     37278

avg / total       0.61      0.61      0.61    126874



In [33]:
confusion_matrix (ys, preds2)

array([[11294,  2525,   705,   423,   320],
       [ 2516,  5032,  3991,  1252,   257],
       [  534,  2225, 10040,  8539,   764],
       [  213,   308,  4067, 25051,  9540],
       [  187,    78,   409, 10706, 25898]])

While 0.61 precision/recall isn't great, the 0.8826 AUC score is very very okay, one of the better kinds of okay. Also, the validation scores during training were very good, which is always helpful. A benefit, no doubt, of using all 630,000 reviews of business in the Toronto area.

The AUC score suggests that most of the predicted ratings are not too far off, and we can see that the vast majority of incorrect scores are within 1 star of the actual rating. Additionally, 1 and 5 star ratings had the greatest precision and recall, so our model is decent at picking up extreme sentiment (or the users are effusive in praise and unrestrained in condemnation). If I had split this into a positive/negative binary problem, obviously the accuracy would be a lot higher (at a glance, over 86% if we consider 3 to be negative and over 90% if we consider it to be positive), but it is interesting to try to pick up on the sublte differences between, say a 4 and a 5 star rating.

Let's see if adding in weather and relative price can increase accuracy.

### Stage 2: weather effects

Star ratings are neither objective nor scientific. We humans often make  bizarre, irrational and otherwise inconsistent choices due to many internal and external factors. Let's consider weather as one of the external factors, especially with regards to giving a star rating for a business. While good weather and a good mood might influence me to leave a more positive review as well as a higher star rating, there is really no way know the sort of review I would have left had the weather been different (the old problem of not knowing probabilities conditional on histories that haven't happened).

What we can do is see if the review text matches with the score, and if knowing the weather conditions can improve the accuracy of our star predictions.

In [24]:
reviews_w = pd.read_csv(f'{PATH}review_on.csv', index_col=0)

In [25]:
reviews_w = reviews_w[['stars','date','text']]

In [26]:
weather = pd.read_csv(f'{WEAT}all_weather.csv', index_col='Unnamed: 0')

  interactivity=interactivity, compiler=compiler, result=result)


In [27]:
weather['Year'] = weather['Year'].astype(int)
weather['Month'] = weather['Month'].astype(int)
weather['Day'] = weather['Day'].astype(int)
weather['Temp (°C)'] = weather['Temp (°C)'].astype(float)

In [28]:
reviews_w['date'] = pd.to_datetime(reviews_w['date'])

In [29]:
reviews_w.head()

Unnamed: 0,stars,date,text
0,4,2012-05-11,Who would have guess that you would be able to...
1,4,2015-10-27,Always drove past this coffee house and wonder...
2,3,2013-02-09,"Not bad!! Love that there is a gluten-free, ve..."
3,5,2016-04-06,Love this place! Peggy is great with dogs and...
4,4,2013-05-01,This is currently my parents new favourite res...


Let's get the temperature noon (12:00), afternoon (16:00) and night (20:00). Other possible features would be the number of hours described as raining or snowing, or adding in more hourly temperature snippets (like for 0:00 and 4:00). But 3 temperatures is already more than enough just to see if it'll work at all.

It is noteable that this is the weather for the day that the user wrote the review rather than when they engaged the business.

A few missing values, but interpolation should provide good estimates.

In [30]:
weather['Temp (°C)']=weather['Temp (°C)'].interpolate()

In [31]:
weather[weather['Temp (°C)'].isnull()]

Unnamed: 0,Date/Time,Year,Month,Day,Time,Data Quality,Temp (°C),Temp Flag,Dew Point Temp (°C),Dew Point Temp Flag,...,Wind Spd Flag,Visibility (km),Visibility Flag,Stn Press (kPa),Stn Press Flag,Hmdx,Hmdx Flag,Wind Chill,Wind Chill Flag,Weather
0.0,2006-01-01 00:00,2006,1,1,00:00,,,,,,...,,,,,,,,,,
1.0,2006-01-01 01:00,2006,1,1,01:00,,,,,,...,,,,,,,,,,


In [32]:
weather [(weather['Year'] == 2012) & (weather['Month'] == 5) & (weather['Day'] == 11) & (weather['Time'] == '09:00')]['Temp (°C)'].values[0]

15.8

In [33]:
def get_noon(d):
    year = d.year
    month = d.month
    day = d.day
    noon = "12:00"
    return (weather [(weather['Year'] == year) & (weather['Month'] == month) & (weather['Day'] == day) & (weather['Time'] == noon)]['Temp (°C)'].values[0])

In [34]:
def get_afternoon(d):
    year = d.year
    month = d.month
    day = d.day
    afternoon = "16:00"
    return (weather [(weather['Year'] == year) & (weather['Month'] == month) & (weather['Day'] == day) & (weather['Time'] == afternoon)]['Temp (°C)'].values[0])

In [35]:
def get_night(d):
    year = d.year
    month = d.month
    day = d.day
    night = "20:00"
    return (weather [(weather['Year'] == year) & (weather['Month'] == month) & (weather['Day'] == day) & (weather['Time'] == night)]['Temp (°C)'].values[0])

Apply is probably slower than manual iteration, since there is the overhead of calling the function, which then just performs iteration. But it's already done...

In [36]:
#reviews_w['noon'] = reviews_w['date'].apply(lambda d: get_noon(d))

In [37]:
#reviews_w['afternoon'] = reviews_w['date'].apply(lambda d: get_afternoon(d))

In [38]:
#reviews_w['night'] = reviews_w['date'].apply(lambda d: get_night(d))

In [39]:
#reviews_w.to_csv(f'{PATH}augmented_comments.csv')

In [18]:
reviews_w = pd.read_csv(f'{PATH}augmented_comments.csv', index_col=0)

In [19]:
new_features = reviews_w[['noon','afternoon','night']]

In [20]:
new_features_array = np.array(new_features)

In [21]:
def lstm_model2 (X_train, y_train,test, val='no'):
    model = Sequential()
    model.add(Embedding(50000,300,input_length=503, weights=[emb_matrix]))
    model.add(Convolution1D(filters=128, kernel_size=5, padding='same', activation='relu'))
    model.add(MaxPooling1D(5))
    model.add(Dropout(0.2))
    
    model.add(Bidirectional(LSTM(128,dropout=0.1,recurrent_dropout=0.1)))
    
    model.add(Dense(5,activation='softmax'))
    
    sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9)     
    model.compile(loss='categorical_crossentropy',optimizer=sgd,metrics=['accuracy'])
    
    if val == 'no':
        model.fit(X_train,y_train,batch_size=128,epochs=3)
    else:
        model.fit(X_train,y_train,batch_size=128,epochs=3,validation_split=0.2)
    pred = model.predict(test)
    return pred    

The simplest possible way to add in the new features, just add them directly onto the existing vectorized features.

In [22]:
XX = np.concatenate((padded,new_features_array),axis=1)

In [23]:
del padded

In [24]:
X_train, X_test, y_train, y_test = train_test_split(XX, dummy_y, test_size=0.2, random_state=202)

In [25]:
pred3 = lstm_model2 (X_train, y_train, X_test, val='yes')

Train on 405993 samples, validate on 101499 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


In [39]:
del X_train, X_test, y_train, y_test

In [26]:
roc_auc_score(y_test,pred3)

0.8816730152568608

In [27]:
preds3 = np.argmax(pred3, axis=1)

In [28]:
ys = np.argmax(y_test, axis=1)

In [29]:
print(classification_report(ys,preds3))

             precision    recall  f1-score   support

          0       0.69      0.82      0.75     15267
          1       0.52      0.26      0.34     13048
          2       0.47      0.61      0.53     22102
          3       0.56      0.60      0.58     39179
          4       0.74      0.62      0.68     37278

avg / total       0.61      0.60      0.60    126874



In [30]:
confusion_matrix (ys, preds3)

array([[12551,  1335,  1027,   159,   195],
       [ 3768,  3333,  5228,   558,   161],
       [  963,  1441, 13517,  5613,   568],
       [  438,   202,  7674, 23675,  7190],
       [  415,    69,  1172, 12324, 23298]])

It doesn't seem like weather helps that much, at least not in this implementation. It's outright awful at recalling 2 star ratings. Maybe most of the weather effect has already gone into the comment itself, maybe the effect is insignificant, or maybe a change in implementing weather effects would help. Maybe I should look at how the weather differs from average rather than just a simple temperature.

For a few ratings, there seems to be a tradeoff between precision and recall among the two models, but I can't be sure of how consistent that is.

### Stage 3: Day of Week

I suspect this could be useful! But then I also suspected weather would be as well!

In [19]:
# at this point you'd think i would be smart enough to write a function that accepts
# a customizable input_length but obviously i'm not

def lstm_model3 (X_train, y_train,test, val='no'):
    model = Sequential()
    model.add(Embedding(50000,300,input_length=507, weights=[emb_matrix]))
    model.add(Convolution1D(filters=128, kernel_size=5, padding='same', activation='relu'))
    model.add(MaxPooling1D(5))
    model.add(Dropout(0.2))
    
    model.add(Bidirectional(LSTM(128,dropout=0.1,recurrent_dropout=0.1)))
    
    model.add(Dense(5,activation='softmax'))
    
    sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9)     
    model.compile(loss='categorical_crossentropy',optimizer=sgd,metrics=['accuracy'])
    
    if val == 'no':
        model.fit(X_train,y_train,batch_size=128,epochs=3)
    else:
        model.fit(X_train,y_train,batch_size=128,epochs=3,validation_split=0.2)
    pred = model.predict(test)
    return pred    

In [20]:
reviews_d = pd.read_csv(f'{PATH}review_on.csv', index_col=0)

In [21]:
reviews_d = reviews_d['date']

In [22]:
for i,d in enumerate(reviews_d):
    reviews_d[i] = datetime.strptime(d, '%Y-%m-%d').weekday()

In [23]:
enc = LabelEncoder()
enc.fit(reviews_d)
dow = enc.transform(reviews_d)
dummy_dow = np_utils.to_categorical(dow)

In [24]:
X3 = np.concatenate((padded,dummy_dow),axis=1)

In [25]:
# Remember when 16 gb of RAM was more than enough for pretty much anything?
del padded, dummy_dow, data

In [26]:
X_train, X_test, y_train, y_test = train_test_split(X3, dummy_y, test_size=0.2, random_state=202)

In [28]:
pred4 = lstm_model3 (X_train, y_train, X_test, val='yes')

Train on 405993 samples, validate on 101499 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


In [29]:
roc_auc_score(y_test,pred4)

0.8835195453706615

In [30]:
preds4 = np.argmax(pred4, axis=1)

In [31]:
ys = np.argmax(y_test, axis=1)

In [32]:
print(classification_report(ys,preds4))

             precision    recall  f1-score   support

          0       0.77      0.75      0.76     15267
          1       0.51      0.37      0.43     13048
          2       0.53      0.47      0.50     22102
          3       0.54      0.67      0.60     39179
          4       0.72      0.65      0.68     37278

avg / total       0.61      0.61      0.61    126874



In [33]:
confusion_matrix (ys, preds4)

array([[11421,  2288,   701,   391,   466],
       [ 2622,  4869,  4012,  1218,   327],
       [  520,  2133, 10494,  8167,   788],
       [  186,   271,  4311, 26426,  7985],
       [  170,    52,   405, 12543, 24108]])

So again, not much difference from just looking at the comment text. A big takeaway from all of this is that 2 star ratings seem to be the most ambiguous, followed by 3 star ratings.

Also, the GLoVe embeddings don't really seem to do much here except take up RAM. Having the pretrained weights seems to have a very slightly positive effect on accuracy - as long as additional training is kept on. 630000+ reviews is a lot of text to train on, probably enough to get a very good picture of the semantic relationships in Yelp reviews.

It would probably make more sense to look at business types separately, especially restaurants. The kinds of things people talk about in a restaurant review would seem to be different from what they would write for a hardware store. Similarly, mood effects from weather or day of week could differ for different types of businesses. Or the effects are not strong enough.

While weather and day of week don't seem to have a huge behavioral effect when it comes to rating businesses (or the effects have already been expressed in the review text), it should be worth exploring other factors that might affect consumer perceptions. For example, they might give more favorable ratings on holidays, less favorable ratings if their favorite political candidate loses an election, or if economic conditions worsen, if there has been a swine flu or mad cow outbreak, or if recent news events have been very negative.

If businesses can better understand the things affecting their customers' moods, they would be better equipped to perhaps try to counteract certain kinds of negative sentiments that might negatively affect their ratings.