{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Yelp star ratings classification and external factors\n", "\n", "by Keith Qu\n", "\n", "It may be a good idea to segregate the data by business type (restaurant, hardware store, etc.). It could be easier and less computationally intensive per category. But it would be interesting to find very general external features that can help determine star ratings for all businesses. These are factors that business owners have no direct control over, but knowing their effects can help with forming a plan to counteract negative customer sentiment.\n", "\n", "Three stages: first we look at comment text alone, to see how accurate we can predict star rating based on that. Then we add in weather effects. Since star ratings are highly subjective, users may be influenced by many things when it comes to the rating. Finally, we'll see if the day of the week that a review is written on can help predict ratings. If these factors have an effect on sentiment, they will undoubtedly affect the review text as well, but there may also be subtle additional effects on the star rating.\n", "\n", "The goal isn't so much to painstakingly tune a NN for the last bit of accuracy, but rather to see if adding one or two engineered features can have a significant improvement regardless of model. Or if I can embarrass myself with a complete lack of improvement!\n", "\n", "I picked weather and day of week since they are known to have effects on customer activity (how many customers visit), but let's see if they can also help predict ratings." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/keith/anaconda/lib/python3.6/site-packages/h5py/__init__.py:34: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n", " from ._conv import register_converters as _register_converters\n", "Using TensorFlow backend.\n" ] } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "from nltk.corpus import stopwords\n", "from nltk import WordNetLemmatizer\n", "from nltk import pos_tag, word_tokenize\n", "\n", "from keras.models import Sequential\n", "from keras.layers import (Dense, Dropout, Input, LSTM, Activation, Flatten,\n", " Convolution1D, MaxPooling1D, Bidirectional,\n", " GlobalMaxPooling1D, Embedding, BatchNormalization,\n", " SpatialDropout1D)\n", "from keras.preprocessing import sequence\n", "from keras.preprocessing.text import Tokenizer\n", "from keras.preprocessing.sequence import pad_sequences\n", "from keras.utils import np_utils\n", "from keras.optimizers import SGD\n", "\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import confusion_matrix, classification_report, roc_curve, roc_auc_score, auc\n", "from sklearn.preprocessing import LabelEncoder\n", "\n", "from datetime import datetime\n", "\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "PATH = \"/d/data/yelpdata/dataset/\"\n", "#PATH = \"d:\\\\data\\\\yelpdata\\\\dataset\\\\\"\n", "WEAT = f'{PATH}processed_weather/'" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#businesses = pd.read_csv(f'{PATH}business_on.csv', index_col=0)\n", "reviews = pd.read_csv(f'{PATH}review_on.csv', index_col=0)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "reviews = reviews[['stars','text']]" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "reviews['text'].fillna('empty', inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Stage 1: Predicting star rating based on review text alone" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def clean_up(t):\n", " t = t.strip().lower()\n", " words = t.split()\n", " \n", " # first get rid of the stopwords, or a lemmatized stopword might not\n", " # be recognized as a stopword\n", " \n", " imp_words = ' '.join(w for w in words if w not in set(stopwords.words('english')))\n", "\n", " # lemmatize based on adjectives (J), verbs (V), nouns (N) and adverbs (R) to\n", " # return only the base words (as opposed to stemming which can return\n", " # non-words). e.g. ponies -> poni with stemming, and pony with lemmatizing\n", " \n", " final_words = ''\n", " \n", " lemma = WordNetLemmatizer()\n", " for (w,tag) in pos_tag(word_tokenize(imp_words)):\n", " if tag.startswith('J'):\n", " final_words += ' '+ lemma.lemmatize(w, pos='a')\n", " elif tag.startswith('V'):\n", " final_words += ' '+ lemma.lemmatize(w, pos='v')\n", " elif tag.startswith('N'):\n", " final_words += ' '+ lemma.lemmatize(w, pos='n')\n", " elif tag.startswith('R'):\n", " final_words += ' '+ lemma.lemmatize(w, pos='r')\n", " else:\n", " final_words += ' '+ w\n", " \n", " return final_words\n", "\n", "# what a great name. do_stuff\n", "\n", "def do_stuff (df):\n", " text = df['text'].copy()\n", " \n", " text.replace(to_replace={r'[^\\x00-\\x7F]':' '},inplace=True,regex=True)\n", " text.replace(to_replace={r'[^a-zA-Z]': ' '},inplace=True,regex=True)\n", " \n", " # Then lower case, tokenize and lemmatize\n", "\n", " # with over 600,000 entries, this is going to be one hell of a long apply...\n", " \n", " text = text.apply(lambda t:clean_up(t))\n", " return text" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# bidirectional LSTM, as described by Zhou et. al. (2016) http://www.aclweb.org/anthology/C16-1329\n", "def lstm_model (X_train, y_train,test, val='no'):\n", " model = Sequential()\n", " model.add(Embedding(50000,300,input_length=500,weights=[emb_matrix]))\n", " model.add(Convolution1D(filters=128, kernel_size=5, padding='same', activation='relu'))\n", " model.add(MaxPooling1D(5))\n", " model.add(Dropout(0.2))\n", " \n", " model.add(Bidirectional(LSTM(128,dropout=0.1,recurrent_dropout=0.1)))\n", " \n", " model.add(Dense(5,activation='softmax'))\n", " \n", " sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9) \n", " model.compile(loss='categorical_crossentropy',optimizer=sgd,metrics=['accuracy'])\n", " \n", " if val == 'no':\n", " model.fit(X_train,y_train,batch_size=128,epochs=3)\n", " else:\n", " model.fit(X_train,y_train,batch_size=128,epochs=3,validation_split=0.2)\n", " pred = model.predict(test)\n", " return pred " ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# converging to a very conventional convolutional NN model to convert non-conversational text to star ratings\n", "# uh... with a non-convex loss function\n", "# an LSTM network could do better, but it would also take significantly longer to run\n", "#\n", "# (not actually using the CNN model here)\n", "\n", "def cnn_model (X_train, y_train, test, val='no'):\n", " model=Sequential()\n", " model.add(Embedding(50000,128,input_length=500))\n", " model.add(Convolution1D(128,5,activation='relu'))\n", " model.add(MaxPooling1D(5))\n", " model.add(Dropout(0.2))\n", " \n", " model.add(Convolution1D(128,5,activation='relu'))\n", " model.add(MaxPooling1D(5))\n", " model.add(Dropout(0.2))\n", " \n", " model.add(Convolution1D(128,5,activation='relu'))\n", " model.add(MaxPooling1D(35))\n", " model.add(Flatten())\n", " model.add(Dense(128,activation='relu'))\n", " model.add(Dropout(0.2))\n", " \n", " model.add(Dense(5,activation='softmax'))\n", " \n", " sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9) \n", " model.compile(loss='categorical_crossentropy',optimizer=sgd,metrics=['accuracy'])\n", " \n", " if val == 'no':\n", " model.fit(X_train,y_train,batch_size=128,epochs=5)\n", " else:\n", " model.fit(X_train,y_train,batch_size=128,epochs=5,validation_split=0.2)\n", " pred = model.predict(test)\n", " return pred" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#data = do_stuff(reviews)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#data.to_csv(f'{PATH}review_on_processed_text.csv')" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "data = pd.Series.from_csv(f'{PATH}review_on_processed_text.csv', index_col=0)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "stars = reviews['stars']" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true }, "outputs": [], "source": [ "del reviews" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": true }, "outputs": [], "source": [ "enc = LabelEncoder()\n", "enc.fit(stars)\n", "y = enc.transform(stars)\n", "dummy_y = np_utils.to_categorical(y)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": true }, "outputs": [], "source": [ "data.fillna('empty', inplace=True)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "tok = Tokenizer(num_words=50000)\n", "tok.fit_on_texts(data)\n", "\n", "sequenced = tok.texts_to_sequences(data)\n", "padded = pad_sequences(sequenced,maxlen=500)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "# getting the pretrained weight matrix\n", "# based on https://www.kaggle.com/jhoward/improved-lstm-baseline-glove-dropout\n", "# by which I mean it's pretty much just that...\n", "\n", "EMBED_FILE = '/d/data/glove.42B.300d.txt'\n", "\n", "def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')\n", "embeddings_index = dict(get_coefs(*o.strip().split()) for o in open(EMBED_FILE))\n", "\n", "embed_size = 300\n", "max_features = 50000\n", "maxlen = 500\n", "\n", "all_embs = np.stack(embeddings_index.values())\n", "emb_mean,emb_std = all_embs.mean(), all_embs.std()\n", "\n", "word_index = tok.word_index\n", "nb_words = min(50000, len(word_index))\n", "emb_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))\n", "for word, i in word_index.items():\n", " if i >= max_features: continue\n", " embedding_vector = embeddings_index.get(word)\n", " if embedding_vector is not None: emb_matrix[i] = embedding_vector" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": true }, "outputs": [], "source": [ "del embedding_vector, embeddings_index" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = train_test_split(padded, dummy_y, test_size=0.2, random_state=202)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "#del data,emb_mean,emb_std,embed_size" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train on 405993 samples, validate on 101499 samples\n", "Epoch 1/3\n", "405993/405993 [==============================] - 1226s 3ms/step - loss: 1.0164 - acc: 0.5483 - val_loss: 0.9206 - val_acc: 0.5885\n", "Epoch 2/3\n", "405993/405993 [==============================] - 1215s 3ms/step - loss: 0.9203 - acc: 0.5910 - val_loss: 0.8964 - val_acc: 0.5967\n", "Epoch 3/3\n", "405993/405993 [==============================] - 1224s 3ms/step - loss: 0.8912 - acc: 0.6035 - val_loss: 0.8804 - val_acc: 0.6102\n" ] } ], "source": [ "# normally 1 comes before 2, but... this just starts at 2\n", "pred2 = lstm_model (X_train, y_train, X_test, val='yes')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Maybe one more epoch would've been helpful, but for my purposes now that's fine." ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "0.8826335410697205" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "roc_auc_score(y_test,pred2)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": true }, "outputs": [], "source": [ "preds2 = np.argmax(pred2, axis=1)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": true }, "outputs": [], "source": [ "ys = np.argmax(y_test, axis=1)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 0 0.77 0.74 0.75 15267\n", " 1 0.49 0.39 0.43 13048\n", " 2 0.52 0.45 0.49 22102\n", " 3 0.54 0.64 0.59 39179\n", " 4 0.70 0.69 0.70 37278\n", "\n", "avg / total 0.61 0.61 0.61 126874\n", "\n" ] } ], "source": [ "print(classification_report(ys,preds2))" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[11294, 2525, 705, 423, 320],\n", " [ 2516, 5032, 3991, 1252, 257],\n", " [ 534, 2225, 10040, 8539, 764],\n", " [ 213, 308, 4067, 25051, 9540],\n", " [ 187, 78, 409, 10706, 25898]])" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "confusion_matrix (ys, preds2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While 0.61 precision/recall isn't great, the 0.8826 AUC score is very very okay, one of the better kinds of okay. Also, the validation scores during training were very good, which is always helpful. A benefit, no doubt, of using all 630,000 reviews of business in the Toronto area.\n", "\n", "The AUC score suggests that most of the predicted ratings are not too far off, and we can see that the vast majority of incorrect scores are within 1 star of the actual rating. Additionally, 1 and 5 star ratings had the greatest precision and recall, so our model is decent at picking up extreme sentiment (or the users are effusive in praise and unrestrained in condemnation). If I had split this into a positive/negative binary problem, obviously the accuracy would be a lot higher (at a glance, over 86% if we consider 3 to be negative and over 90% if we consider it to be positive), but it is interesting to try to pick up on the sublte differences between, say a 4 and a 5 star rating.\n", "\n", "Let's see if adding in weather and relative price can increase accuracy." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Stage 2: weather effects\n", "\n", "Star ratings are neither objective nor scientific. We humans often make bizarre, irrational and otherwise inconsistent choices due to many internal and external factors. Let's consider weather as one of the external factors, especially with regards to giving a star rating for a business. While good weather and a good mood might influence me to leave a more positive review as well as a higher star rating, there is really no way know the sort of review I would have left had the weather been different (the old problem of not knowing probabilities conditional on histories that haven't happened).\n", "\n", "What we can do is see if the review text matches with the score, and if knowing the weather conditions can improve the accuracy of our star predictions." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": true }, "outputs": [], "source": [ "reviews_w = pd.read_csv(f'{PATH}review_on.csv', index_col=0)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": true }, "outputs": [], "source": [ "reviews_w = reviews_w[['stars','date','text']]" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "c:\\programdata\\anaconda3\\lib\\site-packages\\IPython\\core\\interactiveshell.py:2728: DtypeWarning: Columns (12,16,20) have mixed types. Specify dtype option on import or set low_memory=False.\n", " interactivity=interactivity, compiler=compiler, result=result)\n" ] } ], "source": [ "weather = pd.read_csv(f'{WEAT}all_weather.csv', index_col='Unnamed: 0')" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": true }, "outputs": [], "source": [ "weather['Year'] = weather['Year'].astype(int)\n", "weather['Month'] = weather['Month'].astype(int)\n", "weather['Day'] = weather['Day'].astype(int)\n", "weather['Temp (°C)'] = weather['Temp (°C)'].astype(float)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": true }, "outputs": [], "source": [ "reviews_w['date'] = pd.to_datetime(reviews_w['date'])" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
starsdatetext
042012-05-11Who would have guess that you would be able to...
142015-10-27Always drove past this coffee house and wonder...
232013-02-09Not bad!! Love that there is a gluten-free, ve...
352016-04-06Love this place! Peggy is great with dogs and...
442013-05-01This is currently my parents new favourite res...
\n", "
" ], "text/plain": [ " stars date text\n", "0 4 2012-05-11 Who would have guess that you would be able to...\n", "1 4 2015-10-27 Always drove past this coffee house and wonder...\n", "2 3 2013-02-09 Not bad!! Love that there is a gluten-free, ve...\n", "3 5 2016-04-06 Love this place! Peggy is great with dogs and...\n", "4 4 2013-05-01 This is currently my parents new favourite res..." ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reviews_w.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's get the temperature noon (12:00), afternoon (16:00) and night (20:00). Other possible features would be the number of hours described as raining or snowing, or adding in more hourly temperature snippets (like for 0:00 and 4:00). But 3 temperatures is already more than enough just to see if it'll work at all.\n", "\n", "It is noteable that this is the weather for the day that the user wrote the review rather than when they engaged the business." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A few missing values, but interpolation should provide good estimates." ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": true }, "outputs": [], "source": [ "weather['Temp (°C)']=weather['Temp (°C)'].interpolate()" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Date/TimeYearMonthDayTimeData QualityTemp (°C)Temp FlagDew Point Temp (°C)Dew Point Temp Flag...Wind Spd FlagVisibility (km)Visibility FlagStn Press (kPa)Stn Press FlagHmdxHmdx FlagWind ChillWind Chill FlagWeather
0.02006-01-01 00:0020061100:00NaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1.02006-01-01 01:0020061101:00NaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", "

2 rows × 25 columns

\n", "
" ], "text/plain": [ " Date/Time Year Month Day Time Data Quality Temp (°C) \\\n", "0.0 2006-01-01 00:00 2006 1 1 00:00 NaN NaN \n", "1.0 2006-01-01 01:00 2006 1 1 01:00 NaN NaN \n", "\n", " Temp Flag Dew Point Temp (°C) Dew Point Temp Flag ... Wind Spd Flag \\\n", "0.0 NaN NaN NaN ... NaN \n", "1.0 NaN NaN NaN ... NaN \n", "\n", " Visibility (km) Visibility Flag Stn Press (kPa) Stn Press Flag Hmdx \\\n", "0.0 NaN NaN NaN NaN NaN \n", "1.0 NaN NaN NaN NaN NaN \n", "\n", " Hmdx Flag Wind Chill Wind Chill Flag Weather \n", "0.0 NaN NaN NaN NaN \n", "1.0 NaN NaN NaN NaN \n", "\n", "[2 rows x 25 columns]" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "weather[weather['Temp (°C)'].isnull()]" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "15.8" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "weather [(weather['Year'] == 2012) & (weather['Month'] == 5) & (weather['Day'] == 11) & (weather['Time'] == '09:00')]['Temp (°C)'].values[0]" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def get_noon(d):\n", " year = d.year\n", " month = d.month\n", " day = d.day\n", " noon = \"12:00\"\n", " return (weather [(weather['Year'] == year) & (weather['Month'] == month) & (weather['Day'] == day) & (weather['Time'] == noon)]['Temp (°C)'].values[0])" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def get_afternoon(d):\n", " year = d.year\n", " month = d.month\n", " day = d.day\n", " afternoon = \"16:00\"\n", " return (weather [(weather['Year'] == year) & (weather['Month'] == month) & (weather['Day'] == day) & (weather['Time'] == afternoon)]['Temp (°C)'].values[0])" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def get_night(d):\n", " year = d.year\n", " month = d.month\n", " day = d.day\n", " night = \"20:00\"\n", " return (weather [(weather['Year'] == year) & (weather['Month'] == month) & (weather['Day'] == day) & (weather['Time'] == night)]['Temp (°C)'].values[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apply is probably slower than manual iteration, since there is the overhead of calling the function, which then just performs iteration. But it's already done..." ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#reviews_w['noon'] = reviews_w['date'].apply(lambda d: get_noon(d))" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#reviews_w['afternoon'] = reviews_w['date'].apply(lambda d: get_afternoon(d))" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#reviews_w['night'] = reviews_w['date'].apply(lambda d: get_night(d))" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#reviews_w.to_csv(f'{PATH}augmented_comments.csv')" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": true }, "outputs": [], "source": [ "reviews_w = pd.read_csv(f'{PATH}augmented_comments.csv', index_col=0)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": true }, "outputs": [], "source": [ "new_features = reviews_w[['noon','afternoon','night']]" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": true }, "outputs": [], "source": [ "new_features_array = np.array(new_features)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def lstm_model2 (X_train, y_train,test, val='no'):\n", " model = Sequential()\n", " model.add(Embedding(50000,300,input_length=503, weights=[emb_matrix]))\n", " model.add(Convolution1D(filters=128, kernel_size=5, padding='same', activation='relu'))\n", " model.add(MaxPooling1D(5))\n", " model.add(Dropout(0.2))\n", " \n", " model.add(Bidirectional(LSTM(128,dropout=0.1,recurrent_dropout=0.1)))\n", " \n", " model.add(Dense(5,activation='softmax'))\n", " \n", " sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9) \n", " model.compile(loss='categorical_crossentropy',optimizer=sgd,metrics=['accuracy'])\n", " \n", " if val == 'no':\n", " model.fit(X_train,y_train,batch_size=128,epochs=3)\n", " else:\n", " model.fit(X_train,y_train,batch_size=128,epochs=3,validation_split=0.2)\n", " pred = model.predict(test)\n", " return pred " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The simplest possible way to add in the new features, just add them directly onto the existing vectorized features." ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": true }, "outputs": [], "source": [ "XX = np.concatenate((padded,new_features_array),axis=1)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "del padded" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = train_test_split(XX, dummy_y, test_size=0.2, random_state=202)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train on 405993 samples, validate on 101499 samples\n", "Epoch 1/3\n", "405993/405993 [==============================] - 1132s 3ms/step - loss: 1.0049 - acc: 0.5525 - val_loss: 0.9282 - val_acc: 0.5897\n", "Epoch 2/3\n", "405993/405993 [==============================] - 1095s 3ms/step - loss: 0.9188 - acc: 0.5904 - val_loss: 0.9006 - val_acc: 0.5972\n", "Epoch 3/3\n", "405993/405993 [==============================] - 1091s 3ms/step - loss: 0.8900 - acc: 0.6034 - val_loss: 0.8928 - val_acc: 0.6014\n" ] } ], "source": [ "pred3 = lstm_model2 (X_train, y_train, X_test, val='yes')" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": true }, "outputs": [], "source": [ "del X_train, X_test, y_train, y_test" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.8816730152568608" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "roc_auc_score(y_test,pred3)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": true }, "outputs": [], "source": [ "preds3 = np.argmax(pred3, axis=1)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": true }, "outputs": [], "source": [ "ys = np.argmax(y_test, axis=1)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 0 0.69 0.82 0.75 15267\n", " 1 0.52 0.26 0.34 13048\n", " 2 0.47 0.61 0.53 22102\n", " 3 0.56 0.60 0.58 39179\n", " 4 0.74 0.62 0.68 37278\n", "\n", "avg / total 0.61 0.60 0.60 126874\n", "\n" ] } ], "source": [ "print(classification_report(ys,preds3))" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[12551, 1335, 1027, 159, 195],\n", " [ 3768, 3333, 5228, 558, 161],\n", " [ 963, 1441, 13517, 5613, 568],\n", " [ 438, 202, 7674, 23675, 7190],\n", " [ 415, 69, 1172, 12324, 23298]])" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "confusion_matrix (ys, preds3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It doesn't seem like weather helps that much, at least not in this implementation. It's outright awful at recalling 2 star ratings. Maybe most of the weather effect has already gone into the comment itself, maybe the effect is insignificant, or maybe a change in implementing weather effects would help. Maybe I should look at how the weather differs from average rather than just a simple temperature.\n", "\n", "For a few ratings, there seems to be a tradeoff between precision and recall among the two models, but I can't be sure of how consistent that is." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Stage 3: Day of Week\n", "\n", "I suspect this could be useful! But then I also suspected weather would be as well!" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# at this point you'd think i would be smart enough to write a function that accepts\n", "# a customizable input_length but obviously i'm not\n", "\n", "def lstm_model3 (X_train, y_train,test, val='no'):\n", " model = Sequential()\n", " model.add(Embedding(50000,300,input_length=507, weights=[emb_matrix]))\n", " model.add(Convolution1D(filters=128, kernel_size=5, padding='same', activation='relu'))\n", " model.add(MaxPooling1D(5))\n", " model.add(Dropout(0.2))\n", " \n", " model.add(Bidirectional(LSTM(128,dropout=0.1,recurrent_dropout=0.1)))\n", " \n", " model.add(Dense(5,activation='softmax'))\n", " \n", " sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9) \n", " model.compile(loss='categorical_crossentropy',optimizer=sgd,metrics=['accuracy'])\n", " \n", " if val == 'no':\n", " model.fit(X_train,y_train,batch_size=128,epochs=3)\n", " else:\n", " model.fit(X_train,y_train,batch_size=128,epochs=3,validation_split=0.2)\n", " pred = model.predict(test)\n", " return pred " ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": true }, "outputs": [], "source": [ "reviews_d = pd.read_csv(f'{PATH}review_on.csv', index_col=0)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": true }, "outputs": [], "source": [ "reviews_d = reviews_d['date']" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": true }, "outputs": [], "source": [ "for i,d in enumerate(reviews_d):\n", " reviews_d[i] = datetime.strptime(d, '%Y-%m-%d').weekday()" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": true }, "outputs": [], "source": [ "enc = LabelEncoder()\n", "enc.fit(reviews_d)\n", "dow = enc.transform(reviews_d)\n", "dummy_dow = np_utils.to_categorical(dow)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "X3 = np.concatenate((padded,dummy_dow),axis=1)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Remember when 16 gb of RAM was more than enough for pretty much anything?\n", "del padded, dummy_dow, data" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": true }, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = train_test_split(X3, dummy_y, test_size=0.2, random_state=202)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train on 405993 samples, validate on 101499 samples\n", "Epoch 1/3\n", "405993/405993 [==============================] - 1114s 3ms/step - loss: 1.0068 - acc: 0.5516 - val_loss: 0.9247 - val_acc: 0.5856\n", "Epoch 2/3\n", "405993/405993 [==============================] - 1107s 3ms/step - loss: 0.9121 - acc: 0.5944 - val_loss: 0.8907 - val_acc: 0.6033\n", "Epoch 3/3\n", "405993/405993 [==============================] - 1147s 3ms/step - loss: 0.8840 - acc: 0.6064 - val_loss: 0.8784 - val_acc: 0.6111\n" ] } ], "source": [ "pred4 = lstm_model3 (X_train, y_train, X_test, val='yes')" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.8835195453706615" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "roc_auc_score(y_test,pred4)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": true }, "outputs": [], "source": [ "preds4 = np.argmax(pred4, axis=1)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": true }, "outputs": [], "source": [ "ys = np.argmax(y_test, axis=1)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 0 0.77 0.75 0.76 15267\n", " 1 0.51 0.37 0.43 13048\n", " 2 0.53 0.47 0.50 22102\n", " 3 0.54 0.67 0.60 39179\n", " 4 0.72 0.65 0.68 37278\n", "\n", "avg / total 0.61 0.61 0.61 126874\n", "\n" ] } ], "source": [ "print(classification_report(ys,preds4))" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[11421, 2288, 701, 391, 466],\n", " [ 2622, 4869, 4012, 1218, 327],\n", " [ 520, 2133, 10494, 8167, 788],\n", " [ 186, 271, 4311, 26426, 7985],\n", " [ 170, 52, 405, 12543, 24108]])" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "confusion_matrix (ys, preds4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So again, not much difference from just looking at the comment text. A big takeaway from all of this is that 2 star ratings seem to be the most ambiguous, followed by 3 star ratings.\n", "\n", "Also, the GLoVe embeddings don't really seem to do much here except take up RAM. Having the pretrained weights seems to have a very slightly positive effect on accuracy - as long as additional training is kept on. 630000+ reviews is a lot of text to train on, probably enough to get a very good picture of the semantic relationships in Yelp reviews.\n", "\n", "It would probably make more sense to look at business types separately, especially restaurants. The kinds of things people talk about in a restaurant review would seem to be different from what they would write for a hardware store. Similarly, mood effects from weather or day of week could differ for different types of businesses. Or the effects are not strong enough.\n", "\n", "While weather and day of week don't seem to have a huge behavioral effect when it comes to rating businesses (or the effects have already been expressed in the review text), it should be worth exploring other factors that might affect consumer perceptions. For example, they might give more favorable ratings on holidays, less favorable ratings if their favorite political candidate loses an election, or if economic conditions worsen, if there has been a swine flu or mad cow outbreak, or if recent news events have been very negative.\n", "\n", "If businesses can better understand the things affecting their customers' moods, they would be better equipped to perhaps try to counteract certain kinds of negative sentiments that might negatively affect their ratings." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }