{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Yelp star ratings classification and external factors\n", "\n", "by Keith Qu\n", "\n", "It may be a good idea to segregate the data by business type (restaurant, hardware store, etc.). It could be easier and less computationally intensive per category. But it would be interesting to find very general external features that can help determine star ratings for all businesses. These are factors that business owners have no direct control over, but knowing their effects can help with forming a plan to counteract negative customer sentiment.\n", "\n", "Three stages: first we look at comment text alone, to see how accurate we can predict star rating based on that. Then we add in weather effects. Since star ratings are highly subjective, users may be influenced by many things when it comes to the rating. Finally, we'll see if the day of the week that a review is written on can help predict ratings. If these factors have an effect on sentiment, they will undoubtedly affect the review text as well, but there may also be subtle additional effects on the star rating.\n", "\n", "The goal isn't so much to painstakingly tune a NN for the last bit of accuracy, but rather to see if adding one or two engineered features can have a significant improvement regardless of model. Or if I can embarrass myself with a complete lack of improvement!\n", "\n", "I picked weather and day of week since they are known to have effects on customer activity (how many customers visit), but let's see if they can also help predict ratings." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/keith/anaconda/lib/python3.6/site-packages/h5py/__init__.py:34: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n", " from ._conv import register_converters as _register_converters\n", "Using TensorFlow backend.\n" ] } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "from nltk.corpus import stopwords\n", "from nltk import WordNetLemmatizer\n", "from nltk import pos_tag, word_tokenize\n", "\n", "from keras.models import Sequential\n", "from keras.layers import (Dense, Dropout, Input, LSTM, Activation, Flatten,\n", " Convolution1D, MaxPooling1D, Bidirectional,\n", " GlobalMaxPooling1D, Embedding, BatchNormalization,\n", " SpatialDropout1D)\n", "from keras.preprocessing import sequence\n", "from keras.preprocessing.text import Tokenizer\n", "from keras.preprocessing.sequence import pad_sequences\n", "from keras.utils import np_utils\n", "from keras.optimizers import SGD\n", "\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import confusion_matrix, classification_report, roc_curve, roc_auc_score, auc\n", "from sklearn.preprocessing import LabelEncoder\n", "\n", "from datetime import datetime\n", "\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "PATH = \"/d/data/yelpdata/dataset/\"\n", "#PATH = \"d:\\\\data\\\\yelpdata\\\\dataset\\\\\"\n", "WEAT = f'{PATH}processed_weather/'" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#businesses = pd.read_csv(f'{PATH}business_on.csv', index_col=0)\n", "reviews = pd.read_csv(f'{PATH}review_on.csv', index_col=0)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "reviews = reviews[['stars','text']]" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "reviews['text'].fillna('empty', inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Stage 1: Predicting star rating based on review text alone" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def clean_up(t):\n", " t = t.strip().lower()\n", " words = t.split()\n", " \n", " # first get rid of the stopwords, or a lemmatized stopword might not\n", " # be recognized as a stopword\n", " \n", " imp_words = ' '.join(w for w in words if w not in set(stopwords.words('english')))\n", "\n", " # lemmatize based on adjectives (J), verbs (V), nouns (N) and adverbs (R) to\n", " # return only the base words (as opposed to stemming which can return\n", " # non-words). e.g. ponies -> poni with stemming, and pony with lemmatizing\n", " \n", " final_words = ''\n", " \n", " lemma = WordNetLemmatizer()\n", " for (w,tag) in pos_tag(word_tokenize(imp_words)):\n", " if tag.startswith('J'):\n", " final_words += ' '+ lemma.lemmatize(w, pos='a')\n", " elif tag.startswith('V'):\n", " final_words += ' '+ lemma.lemmatize(w, pos='v')\n", " elif tag.startswith('N'):\n", " final_words += ' '+ lemma.lemmatize(w, pos='n')\n", " elif tag.startswith('R'):\n", " final_words += ' '+ lemma.lemmatize(w, pos='r')\n", " else:\n", " final_words += ' '+ w\n", " \n", " return final_words\n", "\n", "# what a great name. do_stuff\n", "\n", "def do_stuff (df):\n", " text = df['text'].copy()\n", " \n", " text.replace(to_replace={r'[^\\x00-\\x7F]':' '},inplace=True,regex=True)\n", " text.replace(to_replace={r'[^a-zA-Z]': ' '},inplace=True,regex=True)\n", " \n", " # Then lower case, tokenize and lemmatize\n", "\n", " # with over 600,000 entries, this is going to be one hell of a long apply...\n", " \n", " text = text.apply(lambda t:clean_up(t))\n", " return text" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# bidirectional LSTM, as described by Zhou et. al. (2016) http://www.aclweb.org/anthology/C16-1329\n", "def lstm_model (X_train, y_train,test, val='no'):\n", " model = Sequential()\n", " model.add(Embedding(50000,300,input_length=500,weights=[emb_matrix]))\n", " model.add(Convolution1D(filters=128, kernel_size=5, padding='same', activation='relu'))\n", " model.add(MaxPooling1D(5))\n", " model.add(Dropout(0.2))\n", " \n", " model.add(Bidirectional(LSTM(128,dropout=0.1,recurrent_dropout=0.1)))\n", " \n", " model.add(Dense(5,activation='softmax'))\n", " \n", " sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9) \n", " model.compile(loss='categorical_crossentropy',optimizer=sgd,metrics=['accuracy'])\n", " \n", " if val == 'no':\n", " model.fit(X_train,y_train,batch_size=128,epochs=3)\n", " else:\n", " model.fit(X_train,y_train,batch_size=128,epochs=3,validation_split=0.2)\n", " pred = model.predict(test)\n", " return pred " ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# converging to a very conventional convolutional NN model to convert non-conversational text to star ratings\n", "# uh... with a non-convex loss function\n", "# an LSTM network could do better, but it would also take significantly longer to run\n", "#\n", "# (not actually using the CNN model here)\n", "\n", "def cnn_model (X_train, y_train, test, val='no'):\n", " model=Sequential()\n", " model.add(Embedding(50000,128,input_length=500))\n", " model.add(Convolution1D(128,5,activation='relu'))\n", " model.add(MaxPooling1D(5))\n", " model.add(Dropout(0.2))\n", " \n", " model.add(Convolution1D(128,5,activation='relu'))\n", " model.add(MaxPooling1D(5))\n", " model.add(Dropout(0.2))\n", " \n", " model.add(Convolution1D(128,5,activation='relu'))\n", " model.add(MaxPooling1D(35))\n", " model.add(Flatten())\n", " model.add(Dense(128,activation='relu'))\n", " model.add(Dropout(0.2))\n", " \n", " model.add(Dense(5,activation='softmax'))\n", " \n", " sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9) \n", " model.compile(loss='categorical_crossentropy',optimizer=sgd,metrics=['accuracy'])\n", " \n", " if val == 'no':\n", " model.fit(X_train,y_train,batch_size=128,epochs=5)\n", " else:\n", " model.fit(X_train,y_train,batch_size=128,epochs=5,validation_split=0.2)\n", " pred = model.predict(test)\n", " return pred" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#data = do_stuff(reviews)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#data.to_csv(f'{PATH}review_on_processed_text.csv')" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "data = pd.Series.from_csv(f'{PATH}review_on_processed_text.csv', index_col=0)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "stars = reviews['stars']" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true }, "outputs": [], "source": [ "del reviews" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": true }, "outputs": [], "source": [ "enc = LabelEncoder()\n", "enc.fit(stars)\n", "y = enc.transform(stars)\n", "dummy_y = np_utils.to_categorical(y)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": true }, "outputs": [], "source": [ "data.fillna('empty', inplace=True)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "tok = Tokenizer(num_words=50000)\n", "tok.fit_on_texts(data)\n", "\n", "sequenced = tok.texts_to_sequences(data)\n", "padded = pad_sequences(sequenced,maxlen=500)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "# getting the pretrained weight matrix\n", "# based on https://www.kaggle.com/jhoward/improved-lstm-baseline-glove-dropout\n", "# by which I mean it's pretty much just that...\n", "\n", "EMBED_FILE = '/d/data/glove.42B.300d.txt'\n", "\n", "def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')\n", "embeddings_index = dict(get_coefs(*o.strip().split()) for o in open(EMBED_FILE))\n", "\n", "embed_size = 300\n", "max_features = 50000\n", "maxlen = 500\n", "\n", "all_embs = np.stack(embeddings_index.values())\n", "emb_mean,emb_std = all_embs.mean(), all_embs.std()\n", "\n", "word_index = tok.word_index\n", "nb_words = min(50000, len(word_index))\n", "emb_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))\n", "for word, i in word_index.items():\n", " if i >= max_features: continue\n", " embedding_vector = embeddings_index.get(word)\n", " if embedding_vector is not None: emb_matrix[i] = embedding_vector" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": true }, "outputs": [], "source": [ "del embedding_vector, embeddings_index" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = train_test_split(padded, dummy_y, test_size=0.2, random_state=202)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "#del data,emb_mean,emb_std,embed_size" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train on 405993 samples, validate on 101499 samples\n", "Epoch 1/3\n", "405993/405993 [==============================] - 1226s 3ms/step - loss: 1.0164 - acc: 0.5483 - val_loss: 0.9206 - val_acc: 0.5885\n", "Epoch 2/3\n", "405993/405993 [==============================] - 1215s 3ms/step - loss: 0.9203 - acc: 0.5910 - val_loss: 0.8964 - val_acc: 0.5967\n", "Epoch 3/3\n", "405993/405993 [==============================] - 1224s 3ms/step - loss: 0.8912 - acc: 0.6035 - val_loss: 0.8804 - val_acc: 0.6102\n" ] } ], "source": [ "# normally 1 comes before 2, but... this just starts at 2\n", "pred2 = lstm_model (X_train, y_train, X_test, val='yes')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Maybe one more epoch would've been helpful, but for my purposes now that's fine." ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "0.8826335410697205" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "roc_auc_score(y_test,pred2)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": true }, "outputs": [], "source": [ "preds2 = np.argmax(pred2, axis=1)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": true }, "outputs": [], "source": [ "ys = np.argmax(y_test, axis=1)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 0 0.77 0.74 0.75 15267\n", " 1 0.49 0.39 0.43 13048\n", " 2 0.52 0.45 0.49 22102\n", " 3 0.54 0.64 0.59 39179\n", " 4 0.70 0.69 0.70 37278\n", "\n", "avg / total 0.61 0.61 0.61 126874\n", "\n" ] } ], "source": [ "print(classification_report(ys,preds2))" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[11294, 2525, 705, 423, 320],\n", " [ 2516, 5032, 3991, 1252, 257],\n", " [ 534, 2225, 10040, 8539, 764],\n", " [ 213, 308, 4067, 25051, 9540],\n", " [ 187, 78, 409, 10706, 25898]])" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "confusion_matrix (ys, preds2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While 0.61 precision/recall isn't great, the 0.8826 AUC score is very very okay, one of the better kinds of okay. Also, the validation scores during training were very good, which is always helpful. A benefit, no doubt, of using all 630,000 reviews of business in the Toronto area.\n", "\n", "The AUC score suggests that most of the predicted ratings are not too far off, and we can see that the vast majority of incorrect scores are within 1 star of the actual rating. Additionally, 1 and 5 star ratings had the greatest precision and recall, so our model is decent at picking up extreme sentiment (or the users are effusive in praise and unrestrained in condemnation). If I had split this into a positive/negative binary problem, obviously the accuracy would be a lot higher (at a glance, over 86% if we consider 3 to be negative and over 90% if we consider it to be positive), but it is interesting to try to pick up on the sublte differences between, say a 4 and a 5 star rating.\n", "\n", "Let's see if adding in weather and relative price can increase accuracy." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Stage 2: weather effects\n", "\n", "Star ratings are neither objective nor scientific. We humans often make bizarre, irrational and otherwise inconsistent choices due to many internal and external factors. Let's consider weather as one of the external factors, especially with regards to giving a star rating for a business. While good weather and a good mood might influence me to leave a more positive review as well as a higher star rating, there is really no way know the sort of review I would have left had the weather been different (the old problem of not knowing probabilities conditional on histories that haven't happened).\n", "\n", "What we can do is see if the review text matches with the score, and if knowing the weather conditions can improve the accuracy of our star predictions." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": true }, "outputs": [], "source": [ "reviews_w = pd.read_csv(f'{PATH}review_on.csv', index_col=0)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": true }, "outputs": [], "source": [ "reviews_w = reviews_w[['stars','date','text']]" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "c:\\programdata\\anaconda3\\lib\\site-packages\\IPython\\core\\interactiveshell.py:2728: DtypeWarning: Columns (12,16,20) have mixed types. Specify dtype option on import or set low_memory=False.\n", " interactivity=interactivity, compiler=compiler, result=result)\n" ] } ], "source": [ "weather = pd.read_csv(f'{WEAT}all_weather.csv', index_col='Unnamed: 0')" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": true }, "outputs": [], "source": [ "weather['Year'] = weather['Year'].astype(int)\n", "weather['Month'] = weather['Month'].astype(int)\n", "weather['Day'] = weather['Day'].astype(int)\n", "weather['Temp (°C)'] = weather['Temp (°C)'].astype(float)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": true }, "outputs": [], "source": [ "reviews_w['date'] = pd.to_datetime(reviews_w['date'])" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | stars | \n", "date | \n", "text | \n", "
---|---|---|---|
0 | \n", "4 | \n", "2012-05-11 | \n", "Who would have guess that you would be able to... | \n", "
1 | \n", "4 | \n", "2015-10-27 | \n", "Always drove past this coffee house and wonder... | \n", "
2 | \n", "3 | \n", "2013-02-09 | \n", "Not bad!! Love that there is a gluten-free, ve... | \n", "
3 | \n", "5 | \n", "2016-04-06 | \n", "Love this place! Peggy is great with dogs and... | \n", "
4 | \n", "4 | \n", "2013-05-01 | \n", "This is currently my parents new favourite res... | \n", "
\n", " | Date/Time | \n", "Year | \n", "Month | \n", "Day | \n", "Time | \n", "Data Quality | \n", "Temp (°C) | \n", "Temp Flag | \n", "Dew Point Temp (°C) | \n", "Dew Point Temp Flag | \n", "... | \n", "Wind Spd Flag | \n", "Visibility (km) | \n", "Visibility Flag | \n", "Stn Press (kPa) | \n", "Stn Press Flag | \n", "Hmdx | \n", "Hmdx Flag | \n", "Wind Chill | \n", "Wind Chill Flag | \n", "Weather | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.0 | \n", "2006-01-01 00:00 | \n", "2006 | \n", "1 | \n", "1 | \n", "00:00 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
1.0 | \n", "2006-01-01 01:00 | \n", "2006 | \n", "1 | \n", "1 | \n", "01:00 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
2 rows × 25 columns
\n", "