{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Board Game Review Prediction\n", "> Using Linear Regression/Random Forest Regression with Game information, it can predict Average game User rates. This data is from boardgamegeek, re-organized in scrapper\n", "\n", "- toc: true \n", "- badges: true\n", "- comments: true\n", "- author: Chanseok Kang\n", "- categories: [Python, Machine_Learning]\n", "- image: images/bgr_heatmap.png" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Required Packages" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "import sys\n", "import numpy as np\n", "import matplotlib as mpl\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "import pandas as pd\n", "import sklearn\n", "\n", "plt.rcParams['figure.figsize'] = (8, 8)\n", "\n", "from sklearn.model_selection import train_test_split" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Version Check" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Python: 3.7.6 (default, Jan 8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)]\n", "Numpy: 1.18.1\n", "Matplotlib: 3.1.3\n", "Seaborn: 0.10.0\n", "Pandas: 1.0.1\n", "Scikit-learn: 0.22.1\n" ] } ], "source": [ "print('Python: {}'.format(sys.version))\n", "print('Numpy: {}'.format(np.__version__))\n", "print('Matplotlib: {}'.format(mpl.__version__))\n", "print('Seaborn: {}'.format(sns.__version__))\n", "print('Pandas: {}'.format(pd.__version__))\n", "print('Scikit-learn: {}'.format(sklearn.__version__))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dataset Load\n", "More data information is in [here](https://raw.githubusercontent.com/ThaWeatherman/scrapers/master/boardgamegeek/games.csv)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idtypenameyearpublishedminplayersmaxplayersplayingtimeminplaytimemaxplaytimeminageusers_ratedaverage_ratingbayes_average_ratingtotal_ownerstotal_traderstotal_wanterstotal_wisherstotal_commentstotal_weightsaverage_weight
012333boardgameTwilight Struggle2005.02.02.0180.0180.0180.013.0201138.337748.221862664737212195865534725623.4785
1120677boardgameTerra Mystica2012.02.05.0150.060.0150.012.0143838.287988.142321651913215866277252614233.8939
2102794boardgameCaverna: The Cave Farmers2013.01.07.0210.030.0210.012.092628.289948.0688612230991476560017007773.7761
325613boardgameThrough the Ages: A Story of Civilization2006.02.04.0240.0240.0240.012.0132948.204078.058041434336210845075337816424.1590
43076boardgamePuerto Rico2002.02.05.0150.090.0150.012.0398838.142618.04524443627958615414917352133.2943
\n", "
" ], "text/plain": [ " id type name \\\n", "0 12333 boardgame Twilight Struggle \n", "1 120677 boardgame Terra Mystica \n", "2 102794 boardgame Caverna: The Cave Farmers \n", "3 25613 boardgame Through the Ages: A Story of Civilization \n", "4 3076 boardgame Puerto Rico \n", "\n", " yearpublished minplayers maxplayers playingtime minplaytime \\\n", "0 2005.0 2.0 2.0 180.0 180.0 \n", "1 2012.0 2.0 5.0 150.0 60.0 \n", "2 2013.0 1.0 7.0 210.0 30.0 \n", "3 2006.0 2.0 4.0 240.0 240.0 \n", "4 2002.0 2.0 5.0 150.0 90.0 \n", "\n", " maxplaytime minage users_rated average_rating bayes_average_rating \\\n", "0 180.0 13.0 20113 8.33774 8.22186 \n", "1 150.0 12.0 14383 8.28798 8.14232 \n", "2 210.0 12.0 9262 8.28994 8.06886 \n", "3 240.0 12.0 13294 8.20407 8.05804 \n", "4 150.0 12.0 39883 8.14261 8.04524 \n", "\n", " total_owners total_traders total_wanters total_wishers total_comments \\\n", "0 26647 372 1219 5865 5347 \n", "1 16519 132 1586 6277 2526 \n", "2 12230 99 1476 5600 1700 \n", "3 14343 362 1084 5075 3378 \n", "4 44362 795 861 5414 9173 \n", "\n", " total_weights average_weight \n", "0 2562 3.4785 \n", "1 1423 3.8939 \n", "2 777 3.7761 \n", "3 1642 4.1590 \n", "4 5213 3.2943 " ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Load the data\n", "games = pd.read_csv('./dataset/games.csv')\n", "games.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exploratory Data Analysis" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idyearpublishedminplayersmaxplayersplayingtimeminplaytimemaxplaytimeminageusers_ratedaverage_ratingbayes_average_ratingtotal_ownerstotal_traderstotal_wanterstotal_wisherstotal_commentstotal_weightsaverage_weight
count81312.00000081309.00000081309.00000081309.00000081309.00000081309.00000081309.00000081309.00000081312.00000081312.00000081312.00000081312.00000081312.00000081312.00000081312.00000081312.00000081312.00000081312.000000
mean72278.1501381806.6306681.9920185.63770351.63478849.27683351.6347886.983975161.8865854.2121441.157632262.5025099.23642312.68889042.71914449.29003116.4880090.908083
std58818.237742588.5178340.93103456.076890345.699969334.483934345.6999695.0351381145.9781263.0565512.3400331504.53669339.75740860.764207239.292628284.862853115.9802851.176002
min1.000000-3500.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000
25%21339.7500001984.0000002.0000002.0000008.00000010.0000008.0000000.0000000.0000000.0000000.0000001.0000000.0000000.0000000.0000000.0000000.0000000.000000
50%43258.0000002003.0000002.0000004.00000030.00000030.00000030.0000008.0000002.0000005.2656200.0000007.0000000.0000000.0000001.0000001.0000000.0000000.000000
75%128836.5000002010.0000002.0000006.00000060.00000060.00000060.00000012.00000016.0000006.7187770.00000051.0000002.0000003.0000007.0000009.0000002.0000001.916700
max184451.0000002018.00000099.00000011299.00000060120.00000060120.00000060120.000000120.00000053680.00000010.0000008.22186073188.0000001395.0000001586.0000006402.00000011798.0000005996.0000005.000000
\n", "
" ], "text/plain": [ " id yearpublished minplayers maxplayers playingtime \\\n", "count 81312.000000 81309.000000 81309.000000 81309.000000 81309.000000 \n", "mean 72278.150138 1806.630668 1.992018 5.637703 51.634788 \n", "std 58818.237742 588.517834 0.931034 56.076890 345.699969 \n", "min 1.000000 -3500.000000 0.000000 0.000000 0.000000 \n", "25% 21339.750000 1984.000000 2.000000 2.000000 8.000000 \n", "50% 43258.000000 2003.000000 2.000000 4.000000 30.000000 \n", "75% 128836.500000 2010.000000 2.000000 6.000000 60.000000 \n", "max 184451.000000 2018.000000 99.000000 11299.000000 60120.000000 \n", "\n", " minplaytime maxplaytime minage users_rated average_rating \\\n", "count 81309.000000 81309.000000 81309.000000 81312.000000 81312.000000 \n", "mean 49.276833 51.634788 6.983975 161.886585 4.212144 \n", "std 334.483934 345.699969 5.035138 1145.978126 3.056551 \n", "min 0.000000 0.000000 0.000000 0.000000 0.000000 \n", "25% 10.000000 8.000000 0.000000 0.000000 0.000000 \n", "50% 30.000000 30.000000 8.000000 2.000000 5.265620 \n", "75% 60.000000 60.000000 12.000000 16.000000 6.718777 \n", "max 60120.000000 60120.000000 120.000000 53680.000000 10.000000 \n", "\n", " bayes_average_rating total_owners total_traders total_wanters \\\n", "count 81312.000000 81312.000000 81312.000000 81312.000000 \n", "mean 1.157632 262.502509 9.236423 12.688890 \n", "std 2.340033 1504.536693 39.757408 60.764207 \n", "min 0.000000 0.000000 0.000000 0.000000 \n", "25% 0.000000 1.000000 0.000000 0.000000 \n", "50% 0.000000 7.000000 0.000000 0.000000 \n", "75% 0.000000 51.000000 2.000000 3.000000 \n", "max 8.221860 73188.000000 1395.000000 1586.000000 \n", "\n", " total_wishers total_comments total_weights average_weight \n", "count 81312.000000 81312.000000 81312.000000 81312.000000 \n", "mean 42.719144 49.290031 16.488009 0.908083 \n", "std 239.292628 284.862853 115.980285 1.176002 \n", "min 0.000000 0.000000 0.000000 0.000000 \n", "25% 0.000000 0.000000 0.000000 0.000000 \n", "50% 1.000000 1.000000 0.000000 0.000000 \n", "75% 7.000000 9.000000 2.000000 1.916700 \n", "max 6402.000000 11798.000000 5996.000000 5.000000 " ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "games.describe()" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Index(['id', 'type', 'name', 'yearpublished', 'minplayers', 'maxplayers',\n", " 'playingtime', 'minplaytime', 'maxplaytime', 'minage', 'users_rated',\n", " 'average_rating', 'bayes_average_rating', 'total_owners',\n", " 'total_traders', 'total_wanters', 'total_wishers', 'total_comments',\n", " 'total_weights', 'average_weight'],\n", " dtype='object')\n", "(81312, 20)\n" ] } ], "source": [ "print(games.columns)\n", "print(games.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our purpose is to predict `average_rating`. But some rows contain 0 rating. So we should remove that." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAfMAAAHSCAYAAAD4/yLYAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAVZElEQVR4nO3df6zldZ3f8de7jG5drQGX0VDADt1MWlmTRXeCtCaN1Qb50RQ2WRNMqhNDMpsNttqYdEf/YaO7CSZdtzVxSdh1KqZWlqgbSGGXJdTENFlZBiUCsoYJUhmhMBZ/tSZr0Xf/uN/J3sCFe5k7w5333McjOTnnfs73+z2fc8LwvN/v+c53qrsDAMz1d7Z6AgDA5og5AAwn5gAwnJgDwHBiDgDDiTkADLdjqydwrM4888zetWvXVk8DAF4S99577/e6e+daz42N+a5du3Lw4MGtngYAvCSq6n8+33MOswPAcGIOAMOJOQAMJ+YAMNy6Ma+qc6vqy1X1UFU9WFUfWMZ/p6q+W1X3LbfLVq3z4ao6VFXfqqp3rhq/ZBk7VFX7V42fV1V3V9XDVfUnVfXy4/1GAeBUtZE982eSfKi735DkoiTXVNX5y3N/0N0XLLfbk2R57qokv5LkkiR/WFWnVdVpST6V5NIk5yd596rtfHzZ1u4k309y9XF6fwBwyls35t39RHd/bXn84yQPJTn7BVa5IslN3f033f3tJIeSXLjcDnX3I9390yQ3JbmiqirJ25N8YVn/xiRXHusbAoDt5kV9Z15Vu5K8Kcndy9D7q+obVXWgqs5Yxs5O8tiq1Q4vY883/ktJftDdzzxrHADYgA3HvKpeleSLST7Y3T9Kcn2SX05yQZInkvz+0UXXWL2PYXytOeyrqoNVdfDIkSMbnToAnNI2FPOqellWQv657v5SknT3k939s+7+eZI/ysph9GRlz/rcVaufk+TxFxj/XpLTq2rHs8afo7tv6O493b1n5841r2gHANvORs5mrySfTvJQd39i1fhZqxb79SQPLI9vTXJVVf1CVZ2XZHeSv0pyT5Ldy5nrL8/KSXK3dncn+XKS31jW35vkls29LQDYPjZybfa3JnlPkvur6r5l7CNZORv9gqwcEn80yW8mSXc/WFU3J/lmVs6Ev6a7f5YkVfX+JHckOS3Jge5+cNnebye5qap+N8nXs/LLAwCwAbWyYzzPnj172j+0AsB2UVX3dveetZ5zBTgAGE7MAWA4MQeA4cQcAIYTcwAYTswBYLiN/D3zbWHX/tu2egrrevS6y7d6CgCchOyZA8BwYg4Aw4k5AAwn5gAwnJgDwHBiDgDDiTkADCfmADCcmAPAcGIOAMOJOQAMJ+YAMJyYA8BwYg4Aw4k5AAwn5gAwnJgDwHBiDgDDiTkADCfmADCcmAPAcGIOAMOJOQAMJ+YAMJyYA8BwYg4Aw4k5AAwn5gAwnJgDwHBiDgDDiTkADCfmADCcmAPAcGIOAMOJOQAMJ+YAMJyYA8BwYg4Aw4k5AAwn5gAwnJgDwHBiDgDDiTkADCfmADCcmAPAcGIOAMOJOQAMJ+YAMJyYA8BwYg4Aw4k5AAwn5gAwnJgDwHBiDgDDiTkADCfmADCcmAPAcGIOAMOJOQAMJ+YAMJyYA8BwYg4Aw4k5AAwn5gAwnJgDwHBiDgDDiTkADCfmADCcmAPAcOvGvKrOraovV9VDVfVgVX1gGX9NVd1ZVQ8v92cs41VVn6yqQ1X1jap686pt7V2Wf7iq9q4a/7Wqun9Z55NVVSfizQLAqWgje+bPJPlQd78hyUVJrqmq85PsT3JXd+9Octfyc5JcmmT3ctuX5PpkJf5Jrk3yliQXJrn26C8AyzL7Vq13yebfGgBsD+vGvLuf6O6vLY9/nOShJGcnuSLJjctiNya5cnl8RZLP9oqvJjm9qs5K8s4kd3b30939/SR3Jrlkee7V3f2X3d1JPrtqWwDAOl7Ud+ZVtSvJm5LcneR13f1EshL8JK9dFjs7yWOrVju8jL3Q+OE1xgGADdhwzKvqVUm+mOSD3f2jF1p0jbE+hvG15rCvqg5W1cEjR46sN2UA2BY2FPOqellWQv657v7SMvzkcog8y/1Ty/jhJOeuWv2cJI+vM37OGuPP0d03dPee7t6zc+fOjUwdAE55GzmbvZJ8OslD3f2JVU/dmuToGel7k9yyavy9y1ntFyX54XIY/o4kF1fVGcuJbxcnuWN57sdVddHyWu9dtS0AYB07NrDMW5O8J8n9VXXfMvaRJNclubmqrk7ynSTvWp67PcllSQ4l+UmS9yVJdz9dVR9Lcs+y3Ee7++nl8W8l+UySVyT5s+UGAGzAujHv7v+Rtb/XTpJ3rLF8J7nmebZ1IMmBNcYPJnnjenMBAJ7LFeAAYDgxB4DhxBwAhhNzABhOzAFgODEHgOHEHACGE3MAGE7MAWA4MQeA4cQcAIYTcwAYTswBYDgxB4DhxBwAhhNzABhOzAFgODEHgOHEHACGE3MAGE7MAWA4MQeA4cQcAIYTcwAYTswBYDgxB4DhxBwAhhNzABhOzAFgODEHgOHEHACGE3MAGE7MAWA4MQeA4cQcAIYTcwAYTswBYDgxB4DhxBwAhhNzABhOzAFgODEHgOHEHACGE3MAGE7MAWA4MQeA4cQcAIYTcwAYTswBYDgxB4DhxBwAhhNzABhOzAFgODEHgOHEHACGE3MAGE7MAWA4MQeA4cQcAIYTcwAYTswBYDgxB4DhxBwAhhNzABhOzAFgODEHgOF2bPUEAI6nXftv2+oprOvR6y7f6ilwirFnDgDDiTkADCfmADCcmAPAcGIOAMOJOQAMJ+YAMJyYA8BwYg4Aw60b86o6UFVPVdUDq8Z+p6q+W1X3LbfLVj334ao6VFXfqqp3rhq/ZBk7VFX7V42fV1V3V9XDVfUnVfXy4/kGAeBUt5E9888kuWSN8T/o7guW2+1JUlXnJ7kqya8s6/xhVZ1WVacl+VSSS5Ocn+Tdy7JJ8vFlW7uTfD/J1Zt5QwCw3awb8+7+SpKnN7i9K5Lc1N1/093fTnIoyYXL7VB3P9LdP01yU5IrqqqSvD3JF5b1b0xy5Yt8DwCwrW3mO/P3V9U3lsPwZyxjZyd5bNUyh5ex5xv/pSQ/6O5nnjUOAGzQscb8+iS/nOSCJE8k+f1lvNZYto9hfE1Vta+qDlbVwSNHjry4GQPAKeqYYt7dT3b3z7r750n+KCuH0ZOVPetzVy16TpLHX2D8e0lOr6odzxp/vte9obv3dPeenTt3HsvUAeCUc0wxr6qzVv3460mOnul+a5KrquoXquq8JLuT/FWSe5LsXs5cf3lWTpK7tbs7yZeT/May/t4ktxzLnABgu9qx3gJV9fkkb0tyZlUdTnJtkrdV1QVZOST+aJLfTJLufrCqbk7yzSTPJLmmu3+2bOf9Se5IclqSA9394PISv53kpqr63SRfT/Lp4/buAGAbWDfm3f3uNYafN7jd/XtJfm+N8duT3L7G+CP528P0AMCL5ApwADCcmAPAcGIOAMOJOQAMJ+YAMJyYA8BwYg4Aw4k5AAwn5gAwnJgDwHBiDgDDiTkADCfmADCcmAPAcGIOAMOJOQAMJ+YAMJyYA8BwYg4Aw4k5AAwn5gAwnJgDwHBiDgDDiTkADCfmADCcmAPAcGIOAMOJOQAMJ+YAMJyYA8BwYg4Aw4k5AAwn5gAwnJgDwHBiDgDDiTkADCfmADCcmAPAcGIOAMOJOQAMJ+YAMNyOrZ4AMMuu/bdt9RSAZ7FnDgDDiTkADCfmADCcmAPAcGIOAMOJOQAMJ+YAMJyYA8BwLhoD8BKbcOGdR6+7fKunwItgzxwAhhNzABhOzAFgODEHgOHEHACGE3MAGE7MAWA4MQeA4cQcAIYTcwAYTswBYDgxB4DhxBwAhhNzABhOzAFgODEHgOHEHACGE3MAGE7MAWA4MQeA4cQcAIYTcwAYTswBYDgxB4DhxBwAhls35lV1oKqeqqoHVo29pqrurKqHl/szlvGqqk9W1aGq+kZVvXnVOnuX5R+uqr2rxn+tqu5f1vlkVdXxfpMAcCrbyJ75Z5Jc8qyx/Unu6u7dSe5afk6SS5PsXm77klyfrMQ/ybVJ3pLkwiTXHv0FYFlm36r1nv1aAMALWDfm3f2VJE8/a/iKJDcuj29McuWq8c/2iq8mOb2qzkryziR3dvfT3f39JHcmuWR57tXd/Zfd3Uk+u2pbAMAGHOt35q/r7ieSZLl/7TJ+dpLHVi13eBl7ofHDa4wDABt0vE+AW+v77j6G8bU3XrWvqg5W1cEjR44c4xQB4NRyrDF/cjlEnuX+qWX8cJJzVy13TpLH1xk/Z43xNXX3Dd29p7v37Ny58xinDgCnlmON+a1Jjp6RvjfJLavG37uc1X5Rkh8uh+HvSHJxVZ2xnPh2cZI7lud+XFUXLWexv3fVtgCADdix3gJV9fkkb0tyZlUdzspZ6dclubmqrk7ynSTvWha/PcllSQ4l+UmS9yVJdz9dVR9Lcs+y3Ee7++hJdb+VlTPmX5Hkz5YbALBB68a8u9/9PE+9Y41lO8k1z7OdA0kOrDF+MMkb15sHALA2V4ADgOHEHACGE3MAGE7MAWA4MQeA4cQcAIYTcwAYTswBYDgxB4DhxBwAhhNzABhOzAFgODEHgOHEHACGE3MAGE7MAWA4MQeA4cQcAIYTcwAYTswBYDgxB4DhxBwAhhNzABhOzAFgODEHgOHEHACGE3MAGE7MAWA4MQeA4cQcAIYTcwAYTswBYDgxB4DhxBwAhhNzABhOzAFgODEHgOHEHACGE3MAGE7MAWA4MQeA4cQcAIYTcwAYTswBYDgxB4DhxBwAhhNzABhOzAFgODEHgOHEHACGE3MAGG7HVk8A+Fu79t+21VMABrJnDgDDiTkADCfmADCcmAPAcGIOAMOJOQAMJ+YAMJyYA8BwYg4Aw4k5AAwn5gAwnJgDwHBiDgDDiTkADCfmADCcmAPAcGIOAMOJOQAMJ+YAMJyYA8BwYg4Aw4k5AAwn5gAwnJgDwHCbinlVPVpV91fVfVV1cBl7TVXdWVUPL/dnLONVVZ+sqkNV9Y2qevOq7exdln+4qvZu7i0BwPZyPPbM/3l3X9Dde5af9ye5q7t3J7lr+TlJLk2ye7ntS3J9shL/JNcmeUuSC5Nce/QXAABgfSfiMPsVSW5cHt+Y5MpV45/tFV9NcnpVnZXknUnu7O6nu/v7Se5McskJmBcAnJI2G/NO8hdVdW9V7VvGXtfdTyTJcv/aZfzsJI+tWvfwMvZ84wDABuzY5Ppv7e7Hq+q1Se6sqr9+gWVrjbF+gfHnbmDlF4Z9SfL617/+xc4VAE5Jm9oz7+7Hl/unkvxpVr7zfnI5fJ7l/qll8cNJzl21+jlJHn+B8bVe74bu3tPde3bu3LmZqQPAKeOYY15Vr6yqv3f0cZKLkzyQ5NYkR89I35vkluXxrUneu5zVflGSHy6H4e9IcnFVnbGc+HbxMgYAbMBmDrO/LsmfVtXR7fzX7v7zqronyc1VdXWS7yR517L87UkuS3IoyU+SvC9JuvvpqvpYknuW5T7a3U9vYl4AsK0cc8y7+5Ekv7rG+P9O8o41xjvJNc+zrQNJDhzrXABgO3MFOAAYTswBYDgxB4DhxBwAhhNzABhOzAFgODEHgOHEHACGE3MAGE7MAWA4MQeA4cQcAIYTcwAYTswBYLjN/HvmAJyidu2/baun8IIeve7yrZ7CScWeOQAMJ+YAMJyYA8BwYg4Aw4k5AAwn5gAwnJgDwHBiDgDDiTkADOcKcGwrJ/tVrQCOhT1zABhOzAFgODEHgOHEHACGE3MAGE7MAWA4MQeA4cQcAIYTcwAYTswBYDgxB4DhxBwAhhNzABhOzAFgODEHgOHEHACGE3MAGE7MAWA4MQeA4cQcAIYTcwAYTswBYDgxB4DhxBwAhhNzABhOzAFgODEHgOHEHACGE3MAGE7MAWA4MQeA4cQcAIbbsdUT4NSxa/9tWz0FgG1JzAEYZ8LOw6PXXf6SvZbD7AAwnJgDwHBiDgDDiTkADCfmADCcmAPAcGIOAMOJOQAMJ+YAMJyYA8BwYg4Aw7k2+yATrkUMwEvPnjkADCfmADCcmAPAcGIOAMOJOQAMd9LEvKouqapvVdWhqtq/1fMBgClOiphX1WlJPpXk0iTnJ3l3VZ2/tbMCgBlOipgnuTDJoe5+pLt/muSmJFds8ZwAYISTJeZnJ3ls1c+HlzEAYB0nyxXgao2xfs5CVfuS7Ft+/D9V9a3jOIczk3zvOG5vO/IZbp7PcPN8hseHz3GT6uPH/TP8B8/3xMkS88NJzl318zlJHn/2Qt19Q5IbTsQEqupgd+85EdveLnyGm+cz3Dyf4fHhc9y8l/IzPFkOs9+TZHdVnVdVL09yVZJbt3hOADDCSbFn3t3PVNX7k9yR5LQkB7r7wS2eFgCMcFLEPEm6+/Ykt2/hFE7I4fttxme4eT7DzfMZHh8+x817yT7D6n7OeWYAwCAny3fmAMAx2vYxdxnZzauqc6vqy1X1UFU9WFUf2Oo5TVVVp1XV16vqv231XCaqqtOr6gtV9dfLf4//ZKvnNE1V/bvlz/EDVfX5qvq7Wz2nCarqQFU9VVUPrBp7TVXdWVUPL/dnnKjX39YxdxnZ4+aZJB/q7jckuSjJNT7HY/aBJA9t9SQG+09J/ry7/3GSX43P8kWpqrOT/Nske7r7jVk5IfmqrZ3VGJ9JcsmzxvYnuau7dye5a/n5hNjWMY/LyB4X3f1Ed39tefzjrPwP1BX8XqSqOifJ5Un+eKvnMlFVvTrJP0vy6STp7p929w+2dlYj7UjyiqrakeQXs8Y1P3iu7v5KkqefNXxFkhuXxzcmufJEvf52j7nLyB5nVbUryZuS3L21MxnpPyb590l+vtUTGeofJjmS5D8vX1X8cVW9cqsnNUl3fzfJf0jynSRPJPlhd//F1s5qtNd19xPJyk5PkteeqBfa7jHf0GVk2ZiqelWSLyb5YHf/aKvnM0lV/cskT3X3vVs9l8F2JHlzkuu7+01J/m9O4GHNU9Hyne4VSc5L8veTvLKq/vXWzoqN2O4x39BlZFlfVb0sKyH/XHd/aavnM9Bbk/yrqno0K1/3vL2q/svWTmmcw0kOd/fRo0JfyErc2bh/keTb3X2ku/9fki8l+adbPKfJnqyqs5JkuX/qRL3Qdo+5y8geB1VVWfme8qHu/sRWz2ei7v5wd5/T3buy8t/hf+9ue0QvQnf/rySPVdU/WobekeSbWzilib6T5KKq+sXlz/U74iTCzbg1yd7l8d4kt5yoFzpprgC3FVxG9rh5a5L3JLm/qu5bxj6yXNUPXkr/Jsnnll/OH0nyvi2ezyjdfXdVfSHJ17Lyt1S+HleC25Cq+nyStyU5s6oOJ7k2yXVJbq6qq7Pyi9K7TtjruwIcAMy23Q+zA8B4Yg4Aw4k5AAwn5gAwnJgDwHBiDgDDiTkADCfmADDc/wcQF5BwxbrPxAAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Make a histogram of all the rating in the average_rating column\n", "plt.hist(games['average_rating']);" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id 318\n", "type boardgame\n", "name Looney Leo\n", "yearpublished 0\n", "minplayers 0\n", "maxplayers 0\n", "playingtime 0\n", "minplaytime 0\n", "maxplaytime 0\n", "minage 0\n", "users_rated 0\n", "average_rating 0\n", "bayes_average_rating 0\n", "total_owners 0\n", "total_traders 0\n", "total_wanters 0\n", "total_wishers 1\n", "total_comments 0\n", "total_weights 0\n", "average_weight 0\n", "Name: 13048, dtype: object\n" ] } ], "source": [ "# Print the first row of all the games with zero scores\n", "print(games[games['average_rating'] == 0].iloc[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This row is meaningless" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "id 12333\n", "type boardgame\n", "name Twilight Struggle\n", "yearpublished 2005\n", "minplayers 2\n", "maxplayers 2\n", "playingtime 180\n", "minplaytime 180\n", "maxplaytime 180\n", "minage 13\n", "users_rated 20113\n", "average_rating 8.33774\n", "bayes_average_rating 8.22186\n", "total_owners 26647\n", "total_traders 372\n", "total_wanters 1219\n", "total_wishers 5865\n", "total_comments 5347\n", "total_weights 2562\n", "average_weight 3.4785\n", "Name: 0, dtype: object\n" ] } ], "source": [ "# Print the first row of games with scores greater than 0\n", "print(games[games['average_rating'] > 0].iloc[0])" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Remove any rows without user reviews\n", "games = games[games['users_rated'] > 0]\n", "\n", "# Remove any rows with missing values\n", "games.dropna(axis=0, inplace=True)\n", "\n", "# Make a histogram of all the average ratings\n", "plt.hist(games['average_rating']);" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Correlation matrix\n", "corrmat = games.corr()\n", "\n", "fig = plt.figure(figsize=(12, 9))\n", "sns.heatmap(corrmat, vmax=.8, square=True);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preprocess Dataset" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "# Get all the columns from the dataFrame\n", "columns = games.columns.tolist()\n", "\n", "# Filter the columns to remove data we don't want\n", "columns = [c for c in columns if c not in ['bayes_average_rating', 'average_rating', 'type', 'name', 'id']]\n", "\n", "# Store the variable we'll be predicting on\n", "target = 'average_rating'" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(45515, 20)\n", "(11379, 20)\n" ] } ], "source": [ "# Generate training and test datasets\n", "train = games.sample(frac=0.8, random_state=1)\n", "test = games.loc[~games.index.isin(train.index)]\n", "\n", "# Print shapes\n", "print(train.shape)\n", "print(test.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Build Linear Regression Model" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.linear_model import LinearRegression\n", "from sklearn.metrics import mean_squared_error\n", "\n", "# Initialize the model class\n", "lr = LinearRegression()\n", "\n", "# Fit the model with training data\n", "lr.fit(train[columns], train[target])" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MSE : 2.0788190326293257\n" ] } ], "source": [ "# Generate prediction for the test set\n", "predictions = lr.predict(test[columns])\n", "\n", "# Compute error between test predictions and actual values\n", "print('MSE : {}'.format(mean_squared_error(predictions, test[target])))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Build Non-Linear Regression (RandomForestRegressor) Model" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',\n", " max_depth=None, max_features='auto', max_leaf_nodes=None,\n", " max_samples=None, min_impurity_decrease=0.0,\n", " min_impurity_split=None, min_samples_leaf=10,\n", " min_samples_split=2, min_weight_fraction_leaf=0.0,\n", " n_estimators=100, n_jobs=None, oob_score=False,\n", " random_state=1, verbose=0, warm_start=False)" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.ensemble import RandomForestRegressor\n", "\n", "# Initialize the model class\n", "rfr = RandomForestRegressor(n_estimators=100, min_samples_leaf=10, random_state=1)\n", "\n", "# Fit the model with training data\n", "rfr.fit(train[columns], train[target])" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MSE : 1.4458560046071653\n" ] } ], "source": [ "# Generate prediction for the test set\n", "predictions = rfr.predict(test[columns])\n", "\n", "# Compute error between test predictions and actual values\n", "print('MSE : {}'.format(mean_squared_error(predictions, test[target])))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As a result, we can get more improved result from non-linear regression rather than linear regression model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Validate the model with individual test set" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "yearpublished 2008.0000\n", "minplayers 1.0000\n", "maxplayers 5.0000\n", "playingtime 200.0000\n", "minplaytime 100.0000\n", "maxplaytime 200.0000\n", "minage 12.0000\n", "users_rated 15774.0000\n", "total_owners 16429.0000\n", "total_traders 205.0000\n", "total_wanters 1343.0000\n", "total_wishers 5149.0000\n", "total_comments 3458.0000\n", "total_weights 1450.0000\n", "average_weight 3.7531\n", "Name: 14, dtype: float64" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test[columns].iloc[1]" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[9.20860328]\n", "[7.85532168]\n" ] } ], "source": [ "# Make prediction with both models\n", "rating_lr = lr.predict(test[columns].iloc[1].values.reshape(1, -1))\n", "rating_rfr = rfr.predict(test[columns].iloc[1].values.reshape(1, -1))\n", "\n", "# Print out the predictions\n", "print(rating_lr)\n", "print(rating_rfr)" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "7.99115\n" ] } ], "source": [ "# Actual value\n", "print(test[target].iloc[1])" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }