{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Wellness Data\n", "\n", "The original wellness data of athletes consists of 19 columns. Our aim for this dataset is to come up with a quantitative summary measure(s). For complete understanding as to what each variable means, please refer to the [CodeBook](https://www.dropbox.com/s/c5v4hig2afq881h/3.%20DataFest%202019%20-%20Codebook.xlsx?dl=0). \n", "\n", "We'll first look at the dataset in depth, to possibly merge highly correlated measures or come up with new measures by transforming the existing ones. The summary of our wellness dataset is as follows:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "<class 'pandas.core.frame.DataFrame'>\n", "RangeIndex: 5011 entries, 0 to 5010\n", "Data columns (total 19 columns):\n", "Date 5011 non-null object\n", "PlayerID 5011 non-null int64\n", "Fatigue 5011 non-null int64\n", "Soreness 5011 non-null int64\n", "Desire 5011 non-null int64\n", "Irritability 5011 non-null int64\n", "BedTime 5011 non-null object\n", "WakeTime 5011 non-null object\n", "SleepHours 5011 non-null float64\n", "SleepQuality 5011 non-null int64\n", "MonitoringScore 5011 non-null int64\n", "Pain 5011 non-null object\n", "Illness 5011 non-null object\n", "Menstruation 4995 non-null object\n", "Nutrition 4174 non-null object\n", "NutritionAdjustment 4266 non-null object\n", "USGMeasurement 4843 non-null object\n", "USG 629 non-null float64\n", "TrainingReadiness 5011 non-null object\n", "dtypes: float64(2), int64(7), object(10)\n", "memory usage: 743.9+ KB\n" ] } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "from factor_analyzer import FactorAnalyzer\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "dfWell = pd.read_csv('https://www.dropbox.com/s/170bc3dimgn8ru8/wellness.csv?dl=1')\n", "dfWell.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, there are some null values in our dataset. After close inspection, we decided to fill the null values of Menstruation, Nutrition and NutritionAdjustment with their modes and drop the USG and USGMeasurement columns as they were ~85% null. The resulting dataset looks as follows:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "<class 'pandas.core.frame.DataFrame'>\n", "RangeIndex: 5011 entries, 0 to 5010\n", "Data columns (total 19 columns):\n", "Date 5011 non-null object\n", "PlayerID 5011 non-null int64\n", "Fatigue 5011 non-null int64\n", "Soreness 5011 non-null int64\n", "Desire 5011 non-null int64\n", "Irritability 5011 non-null int64\n", "BedTime 5011 non-null object\n", "WakeTime 5011 non-null object\n", "SleepHours 5011 non-null float64\n", "SleepQuality 5011 non-null int64\n", "MonitoringScore 5011 non-null int64\n", "Pain 5011 non-null object\n", "Illness 5011 non-null object\n", "Menstruation 5011 non-null object\n", "Nutrition 5011 non-null object\n", "NutritionAdjustment 5011 non-null object\n", "USGMeasurement 4843 non-null object\n", "USG 629 non-null float64\n", "TrainingReadiness 5011 non-null object\n", "dtypes: float64(2), int64(7), object(10)\n", "memory usage: 743.9+ KB\n" ] } ], "source": [ "# As Menstruation, Nutrition and Nutrition Adjustment are categorical, fill NA with their Modes\n", "dfWell.Menstruation.fillna(dfWell.Menstruation.mode()[0],inplace=True)\n", "dfWell.Nutrition.fillna(dfWell.Nutrition.mode()[0],inplace=True)\n", "dfWell.NutritionAdjustment.fillna(dfWell.NutritionAdjustment.mode()[0],inplace=True)\n", "dfWell.info()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Date</th>\n", " <th>PlayerID</th>\n", " <th>Fatigue</th>\n", " <th>Soreness</th>\n", " <th>Desire</th>\n", " <th>Irritability</th>\n", " <th>BedTime</th>\n", " <th>WakeTime</th>\n", " <th>SleepHours</th>\n", " <th>SleepQuality</th>\n", " <th>MonitoringScore</th>\n", " <th>Pain</th>\n", " <th>Illness</th>\n", " <th>Menstruation</th>\n", " <th>Nutrition</th>\n", " <th>NutritionAdjustment</th>\n", " <th>TrainingReadiness</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <td>0</td>\n", " <td>2018-07-21</td>\n", " <td>1</td>\n", " <td>3</td>\n", " <td>3</td>\n", " <td>2</td>\n", " <td>3</td>\n", " <td>23:00:00</td>\n", " <td>07:00:00</td>\n", " <td>8.00</td>\n", " <td>2</td>\n", " <td>13</td>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>Yes</td>\n", " <td>Excellent</td>\n", " <td>Yes</td>\n", " <td>0%</td>\n", " </tr>\n", " <tr>\n", " <td>1</td>\n", " <td>2018-07-21</td>\n", " <td>2</td>\n", " <td>4</td>\n", " <td>3</td>\n", " <td>4</td>\n", " <td>4</td>\n", " <td>23:00:00</td>\n", " <td>07:00:00</td>\n", " <td>8.00</td>\n", " <td>4</td>\n", " <td>19</td>\n", " <td>Yes</td>\n", " <td>No</td>\n", " <td>Yes</td>\n", " <td>Excellent</td>\n", " <td>Yes</td>\n", " <td>0%</td>\n", " </tr>\n", " <tr>\n", " <td>2</td>\n", " <td>2018-07-21</td>\n", " <td>3</td>\n", " <td>3</td>\n", " <td>3</td>\n", " <td>5</td>\n", " <td>4</td>\n", " <td>22:30:00</td>\n", " <td>06:30:00</td>\n", " <td>8.00</td>\n", " <td>4</td>\n", " <td>19</td>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>Excellent</td>\n", " <td>Yes</td>\n", " <td>100%</td>\n", " </tr>\n", " <tr>\n", " <td>3</td>\n", " <td>2018-07-21</td>\n", " <td>4</td>\n", " <td>2</td>\n", " <td>3</td>\n", " <td>5</td>\n", " <td>4</td>\n", " <td>00:30:00</td>\n", " <td>07:00:00</td>\n", " <td>6.50</td>\n", " <td>1</td>\n", " <td>15</td>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>Yes</td>\n", " <td>Excellent</td>\n", " <td>Yes</td>\n", " <td>95%</td>\n", " </tr>\n", " <tr>\n", " <td>4</td>\n", " <td>2018-07-21</td>\n", " <td>5</td>\n", " <td>5</td>\n", " <td>3</td>\n", " <td>4</td>\n", " <td>4</td>\n", " <td>23:45:00</td>\n", " <td>07:00:00</td>\n", " <td>7.25</td>\n", " <td>4</td>\n", " <td>20</td>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>Okay</td>\n", " <td>Yes</td>\n", " <td>100%</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Date PlayerID Fatigue Soreness Desire Irritability BedTime \\\n", "0 2018-07-21 1 3 3 2 3 23:00:00 \n", "1 2018-07-21 2 4 3 4 4 23:00:00 \n", "2 2018-07-21 3 3 3 5 4 22:30:00 \n", "3 2018-07-21 4 2 3 5 4 00:30:00 \n", "4 2018-07-21 5 5 3 4 4 23:45:00 \n", "\n", " WakeTime SleepHours SleepQuality MonitoringScore Pain Illness \\\n", "0 07:00:00 8.00 2 13 No No \n", "1 07:00:00 8.00 4 19 Yes No \n", "2 06:30:00 8.00 4 19 No No \n", "3 07:00:00 6.50 1 15 No No \n", "4 07:00:00 7.25 4 20 No No \n", "\n", " Menstruation Nutrition NutritionAdjustment TrainingReadiness \n", "0 Yes Excellent Yes 0% \n", "1 Yes Excellent Yes 0% \n", "2 No Excellent Yes 100% \n", "3 Yes Excellent Yes 95% \n", "4 No Okay Yes 100% " ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# As USG measurement has almost 87% NA, it doesn't make sense to fill NA with random or mean values. \n", "# Hence, drop USG related columns\n", "dfWell = dfWell.drop(['USGMeasurement', 'USG'], axis=1)\n", "dfWell.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we drop the BedTime and WakeTime as this information is captured by the SleepHours already. We are also going to convert the TrainingReadiness and SleepTime to a scale of 1 to 7 to match all the other scales which are captured in the MonitoringScore. " ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Date</th>\n", " <th>PlayerID</th>\n", " <th>Fatigue</th>\n", " <th>Soreness</th>\n", " <th>Desire</th>\n", " <th>Irritability</th>\n", " <th>SleepHours</th>\n", " <th>SleepQuality</th>\n", " <th>MonitoringScore</th>\n", " <th>Pain</th>\n", " <th>Illness</th>\n", " <th>Menstruation</th>\n", " <th>Nutrition</th>\n", " <th>NutritionAdjustment</th>\n", " <th>TrainingReadiness</th>\n", " <th>SleepHoursScaled</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <td>0</td>\n", " <td>2018-07-21</td>\n", " <td>1</td>\n", " <td>3</td>\n", " <td>3</td>\n", " <td>2</td>\n", " <td>3</td>\n", " <td>8.00</td>\n", " <td>2</td>\n", " <td>13</td>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>Yes</td>\n", " <td>Excellent</td>\n", " <td>Yes</td>\n", " <td>1.0</td>\n", " <td>3.000000</td>\n", " </tr>\n", " <tr>\n", " <td>1</td>\n", " <td>2018-07-21</td>\n", " <td>2</td>\n", " <td>4</td>\n", " <td>3</td>\n", " <td>4</td>\n", " <td>4</td>\n", " <td>8.00</td>\n", " <td>4</td>\n", " <td>19</td>\n", " <td>Yes</td>\n", " <td>No</td>\n", " <td>Yes</td>\n", " <td>Excellent</td>\n", " <td>Yes</td>\n", " <td>1.0</td>\n", " <td>3.000000</td>\n", " </tr>\n", " <tr>\n", " <td>2</td>\n", " <td>2018-07-21</td>\n", " <td>3</td>\n", " <td>3</td>\n", " <td>3</td>\n", " <td>5</td>\n", " <td>4</td>\n", " <td>8.00</td>\n", " <td>4</td>\n", " <td>19</td>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>Excellent</td>\n", " <td>Yes</td>\n", " <td>7.0</td>\n", " <td>3.000000</td>\n", " </tr>\n", " <tr>\n", " <td>3</td>\n", " <td>2018-07-21</td>\n", " <td>4</td>\n", " <td>2</td>\n", " <td>3</td>\n", " <td>5</td>\n", " <td>4</td>\n", " <td>6.50</td>\n", " <td>1</td>\n", " <td>15</td>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>Yes</td>\n", " <td>Excellent</td>\n", " <td>Yes</td>\n", " <td>6.7</td>\n", " <td>2.586207</td>\n", " </tr>\n", " <tr>\n", " <td>4</td>\n", " <td>2018-07-21</td>\n", " <td>5</td>\n", " <td>5</td>\n", " <td>3</td>\n", " <td>4</td>\n", " <td>4</td>\n", " <td>7.25</td>\n", " <td>4</td>\n", " <td>20</td>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>Okay</td>\n", " <td>Yes</td>\n", " <td>7.0</td>\n", " <td>2.793103</td>\n", " </tr>\n", " <tr>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " </tr>\n", " <tr>\n", " <td>5006</td>\n", " <td>2017-08-01</td>\n", " <td>8</td>\n", " <td>6</td>\n", " <td>7</td>\n", " <td>7</td>\n", " <td>7</td>\n", " <td>6.50</td>\n", " <td>6</td>\n", " <td>33</td>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>Poor</td>\n", " <td>No</td>\n", " <td>4.0</td>\n", " <td>2.586207</td>\n", " </tr>\n", " <tr>\n", " <td>5007</td>\n", " <td>2017-08-01</td>\n", " <td>10</td>\n", " <td>3</td>\n", " <td>4</td>\n", " <td>4</td>\n", " <td>4</td>\n", " <td>6.75</td>\n", " <td>3</td>\n", " <td>18</td>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>Excellent</td>\n", " <td>I Don't Know</td>\n", " <td>6.7</td>\n", " <td>2.655172</td>\n", " </tr>\n", " <tr>\n", " <td>5008</td>\n", " <td>2017-08-01</td>\n", " <td>12</td>\n", " <td>5</td>\n", " <td>4</td>\n", " <td>5</td>\n", " <td>4</td>\n", " <td>7.50</td>\n", " <td>3</td>\n", " <td>21</td>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>Yes</td>\n", " <td>Okay</td>\n", " <td>I Don't Know</td>\n", " <td>6.4</td>\n", " <td>2.862069</td>\n", " </tr>\n", " <tr>\n", " <td>5009</td>\n", " <td>2017-08-01</td>\n", " <td>13</td>\n", " <td>4</td>\n", " <td>4</td>\n", " <td>4</td>\n", " <td>4</td>\n", " <td>7.50</td>\n", " <td>4</td>\n", " <td>20</td>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>Okay</td>\n", " <td>Yes</td>\n", " <td>7.0</td>\n", " <td>2.862069</td>\n", " </tr>\n", " <tr>\n", " <td>5010</td>\n", " <td>2017-08-01</td>\n", " <td>14</td>\n", " <td>4</td>\n", " <td>4</td>\n", " <td>5</td>\n", " <td>5</td>\n", " <td>8.00</td>\n", " <td>6</td>\n", " <td>24</td>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>Okay</td>\n", " <td>Yes</td>\n", " <td>7.0</td>\n", " <td>3.000000</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "<p>5011 rows × 16 columns</p>\n", "</div>" ], "text/plain": [ " Date PlayerID Fatigue Soreness Desire Irritability \\\n", "0 2018-07-21 1 3 3 2 3 \n", "1 2018-07-21 2 4 3 4 4 \n", "2 2018-07-21 3 3 3 5 4 \n", "3 2018-07-21 4 2 3 5 4 \n", "4 2018-07-21 5 5 3 4 4 \n", "... ... ... ... ... ... ... \n", "5006 2017-08-01 8 6 7 7 7 \n", "5007 2017-08-01 10 3 4 4 4 \n", "5008 2017-08-01 12 5 4 5 4 \n", "5009 2017-08-01 13 4 4 4 4 \n", "5010 2017-08-01 14 4 4 5 5 \n", "\n", " SleepHours SleepQuality MonitoringScore Pain Illness Menstruation \\\n", "0 8.00 2 13 No No Yes \n", "1 8.00 4 19 Yes No Yes \n", "2 8.00 4 19 No No No \n", "3 6.50 1 15 No No Yes \n", "4 7.25 4 20 No No No \n", "... ... ... ... ... ... ... \n", "5006 6.50 6 33 No No No \n", "5007 6.75 3 18 No No No \n", "5008 7.50 3 21 No No Yes \n", "5009 7.50 4 20 No No No \n", "5010 8.00 6 24 No No No \n", "\n", " Nutrition NutritionAdjustment TrainingReadiness SleepHoursScaled \n", "0 Excellent Yes 1.0 3.000000 \n", "1 Excellent Yes 1.0 3.000000 \n", "2 Excellent Yes 7.0 3.000000 \n", "3 Excellent Yes 6.7 2.586207 \n", "4 Okay Yes 7.0 2.793103 \n", "... ... ... ... ... \n", "5006 Poor No 4.0 2.586207 \n", "5007 Excellent I Don't Know 6.7 2.655172 \n", "5008 Okay I Don't Know 6.4 2.862069 \n", "5009 Okay Yes 7.0 2.862069 \n", "5010 Okay Yes 7.0 3.000000 \n", "\n", "[5011 rows x 16 columns]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# We do not require Bed Time and Wake Time as well. We'll be using SleepHours instead\n", "dfWell = dfWell.drop(['BedTime', 'WakeTime'], axis=1)\n", "# Convert TrainingReadiness from String to fraction\n", "dfWell['TrainingReadiness'] = 1 + (dfWell['TrainingReadiness'].str.rstrip('%').astype('float') / 100.0 * 6)\n", "# Use MinMaxScaler to convert sleeptime to a number from 1 to 7\n", "from sklearn import preprocessing\n", "min_max_scaler = preprocessing.MinMaxScaler()\n", "dfWell[\"SleepHoursScaled\"] = 1 + (min_max_scaler.fit_transform(dfWell[[\"SleepHours\"]]) * 6)\n", "dfWell" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We merge the MonitoringScore with TrainingReadiness and SleepHoursScaled to define a new measure called the WellnessScore. Our dataset looks a lot cleaner and concise now:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Date</th>\n", " <th>PlayerID</th>\n", " <th>Pain</th>\n", " <th>Illness</th>\n", " <th>Menstruation</th>\n", " <th>Nutrition</th>\n", " <th>NutritionAdjustment</th>\n", " <th>WellnessScore</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <td>0</td>\n", " <td>2018-07-21</td>\n", " <td>1</td>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>Yes</td>\n", " <td>Excellent</td>\n", " <td>Yes</td>\n", " <td>17.000000</td>\n", " </tr>\n", " <tr>\n", " <td>1</td>\n", " <td>2018-07-21</td>\n", " <td>2</td>\n", " <td>Yes</td>\n", " <td>No</td>\n", " <td>Yes</td>\n", " <td>Excellent</td>\n", " <td>Yes</td>\n", " <td>23.000000</td>\n", " </tr>\n", " <tr>\n", " <td>2</td>\n", " <td>2018-07-21</td>\n", " <td>3</td>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>Excellent</td>\n", " <td>Yes</td>\n", " <td>29.000000</td>\n", " </tr>\n", " <tr>\n", " <td>3</td>\n", " <td>2018-07-21</td>\n", " <td>4</td>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>Yes</td>\n", " <td>Excellent</td>\n", " <td>Yes</td>\n", " <td>24.286207</td>\n", " </tr>\n", " <tr>\n", " <td>4</td>\n", " <td>2018-07-21</td>\n", " <td>5</td>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>Okay</td>\n", " <td>Yes</td>\n", " <td>29.793103</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Date PlayerID Pain Illness Menstruation Nutrition \\\n", "0 2018-07-21 1 No No Yes Excellent \n", "1 2018-07-21 2 Yes No Yes Excellent \n", "2 2018-07-21 3 No No No Excellent \n", "3 2018-07-21 4 No No Yes Excellent \n", "4 2018-07-21 5 No No No Okay \n", "\n", " NutritionAdjustment WellnessScore \n", "0 Yes 17.000000 \n", "1 Yes 23.000000 \n", "2 Yes 29.000000 \n", "3 Yes 24.286207 \n", "4 Yes 29.793103 " ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create a WellnessScore which is the sum of all the quantitative values\n", "dfWell[\"WellnessScore\"] = dfWell[\"MonitoringScore\"] + dfWell[\"TrainingReadiness\"] + dfWell[\"SleepHoursScaled\"]\n", "# Drop all the quantitative values as we'll be using only the WellnessScore from now on\n", "dfWell = dfWell.drop(['Fatigue','Soreness','Desire','Irritability','SleepHours','SleepQuality','MonitoringScore',\\\n", " 'TrainingReadiness','SleepHoursScaled'],axis=1)\n", "dfWell.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But the data we have here isn't completely independent. Intuitively, doesn't our wellness today depend on how we felt yesterday or even a week before? So, we define a new score by applying exponential moving average to the WellnessScore. Now, each player's <b>'EWMScore'</b> will depend on a week of WellnessScores with exponentially decreasing weights as me move further back in time. The resulting sample of rows like this: " ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Date</th>\n", " <th>PlayerID</th>\n", " <th>Pain</th>\n", " <th>Illness</th>\n", " <th>Menstruation</th>\n", " <th>Nutrition</th>\n", " <th>NutritionAdjustment</th>\n", " <th>WellnessScore</th>\n", " <th>EWMScore</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <td>0</td>\n", " <td>2017-08-01</td>\n", " <td>14</td>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>Okay</td>\n", " <td>Yes</td>\n", " <td>34.000000</td>\n", " <td>34.000000</td>\n", " </tr>\n", " <tr>\n", " <td>1</td>\n", " <td>2017-08-01</td>\n", " <td>2</td>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>Excellent</td>\n", " <td>Yes</td>\n", " <td>35.824138</td>\n", " <td>35.824138</td>\n", " </tr>\n", " <tr>\n", " <td>2</td>\n", " <td>2017-08-01</td>\n", " <td>3</td>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>Excellent</td>\n", " <td>Yes</td>\n", " <td>30.000000</td>\n", " <td>30.000000</td>\n", " </tr>\n", " <tr>\n", " <td>3</td>\n", " <td>2017-08-01</td>\n", " <td>5</td>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>Excellent</td>\n", " <td>Yes</td>\n", " <td>31.596552</td>\n", " <td>31.596552</td>\n", " </tr>\n", " <tr>\n", " <td>4</td>\n", " <td>2017-08-01</td>\n", " <td>13</td>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>Okay</td>\n", " <td>Yes</td>\n", " <td>29.862069</td>\n", " <td>29.862069</td>\n", " </tr>\n", " <tr>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " </tr>\n", " <tr>\n", " <td>5006</td>\n", " <td>2018-07-21</td>\n", " <td>14</td>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>Excellent</td>\n", " <td>Yes</td>\n", " <td>28.206897</td>\n", " <td>28.557073</td>\n", " </tr>\n", " <tr>\n", " <td>5007</td>\n", " <td>2018-07-21</td>\n", " <td>15</td>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>Excellent</td>\n", " <td>Yes</td>\n", " <td>34.837931</td>\n", " <td>35.176056</td>\n", " </tr>\n", " <tr>\n", " <td>5008</td>\n", " <td>2018-07-21</td>\n", " <td>16</td>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>Okay</td>\n", " <td>Yes</td>\n", " <td>25.000000</td>\n", " <td>26.956113</td>\n", " </tr>\n", " <tr>\n", " <td>5009</td>\n", " <td>2018-07-21</td>\n", " <td>10</td>\n", " <td>No</td>\n", " <td>Slightly Off</td>\n", " <td>No</td>\n", " <td>Excellent</td>\n", " <td>Yes</td>\n", " <td>22.993103</td>\n", " <td>24.044602</td>\n", " </tr>\n", " <tr>\n", " <td>5010</td>\n", " <td>2018-07-21</td>\n", " <td>1</td>\n", " <td>No</td>\n", " <td>No</td>\n", " <td>Yes</td>\n", " <td>Excellent</td>\n", " <td>Yes</td>\n", " <td>17.000000</td>\n", " <td>20.872639</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "<p>5011 rows × 9 columns</p>\n", "</div>" ], "text/plain": [ " Date PlayerID Pain Illness Menstruation Nutrition \\\n", "0 2017-08-01 14 No No No Okay \n", "1 2017-08-01 2 No No No Excellent \n", "2 2017-08-01 3 No No No Excellent \n", "3 2017-08-01 5 No No No Excellent \n", "4 2017-08-01 13 No No No Okay \n", "... ... ... ... ... ... ... \n", "5006 2018-07-21 14 No No No Excellent \n", "5007 2018-07-21 15 No No No Excellent \n", "5008 2018-07-21 16 No No No Okay \n", "5009 2018-07-21 10 No Slightly Off No Excellent \n", "5010 2018-07-21 1 No No Yes Excellent \n", "\n", " NutritionAdjustment WellnessScore EWMScore \n", "0 Yes 34.000000 34.000000 \n", "1 Yes 35.824138 35.824138 \n", "2 Yes 30.000000 30.000000 \n", "3 Yes 31.596552 31.596552 \n", "4 Yes 29.862069 29.862069 \n", "... ... ... ... \n", "5006 Yes 28.206897 28.557073 \n", "5007 Yes 34.837931 35.176056 \n", "5008 Yes 25.000000 26.956113 \n", "5009 Yes 22.993103 24.044602 \n", "5010 Yes 17.000000 20.872639 \n", "\n", "[5011 rows x 9 columns]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create a Exponential Moving Average for 7 days span\n", "dfWell = dfWell.sort_values(by='Date')\n", "dfWell['EWMScore'] = dfWell.groupby('PlayerID')['WellnessScore'].transform(lambda x: x.ewm(span=7 , min_periods=1).mean())\n", "dfWell.reset_index(drop=True, inplace=True)\n", "dfWell" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now have the final cleaned and summarized wellness dataset which we'll be using for our model." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "dfWell.to_csv(\"Clean Data/WellnessMeasure.csv\", index = None, header=True)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }