In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.stats.multicomp import (pairwise_tukeyhsd,MultiComparison)

In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/weilixiang/sta2453_project1/master/merged_data.csv")

### Variables Description

- PlayerID: A numerical number range from 1 to 17, indicating a unique playerID for each player
- SessionType: A categorical variable describing the training session types: Skills, Mobility/Recovery, Game, Conditioning, Strength, Combat, Speed
- Duration: A float representing session length, ranging from 3.00 to 120.00
- RPE: An integer representing rate of perceived exertion (0-10 scale)
- SessionLoad: A float obtained by Duration * RPE
- DailyLoad: A float that is the sum of SessionLoad for a given day
- AcuteLoad: A float that is the average daily load over past 7 days
- ChronicLoad: A float that is the average daily load over past 30 days
- AcuteChronicRatio: A float that is calculated by AcuteLoad/ChronicLoad
- Load_Status: A categorical variable, which is dependent on AcuteChronicRatio.If ratio>1.2, status is high. If ratio<0.8, status is recovering. In between it is normal
- GameID: An integer representing a unique game ID, there are 38 games in total
- game_date: A date that indicates the date of the game
- train_date: A date that indicates the training date, which is one day ahead of game day
- AccelImpulse: A float that is the max absolute value of change in speed divided by change in time
- AccelLoad: A float that indicates the max load detected by the accelerometer
- Speed: A float that is the max movement speed of the player, in meters per second
- PerformanceScore: A measurement that is the weighted average of AccelImpulse, AccelLoad, and Speed. This measurement is considered as response variable in our model. It indicates how good the performance is for each player
- Outcome: A binary categorical variable, W indicates win, and L indicates loss.
- PointsDiff: An integer, which is the difference of scores. A positive difference indicates winning by how many points.
- Pain: A binary categorical variable(Yes or No) indicates whether a player is in pain
- Illness: A categorical variable indicates whether a player is feeling ill, all possible values are  Yes, No, and Slightly Off
- Menstruation: A binary categorical variable(Yes or No) indicates whether a player is currently menstruating. 
- Nutrition: A binary categorical variable(Excellent or Okay)
- NutritionAdjustment: A binary categorical variable indicates whether the player has made a nutrition adjustment that day
- EWMScore: A float number that measures the exponential moving average of wellness score

In [3]:
import statsmodels.api as sm

import statsmodels.formula.api as smf

full_md = smf.mixedlm("PerformanceScore  ~ SessionType +  Duration + RPE + SessionLoad + \
    DailyLoad + AcuteLoad + ChronicLoad + AcuteChronicRatio + Load_Status + \
        Outcome + PointsDiff + Pain + Illness + Menstruation + \
       Nutrition + NutritionAdjustment + EWMScore", df,groups = df["PlayerID"])
                 
mdf = full_md.fit()
print(mdf.summary())

                    Mixed Linear Model Regression Results
Model:                 MixedLM      Dependent Variable:      PerformanceScore
No. Observations:      562          Method:                  REML            
No. Groups:            17           Scale:                   58.6021         
Min. group size:       3            Likelihood:              -1956.4791      
Max. group size:       80           Converged:               Yes             
Mean group size:       33.1                                                  
-----------------------------------------------------------------------------
                                  Coef.  Std.Err.   z    P>|z|  [0.025 0.975]
-----------------------------------------------------------------------------
Intercept                         50.766    8.679  5.849 0.000  33.755 67.777
SessionType[T.Game]               -0.501    2.385 -0.210 0.834  -5.175  4.173
SessionType[T.Mobility/Recovery]   0.331    2.774  0.119 0.905  -5.106  5.769
Sessio

As we can see, ChronicLoad has p-value 0.992, which is significantly larger than 0.1, we will drop it for next model. Keep in mind, the REML is about 58.6.

In [5]:
md1 = smf.mixedlm("PerformanceScore  ~ SessionType +  Duration + RPE + SessionLoad + \
    DailyLoad + AcuteLoad + AcuteChronicRatio + Load_Status + \
        Outcome + PointsDiff + Pain + Illness + Menstruation + \
       Nutrition + NutritionAdjustment + EWMScore", df,groups = df["PlayerID"])
                 
mdf = md1.fit()
print(mdf.summary())

                    Mixed Linear Model Regression Results
Model:                 MixedLM      Dependent Variable:      PerformanceScore
No. Observations:      562          Method:                  REML            
No. Groups:            17           Scale:                   58.4976         
Min. group size:       3            Likelihood:              -1953.0431      
Max. group size:       80           Converged:               Yes             
Mean group size:       33.1                                                  
-----------------------------------------------------------------------------
                                  Coef.  Std.Err.   z    P>|z|  [0.025 0.975]
-----------------------------------------------------------------------------
Intercept                         50.784    8.618  5.893 0.000  33.893 67.674
SessionType[T.Game]               -0.500    2.380 -0.210 0.834  -5.165  4.165
SessionType[T.Mobility/Recovery]   0.334    2.765  0.121 0.904  -5.086  5.754
Sessio

As we can see, SessionType has p-value 0.904, which is significantly larger than 0.1, also all levels are insignificant, we will drop it for next model. Keep in mind, the REML is about 58.5.

In [6]:
md2 = smf.mixedlm("PerformanceScore  ~  Duration + RPE + SessionLoad + \
    DailyLoad + AcuteLoad + AcuteChronicRatio + Load_Status + \
        Outcome + PointsDiff + Pain + Illness + Menstruation + \
       Nutrition + NutritionAdjustment + EWMScore", df,groups = df["PlayerID"])
                 
mdf = md2.fit()
print(mdf.summary())

                 Mixed Linear Model Regression Results
Model:               MixedLM    Dependent Variable:    PerformanceScore
No. Observations:    562        Method:                REML            
No. Groups:          17         Scale:                 58.6317         
Min. group size:     3          Likelihood:            -1963.8942      
Max. group size:     80         Converged:             Yes             
Mean group size:     33.1                                              
-----------------------------------------------------------------------
                            Coef.  Std.Err.   z    P>|z|  [0.025 0.975]
-----------------------------------------------------------------------
Intercept                   52.713    8.434  6.250 0.000  36.182 69.244
Load_Status[T.normal]       -6.701    2.449 -2.737 0.006 -11.500 -1.902
Load_Status[T.recovering]   -4.881    2.982 -1.637 0.102 -10.726  0.963
Outcome[T.W]                -2.659    1.315 -2.022 0.043  -5.236 -0.082
Pain[T.Ye

As we can see, Menstruation has p-value 0.903, which is significantly larger than 0.1, also all levels are insignificant, we will drop it for next model. The REML is about 58.6.

In [7]:
md3 = smf.mixedlm("PerformanceScore  ~  Duration + RPE + SessionLoad + \
    DailyLoad + AcuteLoad + AcuteChronicRatio + Load_Status + \
        Outcome + PointsDiff + Pain + Illness +\
       Nutrition + NutritionAdjustment + EWMScore", df,groups = df["PlayerID"])
                 
mdf = md3.fit()
print(mdf.summary())

                 Mixed Linear Model Regression Results
Model:               MixedLM    Dependent Variable:    PerformanceScore
No. Observations:    562        Method:                REML            
No. Groups:          17         Scale:                 58.5248         
Min. group size:     3          Likelihood:            -1965.0699      
Max. group size:     80         Converged:             Yes             
Mean group size:     33.1                                              
-----------------------------------------------------------------------
                            Coef.  Std.Err.   z    P>|z|  [0.025 0.975]
-----------------------------------------------------------------------
Intercept                   52.513    8.271  6.349 0.000  36.301 68.724
Load_Status[T.normal]       -6.693    2.445 -2.737 0.006 -11.486 -1.900
Load_Status[T.recovering]   -4.864    2.976 -1.635 0.102 -10.696  0.968
Outcome[T.W]                -2.674    1.308 -2.045 0.041  -5.237 -0.111
Pain[T.Ye

As we can see, RPE has p-value 0.195, which is larger than 0.1,  we will drop it for next model. The REML is about 58.5.

In [8]:
md4 = smf.mixedlm("PerformanceScore  ~  Duration + SessionLoad +DailyLoad + AcuteLoad + \
AcuteChronicRatio + Load_Status +  Outcome + PointsDiff + Pain + Illness +\
       Nutrition + NutritionAdjustment + EWMScore", df,groups = df["PlayerID"])
                 
mdf = md4.fit()
print(mdf.summary())

                 Mixed Linear Model Regression Results
Model:               MixedLM    Dependent Variable:    PerformanceScore
No. Observations:    562        Method:                REML            
No. Groups:          17         Scale:                 58.6464         
Min. group size:     3          Likelihood:            -1965.4590      
Max. group size:     80         Converged:             Yes             
Mean group size:     33.1                                              
-----------------------------------------------------------------------
                            Coef.  Std.Err.   z    P>|z|  [0.025 0.975]
-----------------------------------------------------------------------
Intercept                   53.926    8.193  6.582 0.000  37.869 69.983
Load_Status[T.normal]       -6.552    2.445 -2.680 0.007 -11.344 -1.760
Load_Status[T.recovering]   -4.635    2.973 -1.559 0.119 -10.461  1.191
Outcome[T.W]                -2.662    1.309 -2.034 0.042  -5.228 -0.097
Pain[T.Ye

As we can see, Duration has p-value 0.379, which is larger than 0.1,  we will drop it for next model. The REML is about 58.6.

In [9]:
md5 = smf.mixedlm("PerformanceScore  ~ SessionLoad +DailyLoad + AcuteLoad + \
AcuteChronicRatio + Load_Status +  Outcome + PointsDiff + Pain + Illness +\
       Nutrition + NutritionAdjustment + EWMScore", df,groups = df["PlayerID"])
                 
mdf = md5.fit()
print(mdf.summary())

                Mixed Linear Model Regression Results
Model:                MixedLM   Dependent Variable:   PerformanceScore
No. Observations:     562       Method:               REML            
No. Groups:           17        Scale:                58.6255         
Min. group size:      3         Likelihood:           -1963.0989      
Max. group size:      80        Converged:            Yes             
Mean group size:      33.1                                            
----------------------------------------------------------------------
                           Coef.  Std.Err.   z    P>|z|  [0.025 0.975]
----------------------------------------------------------------------
Intercept                  55.145    8.072  6.832 0.000  39.324 70.966
Load_Status[T.normal]      -6.703    2.438 -2.749 0.006 -11.482 -1.924
Load_Status[T.recovering]  -4.845    2.963 -1.635 0.102 -10.652  0.961
Outcome[T.W]               -2.631    1.308 -2.011 0.044  -5.195 -0.066
Pain[T.Yes]            

As we can see, Sessionload has p-value 0.186, which is larger than 0.1,  we will drop it for next model. The REML is about 58.6.

In [10]:
md6 = smf.mixedlm("PerformanceScore  ~ DailyLoad + AcuteLoad + \
AcuteChronicRatio + Load_Status +  Outcome + PointsDiff + Pain + Illness +\
       Nutrition + NutritionAdjustment + EWMScore", df,groups = df["PlayerID"])
                 
mdf = md6.fit()
print(mdf.summary())

                 Mixed Linear Model Regression Results
Model:               MixedLM    Dependent Variable:    PerformanceScore
No. Observations:    562        Method:                REML            
No. Groups:          17         Scale:                 58.7169         
Min. group size:     3          Likelihood:            -1958.8323      
Max. group size:     80         Converged:             Yes             
Mean group size:     33.1                                              
-----------------------------------------------------------------------
                            Coef.  Std.Err.   z    P>|z|  [0.025 0.975]
-----------------------------------------------------------------------
Intercept                   54.331    8.054  6.746 0.000  38.546 70.116
Load_Status[T.normal]       -6.640    2.440 -2.722 0.006 -11.422 -1.859
Load_Status[T.recovering]   -4.826    2.965 -1.628 0.104 -10.636  0.984
Outcome[T.W]                -2.610    1.309 -1.994 0.046  -5.176 -0.044
Pain[T.Ye

Up until here, all p-values are less than 0.1, or at least some levels in categorical variable are significant. We will stop here, and use it as our final model.