healthdatamodeling

Generating Results from a Complete Dataset

In [1]:

#Import Data (more libraries to come later in file)
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [ ]:

health=pd.read_csv("")
health
#Date in M/D/YR
#Active Calories measured in Cal
#Total Calories Burned = Active + Passive Calories Burned during specificied day
#Average Heart Rate in BPM
#Exercise Time mesured in minutes
#Stand Time measured in hours
#Workout Duration in minutes
#Temperature in degrees Farenheit of outside tempeature (correlated to timezone/location of workout)
#NOTE^ most strength workouts were indoors, others outdoors
#Calories Burned is the # active cal burned in the workout session
#NOTE^ missing values for rows indicate that no workout sessions were recorded on that day

Out[ ]:

	Date	Active Calories Burned	Total Calories Burned	Active Calories Goal	Exercise Time	Exercise Time Goal	Stand Time	Stand Time Goal	Workout Type	Workout Duration	Temperature	Calories Burned	Average Heart Rate	Steps
0	7/1/2023	518	2240	400	49	30	11	8	Cardio	21.0	75.0	147.0	108.0	6870
1	7/2/2023	954	2720	400	138	30	13	8	Strength Training	53.0	66.0	260.0	108.0	8063
2	7/2/2023	954	2720	400	138	30	13	8	Basketball	47.0	70.0	328.0	120.0	8063
3	7/3/2023	520	2273	400	48	30	14	8	Cardio	24.0	75.0	114.0	123.0	8673
4	7/4/2023	796	2530	400	140	30	11	8	Cardio	45.0	82.0	194.0	108.0	8718
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
838	3/31/2025	526	2437	700	30	30	13	8	NaN	NaN	NaN	NaN	NaN	7974
839	4/1/2025	912	2843	700	93	30	15	8	Strength Training	69.0	46.0	357.0	107.0	11714
840	4/2/2025	616	2532	700	31	30	14	8	NaN	NaN	NaN	NaN	NaN	9599
841	4/3/2025	581	2496	700	34	30	12	8	NaN	NaN	NaN	NaN	NaN	10290
842	4/4/2025	894	2788	700	89	30	15	8	Strength Training	65.0	64.0	352.0	107.0	12276

843 rows × 14 columns

In [3]:

#Heatmap to show densities of avg calories burned per day by month
import seaborn as sns
import calendar
#First, convert date to datetime
health['Date'] = pd.to_datetime(health['Date'])
#Then, extract month from date
health['Month'] = health['Date'].dt.month
health['Year'] = health['Date'].dt.year
monthly_avg = health.groupby(['Year', 'Month'])['Total Calories Burned'].mean().unstack()
#Annotations
monthly_avg.columns = [calendar.month_name[i] for i in monthly_avg.columns]
#Create heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(monthly_avg, annot=monthly_avg, fmt=".1f", cmap="YlGnBu")
plt.title('Average Calories Burned Per Day by Month')
plt.xlabel('Month')
plt.ylabel('Year')
plt.show()

No description has been provided for this image

PART 1: Estimating Total Expected Calories Burned in a Day Given Certain Anticipated Fitness Stats

In [4]:

#Get Number of Workouts in a Day
health
dayswithworkout = health.copy()
dayswithworkout['Workout Count'] = health.groupby('Date')['Workout Duration'].transform('count')
dayswithworkout
#print(dayswithworkout['Workout Count'].value_counts())

Out[4]:

	Date	Active Calories Burned	Total Calories Burned	Active Calories Goal	Exercise Time	Exercise Time Goal	Stand Time	Stand Time Goal	Workout Type	Workout Duration	Temperature	Calories Burned	Average Heart Rate	Steps	Month	Year	Workout Count
0	2023-07-01	518	2240	400	49	30	11	8	Cardio	21.0	75.0	147.0	108.0	6870	7	2023	1
1	2023-07-02	954	2720	400	138	30	13	8	Strength Training	53.0	66.0	260.0	108.0	8063	7	2023	2
2	2023-07-02	954	2720	400	138	30	13	8	Basketball	47.0	70.0	328.0	120.0	8063	7	2023	2
3	2023-07-03	520	2273	400	48	30	14	8	Cardio	24.0	75.0	114.0	123.0	8673	7	2023	1
4	2023-07-04	796	2530	400	140	30	11	8	Cardio	45.0	82.0	194.0	108.0	8718	7	2023	2
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
838	2025-03-31	526	2437	700	30	30	13	8	NaN	NaN	NaN	NaN	NaN	7974	3	2025	0
839	2025-04-01	912	2843	700	93	30	15	8	Strength Training	69.0	46.0	357.0	107.0	11714	4	2025	1
840	2025-04-02	616	2532	700	31	30	14	8	NaN	NaN	NaN	NaN	NaN	9599	4	2025	0
841	2025-04-03	581	2496	700	34	30	12	8	NaN	NaN	NaN	NaN	NaN	10290	4	2025	0
842	2025-04-04	894	2788	700	89	30	15	8	Strength Training	65.0	64.0	352.0	107.0	12276	4	2025	1

843 rows × 17 columns

On certain days, I recorded multiple workout sessions. For instance, I often lift in the morning, and then play soccer in the evening. For most variables, like 'Steps' or 'Exercise Time', the Apple Watch is giving the final totals for the whole day, so those values are just duplicated for all entries on the same date. They are redundant.

But, if I had multiple workout sessions on the same day, the 'Workout Duration' variable needs to be summed, since those cells give the length of just a single workout. Now, we can just keep the first entry for each day, since it encapsulates the totals for all the workout-based metrics for that day.

Columns Not Appropriate for Predicting Total Calories Burned in a Day:
- 'Workout Type' : because we are only keeping one entry for each day (which essentially holds all workout data), we no longer have individual rows for each workout to assign a workout type to

- 'Average Heart Rate' : this is the average heart rate during a workout, and we are not currently using separate rows for each workout. This current dataframe is a date-specific, not workout-specific, set

- 'Temperature' : most of my workouts are indoors. Though my habits vary by time of year, and many soccer or basketball workouts are outside, the 'Month' column captures the seasonal effects already

- 'Calories Burned' : this is the active calories burned in a specific workout, which are summed for the day in 'Active Calories Burned' column already

In [5]:

#Sum Workout Duration for each day
#This is the total workout time for each day, not the total exercise time (which goes beyond just workouts)
dayswithworkout['Workout Duration'] = health.groupby('Date')['Workout Duration'].transform('sum')

#Only keep one entry per day
#Now a date-specific index, not a workout-specific index
daily_calories = dayswithworkout.drop_duplicates(subset='Date', keep='first')
daily_calories

#Drop columns that are not needed for this analysis
daily_calories=daily_calories.drop(columns=['Workout Type', 'Calories Burned', 'Average Heart Rate', 'Temperature'])

#Make NaN values 0
# This is OK to do this b/c 0 for 'Workout Duration' is valid for days with no recorded session; the 'Exercise Time' will not be 0
daily_calories = daily_calories.fillna(0)
daily_calories

Out[5]:

	Date	Active Calories Burned	Total Calories Burned	Active Calories Goal	Exercise Time	Exercise Time Goal	Stand Time	Stand Time Goal	Workout Duration	Steps	Month	Year	Workout Count
0	2023-07-01	518	2240	400	49	30	11	8	21.0	6870	7	2023	1
1	2023-07-02	954	2720	400	138	30	13	8	100.0	8063	7	2023	2
3	2023-07-03	520	2273	400	48	30	14	8	24.0	8673	7	2023	1
4	2023-07-04	796	2530	400	140	30	11	8	82.0	8718	7	2023	2
6	2023-07-05	542	2298	400	51	30	9	8	62.0	3622	7	2023	3
...	...	...	...	...	...	...	...	...	...	...	...	...	...
838	2025-03-31	526	2437	700	30	30	13	8	0.0	7974	3	2025	0
839	2025-04-01	912	2843	700	93	30	15	8	69.0	11714	4	2025	1
840	2025-04-02	616	2532	700	31	30	14	8	0.0	9599	4	2025	0
841	2025-04-03	581	2496	700	34	30	12	8	0.0	10290	4	2025	0
842	2025-04-04	894	2788	700	89	30	15	8	65.0	12276	4	2025	1

644 rows × 13 columns

Make a Correlation Matrix to gauge which variables are promising features for the upcoming model

In [6]:

#Correlation Matrix
#Compute the correlation matrix
correlation_matrix = daily_calories.corr()

# Display the correlation matrix
#print(correlation_matrix)

# Visualize the correlation matrix using a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix')
plt.show()

# Focus on correlations with 'Total Calories Burned', our target variable
correlation_with_target = correlation_matrix['Total Calories Burned'].sort_values(ascending=False)
print("Correlation with 'Total Calories Burned':")
print(correlation_with_target)

Correlation with 'Total Calories Burned':
Total Calories Burned     1.000000
Active Calories Burned    0.979000
Exercise Time             0.817898
Steps                     0.739849
Workout Duration          0.696781
Workout Count             0.520065
Stand Time                0.368875
Date                      0.218326
Active Calories Goal      0.148000
Year                      0.139821
Month                     0.065004
Exercise Time Goal             NaN
Stand Time Goal                NaN
Name: Total Calories Burned, dtype: float64

Strongest Correlations with Total Calories Burned:

Workout Duration (0.90)
Active Calories (0.89)
Exercise Time (0.88)
Stand Time (0.83)

Would be valuable features given domain knowledge of my workouts: - Steps
- Month
- Workout Count

Features that are potentially redundant:
- Workout Duration & Exercise Time

*However, 'Exercise Time' also includes warmup/cooldown/stretching/activity not done as part of workout, so these are both valuable

Ridge Regression Model:
Why Ridge Regression: want to use Ridge regression to avoid overfitting, since some features are highly correlated

In [15]:

# Training the model using Ridge regression
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error, r2_score


#Why we include these features:
# Using aggregate sum because we want to capture the total activity level for the day
#daily_calories = dayswithworkout.groupby('Date').agg({
#    'Steps': 'sum', #important b/c it is a measure of activity that doesn't require a workout session
#   'Exercise Time': 'sum', #important for measuring physical activity beyond just recorded workouts
#   'Stand Time': 'sum', #important metric for gauging general activity level during the day (aside from downtime)
#   'Month': 'sum', #most workouts were done indoors, but my personal workout type & duration tend to vary by time of season
#   #due to the fact that I am a college student and my schedule changes by semester
#   'Total Calories Burned': 'sum'} #This is the target variable we are trying to predict
#   ).reset_index()

# Feature Selection
X = daily_calories[['Active Calories Burned', 'Workout Count', 'Workout Duration', 'Steps', 'Exercise Time', 'Stand Time', 'Month']]
y = daily_calories['Total Calories Burned']

# Add polynomial features (to get nonlinear relationships)
# Polynomial features can help capture interactions between features and nonlinear relationships
# For example, the relationship between exercise time and calories burned may not be linear
#Relatively small dataset, so less worried about added computational cost
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_poly, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train the Ridge regression model
ridge = Ridge()
ridge.fit(X_train_scaled, y_train)

# Make predictions on the training and testing sets
y_train_pred = ridge.predict(X_train_scaled)
y_test_pred = ridge.predict(X_test_scaled)

# Evaluate the model on the training set
train_mse = mean_squared_error(y_train, y_train_pred)
train_r2 = r2_score(y_train, y_train_pred)

# Evaluate the model on the testing set
test_mse = mean_squared_error(y_test, y_test_pred)
test_r2 = r2_score(y_test, y_test_pred)

# Print the results
print(f'Training Mean Squared Error: {train_mse}')
print(f'Training R^2 Score: {train_r2}')
print(f'Test Mean Squared Error: {test_mse}')
print(f'Test R^2 Score: {test_r2}')

Training Mean Squared Error: 2236.3862952065692
Training R^2 Score: 0.9769335868896968
Test Mean Squared Error: 2657.7066421163663
Test R^2 Score: 0.9536452583279413

The accuracy looks strong, but we notice the model performs slightly better on training data than testing data (though not by much). Let's investigate the extent to which the model is overfitting:

In [8]:

#Residual Plot
residuals = y_test - y_test_pred
plt.figure(figsize=(10, 6))
plt.scatter(y_test_pred, residuals, alpha=0.5)
plt.axhline(0, color='red', linestyle='--')
plt.title('Residual Plot')
plt.xlabel('Predicted Total Calories Burned')
plt.ylabel('Residuals')
plt.show()

Notes on the residual analysis:

seem to have a (relatively) even number of values in the positive and negative planes
the absolute values of the residuals are pretty small (an prediction error off by 50 calories is very tolerable when calorie counts range from 2300-3000 usually)
the residuals are spread out randomly (no obvious bias in a single direction)

In [9]:

#Calculate Relative Residuals
residuals = y_test - y_test_pred
# Calculate the average residual
mean_residual = np.mean(residuals)
# Calculate the relative residuals
relative_residuals = (residuals / y_test_pred) * 100
# Calculate the average relative residual
mean_relative_residual = np.mean(relative_residuals)
print(f'Mean Relative Residual: {mean_relative_residual}')

Mean Relative Residual: -0.20704110568879155

At a glance, a mean relative residual of roughly -0.21% indicates the model is underestimating total calories burned for a day by 0.21%. This is a very small error, and definitely acceptable in our context

In [16]:

#Feature Importance
# Note: some of the features are polynomial features, so the coefficients may not be directly interpretable
# Get the coefficients of the model
feature_names = poly.get_feature_names_out(input_features=['Active Calories Burned', 'Workout Count', 
'Workout Duration', 'Steps', 'Exercise Time', 'Stand Time', 'Month'])
coefficients = ridge.coef_
# Create a DataFrame to display the feature importance
feature_importance = pd.DataFrame({'Feature': feature_names, 'Coefficient': coefficients})
feature_importance = feature_importance.sort_values(by='Coefficient', ascending=False)
# Display the feature importance
#print(feature_importance)
# Plot the feature importance
plt.figure(figsize=(10, 6))
plt.barh(feature_importance['Feature'], feature_importance['Coefficient'])
plt.xlabel('Coefficient')
plt.title('Feature Importance')
plt.show()

In [17]:

#Actual vs Predicted Plot
plt.scatter(y_test, y_test_pred)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color='red', linestyle='--')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs. Predicted')
plt.show()

Another way we can assess if a Ridge Regression, which assumes linearity, and see if the relationships between variables do in fact appear linear (at least from a first glance):

In [13]:

import seaborn as sns
import matplotlib.pyplot as plt

# Create pairplot for some key features and target variable
sns.pairplot(daily_calories[['Active Calories Burned', 'Workout Duration', 'Steps', 'Exercise Time', 'Total Calories Burned']])
plt.show()

Most of these relationships do appear linear, so this is further evidence implying the Ridge Regression model is capturing the relationships well

However, let's try another approach: nested cross validation with hyperparameter tuning on the inner loop

We are going to now switch to a Random Forest Regressor, in case nonlinear relationships might be making an impact. To avoid redundancy, we will remove the polynomial feature transformations, and allow the RF model and hyperparameter tuning to act:

In [14]:

# Import necessary libraries
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Feature Selection
X = daily_calories[['Active Calories Burned', 'Workout Count', 'Workout Duration', 'Steps', 'Exercise Time', 'Stand Time', 'Month']]
y = daily_calories['Total Calories Burned']

# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Define Random Forest Regressor and hyperparameter grid
rf = RandomForestRegressor(random_state=42)
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Define nested cross-validation
outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)
inner_cv = KFold(n_splits=3, shuffle=True, random_state=42)

# Use GridSearchCV for hyperparameter tuning in the inner loop
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=inner_cv, scoring='r2', n_jobs=-1)

# Evaluate the model using cross_val_score in the outer loop
nested_scores = cross_val_score(grid_search, X_scaled, y, cv=outer_cv, scoring='r2')

# Print the results
print(f'Nested CV R^2 Scores: {nested_scores}')
print(f'Mean Nested CV R^2 Score: {np.mean(nested_scores)}')

Nested CV R^2 Scores: [0.94565796 0.97263246 0.96497095 0.96825753 0.96876938]
Mean Nested CV R^2 Score: 0.9640576559089986

The consistent R^2 scores across different folds imply the model is performing well on unseen data. Now let's test it on the optimal set of hyperparameters:

In [23]:

# Train the final model using the optimal parameters
grid_search.fit(X_scaled, y)
best_params = grid_search.best_params_
print(f'Best Parameters: {best_params}')

# Split the data into training and testing sets for final evaluation
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Train the final model with the best parameters
final_model = RandomForestRegressor(**best_params, random_state=42)
final_model.fit(X_train, y_train)

# Evaluate the final model on the test set
y_test_pred = final_model.predict(X_test)
test_mse = mean_squared_error(y_test, y_test_pred)
test_r2 = r2_score(y_test, y_test_pred)

# Print the final evaluation results
print(f'Final Test Mean Squared Error: {test_mse}')
print(f'Final Test R^2 Score: {test_r2}')
print(f'Final Training R^2 Score: {final_model.score(X_train, y_train)}')

Best Parameters: {'max_depth': 10, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 200}
Final Test Mean Squared Error: 3128.605471273497
Final Test R^2 Score: 0.9454320141597018
Final Training R^2 Score: 0.9929597911440643

NOTICE: We see a slight drop in the Test R^2 score under best parameters compared to the Mean Nested CV R^2 score. This is quite interesting.

BUT...we also see the Training R^2 score under best parameters is significantly higher, which implies the hyperparameter tuning may be leading to overfitting.

AS OF NOW, we feel confident trusting the results of the Ridge Regression. Despite a slightly higher train accuracy than test accuracy, our residual analysis, mean relative residual, and exploration of variable relationships, paired with a considerably strong test R^2 score, imply the model is fitting the data well while mitigating overfitting.

Below is a function that takes in anticipated activity levels for the day and outputs a prediction of total calorie count for the day!

In [18]:

#Function to get actual outputs for the above RF model
#Takes in array with anticipated activity levels for the day
#Outputs the predicted calories burned for that day

X = daily_calories[['Active Calories Burned', 'Workout Count', 'Workout Duration', 'Steps', 'Exercise Time', 'Stand Time', 'Month']]
y = daily_calories['Total Calories Burned']

def predict_calories(active_calories, workout_count, workout_duration, steps, exercise_time, stand_time, month):
    X = np.array([active_calories, workout_count, workout_duration, steps, exercise_time, stand_time, month]).reshape(1, -1) #turns into 2D array
    X_poly = poly.transform(X) #generates permutations of combos of products of all input features
    X_scaled = scaler.transform(X_poly) #scales poly transformed features, normalizes by Z-score
    return ridge.predict(X_scaled)[0]

We test this function on sample days NOT included in our original dataset:

In the below test for example, the actual total calories recorded by the Apple Watch with these statistics was 2357 Cal. This function, based on the Ridge Regression Model, guessed 2398.3 Cal. Similar accuracies were observed for subsequent test trials.

*NOTE: This function is intended for consideration of the entire day. It performs best at the day's end, when statistics are finalized, rather than halfway through the day. Additionally, the model is trained to me specifically (my body composition and fitness levels affect my personal calories)

I would love to expand this to become versatile for people of other weights, body types, workout schedules

In [21]:

#Example Test
#Plug in sample parameter values and see output expected Total Calories
active_calories = 593
workout_count = 0
workout_duration = 0
steps = 11346
exercise_time = 57
stand_time = 11
month = 6
predicted_calories = predict_calories(active_calories, workout_count, workout_duration, steps, exercise_time, stand_time, month)
print(f'Predicted calories burned: {predicted_calories:.2f} Cal')

Predicted calories burned: 2398.30 Cal

c:\Users\lminkow\anaconda3\Lib\site-packages\sklearn\base.py:493: UserWarning: X does not have valid feature names, but PolynomialFeatures was fitted with feature names
  warnings.warn(

Part 2: Answering Questions about my Activity

How likely am I to hit my Move Goal active calories if I do not workout?

In [24]:

#Note: my Move Goal has evolved from 400 to 650 to 700 active calories for a day
no_workout = dayswithworkout[dayswithworkout['Workout Count'] == 0]
#days with no workout sessions
print(len(no_workout))
no_workout

#Days where I surpassed MOVE goal anyways
moved = no_workout[no_workout['Active Calories Burned'] >= no_workout['Active Calories Goal']]
print(len(moved))

#Likelihood of exceeding MOVE goal on days with no workout sessions
moved_ratio = len(moved) / len(no_workout) * 100
print(f'Likelihood of exceeding MOVE goal on days with no workout sessions: {moved_ratio:.2f}%')

174
35
Likelihood of exceeding MOVE goal on days with no workout sessions: 20.11%

What was my maximum Step count in a day, and how have my Steps changed over time?

In [27]:

# Maximum steps in a day
max_steps = health['Steps'].max()

# What day was it?
max_steps_day = health.loc[health['Steps'] == max_steps, 'Date'].values[0]

# Convert to a more readable date format if necessary
max_steps_day = pd.to_datetime(max_steps_day).strftime('%Y-%m-%d')

# Print the results
print(f'Maximum steps in a day: {max_steps} on {max_steps_day}')

Maximum steps in a day: 26503 on 2025-03-21

In [28]:

plt.figure(figsize=(12, 6))
plt.plot(health['Date'], health['Steps'], label='Steps')
plt.title('Number of Steps Over Time')
plt.xlabel('Date')
plt.ylabel('Steps')
plt.legend()
plt.show()

$No description has been provided for this image$

How are my steps when traveling?

In [29]:

event_data=health.copy()
# Convert 'Date' column to datetime if not already done
event_data['Date'] = pd.to_datetime(event_data['Date'])

# Define events with proper datetime conversion
events = [
    ('Emory Move-In Day', pd.to_datetime('2023-08-19'), pd.to_datetime('2023-08-25'), 'skyblue'),
    ('Spring Break (Austin)', pd.to_datetime('2024-03-08'), pd.to_datetime('2024-03-15'), 'coral'),
    ('Summer (Chicago)', pd.to_datetime('2024-06-01'), pd.to_datetime('2024-08-05'), 'goldenrod'),
    ('Fall Break (SF)', pd.to_datetime('2024-10-12'), pd.to_datetime('2024-10-15'), 'mediumseagreen'),
    ('Spring Break (Disney & Miami)', pd.to_datetime('2025-03-08'), pd.to_datetime('2025-03-14'), 'slateblue'),
]

# Plot the data
plt.figure(figsize=(12, 6))
plt.plot(event_data['Date'], event_data['Steps'], label='Steps', color='black')
plt.xticks(rotation=45)

# Add event spans
for event, start_date, end_date, color in events:
    plt.axvspan(start_date, end_date, color=color, alpha=0.3, label=event)

plt.xlabel('Date')
plt.ylabel('Steps')
plt.title('Travel-Based Events')

# Avoid duplicate labels in the legend
handles, labels = plt.gca().get_legend_handles_labels()
by_label = dict(zip(labels, handles))
plt.legend(by_label.values(), by_label.keys(), loc='upper left')

# Show plot with tight layout
plt.grid(False)
plt.tight_layout()
plt.show()