Generating Results from a Complete Dataset
#Import Data (more libraries to come later in file)
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
health=pd.read_csv("")
health
#Date in M/D/YR
#Active Calories measured in Cal
#Total Calories Burned = Active + Passive Calories Burned during specificied day
#Average Heart Rate in BPM
#Exercise Time mesured in minutes
#Stand Time measured in hours
#Workout Duration in minutes
#Temperature in degrees Farenheit of outside tempeature (correlated to timezone/location of workout)
#NOTE^ most strength workouts were indoors, others outdoors
#Calories Burned is the # active cal burned in the workout session
#NOTE^ missing values for rows indicate that no workout sessions were recorded on that day
| Date | Active Calories Burned | Total Calories Burned | Active Calories Goal | Exercise Time | Exercise Time Goal | Stand Time | Stand Time Goal | Workout Type | Workout Duration | Temperature | Calories Burned | Average Heart Rate | Steps | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7/1/2023 | 518 | 2240 | 400 | 49 | 30 | 11 | 8 | Cardio | 21.0 | 75.0 | 147.0 | 108.0 | 6870 |
| 1 | 7/2/2023 | 954 | 2720 | 400 | 138 | 30 | 13 | 8 | Strength Training | 53.0 | 66.0 | 260.0 | 108.0 | 8063 |
| 2 | 7/2/2023 | 954 | 2720 | 400 | 138 | 30 | 13 | 8 | Basketball | 47.0 | 70.0 | 328.0 | 120.0 | 8063 |
| 3 | 7/3/2023 | 520 | 2273 | 400 | 48 | 30 | 14 | 8 | Cardio | 24.0 | 75.0 | 114.0 | 123.0 | 8673 |
| 4 | 7/4/2023 | 796 | 2530 | 400 | 140 | 30 | 11 | 8 | Cardio | 45.0 | 82.0 | 194.0 | 108.0 | 8718 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 838 | 3/31/2025 | 526 | 2437 | 700 | 30 | 30 | 13 | 8 | NaN | NaN | NaN | NaN | NaN | 7974 |
| 839 | 4/1/2025 | 912 | 2843 | 700 | 93 | 30 | 15 | 8 | Strength Training | 69.0 | 46.0 | 357.0 | 107.0 | 11714 |
| 840 | 4/2/2025 | 616 | 2532 | 700 | 31 | 30 | 14 | 8 | NaN | NaN | NaN | NaN | NaN | 9599 |
| 841 | 4/3/2025 | 581 | 2496 | 700 | 34 | 30 | 12 | 8 | NaN | NaN | NaN | NaN | NaN | 10290 |
| 842 | 4/4/2025 | 894 | 2788 | 700 | 89 | 30 | 15 | 8 | Strength Training | 65.0 | 64.0 | 352.0 | 107.0 | 12276 |
843 rows × 14 columns
#Heatmap to show densities of avg calories burned per day by month
import seaborn as sns
import calendar
#First, convert date to datetime
health['Date'] = pd.to_datetime(health['Date'])
#Then, extract month from date
health['Month'] = health['Date'].dt.month
health['Year'] = health['Date'].dt.year
monthly_avg = health.groupby(['Year', 'Month'])['Total Calories Burned'].mean().unstack()
#Annotations
monthly_avg.columns = [calendar.month_name[i] for i in monthly_avg.columns]
#Create heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(monthly_avg, annot=monthly_avg, fmt=".1f", cmap="YlGnBu")
plt.title('Average Calories Burned Per Day by Month')
plt.xlabel('Month')
plt.ylabel('Year')
plt.show()
PART 1: Estimating Total Expected Calories Burned in a Day Given Certain Anticipated Fitness Stats
#Get Number of Workouts in a Day
health
dayswithworkout = health.copy()
dayswithworkout['Workout Count'] = health.groupby('Date')['Workout Duration'].transform('count')
dayswithworkout
#print(dayswithworkout['Workout Count'].value_counts())
| Date | Active Calories Burned | Total Calories Burned | Active Calories Goal | Exercise Time | Exercise Time Goal | Stand Time | Stand Time Goal | Workout Type | Workout Duration | Temperature | Calories Burned | Average Heart Rate | Steps | Month | Year | Workout Count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2023-07-01 | 518 | 2240 | 400 | 49 | 30 | 11 | 8 | Cardio | 21.0 | 75.0 | 147.0 | 108.0 | 6870 | 7 | 2023 | 1 |
| 1 | 2023-07-02 | 954 | 2720 | 400 | 138 | 30 | 13 | 8 | Strength Training | 53.0 | 66.0 | 260.0 | 108.0 | 8063 | 7 | 2023 | 2 |
| 2 | 2023-07-02 | 954 | 2720 | 400 | 138 | 30 | 13 | 8 | Basketball | 47.0 | 70.0 | 328.0 | 120.0 | 8063 | 7 | 2023 | 2 |
| 3 | 2023-07-03 | 520 | 2273 | 400 | 48 | 30 | 14 | 8 | Cardio | 24.0 | 75.0 | 114.0 | 123.0 | 8673 | 7 | 2023 | 1 |
| 4 | 2023-07-04 | 796 | 2530 | 400 | 140 | 30 | 11 | 8 | Cardio | 45.0 | 82.0 | 194.0 | 108.0 | 8718 | 7 | 2023 | 2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 838 | 2025-03-31 | 526 | 2437 | 700 | 30 | 30 | 13 | 8 | NaN | NaN | NaN | NaN | NaN | 7974 | 3 | 2025 | 0 |
| 839 | 2025-04-01 | 912 | 2843 | 700 | 93 | 30 | 15 | 8 | Strength Training | 69.0 | 46.0 | 357.0 | 107.0 | 11714 | 4 | 2025 | 1 |
| 840 | 2025-04-02 | 616 | 2532 | 700 | 31 | 30 | 14 | 8 | NaN | NaN | NaN | NaN | NaN | 9599 | 4 | 2025 | 0 |
| 841 | 2025-04-03 | 581 | 2496 | 700 | 34 | 30 | 12 | 8 | NaN | NaN | NaN | NaN | NaN | 10290 | 4 | 2025 | 0 |
| 842 | 2025-04-04 | 894 | 2788 | 700 | 89 | 30 | 15 | 8 | Strength Training | 65.0 | 64.0 | 352.0 | 107.0 | 12276 | 4 | 2025 | 1 |
843 rows × 17 columns
But, if I had multiple workout sessions on the same day, the 'Workout Duration' variable needs to be summed, since those cells give the length of just a single workout. Now, we can just keep the first entry for each day, since it encapsulates the totals for all the workout-based metrics for that day.
Columns Not Appropriate for Predicting Total Calories Burned in a Day:
- 'Workout Type' : because we are only keeping one entry for each day (which essentially holds all workout data), we no longer have individual rows for each workout to assign a workout type to
- 'Average Heart Rate' : this is the average heart rate during a workout, and we are not currently using separate rows for each workout. This current dataframe is a date-specific, not workout-specific, set
- 'Temperature' : most of my workouts are indoors. Though my habits vary by time of year, and many soccer or basketball workouts are outside, the 'Month' column captures the seasonal effects already
- 'Calories Burned' : this is the active calories burned in a specific workout, which are summed for the day in 'Active Calories Burned' column already
#Sum Workout Duration for each day
#This is the total workout time for each day, not the total exercise time (which goes beyond just workouts)
dayswithworkout['Workout Duration'] = health.groupby('Date')['Workout Duration'].transform('sum')
#Only keep one entry per day
#Now a date-specific index, not a workout-specific index
daily_calories = dayswithworkout.drop_duplicates(subset='Date', keep='first')
daily_calories
#Drop columns that are not needed for this analysis
daily_calories=daily_calories.drop(columns=['Workout Type', 'Calories Burned', 'Average Heart Rate', 'Temperature'])
#Make NaN values 0
# This is OK to do this b/c 0 for 'Workout Duration' is valid for days with no recorded session; the 'Exercise Time' will not be 0
daily_calories = daily_calories.fillna(0)
daily_calories
| Date | Active Calories Burned | Total Calories Burned | Active Calories Goal | Exercise Time | Exercise Time Goal | Stand Time | Stand Time Goal | Workout Duration | Steps | Month | Year | Workout Count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2023-07-01 | 518 | 2240 | 400 | 49 | 30 | 11 | 8 | 21.0 | 6870 | 7 | 2023 | 1 |
| 1 | 2023-07-02 | 954 | 2720 | 400 | 138 | 30 | 13 | 8 | 100.0 | 8063 | 7 | 2023 | 2 |
| 3 | 2023-07-03 | 520 | 2273 | 400 | 48 | 30 | 14 | 8 | 24.0 | 8673 | 7 | 2023 | 1 |
| 4 | 2023-07-04 | 796 | 2530 | 400 | 140 | 30 | 11 | 8 | 82.0 | 8718 | 7 | 2023 | 2 |
| 6 | 2023-07-05 | 542 | 2298 | 400 | 51 | 30 | 9 | 8 | 62.0 | 3622 | 7 | 2023 | 3 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 838 | 2025-03-31 | 526 | 2437 | 700 | 30 | 30 | 13 | 8 | 0.0 | 7974 | 3 | 2025 | 0 |
| 839 | 2025-04-01 | 912 | 2843 | 700 | 93 | 30 | 15 | 8 | 69.0 | 11714 | 4 | 2025 | 1 |
| 840 | 2025-04-02 | 616 | 2532 | 700 | 31 | 30 | 14 | 8 | 0.0 | 9599 | 4 | 2025 | 0 |
| 841 | 2025-04-03 | 581 | 2496 | 700 | 34 | 30 | 12 | 8 | 0.0 | 10290 | 4 | 2025 | 0 |
| 842 | 2025-04-04 | 894 | 2788 | 700 | 89 | 30 | 15 | 8 | 65.0 | 12276 | 4 | 2025 | 1 |
644 rows × 13 columns
Make a Correlation Matrix to gauge which variables are promising features for the upcoming model
#Correlation Matrix
#Compute the correlation matrix
correlation_matrix = daily_calories.corr()
# Display the correlation matrix
#print(correlation_matrix)
# Visualize the correlation matrix using a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix')
plt.show()
# Focus on correlations with 'Total Calories Burned', our target variable
correlation_with_target = correlation_matrix['Total Calories Burned'].sort_values(ascending=False)
print("Correlation with 'Total Calories Burned':")
print(correlation_with_target)
Correlation with 'Total Calories Burned': Total Calories Burned 1.000000 Active Calories Burned 0.979000 Exercise Time 0.817898 Steps 0.739849 Workout Duration 0.696781 Workout Count 0.520065 Stand Time 0.368875 Date 0.218326 Active Calories Goal 0.148000 Year 0.139821 Month 0.065004 Exercise Time Goal NaN Stand Time Goal NaN Name: Total Calories Burned, dtype: float64
Strongest Correlations with Total Calories Burned:
- Workout Duration (0.90)
- Active Calories (0.89)
- Exercise Time (0.88)
- Stand Time (0.83)
Would be valuable features given domain knowledge of my workouts: - Steps
- Month
- Workout Count
Features that are potentially redundant:
- Workout Duration & Exercise Time
*However, 'Exercise Time' also includes warmup/cooldown/stretching/activity not done as part of workout, so these are both valuable
Ridge Regression Model:
Why Ridge Regression: want to use Ridge regression to avoid overfitting, since some features are highly correlated
# Training the model using Ridge regression
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error, r2_score
#Why we include these features:
# Using aggregate sum because we want to capture the total activity level for the day
#daily_calories = dayswithworkout.groupby('Date').agg({
# 'Steps': 'sum', #important b/c it is a measure of activity that doesn't require a workout session
# 'Exercise Time': 'sum', #important for measuring physical activity beyond just recorded workouts
# 'Stand Time': 'sum', #important metric for gauging general activity level during the day (aside from downtime)
# 'Month': 'sum', #most workouts were done indoors, but my personal workout type & duration tend to vary by time of season
# #due to the fact that I am a college student and my schedule changes by semester
# 'Total Calories Burned': 'sum'} #This is the target variable we are trying to predict
# ).reset_index()
# Feature Selection
X = daily_calories[['Active Calories Burned', 'Workout Count', 'Workout Duration', 'Steps', 'Exercise Time', 'Stand Time', 'Month']]
y = daily_calories['Total Calories Burned']
# Add polynomial features (to get nonlinear relationships)
# Polynomial features can help capture interactions between features and nonlinear relationships
# For example, the relationship between exercise time and calories burned may not be linear
#Relatively small dataset, so less worried about added computational cost
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_poly, y, test_size=0.2, random_state=42)
# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train the Ridge regression model
ridge = Ridge()
ridge.fit(X_train_scaled, y_train)
# Make predictions on the training and testing sets
y_train_pred = ridge.predict(X_train_scaled)
y_test_pred = ridge.predict(X_test_scaled)
# Evaluate the model on the training set
train_mse = mean_squared_error(y_train, y_train_pred)
train_r2 = r2_score(y_train, y_train_pred)
# Evaluate the model on the testing set
test_mse = mean_squared_error(y_test, y_test_pred)
test_r2 = r2_score(y_test, y_test_pred)
# Print the results
print(f'Training Mean Squared Error: {train_mse}')
print(f'Training R^2 Score: {train_r2}')
print(f'Test Mean Squared Error: {test_mse}')
print(f'Test R^2 Score: {test_r2}')
Training Mean Squared Error: 2236.3862952065692 Training R^2 Score: 0.9769335868896968 Test Mean Squared Error: 2657.7066421163663 Test R^2 Score: 0.9536452583279413
The accuracy looks strong, but we notice the model performs slightly better on training data than testing data (though not by much). Let's investigate the extent to which the model is overfitting:
#Residual Plot
residuals = y_test - y_test_pred
plt.figure(figsize=(10, 6))
plt.scatter(y_test_pred, residuals, alpha=0.5)
plt.axhline(0, color='red', linestyle='--')
plt.title('Residual Plot')
plt.xlabel('Predicted Total Calories Burned')
plt.ylabel('Residuals')
plt.show()
Notes on the residual analysis:
- seem to have a (relatively) even number of values in the positive and negative planes
- the absolute values of the residuals are pretty small (an prediction error off by 50 calories is very tolerable when calorie counts range from 2300-3000 usually)
- the residuals are spread out randomly (no obvious bias in a single direction)
#Calculate Relative Residuals
residuals = y_test - y_test_pred
# Calculate the average residual
mean_residual = np.mean(residuals)
# Calculate the relative residuals
relative_residuals = (residuals / y_test_pred) * 100
# Calculate the average relative residual
mean_relative_residual = np.mean(relative_residuals)
print(f'Mean Relative Residual: {mean_relative_residual}')
Mean Relative Residual: -0.20704110568879155
At a glance, a mean relative residual of roughly -0.21% indicates the model is underestimating total calories burned for a day by 0.21%. This is a very small error, and definitely acceptable in our context
#Feature Importance
# Note: some of the features are polynomial features, so the coefficients may not be directly interpretable
# Get the coefficients of the model
feature_names = poly.get_feature_names_out(input_features=['Active Calories Burned', 'Workout Count',
'Workout Duration', 'Steps', 'Exercise Time', 'Stand Time', 'Month'])
coefficients = ridge.coef_
# Create a DataFrame to display the feature importance
feature_importance = pd.DataFrame({'Feature': feature_names, 'Coefficient': coefficients})
feature_importance = feature_importance.sort_values(by='Coefficient', ascending=False)
# Display the feature importance
#print(feature_importance)
# Plot the feature importance
plt.figure(figsize=(10, 6))
plt.barh(feature_importance['Feature'], feature_importance['Coefficient'])
plt.xlabel('Coefficient')
plt.title('Feature Importance')
plt.show()
#Actual vs Predicted Plot
plt.scatter(y_test, y_test_pred)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color='red', linestyle='--')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs. Predicted')
plt.show()
Another way we can assess if a Ridge Regression, which assumes linearity, and see if the relationships between variables do in fact appear linear (at least from a first glance):
import seaborn as sns
import matplotlib.pyplot as plt
# Create pairplot for some key features and target variable
sns.pairplot(daily_calories[['Active Calories Burned', 'Workout Duration', 'Steps', 'Exercise Time', 'Total Calories Burned']])
plt.show()
Most of these relationships do appear linear, so this is further evidence implying the Ridge Regression model is capturing the relationships well
However, let's try another approach: nested cross validation with hyperparameter tuning on the inner loop
We are going to now switch to a Random Forest Regressor, in case nonlinear relationships might be making an impact. To avoid redundancy, we will remove the polynomial feature transformations, and allow the RF model and hyperparameter tuning to act:
# Import necessary libraries
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
# Feature Selection
X = daily_calories[['Active Calories Burned', 'Workout Count', 'Workout Duration', 'Steps', 'Exercise Time', 'Stand Time', 'Month']]
y = daily_calories['Total Calories Burned']
# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Define Random Forest Regressor and hyperparameter grid
rf = RandomForestRegressor(random_state=42)
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
# Define nested cross-validation
outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)
inner_cv = KFold(n_splits=3, shuffle=True, random_state=42)
# Use GridSearchCV for hyperparameter tuning in the inner loop
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=inner_cv, scoring='r2', n_jobs=-1)
# Evaluate the model using cross_val_score in the outer loop
nested_scores = cross_val_score(grid_search, X_scaled, y, cv=outer_cv, scoring='r2')
# Print the results
print(f'Nested CV R^2 Scores: {nested_scores}')
print(f'Mean Nested CV R^2 Score: {np.mean(nested_scores)}')
Nested CV R^2 Scores: [0.94565796 0.97263246 0.96497095 0.96825753 0.96876938] Mean Nested CV R^2 Score: 0.9640576559089986
The consistent R^2 scores across different folds imply the model is performing well on unseen data. Now let's test it on the optimal set of hyperparameters:
# Train the final model using the optimal parameters
grid_search.fit(X_scaled, y)
best_params = grid_search.best_params_
print(f'Best Parameters: {best_params}')
# Split the data into training and testing sets for final evaluation
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
# Train the final model with the best parameters
final_model = RandomForestRegressor(**best_params, random_state=42)
final_model.fit(X_train, y_train)
# Evaluate the final model on the test set
y_test_pred = final_model.predict(X_test)
test_mse = mean_squared_error(y_test, y_test_pred)
test_r2 = r2_score(y_test, y_test_pred)
# Print the final evaluation results
print(f'Final Test Mean Squared Error: {test_mse}')
print(f'Final Test R^2 Score: {test_r2}')
print(f'Final Training R^2 Score: {final_model.score(X_train, y_train)}')
Best Parameters: {'max_depth': 10, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 200}
Final Test Mean Squared Error: 3128.605471273497
Final Test R^2 Score: 0.9454320141597018
Final Training R^2 Score: 0.9929597911440643
NOTICE: We see a slight drop in the Test R^2 score under best parameters compared to the Mean Nested CV R^2 score. This is quite interesting.
BUT...we also see the Training R^2 score under best parameters is significantly higher, which implies the hyperparameter tuning may be leading to overfitting.
AS OF NOW, we feel confident trusting the results of the Ridge Regression. Despite a slightly higher train accuracy than test accuracy, our residual analysis, mean relative residual, and exploration of variable relationships, paired with a considerably strong test R^2 score, imply the model is fitting the data well while mitigating overfitting.
Below is a function that takes in anticipated activity levels for the day and outputs a prediction of total calorie count for the day!
#Function to get actual outputs for the above RF model
#Takes in array with anticipated activity levels for the day
#Outputs the predicted calories burned for that day
X = daily_calories[['Active Calories Burned', 'Workout Count', 'Workout Duration', 'Steps', 'Exercise Time', 'Stand Time', 'Month']]
y = daily_calories['Total Calories Burned']
def predict_calories(active_calories, workout_count, workout_duration, steps, exercise_time, stand_time, month):
X = np.array([active_calories, workout_count, workout_duration, steps, exercise_time, stand_time, month]).reshape(1, -1) #turns into 2D array
X_poly = poly.transform(X) #generates permutations of combos of products of all input features
X_scaled = scaler.transform(X_poly) #scales poly transformed features, normalizes by Z-score
return ridge.predict(X_scaled)[0]
We test this function on sample days NOT included in our original dataset:
In the below test for example, the actual total calories recorded by the Apple Watch with these statistics was 2357 Cal. This function, based on the Ridge Regression Model, guessed 2398.3 Cal. Similar accuracies were observed for subsequent test trials.
*NOTE: This function is intended for consideration of the entire day. It performs best at the day's end, when statistics are finalized, rather than halfway through the day. Additionally, the model is trained to me specifically (my body composition and fitness levels affect my personal calories)
I would love to expand this to become versatile for people of other weights, body types, workout schedules
#Example Test
#Plug in sample parameter values and see output expected Total Calories
active_calories = 593
workout_count = 0
workout_duration = 0
steps = 11346
exercise_time = 57
stand_time = 11
month = 6
predicted_calories = predict_calories(active_calories, workout_count, workout_duration, steps, exercise_time, stand_time, month)
print(f'Predicted calories burned: {predicted_calories:.2f} Cal')
Predicted calories burned: 2398.30 Cal
c:\Users\lminkow\anaconda3\Lib\site-packages\sklearn\base.py:493: UserWarning: X does not have valid feature names, but PolynomialFeatures was fitted with feature names warnings.warn(
Part 2: Answering Questions about my Activity
How likely am I to hit my Move Goal active calories if I do not workout?
#Note: my Move Goal has evolved from 400 to 650 to 700 active calories for a day
no_workout = dayswithworkout[dayswithworkout['Workout Count'] == 0]
#days with no workout sessions
print(len(no_workout))
no_workout
#Days where I surpassed MOVE goal anyways
moved = no_workout[no_workout['Active Calories Burned'] >= no_workout['Active Calories Goal']]
print(len(moved))
#Likelihood of exceeding MOVE goal on days with no workout sessions
moved_ratio = len(moved) / len(no_workout) * 100
print(f'Likelihood of exceeding MOVE goal on days with no workout sessions: {moved_ratio:.2f}%')
174 35 Likelihood of exceeding MOVE goal on days with no workout sessions: 20.11%
What was my maximum Step count in a day, and how have my Steps changed over time?
# Maximum steps in a day
max_steps = health['Steps'].max()
# What day was it?
max_steps_day = health.loc[health['Steps'] == max_steps, 'Date'].values[0]
# Convert to a more readable date format if necessary
max_steps_day = pd.to_datetime(max_steps_day).strftime('%Y-%m-%d')
# Print the results
print(f'Maximum steps in a day: {max_steps} on {max_steps_day}')
Maximum steps in a day: 26503 on 2025-03-21
plt.figure(figsize=(12, 6))
plt.plot(health['Date'], health['Steps'], label='Steps')
plt.title('Number of Steps Over Time')
plt.xlabel('Date')
plt.ylabel('Steps')
plt.legend()
plt.show()
How are my steps when traveling?
event_data=health.copy()
# Convert 'Date' column to datetime if not already done
event_data['Date'] = pd.to_datetime(event_data['Date'])
# Define events with proper datetime conversion
events = [
('Emory Move-In Day', pd.to_datetime('2023-08-19'), pd.to_datetime('2023-08-25'), 'skyblue'),
('Spring Break (Austin)', pd.to_datetime('2024-03-08'), pd.to_datetime('2024-03-15'), 'coral'),
('Summer (Chicago)', pd.to_datetime('2024-06-01'), pd.to_datetime('2024-08-05'), 'goldenrod'),
('Fall Break (SF)', pd.to_datetime('2024-10-12'), pd.to_datetime('2024-10-15'), 'mediumseagreen'),
('Spring Break (Disney & Miami)', pd.to_datetime('2025-03-08'), pd.to_datetime('2025-03-14'), 'slateblue'),
]
# Plot the data
plt.figure(figsize=(12, 6))
plt.plot(event_data['Date'], event_data['Steps'], label='Steps', color='black')
plt.xticks(rotation=45)
# Add event spans
for event, start_date, end_date, color in events:
plt.axvspan(start_date, end_date, color=color, alpha=0.3, label=event)
plt.xlabel('Date')
plt.ylabel('Steps')
plt.title('Travel-Based Events')
# Avoid duplicate labels in the legend
handles, labels = plt.gca().get_legend_handles_labels()
by_label = dict(zip(labels, handles))
plt.legend(by_label.values(), by_label.keys(), loc='upper left')
# Show plot with tight layout
plt.grid(False)
plt.tight_layout()
plt.show()