--- name: mooc-analytics-guide description: "Analyzing MOOC data, learning analytics, and online education metrics" metadata: openclaw: emoji: "📈" category: "domains" subcategory: "education" keywords: ["mooc", "learning-analytics", "online-education", "edx", "coursera", "clickstream"] source: "wentor" --- # MOOC Analytics Guide A skill for analyzing Massive Open Online Course data, implementing learning analytics pipelines, and extracting actionable insights from online education platforms. Covers clickstream processing, engagement modeling, dropout prediction, and A/B testing for course design. ## Data Sources and Formats ### Common MOOC Data Schemas MOOC platforms export several standard data types: | Data Type | Description | Typical Format | |-----------|-------------|----------------| | Clickstream logs | Page views, video plays, pauses, seeks | JSON event logs | | Forum posts | Discussion text, timestamps, thread structure | CSV/JSON | | Grade records | Assignment scores, quiz attempts, certificates | CSV | | Course structure | Module hierarchy, release dates, prerequisites | XML/JSON | | Survey responses | Pre/post course surveys, demographics | CSV | ### Accessing Open MOOC Datasets Several open datasets are available for research: - **MOOCdb**: Standardized schema from MIT, includes clickstream, forum, and grade data - **Stanford MOOCPosts**: 30,000+ labeled forum posts for sentiment and urgency classification - **Open University Learning Analytics (OULAD)**: Anonymized data for 30,000+ students across 7 courses - **edX Research Data Exchange**: Available to institutional partners via application ```python import pandas as pd # Load OULAD dataset (publicly available) students = pd.read_csv("studentInfo.csv") assessments = pd.read_csv("assessments.csv") interactions = pd.read_csv("studentVle.csv") # Basic engagement metric: total clicks per student per course engagement = ( interactions .groupby(["id_student", "code_module", "code_presentation"]) .agg(total_clicks=("sum_click", "sum"), active_days=("date", "nunique")) .reset_index() ) print(engagement.describe()) ``` ## Engagement and Retention Analysis ### Defining Engagement Metrics Key metrics used in learning analytics research: - **Session count**: Number of distinct learning sessions (gap-based, e.g., 30-min inactivity threshold) - **Time on task**: Total seconds spent on content pages and videos - **Video completion ratio**: Fraction of video duration actually watched - **Forum participation rate**: Posts + replies per student per week - **Assignment submission rate**: Fraction of graded assignments submitted on time - **Regularity index**: Entropy of daily activity distribution (lower entropy = more regular) ```python import numpy as np def regularity_index(daily_counts: np.ndarray) -> float: """ Compute regularity index based on Shannon entropy. Lower values indicate more regular study patterns. daily_counts: array of click counts per day over the course. """ total = daily_counts.sum() if total == 0: return float("nan") probs = daily_counts / total probs = probs[probs > 0] entropy = -np.sum(probs * np.log2(probs)) max_entropy = np.log2(len(daily_counts)) return round(entropy / max_entropy, 4) # normalized [0, 1] ``` ### Dropout Prediction Predicting which learners will drop out is a central MOOC analytics task: ```python from sklearn.ensemble import GradientBoostingClassifier from sklearn.model_selection import TimeSeriesSplit from sklearn.metrics import roc_auc_score # Feature engineering: weekly aggregates features = [ "clicks_week", "video_time_week", "forum_posts_week", "assignments_submitted", "avg_score", "days_since_last_login", "regularity_index", "week_number" ] X = weekly_features[features] y = weekly_features["dropped_next_week"] # Time-aware cross-validation (no future leakage) tscv = TimeSeriesSplit(n_splits=5) aucs = [] for train_idx, test_idx in tscv.split(X): model = GradientBoostingClassifier( n_estimators=200, max_depth=4, learning_rate=0.1 ) model.fit(X.iloc[train_idx], y.iloc[train_idx]) pred = model.predict_proba(X.iloc[test_idx])[:, 1] aucs.append(roc_auc_score(y.iloc[test_idx], pred)) print(f"Mean AUC: {np.mean(aucs):.3f} +/- {np.std(aucs):.3f}") ``` ## Video Analytics ### Clickstream Processing for Video Events Video interaction is the primary learning activity in MOOCs. Analyzing play, pause, seek, and speed-change events reveals learning patterns: ```python def compute_video_metrics(events: pd.DataFrame) -> dict: """ Process video clickstream events into engagement metrics. events: DataFrame with columns [user_id, video_id, event_type, timestamp, position_seconds, video_duration] """ plays = events[events.event_type == "play"] pauses = events[events.event_type == "pause"] seeks = events[events.event_type == "seek"] total_duration = events.video_duration.iloc[0] watched_positions = set() for _, row in plays.iterrows(): start = int(row.position_seconds) # Estimate 10-second watch window per play event for sec in range(start, min(start + 10, int(total_duration))): watched_positions.add(sec) return { "play_count": len(plays), "pause_count": len(pauses), "seek_count": len(seeks), "coverage_ratio": len(watched_positions) / max(total_duration, 1), "replay_indicator": len(plays) > 1, } ``` ### Optimal Video Length Research findings on video engagement (Guo et al., 2014): - Videos under 6 minutes have the highest engagement - Informal talking-head videos outperform studio productions - Tablet drawing (Khan Academy style) is more engaging than slides - Pre-production planning matters more than production quality ## A/B Testing for Course Design ### Experimental Design in MOOCs MOOCs provide large sample sizes ideal for randomized experiments: 1. **Unit of randomization**: Typically the learner, but can be section or cohort 2. **Outcome metrics**: Completion rate, quiz scores, time to completion, forum engagement 3. **Duration**: Run for at least one full module cycle (typically 1-2 weeks) 4. **Power analysis**: With 10,000+ enrollees, even small effects (d=0.05) are detectable ```python from scipy.stats import norm def mooc_power_analysis(effect_size: float, n_per_group: int, alpha: float = 0.05) -> float: """Compute statistical power for a two-sample t-test in MOOC A/B test.""" z_alpha = norm.ppf(1 - alpha / 2) z_beta = effect_size * (n_per_group ** 0.5) / 2 - z_alpha power = norm.cdf(z_beta) return round(power, 4) # Example: 5000 per group, small effect print(mooc_power_analysis(0.1, 5000)) # ~0.94 ``` ## Tools and Platforms - **edX Insights**: Built-in analytics dashboard for edX course teams - **Google BigQuery** + **Coursera Research Exports**: SQL-based analysis at scale - **Open edX**: Self-hosted platform with full database access (MySQL + MongoDB) - **Learning Locker**: Open-source Learning Record Store (xAPI compliant) - **MORF (MOOC Replication Framework)**: Docker-based reproducible analytics pipeline from University of Michigan ## Key References - Guo, P.J., Kim, J., and Rubin, R. (2014). How video production affects student engagement. *ACM L@S*. - Gardner, J. and Brooks, C. (2018). Student success prediction in MOOCs. *User Modeling and User-Adapted Interaction*. - Reich, J. and Ruiperez-Valiente, J.A. (2019). The MOOC pivot. *Science*.