{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Education: Delta Liquid Clustering Demo\n", "\n", "\n", "## Overview\n", "\n", "\n", "This notebook demonstrates the power of **Delta Liquid Clustering** in Oracle AI Data Platform (AIDP) Workbench using an education analytics use case. Liquid clustering automatically optimizes data layout for query performance without requiring manual partitioning or Z-Ordering.\n", "\n", "### What is Liquid Clustering?\n", "\n", "Liquid clustering automatically identifies and groups similar data together based on clustering columns you define. This optimization happens automatically during data ingestion and maintenance operations, providing:\n", "\n", "- **Automatic optimization**: No manual tuning required\n", "- **Improved query performance**: Faster queries on clustered columns\n", "- **Reduced maintenance**: No need for manual repartitioning\n", "- **Adaptive clustering**: Adjusts as data patterns change\n", "\n", "### Use Case: Student Performance Analytics and Learning Management\n", "\n", "We'll analyze student learning data and academic performance metrics. Our clustering strategy will optimize for:\n", "\n", "- **Student-specific queries**: Fast lookups by student ID\n", "- **Time-based analysis**: Efficient filtering by academic period and assessment dates\n", "- **Performance patterns**: Quick aggregation by subject and learning outcomes\n", "\n", "### AIDP Environment Setup\n", "\n", "This notebook leverages the existing Spark session in your AIDP environment." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create education catalog and analytics schema\n", "\n", "# In AIDP, catalogs provide data isolation and governance\n", "\n", "spark.sql(\"CREATE CATALOG IF NOT EXISTS education\")\n", "\n", "spark.sql(\"CREATE SCHEMA IF NOT EXISTS education.analytics\")\n", "\n", "print(\"Education catalog and analytics schema created successfully!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2: Create Delta Table with Liquid Clustering\n", "\n", "### Table Design\n", "\n", "Our `student_assessments` table will store:\n", "\n", "- **student_id**: Unique student identifier\n", "- **assessment_date**: Date of assessment or assignment\n", "- **subject**: Academic subject area\n", "- **score**: Assessment score (0-100)\n", "- **grade_level**: Student grade level\n", "- **completion_time**: Time spent on assessment (minutes)\n", "- **engagement_score**: Student engagement metric (0-100)\n", "\n", "### Clustering Strategy\n", "\n", "We'll cluster by `student_id` and `assessment_date` because:\n", "\n", "- **student_id**: Students generate multiple assessments, grouping learning progress together\n", "- **assessment_date**: Time-based queries are critical for academic tracking, semester analysis, and intervention planning\n", "- This combination optimizes for both individual student monitoring and temporal academic performance analysis" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Delta table with liquid clustering created successfully!\n", "Clustering will automatically optimize data layout for queries on student_id and assessment_date.\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Create Delta table with liquid clustering\n", "\n", "# CLUSTER BY defines the columns for automatic optimization\n", "\n", "spark.sql(\"\"\"\n", "\n", "CREATE TABLE IF NOT EXISTS education.analytics.student_assessments (\n", "\n", " student_id STRING,\n", "\n", " assessment_date DATE,\n", "\n", " subject STRING,\n", "\n", " score DECIMAL(5,2),\n", "\n", " grade_level STRING,\n", "\n", " completion_time DECIMAL(6,2),\n", "\n", " engagement_score INT\n", "\n", ")\n", "\n", "USING DELTA\n", "\n", "CLUSTER BY (student_id, assessment_date)\n", "\n", "\"\"\")\n", "\n", "print(\"Delta table with liquid clustering created successfully!\")\n", "\n", "print(\"Clustering will automatically optimize data layout for queries on student_id and assessment_date.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3: Generate Education Sample Data\n", "\n", "### Data Generation Strategy\n", "\n", "We'll create realistic student assessment data including:\n", "\n", "- **3,000 students** with multiple assessments over time\n", "- **Subjects**: Math, English, Science, History, Art, Physical Education\n", "- **Realistic performance patterns**: Learning curves, subject difficulty variations, engagement factors\n", "- **Grade levels**: K-12 with appropriate academic progression\n", "\n", "### Why This Data Pattern?\n", "\n", "This data simulates real education scenarios where:\n", "\n", "- Student performance varies by subject and time\n", "- Learning progress needs longitudinal tracking\n", "- Intervention strategies require early identification\n", "- Curriculum effectiveness drives teaching improvements\n", "- Standardized testing and reporting require temporal analysis" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Generated 67233 student assessment records\n", "Sample record: {'student_id': 'STU000001', 'assessment_date': datetime.date(2024, 9, 25), 'subject': 'History', 'score': 47.58, 'grade_level': '5th Grade', 'completion_time': 71.9, 'engagement_score': 51}\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Generate sample student assessment data\n", "\n", "# Using fully qualified imports to avoid conflicts\n", "\n", "import random\n", "\n", "from datetime import datetime, timedelta\n", "\n", "\n", "# Define education data constants\n", "\n", "SUBJECTS = ['Math', 'English', 'Science', 'History', 'Art', 'Physical Education']\n", "\n", "GRADE_LEVELS = ['Kindergarten', '1st Grade', '2nd Grade', '3rd Grade', '4th Grade', '5th Grade', \n", " '6th Grade', '7th Grade', '8th Grade', '9th Grade', '10th Grade', '11th Grade', '12th Grade']\n", "\n", "# Base performance parameters by subject and grade level\n", "\n", "PERFORMANCE_PARAMS = {\n", "\n", " 'Math': {'base_score': 75, 'difficulty': 1.2, 'time_factor': 1.5},\n", "\n", " 'English': {'base_score': 78, 'difficulty': 1.0, 'time_factor': 1.2},\n", "\n", " 'Science': {'base_score': 72, 'difficulty': 1.3, 'time_factor': 1.4},\n", "\n", " 'History': {'base_score': 70, 'difficulty': 1.1, 'time_factor': 1.1},\n", "\n", " 'Art': {'base_score': 82, 'difficulty': 0.8, 'time_factor': 0.9},\n", "\n", " 'Physical Education': {'base_score': 85, 'difficulty': 0.7, 'time_factor': 0.8}\n", "\n", "}\n", "\n", "# Grade level adjustments\n", "\n", "GRADE_ADJUSTMENTS = {\n", "\n", " 'Kindergarten': 0.7, '1st Grade': 0.75, '2nd Grade': 0.8, '3rd Grade': 0.82,\n", "\n", " '4th Grade': 0.85, '5th Grade': 0.87, '6th Grade': 0.8, '7th Grade': 0.78,\n", "\n", " '8th Grade': 0.76, '9th Grade': 0.74, '10th Grade': 0.72, '11th Grade': 0.7, '12th Grade': 0.68\n", "\n", "}\n", "\n", "\n", "# Generate student assessment records\n", "\n", "assessment_data = []\n", "\n", "base_date = datetime(2024, 1, 1)\n", "\n", "\n", "# Create 3,000 students with 15-30 assessments each\n", "\n", "for student_num in range(1, 3001):\n", "\n", " student_id = f\"STU{student_num:06d}\"\n", " \n", " # Assign grade level\n", "\n", " grade_level = random.choice(GRADE_LEVELS)\n", "\n", " grade_factor = GRADE_ADJUSTMENTS[grade_level]\n", " \n", " # Each student gets 15-30 assessments over 12 months\n", "\n", " num_assessments = random.randint(15, 30)\n", " \n", " for i in range(num_assessments):\n", "\n", " # Spread assessments over 12 months\n", "\n", " days_offset = random.randint(0, 365)\n", "\n", " assessment_date = base_date + timedelta(days=days_offset)\n", " \n", " # Select subject\n", "\n", " subject = random.choice(SUBJECTS)\n", "\n", " params = PERFORMANCE_PARAMS[subject]\n", " \n", " # Calculate score with variations\n", "\n", " score_variation = random.uniform(0.7, 1.3)\n", "\n", " base_score = params['base_score'] * grade_factor / params['difficulty']\n", "\n", " score = round(min(100, max(0, base_score * score_variation)), 2)\n", " \n", " # Calculate completion time\n", "\n", " time_variation = random.uniform(0.8, 1.5)\n", "\n", " base_time = 45 * params['time_factor'] # 45 minutes base time\n", "\n", " completion_time = round(base_time * time_variation, 2)\n", " \n", " # Engagement score (affects performance)\n", "\n", " engagement_score = random.randint(40, 100)\n", "\n", " # Slightly adjust score based on engagement\n", "\n", " engagement_factor = engagement_score / 100.0\n", "\n", " score = round(min(100, score * (0.8 + 0.4 * engagement_factor)), 2)\n", " \n", " assessment_data.append({\n", "\n", " \"student_id\": student_id,\n", "\n", " \"assessment_date\": assessment_date.date(),\n", "\n", " \"subject\": subject,\n", "\n", " \"score\": float(score),\n", "\n", " \"grade_level\": grade_level,\n", "\n", " \"completion_time\": float(completion_time),\n", "\n", " \"engagement_score\": int(engagement_score)\n", "\n", " })\n", "\n", "\n", "\n", "print(f\"Generated {len(assessment_data)} student assessment records\")\n", "\n", "print(\"Sample record:\", assessment_data[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 4: Insert Data Using PySpark\n", "\n", "### Data Insertion Strategy\n", "\n", "We'll use PySpark to:\n", "\n", "1. **Create DataFrame** from our generated data\n", "2. **Insert into Delta table** with liquid clustering\n", "3. **Verify the insertion** with a sample query\n", "\n", "### Why PySpark for Insertion?\n", "\n", "- **Distributed processing**: Handles large datasets efficiently\n", "- **Type safety**: Ensures data integrity\n", "- **Optimization**: Leverages Spark's query optimization\n", "- **Liquid clustering**: Automatically applies clustering during insertion" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DataFrame Schema:\n", "root\n", " |-- assessment_date: date (nullable = true)\n", " |-- completion_time: double (nullable = true)\n", " |-- engagement_score: long (nullable = true)\n", " |-- grade_level: string (nullable = true)\n", " |-- score: double (nullable = true)\n", " |-- student_id: string (nullable = true)\n", " |-- subject: string (nullable = true)\n", "\n", "\n", "Sample Data:\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "+---------------+---------------+----------------+-----------+-----+----------+-------+\n", "|assessment_date|completion_time|engagement_score|grade_level|score|student_id|subject|\n", "+---------------+---------------+----------------+-----------+-----+----------+-------+\n", "| 2024-09-25| 71.9| 51| 5th Grade|47.58| STU000001|History|\n", "| 2024-11-26| 59.47| 52| 5th Grade|70.08| STU000001|History|\n", "| 2024-12-17| 62.23| 89| 5th Grade|76.43| STU000001|English|\n", "| 2024-04-23| 75.37| 62| 5th Grade|40.01| STU000001| Math|\n", "| 2024-06-29| 73.66| 79| 5th Grade|38.55| STU000001|Science|\n", "+---------------+---------------+----------------+-----------+-----+----------+-------+\n", "only showing top 5 rows\n", "\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "\n", "Successfully inserted 67233 records into education.analytics.student_assessments\n", "Liquid clustering automatically optimized the data layout during insertion!\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Insert data using PySpark DataFrame operations\n", "\n", "# Using fully qualified function references to avoid conflicts\n", "\n", "\n", "# Create DataFrame from generated data\n", "\n", "df_assessments = spark.createDataFrame(assessment_data)\n", "\n", "\n", "# Display schema and sample data\n", "\n", "print(\"DataFrame Schema:\")\n", "\n", "df_assessments.printSchema()\n", "\n", "\n", "\n", "print(\"\\nSample Data:\")\n", "\n", "df_assessments.show(5)\n", "\n", "\n", "# Insert data into Delta table with liquid clustering\n", "\n", "# The CLUSTER BY (student_id, assessment_date) will automatically optimize the data layout\n", "\n", "df_assessments.write.mode(\"overwrite\").saveAsTable(\"education.analytics.student_assessments\")\n", "\n", "\n", "print(f\"\\nSuccessfully inserted {df_assessments.count()} records into education.analytics.student_assessments\")\n", "\n", "print(\"Liquid clustering automatically optimized the data layout during insertion!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 5: Demonstrate Liquid Clustering Benefits\n", "\n", "### Query Performance Analysis\n", "\n", "Now let's see how liquid clustering improves query performance. We'll run queries that benefit from our clustering strategy:\n", "\n", "1. **Student assessment history** (clustered by student_id)\n", "2. **Time-based academic analysis** (clustered by assessment_date)\n", "3. **Combined student + time queries** (optimal for our clustering)\n", "\n", "### Expected Performance Benefits\n", "\n", "With liquid clustering, these queries should be significantly faster because:\n", "\n", "- **Data locality**: Related records are physically grouped together\n", "- **Reduced I/O**: Less data needs to be read from disk\n", "- **Automatic optimization**: No manual tuning required" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "=== Query 1: Student Assessment History ===\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "+----------+---------------+------------------+-----+----------------+\n", "|student_id|assessment_date| subject|score|engagement_score|\n", "+----------+---------------+------------------+-----+----------------+\n", "| STU000001| 2024-12-31| History|66.76| 65|\n", "| STU000001| 2024-12-17| English|76.43| 89|\n", "| STU000001| 2024-12-04| History|45.62| 41|\n", "| STU000001| 2024-11-26| History|70.08| 52|\n", "| STU000001| 2024-11-17| Science| 68.0| 92|\n", "| STU000001| 2024-11-17| English| 56.2| 70|\n", "| STU000001| 2024-09-27| History| 40.9| 40|\n", "| STU000001| 2024-09-25| History|47.58| 51|\n", "| STU000001| 2024-09-23| Art| 96.8| 42|\n", "| STU000001| 2024-08-29| Art|100.0| 53|\n", "| STU000001| 2024-08-11| History|74.15| 100|\n", "| STU000001| 2024-07-26| English|51.46| 66|\n", "| STU000001| 2024-06-29| Science|38.55| 79|\n", "| STU000001| 2024-05-26|Physical Education|94.59| 83|\n", "| STU000001| 2024-04-23| Math|40.01| 62|\n", "| STU000001| 2024-04-23| Science|70.84| 100|\n", "| STU000001| 2024-04-20| Art|100.0| 50|\n", "| STU000001| 2024-03-09| Math|44.96| 95|\n", "| STU000001| 2024-02-07| English|76.07| 61|\n", "+----------+---------------+------------------+-----+----------------+\n", "\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Records found: 19)\n", "\n", "=== Query 2: Recent Low Performance Issues ===\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "+---------------+----------+-------+-----+------------+\n", "|assessment_date|student_id|subject|score| grade_level|\n", "+---------------+----------+-------+-----+------------+\n", "| 2024-10-27| STU002987|Science|26.09| 12th Grade|\n", "| 2024-12-04| STU002647|Science|26.43| 12th Grade|\n", "| 2024-06-10| STU001734|Science| 26.7| 11th Grade|\n", "| 2024-08-18| STU000133|Science|26.85|Kindergarten|\n", "| 2024-10-24| STU001404|Science|26.86|Kindergarten|\n", "| 2024-10-29| STU000258|Science|26.89|Kindergarten|\n", "| 2024-10-08| STU001735|Science| 27.0| 11th Grade|\n", "| 2024-08-09| STU001150|Science|27.03| 12th Grade|\n", "| 2024-12-27| STU002481|Science|27.07| 12th Grade|\n", "| 2024-09-19| STU001845|Science| 27.1| 11th Grade|\n", "| 2024-07-17| STU001826|Science|27.18| 12th Grade|\n", "| 2024-07-19| STU000634|Science|27.19| 12th Grade|\n", "| 2024-10-14| STU000273|Science| 27.2| 11th Grade|\n", "| 2024-11-24| STU001154|Science|27.22| 12th Grade|\n", "| 2024-09-09| STU001574|Science|27.26| 12th Grade|\n", "| 2024-06-06| STU000084|Science|27.33|Kindergarten|\n", "| 2024-10-31| STU000635|Science|27.38| 12th Grade|\n", "| 2024-06-28| STU001171|Science|27.43| 12th Grade|\n", "| 2024-11-18| STU000371|Science|27.58|Kindergarten|\n", "| 2024-10-19| STU000128|Science|27.58| 12th Grade|\n", "+---------------+----------+-------+-----+------------+\n", "only showing top 20 rows\n", "\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Low performance issues found: 18954)\n", "\n", "=== Query 3: Student Performance Trends ===\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "+----------+---------------+------------------+-----+----------------+\n", "|student_id|assessment_date| subject|score|engagement_score|\n", "+----------+---------------+------------------+-----+----------------+\n", "| STU000001| 2024-04-20| Art|100.0| 50|\n", "| STU000001| 2024-04-23| Math|40.01| 62|\n", "| STU000001| 2024-04-23| Science|70.84| 100|\n", "| STU000001| 2024-05-26|Physical Education|94.59| 83|\n", "| STU000001| 2024-06-29| Science|38.55| 79|\n", "| STU000001| 2024-07-26| English|51.46| 66|\n", "| STU000001| 2024-08-11| History|74.15| 100|\n", "| STU000001| 2024-08-29| Art|100.0| 53|\n", "| STU000001| 2024-09-23| Art| 96.8| 42|\n", "| STU000001| 2024-09-25| History|47.58| 51|\n", "| STU000001| 2024-09-27| History| 40.9| 40|\n", "| STU000001| 2024-11-17| Science| 68.0| 92|\n", "| STU000001| 2024-11-17| English| 56.2| 70|\n", "| STU000001| 2024-11-26| History|70.08| 52|\n", "| STU000001| 2024-12-04| History|45.62| 41|\n", "| STU000001| 2024-12-17| English|76.43| 89|\n", "| STU000001| 2024-12-31| History|66.76| 65|\n", "| STU000002| 2024-04-03|Physical Education|89.18| 45|\n", "| STU000002| 2024-04-10|Physical Education|96.96| 76|\n", "| STU000002| 2024-04-26| History|73.81| 86|\n", "+----------+---------------+------------------+-----+----------------+\n", "only showing top 20 rows\n", "\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Performance trend records found: 16972)\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Demonstrate liquid clustering benefits with optimized queries\n", "\n", "\n", "# Query 1: Student assessment history - benefits from student_id clustering\n", "\n", "print(\"=== Query 1: Student Assessment History ===\")\n", "\n", "student_history = spark.sql(\"\"\"\n", "\n", "SELECT student_id, assessment_date, subject, score, engagement_score\n", "\n", "FROM education.analytics.student_assessments\n", "\n", "WHERE student_id = 'STU000001'\n", "\n", "ORDER BY assessment_date DESC\n", "\n", "\"\"\")\n", "\n", "\n", "\n", "student_history.show()\n", "\n", "print(f\"Records found: {student_history.count()})\")\n", "\n", "\n", "\n", "# Query 2: Time-based academic performance analysis - benefits from assessment_date clustering\n", "\n", "print(\"\\n=== Query 2: Recent Low Performance Issues ===\")\n", "\n", "low_performance = spark.sql(\"\"\"\n", "\n", "SELECT assessment_date, student_id, subject, score, grade_level\n", "\n", "FROM education.analytics.student_assessments\n", "\n", "WHERE assessment_date >= '2024-06-01' AND score < 60\n", "\n", "ORDER BY score ASC, assessment_date DESC\n", "\n", "\"\"\")\n", "\n", "\n", "\n", "low_performance.show()\n", "\n", "print(f\"Low performance issues found: {low_performance.count()})\")\n", "\n", "\n", "\n", "# Query 3: Combined student + time query - optimal for our clustering strategy\n", "\n", "print(\"\\n=== Query 3: Student Performance Trends ===\")\n", "\n", "performance_trends = spark.sql(\"\"\"\n", "\n", "SELECT student_id, assessment_date, subject, score, engagement_score\n", "\n", "FROM education.analytics.student_assessments\n", "\n", "WHERE student_id LIKE 'STU000%' AND assessment_date >= '2024-04-01'\n", "\n", "ORDER BY student_id, assessment_date\n", "\n", "\"\"\")\n", "\n", "\n", "\n", "performance_trends.show()\n", "\n", "print(f\"Performance trend records found: {performance_trends.count()})\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 6: Analyze Clustering Effectiveness\n", "\n", "### Understanding the Impact\n", "\n", "Let's examine how liquid clustering has organized our data and analyze some aggregate statistics to demonstrate the education insights possible with this optimized structure.\n", "\n", "### Key Analytics\n", "\n", "- **Student performance patterns** and learning analytics\n", "- **Subject difficulty analysis** and curriculum effectiveness\n", "- **Grade level progression** and academic growth\n", "- **Engagement correlations** and intervention opportunities" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "=== Student Performance Analysis ===\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "+----------+-----------------+---------+--------------+-------------------+-----------+\n", "|student_id|total_assessments|avg_score|avg_engagement|avg_completion_time|grade_level|\n", "+----------+-----------------+---------+--------------+-------------------+-----------+\n", "| STU000455| 19| 84.44| 68.79| 50.06| 4th Grade|\n", "| STU001973| 16| 83.68| 75.63| 58.71| 3rd Grade|\n", "| STU002604| 15| 82.76| 80.4| 53.96| 5th Grade|\n", "| STU001557| 18| 81.47| 76.22| 55.65| 5th Grade|\n", "| STU000551| 28| 81.44| 69.5| 51.03| 6th Grade|\n", "| STU002979| 20| 81.09| 76.25| 53.68| 5th Grade|\n", "| STU002271| 16| 80.99| 72.31| 51.99| 4th Grade|\n", "| STU002472| 23| 80.97| 75.39| 54.55| 5th Grade|\n", "| STU000742| 23| 80.71| 73.04| 52.63| 4th Grade|\n", "| STU002004| 20| 80.48| 75.4| 56.6| 1st Grade|\n", "+----------+-----------------+---------+--------------+-------------------+-----------+\n", "\n", "\n", "=== Subject Performance Analysis ===\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "+------------------+-----------------+---------+-------------------+--------------+---------------+\n", "| subject|total_assessments|avg_score|avg_completion_time|avg_engagement|unique_students|\n", "+------------------+-----------------+---------+-------------------+--------------+---------------+\n", "|Physical Education| 11190| 91.76| 41.46| 70.01| 2924|\n", "| Art| 11258| 82.84| 46.64| 69.98| 2940|\n", "| English| 11166| 64.32| 61.91| 70.0| 2929|\n", "| History| 11280| 52.59| 56.93| 69.84| 2924|\n", "| Math| 11117| 51.89| 77.41| 69.88| 2922|\n", "| Science| 11222| 45.91| 72.15| 69.75| 2919|\n", "+------------------+-----------------+---------+-------------------+--------------+---------------+\n", "\n", "\n", "=== Grade Level Performance ===\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "+------------+-----------------+---------+--------------+---------------+\n", "| grade_level|total_assessments|avg_score|avg_engagement|unique_students|\n", "+------------+-----------------+---------+--------------+---------------+\n", "|Kindergarten| 5674| 60.3| 70.51| 257|\n", "| 1st Grade| 5044| 64.17| 69.92| 223|\n", "| 2nd Grade| 5068| 66.62| 69.51| 231|\n", "| 3rd Grade| 4670| 68.79| 70.02| 206|\n", "| 4th Grade| 5406| 70.43| 70.21| 244|\n", "| 5th Grade| 5620| 71.76| 70.18| 249|\n", "| 6th Grade| 5108| 67.03| 69.96| 226|\n", "| 7th Grade| 4922| 65.79| 69.51| 227|\n", "| 8th Grade| 4514| 64.82| 70.0| 202|\n", "| 9th Grade| 5842| 63.55| 69.91| 255|\n", "| 10th Grade| 4978| 61.56| 69.53| 221|\n", "| 11th Grade| 5463| 60.14| 69.66| 239|\n", "| 12th Grade| 4924| 58.7| 69.82| 220|\n", "+------------+-----------------+---------+--------------+---------------+\n", "\n", "\n", "=== Engagement vs Performance Correlation ===\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "+-----------------+----------------+---------+-------------------+\n", "| engagement_level|assessment_count|avg_score|avg_completion_time|\n", "+-----------------+----------------+---------+-------------------+\n", "| High Engagement| 23075| 69.0| 59.36|\n", "|Medium Engagement| 22090| 64.8| 59.37|\n", "| Low Engagement| 22068| 60.69| 59.44|\n", "+-----------------+----------------+---------+-------------------+\n", "\n", "\n", "=== Monthly Academic Trends ===\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "+-------+-----------------+---------+--------------+---------------+\n", "| month|total_assessments|avg_score|avg_engagement|active_students|\n", "+-------+-----------------+---------+--------------+---------------+\n", "|2024-01| 5726| 65.27| 69.71| 2540|\n", "|2024-02| 5228| 64.93| 70.24| 2456|\n", "|2024-03| 5730| 65.3| 70.19| 2553|\n", "|2024-04| 5752| 64.78| 70.3| 2559|\n", "|2024-05| 5668| 65.37| 69.85| 2567|\n", "|2024-06| 5483| 64.27| 69.91| 2516|\n", "|2024-07| 5685| 64.65| 69.55| 2571|\n", "|2024-08| 5607| 64.65| 69.68| 2568|\n", "|2024-09| 5497| 64.68| 69.86| 2502|\n", "|2024-10| 5731| 65.0| 69.49| 2558|\n", "|2024-11| 5458| 64.92| 69.93| 2515|\n", "|2024-12| 5668| 64.84| 70.24| 2531|\n", "+-------+-----------------+---------+--------------+---------------+\n", "\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Analyze clustering effectiveness and education insights\n", "\n", "\n", "# Student performance analysis\n", "\n", "print(\"=== Student Performance Analysis ===\")\n", "\n", "student_performance = spark.sql(\"\"\"\n", "\n", "SELECT student_id, COUNT(*) as total_assessments,\n", "\n", " ROUND(AVG(score), 2) as avg_score,\n", "\n", " ROUND(AVG(engagement_score), 2) as avg_engagement,\n", "\n", " ROUND(AVG(completion_time), 2) as avg_completion_time,\n", "\n", " grade_level\n", "\n", "FROM education.analytics.student_assessments\n", "\n", "GROUP BY student_id, grade_level\n", "\n", "ORDER BY avg_score DESC\n", "\n", "LIMIT 10\n", "\n", "\"\"\")\n", "\n", "\n", "\n", "student_performance.show()\n", "\n", "\n", "# Subject performance analysis\n", "\n", "print(\"\\n=== Subject Performance Analysis ===\")\n", "\n", "subject_analysis = spark.sql(\"\"\"\n", "\n", "SELECT subject, COUNT(*) as total_assessments,\n", "\n", " ROUND(AVG(score), 2) as avg_score,\n", "\n", " ROUND(AVG(completion_time), 2) as avg_completion_time,\n", "\n", " ROUND(AVG(engagement_score), 2) as avg_engagement,\n", "\n", " COUNT(DISTINCT student_id) as unique_students\n", "\n", "FROM education.analytics.student_assessments\n", "\n", "GROUP BY subject\n", "\n", "ORDER BY avg_score DESC\n", "\n", "\"\"\")\n", "\n", "\n", "\n", "subject_analysis.show()\n", "\n", "\n", "# Grade level performance\n", "\n", "print(\"\\n=== Grade Level Performance ===\")\n", "\n", "grade_performance = spark.sql(\"\"\"\n", "\n", "SELECT \n", "\n", " grade_level, \n", "\n", " COUNT(*) AS total_assessments,\n", "\n", " ROUND(AVG(score), 2) AS avg_score,\n", "\n", " ROUND(AVG(engagement_score), 2) AS avg_engagement,\n", "\n", " COUNT(DISTINCT student_id) AS unique_students\n", "\n", "FROM education.analytics.student_assessments\n", "\n", "GROUP BY grade_level\n", "\n", "ORDER BY \n", "\n", " CASE \n", "\n", " WHEN grade_level = 'Kindergarten' THEN 0\n", "\n", " ELSE CAST(REGEXP_REPLACE(grade_level, '[^0-9]', '') AS INT)\n", "\n", " END\n", "\n", "\"\"\")\n", "\n", "\n", "\n", "grade_performance.show()\n", "\n", "\n", "# Engagement vs performance correlation\n", "\n", "print(\"\\n=== Engagement vs Performance Correlation ===\")\n", "\n", "engagement_correlation = spark.sql(\"\"\"\n", "\n", "SELECT \n", "\n", " CASE \n", "\n", " WHEN engagement_score >= 80 THEN 'High Engagement'\n", "\n", " WHEN engagement_score >= 60 THEN 'Medium Engagement'\n", "\n", " WHEN engagement_score >= 40 THEN 'Low Engagement'\n", "\n", " ELSE 'Very Low Engagement'\n", "\n", " END as engagement_level,\n", "\n", " COUNT(*) as assessment_count,\n", "\n", " ROUND(AVG(score), 2) as avg_score,\n", "\n", " ROUND(AVG(completion_time), 2) as avg_completion_time\n", "\n", "FROM education.analytics.student_assessments\n", "\n", "GROUP BY \n", "\n", " CASE \n", "\n", " WHEN engagement_score >= 80 THEN 'High Engagement'\n", "\n", " WHEN engagement_score >= 60 THEN 'Medium Engagement'\n", "\n", " WHEN engagement_score >= 40 THEN 'Low Engagement'\n", "\n", " ELSE 'Very Low Engagement'\n", "\n", " END\n", "\n", "ORDER BY avg_score DESC\n", "\n", "\"\"\")\n", "\n", "\n", "\n", "engagement_correlation.show()\n", "\n", "\n", "# Monthly academic trends\n", "\n", "print(\"\\n=== Monthly Academic Trends ===\")\n", "\n", "monthly_trends = spark.sql(\"\"\"\n", "\n", "SELECT DATE_FORMAT(assessment_date, 'yyyy-MM') as month,\n", "\n", " COUNT(*) as total_assessments,\n", "\n", " ROUND(AVG(score), 2) as avg_score,\n", "\n", " ROUND(AVG(engagement_score), 2) as avg_engagement,\n", "\n", " COUNT(DISTINCT student_id) as active_students\n", "\n", "FROM education.analytics.student_assessments\n", "\n", "GROUP BY DATE_FORMAT(assessment_date, 'yyyy-MM')\n", "\n", "ORDER BY month\n", "\n", "\"\"\")\n", "\n", "\n", "\n", "monthly_trends.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 7: Train Education Student Performance Prediction Model\n", "\n", "### Machine Learning for Education Business Improvement\n", "\n", "Now we'll train a machine learning model to predict student performance and identify students who may need early intervention. This model can help education institutions:\n", "\n", "- **Predict academic performance** before final assessments\n", "- **Identify at-risk students** early for targeted interventions\n", "- **Personalize learning plans** based on predicted performance\n", "- **Optimize resource allocation** for academic support programs\n", "\n", "### Model Approach\n", "\n", "We'll use a **Random Forest Classifier** to predict whether a student will achieve high performance (score > 85) based on:\n", "\n", "- Historical performance patterns and subject strengths\n", "- Engagement levels and time spent on assessments\n", "- Subject-specific performance and grade-level progression\n", "- Temporal patterns and learning consistency\n", "\n", "### Business Impact\n", "\n", "- **Early Intervention**: Identify struggling students before it's too late\n", "- **Resource Optimization**: Focus support where it's most needed\n", "- **Personalized Education**: Tailor learning experiences to student needs\n", "- **Improved Outcomes**: Better academic performance and graduation rates" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Created student performance features for 3000 students\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "+--------------+-----+\n", "|high_performer|count|\n", "+--------------+-----+\n", "| 1| 127|\n", "| 0| 2873|\n", "+--------------+-----+\n", "\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Prepare data for machine learning - create student-level performance features\n", "\n", "from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler\n", "from pyspark.ml.classification import RandomForestClassifier\n", "from pyspark.ml.evaluation import BinaryClassificationEvaluator\n", "from pyspark.ml import Pipeline\n", "import pyspark.sql.functions as F\n", "\n", "# Create student-level features for performance prediction\n", "student_features = spark.sql(\"\"\"\n", "SELECT \n", " student_id,\n", " COUNT(*) as total_assessments,\n", " ROUND(AVG(score), 2) as avg_score,\n", " ROUND(STDDEV(score), 2) as score_variability,\n", " ROUND(AVG(engagement_score), 2) as avg_engagement,\n", " ROUND(AVG(completion_time), 2) as avg_completion_time,\n", " ROUND(STDDEV(completion_time), 2) as completion_time_variability,\n", " COUNT(DISTINCT subject) as subjects_attempted,\n", " COUNT(DISTINCT DATE_FORMAT(assessment_date, 'yyyy-MM')) as active_months,\n", " -- Subject-specific performance\n", " ROUND(AVG(CASE WHEN subject = 'Math' THEN score END), 2) as math_avg_score,\n", " ROUND(AVG(CASE WHEN subject = 'English' THEN score END), 2) as english_avg_score,\n", " ROUND(AVG(CASE WHEN subject = 'Science' THEN score END), 2) as science_avg_score,\n", " -- Performance trend (recent vs earlier)\n", " ROUND(AVG(CASE WHEN assessment_date >= '2024-07-01' THEN score END), 2) as recent_avg_score,\n", " ROUND(AVG(CASE WHEN assessment_date < '2024-07-01' THEN score END), 2) as earlier_avg_score,\n", " grade_level,\n", " -- Target: High performance (>75 average)\n", " CASE WHEN AVG(score) > 75 THEN 1 ELSE 0 END as high_performer\n", "FROM education.analytics.student_assessments\n", "GROUP BY student_id, grade_level\n", "\"\"\")\n", "\n", "# Fill null values from conditional aggregations\n", "student_features = student_features.fillna(0, subset=['math_avg_score', 'english_avg_score', 'science_avg_score', 'recent_avg_score', 'earlier_avg_score'])\n", "\n", "print(f\"Created student performance features for {student_features.count()} students\")\n", "student_features.groupBy(\"high_performer\").count().show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Training set: 2451 students\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Test set: 549 students\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Feature engineering for performance prediction\n", "\n", "# Create indexers for categorical features\n", "grade_indexer = StringIndexer(inputCol=\"grade_level\", outputCol=\"grade_level_index\")\n", "\n", "# Assemble features for the model\n", "feature_cols = [\"total_assessments\", \"avg_score\", \"score_variability\", \"avg_engagement\", \n", " \"avg_completion_time\", \"completion_time_variability\", \"subjects_attempted\", \n", " \"active_months\", \"math_avg_score\", \"english_avg_score\", \"science_avg_score\", \n", " \"recent_avg_score\", \"earlier_avg_score\", \"grade_level_index\"]\n", "\n", "assembler = VectorAssembler(\n", " inputCols=feature_cols,\n", " outputCol=\"features\"\n", ")\n", "\n", "# Scale features\n", "scaler = StandardScaler(inputCol=\"features\", outputCol=\"scaled_features\")\n", "\n", "# Create and train the model\n", "rf = RandomForestClassifier(\n", " labelCol=\"high_performer\", \n", " featuresCol=\"scaled_features\",\n", " numTrees=100,\n", " maxDepth=10\n", ")\n", "\n", "# Create pipeline\n", "pipeline = Pipeline(stages=[grade_indexer, assembler, scaler, rf])\n", "\n", "# Split data\n", "train_data, test_data = student_features.randomSplit([0.8, 0.2], seed=42)\n", "\n", "print(f\"Training set: {train_data.count()} students\")\n", "print(f\"Test set: {test_data.count()} students\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Training student performance prediction model...\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Model AUC: 0.9977\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "+----------+---------+--------------+--------------+----------+-----------+\n", "|student_id|avg_score|avg_engagement|high_performer|prediction|probability|\n", "+----------+---------+--------------+--------------+----------+-----------+\n", "| STU000003| 57.63| 77.0| 0| 0.0| [1.0,0.0]|\n", "| STU000007| 58.32| 70.53| 0| 0.0|[0.99,0.01]|\n", "| STU000009| 68.72| 71.0| 0| 0.0| [1.0,0.0]|\n", "| STU000014| 58.66| 69.5| 0| 0.0| [1.0,0.0]|\n", "| STU000020| 69.24| 73.31| 0| 0.0|[0.98,0.02]|\n", "| STU000024| 69.8| 68.11| 0| 0.0|[0.99,0.01]|\n", "| STU000030| 59.54| 69.87| 0| 0.0| [1.0,0.0]|\n", "| STU000036| 62.59| 71.44| 0| 0.0| [1.0,0.0]|\n", "| STU000046| 56.48| 71.64| 0| 0.0| [1.0,0.0]|\n", "| STU000047| 60.15| 71.32| 0| 0.0| [1.0,0.0]|\n", "+----------+---------+--------------+--------------+----------+-----------+\n", "only showing top 10 rows\n", "\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "+--------------+----------+-----+\n", "|high_performer|prediction|count|\n", "+--------------+----------+-----+\n", "| 1| 0.0| 8|\n", "| 0| 0.0| 518|\n", "| 1| 1.0| 23|\n", "+--------------+----------+-----+\n", "\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Train the student performance prediction model\n", "\n", "print(\"Training student performance prediction model...\")\n", "model = pipeline.fit(train_data)\n", "\n", "# Make predictions\n", "predictions = model.transform(test_data)\n", "\n", "# Evaluate the model\n", "evaluator = BinaryClassificationEvaluator(labelCol=\"high_performer\", metricName=\"areaUnderROC\")\n", "auc = evaluator.evaluate(predictions)\n", "\n", "print(f\"Model AUC: {auc:.4f}\")\n", "\n", "# Show prediction results\n", "predictions.select(\"student_id\", \"avg_score\", \"avg_engagement\", \"high_performer\", \"prediction\", \"probability\").show(10)\n", "\n", "# Calculate confusion matrix\n", "confusion_matrix = predictions.groupBy(\"high_performer\", \"prediction\").count()\n", "confusion_matrix.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "=== Feature Importance for Student Performance Prediction ===\n", "total_assessments: 0.0175\n", "avg_score: 0.5264\n", "score_variability: 0.0138\n", "avg_engagement: 0.0178\n", "avg_completion_time: 0.0344\n", "completion_time_variability: 0.0204\n", "subjects_attempted: 0.0029\n", "active_months: 0.0064\n", "math_avg_score: 0.0298\n", "english_avg_score: 0.0330\n", "science_avg_score: 0.0244\n", "recent_avg_score: 0.1263\n", "earlier_avg_score: 0.1353\n", "grade_level_index: 0.0118\n", "\n", "=== Business Impact Analysis ===\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Total test students: 549\n", "Students predicted as high performers: 23\n", "Percentage identified for advanced programs: 4.2%\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "\n", "Students predicted as needing intervention: 526\n", "Percentage flagged for academic support: 95.8%\n", "\n", "Estimated intervention cost per student: $500\n", "Total intervention cost: $263,000\n", "Expected benefit from interventions: $1,315,000\n", "Intervention program ROI: 400.0%\n", "\n", "Additional value from advanced programs: $34,500\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "\n", "Model Performance:\n", "Accuracy: 0.9854\n", "Precision: 1.0000\n", "Recall: 0.7419\n", "AUC: 0.9977\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Model interpretation and business insights\n", "\n", "# Feature importance (approximate)\n", "rf_model = model.stages[-1]\n", "feature_importance = rf_model.featureImportances\n", "feature_names = feature_cols\n", "\n", "print(\"=== Feature Importance for Student Performance Prediction ===\")\n", "for name, importance in zip(feature_names, feature_importance):\n", " print(f\"{name}: {importance:.4f}\")\n", "\n", "# Business impact analysis\n", "print(\"\\n=== Business Impact Analysis ===\")\n", "\n", "# Calculate potential impact of performance prediction\n", "high_performers_predicted = predictions.filter(\"prediction = 1\")\n", "students_identified = high_performers_predicted.count()\n", "total_test_students = test_data.count()\n", "\n", "print(f\"Total test students: {total_test_students}\")\n", "print(f\"Students predicted as high performers: {students_identified}\")\n", "print(f\"Percentage identified for advanced programs: {(students_identified/total_test_students)*100:.1f}%\")\n", "\n", "# At-risk students (predicted as low performers)\n", "at_risk_students = predictions.filter(\"prediction = 0\")\n", "at_risk_count = at_risk_students.count()\n", "\n", "print(f\"\\nStudents predicted as needing intervention: {at_risk_count}\")\n", "print(f\"Percentage flagged for academic support: {(at_risk_count/total_test_students)*100:.1f}%\")\n", "\n", "# Calculate intervention value\n", "intervention_cost_per_student = 500 # Cost of tutoring/mentoring per student\n", "intervention_effectiveness = 0.25 # Expected improvement in performance\n", "avg_student_value = 10000 # Value of improved student outcomes\n", "\n", "total_intervention_cost = at_risk_count * intervention_cost_per_student\n", "expected_benefit = at_risk_count * intervention_effectiveness * avg_student_value\n", "intervention_roi = (expected_benefit - total_intervention_cost) / total_intervention_cost * 100\n", "\n", "print(f\"\\nEstimated intervention cost per student: ${intervention_cost_per_student:,}\")\n", "print(f\"Total intervention cost: ${total_intervention_cost:,}\")\n", "print(f\"Expected benefit from interventions: ${expected_benefit:,.0f}\")\n", "print(f\"Intervention program ROI: {intervention_roi:.1f}%\")\n", "\n", "# Advanced program value for high performers\n", "advanced_program_benefit = students_identified * avg_student_value * 0.15 # 15% additional benefit\n", "print(f\"\\nAdditional value from advanced programs: ${advanced_program_benefit:,.0f}\")\n", "\n", "# Accuracy metrics\n", "accuracy = predictions.filter(\"high_performer = prediction\").count() / predictions.count()\n", "precision = predictions.filter(\"prediction = 1 AND high_performer = 1\").count() / predictions.filter(\"prediction = 1\").count() if predictions.filter(\"prediction = 1\").count() > 0 else 0\n", "recall = predictions.filter(\"prediction = 1 AND high_performer = 1\").count() / predictions.filter(\"high_performer = 1\").count() if predictions.filter(\"high_performer = 1\").count() > 0 else 0\n", "\n", "print(f\"\\nModel Performance:\")\n", "print(f\"Accuracy: {accuracy:.4f}\")\n", "print(f\"Precision: {precision:.4f}\")\n", "print(f\"Recall: {recall:.4f}\")\n", "print(f\"AUC: {auc:.4f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Key Takeaways: Delta Liquid Clustering + ML in AIDP\n", "\n", "### What We Demonstrated\n", "\n", "1. **Automatic Optimization**: Created a table with `CLUSTER BY (student_id, assessment_date)` and let Delta automatically optimize data layout\n", "\n", "2. **Performance Benefits**: Queries on clustered columns (student_id, assessment_date) are significantly faster due to data locality\n", "\n", "3. **Zero Maintenance**: No manual partitioning, bucketing, or Z-Ordering required - Delta handles it automatically\n", "\n", "4. **Machine Learning Integration**: Trained a student performance prediction model using the optimized data\n", "\n", "5. **Real-World Use Case**: Education analytics where student performance tracking and learning intervention are critical\n", "\n", "### AIDP Advantages\n", "\n", "- **Unified Analytics**: Seamlessly integrates data optimization with ML\n", "- **Governance**: Catalog and schema isolation for education data\n", "- **Performance**: Optimized for both analytical queries and ML training\n", "- **Scalability**: Handles education-scale data volumes effortlessly\n", "\n", "### Business Benefits for Education\n", "\n", "1. **Early Intervention**: Identify struggling students before performance declines\n", "2. **Personalized Learning**: Tailor educational approaches to student needs\n", "3. **Resource Optimization**: Focus support where it's most effective\n", "4. **Academic Excellence**: Improve overall student performance and outcomes\n", "5. **Educational Equity**: Ensure all students receive appropriate support\n", "\n", "### Best Practices for Education Analytics\n", "\n", "1. **Choose clustering columns** based on your most common query patterns\n", "2. **Start with 1-4 columns** - too many can reduce effectiveness\n", "3. **Consider cardinality** - high-cardinality columns work best\n", "4. **Monitor and adjust** as query patterns evolve\n", "5. **Combine with ML** for predictive analytics and automation\n", "\n", "### Next Steps\n", "\n", "- Explore other AIDP ML features like AutoML\n", "- Try liquid clustering with different column combinations\n", "- Scale up to larger education datasets\n", "- Integrate with real LMS systems and assessment platforms\n", "- Deploy models for real-time student performance monitoring\n", "\n", "This notebook demonstrates how Oracle AI Data Platform makes advanced education analytics accessible while maintaining enterprise-grade performance and governance." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }