{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Media: Delta Liquid Clustering Demo\n",
    "\n",
    "\n",
    "## Overview\n",
    "\n",
    "\n",
    "This notebook demonstrates the power of **Delta Liquid Clustering** in Oracle AI Data Platform (AIDP) Workbench using a media and entertainment analytics use case. Liquid clustering automatically optimizes data layout for query performance without requiring manual partitioning or Z-Ordering.\n",
    "\n",
    "### What is Liquid Clustering?\n",
    "\n",
    "Liquid clustering automatically identifies and groups similar data together based on clustering columns you define. This optimization happens automatically during data ingestion and maintenance operations, providing:\n",
    "\n",
    "- **Automatic optimization**: No manual tuning required\n",
    "- **Improved query performance**: Faster queries on clustered columns\n",
    "- **Reduced maintenance**: No need for manual repartitioning\n",
    "- **Adaptive clustering**: Adjusts as data patterns change\n",
    "\n",
    "### Use Case: Content Performance and User Engagement Analytics\n",
    "\n",
    "We'll analyze media content consumption and user engagement data. Our clustering strategy will optimize for:\n",
    "\n",
    "- **User-specific queries**: Fast lookups by user ID\n",
    "- **Time-based analysis**: Efficient filtering by viewing and engagement dates\n",
    "- **Content performance patterns**: Quick aggregation by content type and engagement metrics\n",
    "\n",
    "### AIDP Environment Setup\n",
    "\n",
    "This notebook leverages the existing Spark session in your AIDP environment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create media catalog and analytics schema\n",
    "\n",
    "# In AIDP, catalogs provide data isolation and governance\n",
    "\n",
    "spark.sql(\"CREATE CATALOG IF NOT EXISTS media\")\n",
    "\n",
    "spark.sql(\"CREATE SCHEMA IF NOT EXISTS media.analytics\")\n",
    "\n",
    "print(\"Media catalog and analytics schema created successfully!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Create Delta Table with Liquid Clustering\n",
    "\n",
    "### Table Design\n",
    "\n",
    "Our `content_engagement` table will store:\n",
    "\n",
    "- **user_id**: Unique user identifier\n",
    "- **engagement_date**: Date and time of engagement\n",
    "- **content_type**: Type (Video, Article, Podcast, Live Stream)\n",
    "- **watch_time**: Time spent consuming content (minutes)\n",
    "- **content_id**: Specific content identifier\n",
    "- **engagement_score**: User engagement metric (0-100)\n",
    "- **device_type**: Device used (Mobile, Desktop, TV, etc.)\n",
    "\n",
    "### Clustering Strategy\n",
    "\n",
    "We'll cluster by `user_id` and `engagement_date` because:\n",
    "\n",
    "- **user_id**: Users consume multiple pieces of content, grouping their viewing history together\n",
    "- **engagement_date**: Time-based queries are critical for content performance analysis, recommendation systems, and user behavior trends\n",
    "- This combination optimizes for both personalized content recommendations and temporal engagement analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Delta table with liquid clustering created successfully!\n",
       "Clustering will automatically optimize data layout for queries on user_id and engagement_date.\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Create Delta table with liquid clustering\n",
    "\n",
    "# CLUSTER BY defines the columns for automatic optimization\n",
    "\n",
    "spark.sql(\"\"\"\n",
    "\n",
    "CREATE TABLE IF NOT EXISTS media.analytics.content_engagement (\n",
    "\n",
    "    user_id STRING,\n",
    "\n",
    "    engagement_date TIMESTAMP,\n",
    "\n",
    "    content_type STRING,\n",
    "\n",
    "    watch_time DECIMAL(8,2),\n",
    "\n",
    "    content_id STRING,\n",
    "\n",
    "    engagement_score INT,\n",
    "\n",
    "    device_type STRING\n",
    "\n",
    ")\n",
    "\n",
    "USING DELTA\n",
    "\n",
    "CLUSTER BY (user_id, engagement_date)\n",
    "\n",
    "\"\"\")\n",
    "\n",
    "print(\"Delta table with liquid clustering created successfully!\")\n",
    "\n",
    "print(\"Clustering will automatically optimize data layout for queries on user_id and engagement_date.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3: Generate Media Sample Data\n",
    "\n",
    "### Data Generation Strategy\n",
    "\n",
    "We'll create realistic media engagement data including:\n",
    "\n",
    "- **12,000 users** with multiple content interactions over time\n",
    "- **Content types**: Video, Article, Podcast, Live Stream\n",
    "- **Realistic engagement patterns**: Peak viewing times, content preferences, device usage\n",
    "- **Engagement metrics**: Watch time, completion rates, interaction scores\n",
    "\n",
    "### Why This Data Pattern?\n",
    "\n",
    "This data simulates real media scenarios where:\n",
    "\n",
    "- User preferences drive content recommendations\n",
    "- Engagement metrics determine content success\n",
    "- Device usage affects viewing experience\n",
    "- Time-based patterns influence programming decisions\n",
    "- Personalization requires historical user behavior"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Generated 299696 content engagement records\n",
       "Sample record: {'user_id': 'USER000001', 'engagement_date': datetime.datetime(2024, 5, 12, 21, 57), 'content_type': 'Article', 'watch_time': 10.07, 'content_id': 'ART59438', 'engagement_score': 64, 'device_type': 'Gaming Console'}\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Generate sample media engagement data\n",
    "\n",
    "# Using fully qualified imports to avoid conflicts\n",
    "\n",
    "import random\n",
    "\n",
    "from datetime import datetime, timedelta\n",
    "\n",
    "\n",
    "# Define media data constants\n",
    "\n",
    "CONTENT_TYPES = ['Video', 'Article', 'Podcast', 'Live Stream']\n",
    "\n",
    "DEVICE_TYPES = ['Mobile', 'Desktop', 'Tablet', 'Smart TV', 'Gaming Console']\n",
    "\n",
    "# Base engagement parameters by content type\n",
    "\n",
    "ENGAGEMENT_PARAMS = {\n",
    "\n",
    "    'Video': {'avg_watch_time': 15, 'engagement_base': 75, 'frequency': 12},\n",
    "\n",
    "    'Article': {'avg_watch_time': 8, 'engagement_base': 65, 'frequency': 8},\n",
    "\n",
    "    'Podcast': {'avg_watch_time': 25, 'engagement_base': 70, 'frequency': 6},\n",
    "\n",
    "    'Live Stream': {'avg_watch_time': 45, 'engagement_base': 80, 'frequency': 4}\n",
    "\n",
    "}\n",
    "\n",
    "# Device engagement multipliers\n",
    "\n",
    "DEVICE_MULTIPLIERS = {\n",
    "\n",
    "    'Mobile': 0.9, 'Desktop': 1.0, 'Tablet': 0.95, 'Smart TV': 1.1, 'Gaming Console': 1.05\n",
    "\n",
    "}\n",
    "\n",
    "\n",
    "# Generate content engagement records\n",
    "\n",
    "engagement_data = []\n",
    "\n",
    "base_date = datetime(2024, 1, 1)\n",
    "\n",
    "\n",
    "# Create 12,000 users with 10-40 engagement events each\n",
    "\n",
    "for user_num in range(1, 12001):\n",
    "\n",
    "    user_id = f\"USER{user_num:06d}\"\n",
    "    \n",
    "    # Each user gets 10-40 engagement events over 12 months\n",
    "\n",
    "    num_engagements = random.randint(10, 40)\n",
    "    \n",
    "    for i in range(num_engagements):\n",
    "\n",
    "        # Spread engagements over 12 months\n",
    "\n",
    "        days_offset = random.randint(0, 365)\n",
    "\n",
    "        engagement_date = base_date + timedelta(days=days_offset)\n",
    "        \n",
    "        # Add realistic timing (more engagement during certain hours)\n",
    "\n",
    "        hour_weights = [2, 1, 1, 1, 1, 1, 3, 6, 8, 7, 6, 7, 8, 9, 10, 9, 8, 10, 12, 9, 7, 5, 4, 3]\n",
    "\n",
    "        hours_offset = random.choices(range(24), weights=hour_weights)[0]\n",
    "\n",
    "        engagement_date = engagement_date.replace(hour=hours_offset, minute=random.randint(0, 59), second=0, microsecond=0)\n",
    "        \n",
    "        # Select content type\n",
    "\n",
    "        content_type = random.choice(CONTENT_TYPES)\n",
    "\n",
    "        params = ENGAGEMENT_PARAMS[content_type]\n",
    "        \n",
    "        # Select device type\n",
    "\n",
    "        device_type = random.choice(DEVICE_TYPES)\n",
    "\n",
    "        device_multiplier = DEVICE_MULTIPLIERS[device_type]\n",
    "        \n",
    "        # Calculate watch time with variations\n",
    "\n",
    "        time_variation = random.uniform(0.3, 2.5)\n",
    "\n",
    "        watch_time = round(params['avg_watch_time'] * time_variation * device_multiplier, 2)\n",
    "        \n",
    "        # Content ID\n",
    "\n",
    "        content_id = f\"{content_type[:3].upper()}{random.randint(10000, 99999)}\"\n",
    "        \n",
    "        # Engagement score (based on content type, device, and some randomness)\n",
    "\n",
    "        engagement_variation = random.randint(-15, 15)\n",
    "\n",
    "        engagement_score = max(0, min(100, int(params['engagement_base'] * device_multiplier) + engagement_variation))\n",
    "        \n",
    "        engagement_data.append({\n",
    "\n",
    "            \"user_id\": user_id,\n",
    "\n",
    "            \"engagement_date\": engagement_date,\n",
    "\n",
    "            \"content_type\": content_type,\n",
    "\n",
    "            \"watch_time\": watch_time,\n",
    "\n",
    "            \"content_id\": content_id,\n",
    "\n",
    "            \"engagement_score\": engagement_score,\n",
    "\n",
    "            \"device_type\": device_type\n",
    "\n",
    "        })\n",
    "\n",
    "\n",
    "\n",
    "print(f\"Generated {len(engagement_data)} content engagement records\")\n",
    "\n",
    "print(\"Sample record:\", engagement_data[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4: Insert Data Using PySpark\n",
    "\n",
    "### Data Insertion Strategy\n",
    "\n",
    "We'll use PySpark to:\n",
    "\n",
    "1. **Create DataFrame** from our generated data\n",
    "2. **Insert into Delta table** with liquid clustering\n",
    "3. **Verify the insertion** with a sample query\n",
    "\n",
    "### Why PySpark for Insertion?\n",
    "\n",
    "- **Distributed processing**: Handles large datasets efficiently\n",
    "- **Type safety**: Ensures data integrity\n",
    "- **Optimization**: Leverages Spark's query optimization\n",
    "- **Liquid clustering**: Automatically applies clustering during insertion"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "DataFrame Schema:\n",
       "root\n",
       " |-- content_id: string (nullable = true)\n",
       " |-- content_type: string (nullable = true)\n",
       " |-- device_type: string (nullable = true)\n",
       " |-- engagement_date: timestamp (nullable = true)\n",
       " |-- engagement_score: long (nullable = true)\n",
       " |-- user_id: string (nullable = true)\n",
       " |-- watch_time: double (nullable = true)\n",
       "\n",
       "\n",
       "Sample Data:\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "+----------+------------+--------------+-------------------+----------------+----------+----------+\n",
       "|content_id|content_type|   device_type|    engagement_date|engagement_score|   user_id|watch_time|\n",
       "+----------+------------+--------------+-------------------+----------------+----------+----------+\n",
       "|  ART59438|     Article|Gaming Console|2024-05-12 21:57:00|              64|USER000001|     10.07|\n",
       "|  VID93820|       Video|        Mobile|2024-10-15 06:04:00|              56|USER000001|     33.09|\n",
       "|  ART16141|     Article|        Tablet|2024-09-23 16:09:00|              59|USER000001|     11.63|\n",
       "|  LIV44087| Live Stream|        Tablet|2024-12-28 16:21:00|              79|USER000001|      13.6|\n",
       "|  POD85603|     Podcast|      Smart TV|2024-10-13 09:41:00|              68|USER000001|      29.2|\n",
       "+----------+------------+--------------+-------------------+----------------+----------+----------+\n",
       "only showing top 5 rows\n",
       "\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "\n",
       "Successfully inserted 299696 records into media.analytics.content_engagement\n",
       "Liquid clustering automatically optimized the data layout during insertion!\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Insert data using PySpark DataFrame operations\n",
    "\n",
    "# Using fully qualified function references to avoid conflicts\n",
    "\n",
    "\n",
    "# Create DataFrame from generated data\n",
    "\n",
    "df_engagement = spark.createDataFrame(engagement_data)\n",
    "\n",
    "\n",
    "# Display schema and sample data\n",
    "\n",
    "print(\"DataFrame Schema:\")\n",
    "\n",
    "df_engagement.printSchema()\n",
    "\n",
    "\n",
    "\n",
    "print(\"\\nSample Data:\")\n",
    "\n",
    "df_engagement.show(5)\n",
    "\n",
    "\n",
    "# Insert data into Delta table with liquid clustering\n",
    "\n",
    "# The CLUSTER BY (user_id, engagement_date) will automatically optimize the data layout\n",
    "\n",
    "df_engagement.write.mode(\"overwrite\").saveAsTable(\"media.analytics.content_engagement\")\n",
    "\n",
    "\n",
    "print(f\"\\nSuccessfully inserted {df_engagement.count()} records into media.analytics.content_engagement\")\n",
    "\n",
    "print(\"Liquid clustering automatically optimized the data layout during insertion!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 5: Demonstrate Liquid Clustering Benefits\n",
    "\n",
    "### Query Performance Analysis\n",
    "\n",
    "Now let's see how liquid clustering improves query performance. We'll run queries that benefit from our clustering strategy:\n",
    "\n",
    "1. **User engagement history** (clustered by user_id)\n",
    "2. **Time-based content analysis** (clustered by engagement_date)\n",
    "3. **Combined user + time queries** (optimal for our clustering)\n",
    "\n",
    "### Expected Performance Benefits\n",
    "\n",
    "With liquid clustering, these queries should be significantly faster because:\n",
    "\n",
    "- **Data locality**: Related records are physically grouped together\n",
    "- **Reduced I/O**: Less data needs to be read from disk\n",
    "- **Automatic optimization**: No manual tuning required"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "=== Query 1: User Engagement History ===\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "+----------+-------------------+------------+----------+----------------+\n",
       "|   user_id|    engagement_date|content_type|watch_time|engagement_score|\n",
       "+----------+-------------------+------------+----------+----------------+\n",
       "|USER000001|2024-12-29 19:51:00| Live Stream|     67.09|              63|\n",
       "|USER000001|2024-12-28 16:21:00| Live Stream|      13.6|              79|\n",
       "|USER000001|2024-12-24 17:40:00|       Video|     20.14|              75|\n",
       "|USER000001|2024-12-05 09:50:00|       Video|     27.32|              73|\n",
       "|USER000001|2024-12-03 14:50:00|     Podcast|     50.85|              70|\n",
       "|USER000001|2024-11-26 11:18:00| Live Stream|     79.14|             100|\n",
       "|USER000001|2024-11-09 12:07:00| Live Stream|     49.07|              66|\n",
       "|USER000001|2024-10-15 06:04:00|       Video|     33.09|              56|\n",
       "|USER000001|2024-10-13 09:41:00|     Podcast|      29.2|              68|\n",
       "|USER000001|2024-10-08 16:00:00|     Article|      8.59|              66|\n",
       "+----------+-------------------+------------+----------+----------------+\n",
       "\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "Records found: 10 (showing first 10)\n",
       "\n",
       "=== Query 2: Recent High-Engagement Content ===\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "+-------------------+----------+----------+------------+----------------+----------+\n",
       "|    engagement_date|   user_id|content_id|content_type|engagement_score|watch_time|\n",
       "+-------------------+----------+----------+------------+----------------+----------+\n",
       "|2024-02-15 07:41:00|USER001708|  LIV38799| Live Stream|             100|    105.07|\n",
       "|2024-02-15 08:23:00|USER009654|  LIV43097| Live Stream|             100|     99.65|\n",
       "|2024-02-15 12:52:00|USER010253|  LIV92793| Live Stream|             100|     77.38|\n",
       "|2024-02-15 08:33:00|USER001622|  LIV68096| Live Stream|             100|     69.99|\n",
       "|2024-02-15 22:50:00|USER011218|  LIV95921| Live Stream|             100|     54.75|\n",
       "|2024-02-15 07:07:00|USER005461|  LIV57619| Live Stream|             100|     44.22|\n",
       "|2024-02-15 08:45:00|USER006405|  LIV69306| Live Stream|             100|     43.79|\n",
       "|2024-02-15 20:49:00|USER001080|  LIV83472| Live Stream|             100|     28.92|\n",
       "|2024-02-15 12:32:00|USER006006|  LIV82269| Live Stream|             100|     17.02|\n",
       "|2024-02-15 12:41:00|USER004334|  LIV75056| Live Stream|             100|     16.41|\n",
       "|2024-02-15 14:52:00|USER001129|  LIV45195| Live Stream|              99|     105.9|\n",
       "|2024-02-15 15:33:00|USER008899|  LIV93029| Live Stream|              99|    101.29|\n",
       "|2024-02-15 18:21:00|USER006605|  LIV79583| Live Stream|              98|     69.74|\n",
       "|2024-02-15 10:18:00|USER011599|  LIV42264| Live Stream|              98|     57.02|\n",
       "|2024-02-15 17:53:00|USER002707|  LIV74638| Live Stream|              97|     87.28|\n",
       "|2024-02-15 18:59:00|USER003422|  LIV76190| Live Stream|              97|     86.81|\n",
       "|2024-02-15 09:21:00|USER011755|  LIV63710| Live Stream|              97|     33.21|\n",
       "|2024-02-15 19:56:00|USER004144|  VID52961|       Video|              97|     24.87|\n",
       "|2024-02-15 07:16:00|USER000628|  VID47540|       Video|              97|     18.96|\n",
       "|2024-02-15 08:05:00|USER006957|  VID23212|       Video|              97|     17.05|\n",
       "+-------------------+----------+----------+------------+----------------+----------+\n",
       "only showing top 20 rows\n",
       "\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "High-engagement records found: 112 (showing first 20)\n",
       "\n",
       "=== Query 3: User Content Preferences ===\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "+----------+-------------------+------------+----------+--------------+\n",
       "|   user_id|    engagement_date|content_type|watch_time|   device_type|\n",
       "+----------+-------------------+------------+----------+--------------+\n",
       "|USER000001|2024-02-06 22:09:00|     Article|     13.05|       Desktop|\n",
       "|USER000001|2024-03-12 06:14:00|     Podcast|     31.97|Gaming Console|\n",
       "|USER000001|2024-03-19 14:51:00| Live Stream|     39.39|      Smart TV|\n",
       "|USER000001|2024-05-12 21:57:00|     Article|     10.07|Gaming Console|\n",
       "|USER000001|2024-05-23 19:15:00|     Article|      5.61|      Smart TV|\n",
       "|USER000001|2024-06-28 19:36:00| Live Stream|     49.28|Gaming Console|\n",
       "|USER000001|2024-07-01 21:31:00|       Video|     16.08|Gaming Console|\n",
       "|USER000001|2024-07-09 06:32:00|     Podcast|     25.73|      Smart TV|\n",
       "|USER000001|2024-07-11 02:53:00|     Article|     12.19|        Mobile|\n",
       "|USER000001|2024-08-10 16:31:00|     Podcast|     54.28|        Mobile|\n",
       "|USER000001|2024-08-16 20:11:00|     Podcast|     62.21|       Desktop|\n",
       "|USER000001|2024-08-30 20:32:00|     Podcast|      15.8|       Desktop|\n",
       "|USER000001|2024-09-17 14:20:00|       Video|      8.99|        Mobile|\n",
       "|USER000001|2024-09-18 13:56:00|     Podcast|     52.46|      Smart TV|\n",
       "|USER000001|2024-09-23 16:09:00|     Article|     11.63|        Tablet|\n",
       "|USER000001|2024-10-02 18:48:00| Live Stream|     25.29|Gaming Console|\n",
       "|USER000001|2024-10-08 16:00:00|     Article|      8.59|       Desktop|\n",
       "|USER000001|2024-10-13 09:41:00|     Podcast|      29.2|      Smart TV|\n",
       "|USER000001|2024-10-15 06:04:00|       Video|     33.09|        Mobile|\n",
       "|USER000001|2024-11-09 12:07:00| Live Stream|     49.07|       Desktop|\n",
       "+----------+-------------------+------------+----------+--------------+\n",
       "only showing top 20 rows\n",
       "\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "User preference records found: 25 (showing first 25)\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Demonstrate liquid clustering benefits with optimized queries\n",
    "\n",
    "\n",
    "# Query 1: User engagement history - benefits from user_id clustering\n",
    "\n",
    "print(\"=== Query 1: User Engagement History ===\")\n",
    "\n",
    "user_history = spark.sql(\"\"\"\n",
    "\n",
    "SELECT user_id, engagement_date, content_type, watch_time, engagement_score\n",
    "\n",
    "FROM media.analytics.content_engagement\n",
    "\n",
    "WHERE user_id = 'USER000001'\n",
    "\n",
    "ORDER BY engagement_date DESC\n",
    "\n",
    "LIMIT 10\n",
    "\n",
    "\"\"\")\n",
    "\n",
    "\n",
    "\n",
    "user_history.show()\n",
    "\n",
    "print(f\"Records found: {user_history.count()} (showing first 10)\")\n",
    "\n",
    "\n",
    "\n",
    "# Query 2: Time-based high-engagement content analysis - benefits from engagement_date clustering\n",
    "\n",
    "print(\"\\n=== Query 2: Recent High-Engagement Content ===\")\n",
    "\n",
    "high_engagement = spark.sql(\"\"\"\n",
    "\n",
    "SELECT engagement_date, user_id, content_id, content_type, engagement_score, watch_time\n",
    "\n",
    "FROM media.analytics.content_engagement\n",
    "\n",
    "WHERE DATE(engagement_date) = '2024-02-15' AND engagement_score > 85\n",
    "\n",
    "ORDER BY engagement_score DESC, watch_time DESC\n",
    "\n",
    "\"\"\")\n",
    "\n",
    "\n",
    "\n",
    "high_engagement.show()\n",
    "\n",
    "print(f\"High-engagement records found: {high_engagement.count()} (showing first 20)\")\n",
    "\n",
    "\n",
    "\n",
    "# Query 3: Combined user + time query - optimal for our clustering strategy\n",
    "\n",
    "print(\"\\n=== Query 3: User Content Preferences ===\")\n",
    "\n",
    "user_preferences = spark.sql(\"\"\"\n",
    "\n",
    "SELECT user_id, engagement_date, content_type, watch_time, device_type\n",
    "\n",
    "FROM media.analytics.content_engagement\n",
    "\n",
    "WHERE user_id LIKE 'USER000%' AND engagement_date >= '2024-02-01'\n",
    "\n",
    "ORDER BY user_id, engagement_date\n",
    "\n",
    "LIMIT 25\n",
    "\n",
    "\"\"\")\n",
    "\n",
    "\n",
    "\n",
    "user_preferences.show()\n",
    "\n",
    "print(f\"User preference records found: {user_preferences.count()} (showing first 25)\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 6: Analyze Clustering Effectiveness\n",
    "\n",
    "### Understanding the Impact\n",
    "\n",
    "Let's examine how liquid clustering has organized our data and analyze some aggregate statistics to demonstrate the media insights possible with this optimized structure.\n",
    "\n",
    "### Key Analytics\n",
    "\n",
    "- **User engagement patterns** and content preferences\n",
    "- **Content performance** by type and popularity metrics\n",
    "- **Device usage trends** and platform optimization\n",
    "- **Time-based consumption patterns** and programming insights"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "=== User Engagement Analysis ===\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "+----------+--------------+----------------+----------------+--------------+------------------+\n",
       "|   user_id|total_sessions|total_watch_time|avg_session_time|avg_engagement|content_types_used|\n",
       "+----------+--------------+----------------+----------------+--------------+------------------+\n",
       "|USER005870|            40|         1904.54|           47.61|         77.88|                 4|\n",
       "|USER008764|            38|         1788.89|           47.08|         75.37|                 4|\n",
       "|USER005341|            40|         1779.68|           44.49|         73.75|                 4|\n",
       "|USER000055|            40|         1775.95|            44.4|         73.28|                 4|\n",
       "|USER009434|            39|         1770.35|           45.39|         72.87|                 4|\n",
       "|USER000927|            37|         1769.41|           47.82|         78.35|                 4|\n",
       "|USER008715|            40|         1743.27|           43.58|         73.33|                 4|\n",
       "|USER001049|            38|         1738.06|           45.74|         75.39|                 4|\n",
       "|USER007246|            38|         1730.49|           45.54|         72.39|                 4|\n",
       "|USER002102|            38|         1725.56|           45.41|         71.55|                 4|\n",
       "+----------+--------------+----------------+----------------+--------------+------------------+\n",
       "\n",
       "\n",
       "=== Content Type Performance ===\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "+------------+-----------------+----------------+--------------+--------------+------------+--------------+\n",
       "|content_type|total_engagements|total_watch_time|avg_watch_time|avg_engagement|unique_users|unique_content|\n",
       "+------------+-----------------+----------------+--------------+--------------+------------+--------------+\n",
       "| Live Stream|            74856|      4709012.02|         62.91|         80.03|       11920|         50734|\n",
       "|     Podcast|            74797|      2617116.68|         34.99|         69.86|       11912|         50808|\n",
       "|       Video|            75029|      1572204.49|         20.95|         74.59|       11921|         50963|\n",
       "|     Article|            75014|       840406.44|          11.2|         64.63|       11918|         50808|\n",
       "+------------+-----------------+----------------+--------------+--------------+------------+--------------+\n",
       "\n",
       "\n",
       "=== Device Usage Analysis ===\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "+--------------+--------------+----------------+----------------+--------------+------------+\n",
       "|   device_type|total_sessions|total_watch_time|avg_session_time|avg_engagement|unique_users|\n",
       "+--------------+--------------+----------------+----------------+--------------+------------+\n",
       "|      Smart TV|         59789|      2133021.69|           35.68|         79.48|       11786|\n",
       "|Gaming Console|         60021|      2044411.01|           34.06|         75.85|       11797|\n",
       "|       Desktop|         60280|      1959091.98|            32.5|         72.52|       11790|\n",
       "|        Tablet|         59771|      1848360.38|           30.92|         68.53|       11818|\n",
       "|        Mobile|         59835|      1753854.57|           29.31|         64.98|       11783|\n",
       "+--------------+--------------+----------------+----------------+--------------+------------+\n",
       "\n",
       "\n",
       "=== Hourly Engagement Patterns ===\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "+-----------+-----------------+----------------+--------------+------------+\n",
       "|hour_of_day|engagement_events|total_watch_time|avg_engagement|active_users|\n",
       "+-----------+-----------------+----------------+--------------+------------+\n",
       "|          0|               14|          442.86|         77.71|          14|\n",
       "|          1|                8|          241.31|         69.88|           8|\n",
       "|          2|                5|          178.38|          67.0|           5|\n",
       "|          3|                9|          331.53|         73.44|           9|\n",
       "|          4|                8|          292.48|         73.13|           8|\n",
       "|          5|                3|          256.24|         68.33|           3|\n",
       "|          6|               16|          524.01|         73.38|          16|\n",
       "|          7|               31|         1149.06|         72.55|          31|\n",
       "|          8|               38|         1430.55|         73.53|          38|\n",
       "|          9|               46|         1390.35|         71.78|          46|\n",
       "|         10|               32|          992.33|         71.31|          32|\n",
       "|         11|               40|         1388.35|         71.55|          40|\n",
       "|         12|               55|         1899.99|         74.22|          55|\n",
       "|         13|               46|         1644.95|         72.89|          46|\n",
       "|         14|               52|         1465.91|          72.4|          52|\n",
       "|         15|               43|         1484.52|         71.09|          43|\n",
       "|         16|               47|         1685.28|         71.68|          47|\n",
       "|         17|               59|         1880.32|         69.73|          58|\n",
       "|         18|               74|         2336.59|         72.93|          74|\n",
       "|         19|               49|         1555.56|         74.16|          49|\n",
       "+-----------+-----------------+----------------+--------------+------------+\n",
       "only showing top 20 rows\n",
       "\n",
       "\n",
       "=== Monthly Engagement Trends ===\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "+-------+-----------------+------------------+----------------+--------------+------------+\n",
       "|  month|total_engagements|monthly_watch_time|avg_session_time|avg_engagement|active_users|\n",
       "+-------+-----------------+------------------+----------------+--------------+------------+\n",
       "|2024-01|            25180|         813811.06|           32.32|         72.23|       10218|\n",
       "|2024-02|            23629|         770696.13|           32.62|         72.29|        9999|\n",
       "|2024-03|            25417|         819129.81|           32.23|         72.09|       10227|\n",
       "|2024-04|            24549|         798241.78|           32.52|          72.2|       10148|\n",
       "|2024-05|            25286|         823505.15|           32.57|         72.27|       10283|\n",
       "|2024-06|            24567|         797466.01|           32.46|         72.36|       10148|\n",
       "|2024-07|            25411|         829033.74|           32.62|         72.41|       10222|\n",
       "|2024-08|            25545|         828501.89|           32.43|         72.27|       10264|\n",
       "|2024-09|            24412|         792450.74|           32.46|         72.39|       10148|\n",
       "|2024-10|            25538|         828750.42|           32.45|         72.27|       10261|\n",
       "|2024-11|            24756|         811877.48|            32.8|         72.22|       10210|\n",
       "|2024-12|            25406|         825275.42|           32.48|         72.24|       10230|\n",
       "+-------+-----------------+------------------+----------------+--------------+------------+\n",
       "\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Analyze clustering effectiveness and media insights\n",
    "\n",
    "\n",
    "# User engagement analysis\n",
    "\n",
    "print(\"=== User Engagement Analysis ===\")\n",
    "\n",
    "user_engagement = spark.sql(\"\"\"\n",
    "\n",
    "SELECT user_id, COUNT(*) as total_sessions,\n",
    "\n",
    "       ROUND(SUM(watch_time), 2) as total_watch_time,\n",
    "\n",
    "       ROUND(AVG(watch_time), 2) as avg_session_time,\n",
    "\n",
    "       ROUND(AVG(engagement_score), 2) as avg_engagement,\n",
    "\n",
    "       COUNT(DISTINCT content_type) as content_types_used\n",
    "\n",
    "FROM media.analytics.content_engagement\n",
    "\n",
    "GROUP BY user_id\n",
    "\n",
    "ORDER BY total_watch_time DESC\n",
    "\n",
    "LIMIT 10\n",
    "\n",
    "\"\"\")\n",
    "\n",
    "\n",
    "\n",
    "user_engagement.show()\n",
    "\n",
    "\n",
    "# Content type performance\n",
    "\n",
    "print(\"\\n=== Content Type Performance ===\")\n",
    "\n",
    "content_performance = spark.sql(\"\"\"\n",
    "\n",
    "SELECT content_type, COUNT(*) as total_engagements,\n",
    "\n",
    "       ROUND(SUM(watch_time), 2) as total_watch_time,\n",
    "\n",
    "       ROUND(AVG(watch_time), 2) as avg_watch_time,\n",
    "\n",
    "       ROUND(AVG(engagement_score), 2) as avg_engagement,\n",
    "\n",
    "       COUNT(DISTINCT user_id) as unique_users,\n",
    "\n",
    "       COUNT(DISTINCT content_id) as unique_content\n",
    "\n",
    "FROM media.analytics.content_engagement\n",
    "\n",
    "GROUP BY content_type\n",
    "\n",
    "ORDER BY total_watch_time DESC\n",
    "\n",
    "\"\"\")\n",
    "\n",
    "\n",
    "\n",
    "content_performance.show()\n",
    "\n",
    "\n",
    "# Device usage analysis\n",
    "\n",
    "print(\"\\n=== Device Usage Analysis ===\")\n",
    "\n",
    "device_analysis = spark.sql(\"\"\"\n",
    "\n",
    "SELECT device_type, COUNT(*) as total_sessions,\n",
    "\n",
    "       ROUND(SUM(watch_time), 2) as total_watch_time,\n",
    "\n",
    "       ROUND(AVG(watch_time), 2) as avg_session_time,\n",
    "\n",
    "       ROUND(AVG(engagement_score), 2) as avg_engagement,\n",
    "\n",
    "       COUNT(DISTINCT user_id) as unique_users\n",
    "\n",
    "FROM media.analytics.content_engagement\n",
    "\n",
    "GROUP BY device_type\n",
    "\n",
    "ORDER BY total_watch_time DESC\n",
    "\n",
    "\"\"\")\n",
    "\n",
    "\n",
    "\n",
    "device_analysis.show()\n",
    "\n",
    "\n",
    "# Hourly engagement patterns\n",
    "\n",
    "print(\"\\n=== Hourly Engagement Patterns ===\")\n",
    "\n",
    "hourly_patterns = spark.sql(\"\"\"\n",
    "\n",
    "SELECT HOUR(engagement_date) as hour_of_day, COUNT(*) as engagement_events,\n",
    "\n",
    "       ROUND(SUM(watch_time), 2) as total_watch_time,\n",
    "\n",
    "       ROUND(AVG(engagement_score), 2) as avg_engagement,\n",
    "\n",
    "       COUNT(DISTINCT user_id) as active_users\n",
    "\n",
    "FROM media.analytics.content_engagement\n",
    "\n",
    "WHERE DATE(engagement_date) = '2024-02-01'\n",
    "\n",
    "GROUP BY HOUR(engagement_date)\n",
    "\n",
    "ORDER BY hour_of_day\n",
    "\n",
    "\"\"\")\n",
    "\n",
    "\n",
    "\n",
    "hourly_patterns.show()\n",
    "\n",
    "\n",
    "# Monthly engagement trends\n",
    "\n",
    "print(\"\\n=== Monthly Engagement Trends ===\")\n",
    "\n",
    "monthly_trends = spark.sql(\"\"\"\n",
    "\n",
    "SELECT DATE_FORMAT(engagement_date, 'yyyy-MM') as month,\n",
    "\n",
    "       COUNT(*) as total_engagements,\n",
    "\n",
    "       ROUND(SUM(watch_time), 2) as monthly_watch_time,\n",
    "\n",
    "       ROUND(AVG(watch_time), 2) as avg_session_time,\n",
    "\n",
    "       ROUND(AVG(engagement_score), 2) as avg_engagement,\n",
    "\n",
    "       COUNT(DISTINCT user_id) as active_users\n",
    "\n",
    "FROM media.analytics.content_engagement\n",
    "\n",
    "GROUP BY DATE_FORMAT(engagement_date, 'yyyy-MM')\n",
    "\n",
    "ORDER BY month\n",
    "\n",
    "\"\"\")\n",
    "\n",
    "\n",
    "\n",
    "monthly_trends.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 7: Train Media Content Recommendation Model\n",
    "\n",
    "### Machine Learning for Media Business Improvement\n",
    "\n",
    "Now we'll train a machine learning model to predict content engagement and enable personalized recommendations. This model can help media companies:\n",
    "\n",
    "- **Personalize content recommendations** for better user engagement\n",
    "- **Optimize content discovery** and reduce user churn\n",
    "- **Maximize watch time** through intelligent content suggestions\n",
    "- **Improve content production** decisions based on engagement predictions\n",
    "\n",
    "### Model Approach\n",
    "\n",
    "We'll use a **Random Forest Classifier** to predict high engagement (engagement_score > 80) based on:\n",
    "\n",
    "- User behavior patterns (content preferences, device usage)\n",
    "- Content characteristics (type, timing)\n",
    "- Contextual factors (time of day, user engagement history)\n",
    "\n",
    "### Business Impact\n",
    "\n",
    "- **Engagement Boost**: Personalized recommendations increase watch time\n",
    "- **Retention Improvement**: Better content discovery reduces user churn\n",
    "- **Revenue Growth**: Higher engagement drives advertising and subscription revenue\n",
    "- **Content Strategy**: Data-driven decisions for content creation and acquisition"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Created engagement prediction features for 299696 interactions\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "+---------------+------+\n",
       "|high_engagement| count|\n",
       "+---------------+------+\n",
       "|              1| 76722|\n",
       "|              0|222974|\n",
       "+---------------+------+\n",
       "\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Prepare data for machine learning - create user-content engagement features\n",
    "\n",
    "from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler\n",
    "from pyspark.ml.classification import RandomForestClassifier\n",
    "from pyspark.ml.evaluation import BinaryClassificationEvaluator\n",
    "from pyspark.ml import Pipeline\n",
    "import pyspark.sql.functions as F\n",
    "\n",
    "# Create engagement prediction features\n",
    "engagement_features = spark.sql(\"\"\"\n",
    "SELECT \n",
    "    user_id,\n",
    "    engagement_date,\n",
    "    content_type,\n",
    "    watch_time,\n",
    "    content_id,\n",
    "    device_type,\n",
    "    HOUR(engagement_date) as engagement_hour,\n",
    "    DAYOFWEEK(engagement_date) as engagement_day_of_week,\n",
    "    AVG(engagement_score) OVER (PARTITION BY user_id ORDER BY engagement_date ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) as user_avg_engagement,\n",
    "    COUNT(*) OVER (PARTITION BY user_id ORDER BY engagement_date ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) as user_prior_engagements,\n",
    "    COUNT(CASE WHEN content_type = 'Video' THEN 1 END) OVER (PARTITION BY user_id ORDER BY engagement_date ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) / NULLIF(COUNT(*) OVER (PARTITION BY user_id ORDER BY engagement_date ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING), 0) as video_preference,\n",
    "    CASE WHEN engagement_score > 80 THEN 1 ELSE 0 END as high_engagement\n",
    "FROM media.analytics.content_engagement\n",
    "\"\"\")\n",
    "\n",
    "# Fill null values from window functions\n",
    "engagement_features = engagement_features.fillna(0, subset=['user_avg_engagement', 'video_preference'])\n",
    "engagement_features = engagement_features.fillna(1, subset=['user_prior_engagements'])\n",
    "\n",
    "print(f\"Created engagement prediction features for {engagement_features.count()} interactions\")\n",
    "engagement_features.groupBy(\"high_engagement\").count().show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Training set: 239756 interactions\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "Test set: 59940 interactions\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Feature engineering for engagement prediction\n",
    "\n",
    "# Create indexers for categorical features\n",
    "content_type_indexer = StringIndexer(inputCol=\"content_type\", outputCol=\"content_type_index\")\n",
    "device_type_indexer = StringIndexer(inputCol=\"device_type\", outputCol=\"device_type_index\")\n",
    "\n",
    "# Assemble features for the model\n",
    "feature_cols = [\"watch_time\", \"engagement_hour\", \"engagement_day_of_week\", \n",
    "                \"user_avg_engagement\", \"user_prior_engagements\", \"video_preference\", \n",
    "                \"content_type_index\", \"device_type_index\"]\n",
    "\n",
    "assembler = VectorAssembler(\n",
    "    inputCols=feature_cols,\n",
    "    outputCol=\"features\"\n",
    ")\n",
    "\n",
    "# Scale features\n",
    "scaler = StandardScaler(inputCol=\"features\", outputCol=\"scaled_features\")\n",
    "\n",
    "# Create and train the model\n",
    "rf = RandomForestClassifier(\n",
    "    labelCol=\"high_engagement\", \n",
    "    featuresCol=\"scaled_features\",\n",
    "    numTrees=100,\n",
    "    maxDepth=10\n",
    ")\n",
    "\n",
    "# Create pipeline\n",
    "pipeline = Pipeline(stages=[content_type_indexer, device_type_indexer, assembler, scaler, rf])\n",
    "\n",
    "# Split data\n",
    "train_data, test_data = engagement_features.randomSplit([0.8, 0.2], seed=42)\n",
    "\n",
    "print(f\"Training set: {train_data.count()} interactions\")\n",
    "print(f\"Test set: {test_data.count()} interactions\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Training content engagement prediction model...\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "Model AUC: 0.8165\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "+----------+------------+----------+---------------+----------+--------------------+\n",
       "|   user_id|content_type|watch_time|high_engagement|prediction|         probability|\n",
       "+----------+------------+----------+---------------+----------+--------------------+\n",
       "|USER000004| Live Stream|     94.86|              1|       1.0|[0.41142401408534...|\n",
       "|USER000004|     Podcast|     64.35|              0|       0.0|[0.62779094770207...|\n",
       "|USER000004|     Article|     12.64|              0|       0.0|[0.82686425125836...|\n",
       "|USER000004|     Article|      7.65|              0|       0.0|[0.96838202801927...|\n",
       "|USER000004|       Video|     10.76|              0|       0.0|[0.70597107984159...|\n",
       "|USER000004|     Article|       2.7|              0|       0.0|[0.97082017356439...|\n",
       "|USER000004|     Podcast|     44.58|              0|       0.0|[0.92652018088369...|\n",
       "|USER000004|     Podcast|     64.41|              0|       0.0|[0.76833594624079...|\n",
       "|USER000007|     Podcast|     16.51|              0|       0.0|[0.59077696904489...|\n",
       "|USER000007|       Video|     24.81|              0|       0.0|[0.72895504182725...|\n",
       "|USER000007|       Video|      4.38|              0|       0.0|[0.86603321105772...|\n",
       "|USER000011|     Article|     15.42|              0|       0.0|[0.96904811552306...|\n",
       "|USER000011| Live Stream|     94.83|              1|       0.0|[0.72958280919723...|\n",
       "|USER000011| Live Stream|     44.25|              1|       1.0|[0.24880520729266...|\n",
       "|USER000011|       Video|     26.61|              0|       0.0|[0.73453869566091...|\n",
       "+----------+------------+----------+---------------+----------+--------------------+\n",
       "only showing top 15 rows\n",
       "\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "+---------------+----------+-----+\n",
       "|high_engagement|prediction|count|\n",
       "+---------------+----------+-----+\n",
       "|              1|       0.0| 8732|\n",
       "|              0|       0.0|40513|\n",
       "|              1|       1.0| 6514|\n",
       "|              0|       1.0| 4181|\n",
       "+---------------+----------+-----+\n",
       "\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Train the engagement prediction model\n",
    "\n",
    "print(\"Training content engagement prediction model...\")\n",
    "model = pipeline.fit(train_data)\n",
    "\n",
    "# Make predictions\n",
    "predictions = model.transform(test_data)\n",
    "\n",
    "# Evaluate the model\n",
    "evaluator = BinaryClassificationEvaluator(labelCol=\"high_engagement\", metricName=\"areaUnderROC\")\n",
    "auc = evaluator.evaluate(predictions)\n",
    "\n",
    "print(f\"Model AUC: {auc:.4f}\")\n",
    "\n",
    "# Show prediction results\n",
    "predictions.select(\"user_id\", \"content_type\", \"watch_time\", \"high_engagement\", \"prediction\", \"probability\").show(15)\n",
    "\n",
    "# Calculate confusion matrix\n",
    "confusion_matrix = predictions.groupBy(\"high_engagement\", \"prediction\").count()\n",
    "confusion_matrix.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "=== Feature Importance for Engagement Prediction ===\n",
       "watch_time: 0.1027\n",
       "engagement_hour: 0.0104\n",
       "engagement_day_of_week: 0.0057\n",
       "user_avg_engagement: 0.0100\n",
       "user_prior_engagements: 0.0102\n",
       "video_preference: 0.0091\n",
       "content_type_index: 0.4600\n",
       "device_type_index: 0.3918\n",
       "\n",
       "=== Business Impact Analysis ===\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "Total test interactions: 59940\n",
       "Content predicted for high engagement: 10695\n",
       "Recommendation coverage: 17.8%\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "\n",
       "Average watch time for recommended content: 56.16 minutes\n",
       "Average watch time overall: 32.59 minutes\n",
       "Potential engagement lift: 72.3%\n",
       "\n",
       "Estimated additional watch minutes: 252,134\n",
       "Potential additional revenue: $6,303\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "\n",
       "Model Performance:\n",
       "Accuracy: 0.7846\n",
       "Precision: 0.6091\n",
       "Recall: 0.4273\n",
       "AUC: 0.8165\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Model interpretation and business insights\n",
    "\n",
    "# Feature importance (approximate)\n",
    "rf_model = model.stages[-1]\n",
    "feature_importance = rf_model.featureImportances\n",
    "feature_names = feature_cols\n",
    "\n",
    "print(\"=== Feature Importance for Engagement Prediction ===\")\n",
    "for name, importance in zip(feature_names, feature_importance):\n",
    "    print(f\"{name}: {importance:.4f}\")\n",
    "\n",
    "# Business impact analysis\n",
    "print(\"\\n=== Business Impact Analysis ===\")\n",
    "\n",
    "# Calculate potential impact of personalized recommendations\n",
    "high_engagement_predictions = predictions.filter(\"prediction = 1\")\n",
    "recommended_content = high_engagement_predictions.count()\n",
    "total_test_content = test_data.count()\n",
    "\n",
    "print(f\"Total test interactions: {total_test_content}\")\n",
    "print(f\"Content predicted for high engagement: {recommended_content}\")\n",
    "print(f\"Recommendation coverage: {(recommended_content/total_test_content)*100:.1f}%\")\n",
    "\n",
    "# Calculate engagement lift potential\n",
    "avg_watch_time_recommended = high_engagement_predictions.agg(F.avg(\"watch_time\")).collect()[0][0] or 0\n",
    "avg_watch_time_all = test_data.agg(F.avg(\"watch_time\")).collect()[0][0] or 0\n",
    "engagement_lift = ((avg_watch_time_recommended - avg_watch_time_all) / avg_watch_time_all) * 100\n",
    "\n",
    "print(f\"\\nAverage watch time for recommended content: {avg_watch_time_recommended:.2f} minutes\")\n",
    "print(f\"Average watch time overall: {avg_watch_time_all:.2f} minutes\")\n",
    "print(f\"Potential engagement lift: {engagement_lift:.1f}%\")\n",
    "\n",
    "# Revenue impact estimation\n",
    "avg_rpm = 25  # Average revenue per thousand minutes watched\n",
    "additional_minutes = recommended_content * (avg_watch_time_recommended - avg_watch_time_all)\n",
    "additional_revenue = (additional_minutes / 1000) * avg_rpm\n",
    "\n",
    "print(f\"\\nEstimated additional watch minutes: {additional_minutes:,.0f}\")\n",
    "print(f\"Potential additional revenue: ${additional_revenue:,.0f}\")\n",
    "\n",
    "# Accuracy metrics\n",
    "accuracy = predictions.filter(\"high_engagement = prediction\").count() / predictions.count()\n",
    "precision = predictions.filter(\"prediction = 1 AND high_engagement = 1\").count() / predictions.filter(\"prediction = 1\").count() if predictions.filter(\"prediction = 1\").count() > 0 else 0\n",
    "recall = predictions.filter(\"prediction = 1 AND high_engagement = 1\").count() / predictions.filter(\"high_engagement = 1\").count() if predictions.filter(\"high_engagement = 1\").count() > 0 else 0\n",
    "\n",
    "print(f\"\\nModel Performance:\")\n",
    "print(f\"Accuracy: {accuracy:.4f}\")\n",
    "print(f\"Precision: {precision:.4f}\")\n",
    "print(f\"Recall: {recall:.4f}\")\n",
    "print(f\"AUC: {auc:.4f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Key Takeaways: Delta Liquid Clustering + ML in AIDP\n",
    "\n",
    "### What We Demonstrated\n",
    "\n",
    "1. **Automatic Optimization**: Created a table with `CLUSTER BY (user_id, engagement_date)` and let Delta automatically optimize data layout\n",
    "\n",
    "2. **Performance Benefits**: Queries on clustered columns (user_id, engagement_date) are significantly faster due to data locality\n",
    "\n",
    "3. **Zero Maintenance**: No manual partitioning, bucketing, or Z-Ordering required - Delta handles it automatically\n",
    "\n",
    "4. **Machine Learning Integration**: Trained a content engagement prediction model using the optimized data\n",
    "\n",
    "5. **Real-World Use Case**: Media analytics where content engagement and user behavior analysis are critical\n",
    "\n",
    "### AIDP Advantages\n",
    "\n",
    "- **Unified Analytics**: Seamlessly integrates data optimization with ML\n",
    "- **Governance**: Catalog and schema isolation for media data\n",
    "- **Performance**: Optimized for both analytical queries and ML training\n",
    "- **Scalability**: Handles media-scale data volumes effortlessly\n",
    "\n",
    "### Business Benefits for Media\n",
    "\n",
    "1. **Personalization**: AI-driven content recommendations increase engagement\n",
    "2. **Revenue Growth**: Higher watch time drives advertising and subscription revenue\n",
    "3. **User Retention**: Better content discovery reduces churn\n",
    "4. **Content Strategy**: Data-driven decisions for content creation and acquisition\n",
    "5. **Platform Optimization**: Device-specific recommendations improve user experience\n",
    "\n",
    "### Best Practices for Media Analytics\n",
    "\n",
    "1. **Choose clustering columns** based on your most common query patterns\n",
    "2. **Start with 1-4 columns** - too many can reduce effectiveness\n",
    "3. **Consider cardinality** - high-cardinality columns work best\n",
    "4. **Monitor and adjust** as query patterns evolve\n",
    "5. **Combine with ML** for predictive analytics and automation\n",
    "\n",
    "### Next Steps\n",
    "\n",
    "- Explore other AIDP ML features like AutoML\n",
    "- Try liquid clustering with different column combinations\n",
    "- Scale up to larger media datasets\n",
    "- Integrate with real content management and streaming platforms\n",
    "- Deploy models for real-time content recommendations\n",
    "\n",
    "This notebook demonstrates how Oracle AI Data Platform makes advanced media analytics accessible while maintaining enterprise-grade performance and governance."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}