{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Telecommunications: Delta Liquid Clustering Demo\n",
    "\n",
    "\n",
    "## Overview\n",
    "\n",
    "\n",
    "This notebook demonstrates the power of **Delta Liquid Clustering** in Oracle AI Data Platform (AIDP) Workbench using a telecommunications analytics use case. Liquid clustering automatically optimizes data layout for query performance without requiring manual partitioning or Z-Ordering.\n",
    "\n",
    "### What is Liquid Clustering?\n",
    "\n",
    "Liquid clustering automatically identifies and groups similar data together based on clustering columns you define. This optimization happens automatically during data ingestion and maintenance operations, providing:\n",
    "\n",
    "- **Automatic optimization**: No manual tuning required\n",
    "- **Improved query performance**: Faster queries on clustered columns\n",
    "- **Reduced maintenance**: No need for manual repartitioning\n",
    "- **Adaptive clustering**: Adjusts as data patterns change\n",
    "\n",
    "### Use Case: Network Performance Monitoring and Customer Experience Analytics\n",
    "\n",
    "We'll analyze telecommunications network performance and customer usage data. Our clustering strategy will optimize for:\n",
    "\n",
    "- **Customer-specific queries**: Fast lookups by subscriber ID\n",
    "- **Time-based analysis**: Efficient filtering by call/service date\n",
    "- **Network performance patterns**: Quick aggregation by cell tower and service quality metrics\n",
    "\n",
    "### AIDP Environment Setup\n",
    "\n",
    "This notebook leverages the existing Spark session in your AIDP environment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create telecommunications catalog and analytics schema\n",
    "\n",
    "# In AIDP, catalogs provide data isolation and governance\n",
    "\n",
    "spark.sql(\"CREATE CATALOG IF NOT EXISTS telecom\")\n",
    "\n",
    "spark.sql(\"CREATE SCHEMA IF NOT EXISTS telecom.analytics\")\n",
    "\n",
    "print(\"Telecommunications catalog and analytics schema created successfully!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Create Delta Table with Liquid Clustering\n",
    "\n",
    "### Table Design\n",
    "\n",
    "Our `network_usage` table will store:\n",
    "\n",
    "- **subscriber_id**: Unique customer identifier\n",
    "- **usage_date**: Date and time of service usage\n",
    "- **service_type**: Type (Voice, Data, SMS, Streaming)\n",
    "- **data_volume**: Data consumed (GB)\n",
    "- **call_duration**: Call length (minutes)\n",
    "- **cell_tower_id**: Network cell tower identifier\n",
    "- **signal_quality**: Network signal strength (0-100)\n",
    "\n",
    "### Clustering Strategy\n",
    "\n",
    "We'll cluster by `subscriber_id` and `usage_date` because:\n",
    "\n",
    "- **subscriber_id**: Customers generate multiple service interactions, grouping their usage patterns together\n",
    "- **usage_date**: Time-based queries are critical for billing cycles, network planning, and customer behavior analysis\n",
    "- This combination optimizes for both customer analytics and temporal network performance monitoring"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Delta table with liquid clustering created successfully!\n",
       "Clustering will automatically optimize data layout for queries on subscriber_id and usage_date.\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Create Delta table with liquid clustering\n",
    "\n",
    "# CLUSTER BY defines the columns for automatic optimization\n",
    "\n",
    "spark.sql(\"\"\"\n",
    "\n",
    "CREATE TABLE IF NOT EXISTS telecom.analytics.network_usage (\n",
    "\n",
    "    subscriber_id STRING,\n",
    "\n",
    "    usage_date TIMESTAMP,\n",
    "\n",
    "    service_type STRING,\n",
    "\n",
    "    data_volume DECIMAL(10,3),\n",
    "\n",
    "    call_duration DECIMAL(8,2),\n",
    "\n",
    "    cell_tower_id STRING,\n",
    "\n",
    "    signal_quality INT\n",
    "\n",
    ")\n",
    "\n",
    "USING DELTA\n",
    "\n",
    "CLUSTER BY (subscriber_id, usage_date)\n",
    "\n",
    "\"\"\")\n",
    "\n",
    "print(\"Delta table with liquid clustering created successfully!\")\n",
    "\n",
    "print(\"Clustering will automatically optimize data layout for queries on subscriber_id and usage_date.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3: Generate Telecommunications Sample Data\n",
    "\n",
    "### Data Generation Strategy\n",
    "\n",
    "We'll create realistic telecommunications usage data including:\n",
    "\n",
    "- **10,000 subscribers** with multiple service interactions over time\n",
    "- **Service types**: Voice calls, Data usage, SMS, Video streaming\n",
    "- **Realistic usage patterns**: Peak hours, weekend vs weekday patterns, roaming\n",
    "- **Network infrastructure**: Multiple cell towers with varying signal quality\n",
    "\n",
    "### Why This Data Pattern?\n",
    "\n",
    "This data simulates real telecommunications scenarios where:\n",
    "\n",
    "- Customer usage varies by time of day and service type\n",
    "- Network performance impacts customer experience\n",
    "- Billing and service quality require temporal analysis\n",
    "- Capacity planning depends on usage patterns\n",
    "- Fraud detection needs real-time monitoring"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Generated 599492 network usage records\n",
       "Sample record: {'subscriber_id': 'SUB00000001', 'usage_date': datetime.datetime(2024, 12, 12, 20, 52), 'service_type': 'Data', 'data_volume': 0.753, 'call_duration': 0.0, 'cell_tower_id': 'TOWER_MIA_005', 'signal_quality': 64}\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Generate sample telecommunications usage data\n",
    "\n",
    "# Using fully qualified imports to avoid conflicts\n",
    "\n",
    "import random\n",
    "\n",
    "from datetime import datetime, timedelta\n",
    "\n",
    "\n",
    "# Define telecommunications data constants\n",
    "\n",
    "SERVICE_TYPES = ['Voice', 'Data', 'SMS', 'Streaming']\n",
    "\n",
    "CELL_TOWERS = ['TOWER_NYC_001', 'TOWER_LAX_002', 'TOWER_CHI_003', 'TOWER_HOU_004', 'TOWER_MIA_005', 'TOWER_SFO_006', 'TOWER_SEA_007']\n",
    "\n",
    "# Base usage parameters by service type\n",
    "\n",
    "USAGE_PARAMS = {\n",
    "\n",
    "    'Voice': {'avg_duration': 5.0, 'frequency': 8, 'data_volume': 0.0},\n",
    "\n",
    "    'Data': {'avg_duration': 0.0, 'frequency': 15, 'data_volume': 0.5},\n",
    "\n",
    "    'SMS': {'avg_duration': 0.0, 'frequency': 12, 'data_volume': 0.0},\n",
    "\n",
    "    'Streaming': {'avg_duration': 0.0, 'frequency': 6, 'data_volume': 2.0}\n",
    "\n",
    "}\n",
    "\n",
    "\n",
    "# Generate network usage records\n",
    "\n",
    "usage_data = []\n",
    "\n",
    "base_date = datetime(2024, 1, 1)\n",
    "\n",
    "\n",
    "# Create 10,000 subscribers with 20-100 usage events each\n",
    "\n",
    "for subscriber_num in range(1, 10001):\n",
    "\n",
    "    subscriber_id = f\"SUB{subscriber_num:08d}\"\n",
    "    \n",
    "    # Each subscriber gets 20-100 usage events over 12 months\n",
    "\n",
    "    num_events = random.randint(20, 100)\n",
    "    \n",
    "    for i in range(num_events):\n",
    "\n",
    "        # Spread usage events over 12 months\n",
    "\n",
    "        days_offset = random.randint(0, 365)\n",
    "\n",
    "        usage_date = base_date + timedelta(days=days_offset)\n",
    "        \n",
    "        # Add realistic timing (more usage during business hours and evenings)\n",
    "\n",
    "        hour_weights = [1, 1, 1, 1, 1, 2, 4, 6, 8, 7, 6, 8, 9, 8, 7, 6, 8, 9, 10, 8, 6, 4, 3, 2]\n",
    "\n",
    "        hours_offset = random.choices(range(24), weights=hour_weights)[0]\n",
    "\n",
    "        usage_date = usage_date.replace(hour=hours_offset, minute=random.randint(0, 59), second=0, microsecond=0)\n",
    "        \n",
    "        # Select service type\n",
    "\n",
    "        service_type = random.choice(SERVICE_TYPES)\n",
    "\n",
    "        params = USAGE_PARAMS[service_type]\n",
    "        \n",
    "        # Calculate usage metrics with variability\n",
    "\n",
    "        if service_type == 'Voice':\n",
    "\n",
    "            duration_variation = random.uniform(0.3, 3.0)\n",
    "\n",
    "            call_duration = round(params['avg_duration'] * duration_variation, 2)\n",
    "\n",
    "            data_volume = 0.0\n",
    "\n",
    "        elif service_type == 'Data':\n",
    "\n",
    "            data_variation = random.uniform(0.1, 5.0)\n",
    "\n",
    "            data_volume = round(params['data_volume'] * data_variation, 3)\n",
    "\n",
    "            call_duration = 0.0\n",
    "\n",
    "        elif service_type == 'SMS':\n",
    "\n",
    "            data_volume = 0.0\n",
    "\n",
    "            call_duration = 0.0\n",
    "\n",
    "        else:  # Streaming\n",
    "\n",
    "            data_variation = random.uniform(0.5, 8.0)\n",
    "\n",
    "            data_volume = round(params['data_volume'] * data_variation, 3)\n",
    "\n",
    "            call_duration = 0.0\n",
    "        \n",
    "        # Select cell tower and signal quality\n",
    "\n",
    "        cell_tower_id = random.choice(CELL_TOWERS)\n",
    "\n",
    "        # Signal quality varies by tower and time\n",
    "\n",
    "        base_signal = random.randint(60, 95)\n",
    "\n",
    "        signal_variation = random.randint(-15, 5)\n",
    "\n",
    "        signal_quality = max(0, min(100, base_signal + signal_variation))\n",
    "        \n",
    "        usage_data.append({\n",
    "\n",
    "            \"subscriber_id\": subscriber_id,\n",
    "\n",
    "            \"usage_date\": usage_date,\n",
    "\n",
    "            \"service_type\": service_type,\n",
    "\n",
    "            \"data_volume\": data_volume,\n",
    "\n",
    "            \"call_duration\": call_duration,\n",
    "\n",
    "            \"cell_tower_id\": cell_tower_id,\n",
    "\n",
    "            \"signal_quality\": signal_quality\n",
    "\n",
    "        })\n",
    "\n",
    "\n",
    "\n",
    "print(f\"Generated {len(usage_data)} network usage records\")\n",
    "\n",
    "print(\"Sample record:\", usage_data[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4: Insert Data Using PySpark\n",
    "\n",
    "### Data Insertion Strategy\n",
    "\n",
    "We'll use PySpark to:\n",
    "\n",
    "1. **Create DataFrame** from our generated data\n",
    "2. **Insert into Delta table** with liquid clustering\n",
    "3. **Verify the insertion** with a sample query\n",
    "\n",
    "### Why PySpark for Insertion?\n",
    "\n",
    "- **Distributed processing**: Handles large datasets efficiently\n",
    "- **Type safety**: Ensures data integrity\n",
    "- **Optimization**: Leverages Spark's query optimization\n",
    "- **Liquid clustering**: Automatically applies clustering during insertion"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "DataFrame Schema:\n",
       "root\n",
       " |-- call_duration: double (nullable = true)\n",
       " |-- cell_tower_id: string (nullable = true)\n",
       " |-- data_volume: double (nullable = true)\n",
       " |-- service_type: string (nullable = true)\n",
       " |-- signal_quality: long (nullable = true)\n",
       " |-- subscriber_id: string (nullable = true)\n",
       " |-- usage_date: timestamp (nullable = true)\n",
       "\n",
       "\n",
       "Sample Data:\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "+-------------+-------------+-----------+------------+--------------+-------------+-------------------+\n",
       "|call_duration|cell_tower_id|data_volume|service_type|signal_quality|subscriber_id|         usage_date|\n",
       "+-------------+-------------+-----------+------------+--------------+-------------+-------------------+\n",
       "|          0.0|TOWER_MIA_005|      0.753|        Data|            64|  SUB00000001|2024-12-12 20:52:00|\n",
       "|         9.03|TOWER_NYC_001|        0.0|       Voice|            69|  SUB00000001|2024-12-18 09:32:00|\n",
       "|         5.75|TOWER_CHI_003|        0.0|       Voice|            58|  SUB00000001|2024-10-05 17:21:00|\n",
       "|          0.0|TOWER_HOU_004|     12.468|   Streaming|            70|  SUB00000001|2024-12-18 09:32:00|\n",
       "|          0.0|TOWER_NYC_001|      0.394|        Data|            63|  SUB00000001|2024-04-30 02:00:00|\n",
       "+-------------+-------------+-----------+------------+--------------+-------------+-------------------+\n",
       "only showing top 5 rows\n",
       "\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "\n",
       "Successfully inserted 599492 records into telecom.analytics.network_usage\n",
       "Liquid clustering automatically optimized the data layout during insertion!\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Insert data using PySpark DataFrame operations\n",
    "\n",
    "# Using fully qualified function references to avoid conflicts\n",
    "\n",
    "\n",
    "# Create DataFrame from generated data\n",
    "\n",
    "df_usage = spark.createDataFrame(usage_data)\n",
    "\n",
    "\n",
    "# Display schema and sample data\n",
    "\n",
    "print(\"DataFrame Schema:\")\n",
    "\n",
    "df_usage.printSchema()\n",
    "\n",
    "\n",
    "\n",
    "print(\"\\nSample Data:\")\n",
    "\n",
    "df_usage.show(5)\n",
    "\n",
    "\n",
    "# Insert data into Delta table with liquid clustering\n",
    "\n",
    "# The CLUSTER BY (subscriber_id, usage_date) will automatically optimize the data layout\n",
    "\n",
    "df_usage.write.mode(\"overwrite\").saveAsTable(\"telecom.analytics.network_usage\")\n",
    "\n",
    "\n",
    "print(f\"\\nSuccessfully inserted {df_usage.count()} records into telecom.analytics.network_usage\")\n",
    "\n",
    "print(\"Liquid clustering automatically optimized the data layout during insertion!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 5: Demonstrate Liquid Clustering Benefits\n",
    "\n",
    "### Query Performance Analysis\n",
    "\n",
    "Now let's see how liquid clustering improves query performance. We'll run queries that benefit from our clustering strategy:\n",
    "\n",
    "1. **Subscriber usage history** (clustered by subscriber_id)\n",
    "2. **Time-based network analysis** (clustered by usage_date)\n",
    "3. **Combined subscriber + time queries** (optimal for our clustering)\n",
    "\n",
    "### Expected Performance Benefits\n",
    "\n",
    "With liquid clustering, these queries should be significantly faster because:\n",
    "\n",
    "- **Data locality**: Related records are physically grouped together\n",
    "- **Reduced I/O**: Less data needs to be read from disk\n",
    "- **Automatic optimization**: No manual tuning required"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "=== Query 1: Subscriber Usage History ===\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "+-------------+-------------------+------------+-----------+-------------+--------------+\n",
       "|subscriber_id|         usage_date|service_type|data_volume|call_duration|signal_quality|\n",
       "+-------------+-------------------+------------+-----------+-------------+--------------+\n",
       "|  SUB00000001|2024-12-31 23:30:00|   Streaming|     15.036|          0.0|            70|\n",
       "|  SUB00000001|2024-12-28 19:26:00|         SMS|        0.0|          0.0|            68|\n",
       "|  SUB00000001|2024-12-28 00:56:00|       Voice|        0.0|         7.09|            89|\n",
       "|  SUB00000001|2024-12-26 08:25:00|       Voice|        0.0|        10.12|            87|\n",
       "|  SUB00000001|2024-12-18 09:32:00|       Voice|        0.0|         9.03|            69|\n",
       "|  SUB00000001|2024-12-18 09:32:00|   Streaming|     12.468|          0.0|            70|\n",
       "|  SUB00000001|2024-12-18 07:42:00|       Voice|        0.0|         5.89|            86|\n",
       "|  SUB00000001|2024-12-15 12:21:00|       Voice|        0.0|        11.78|            52|\n",
       "|  SUB00000001|2024-12-12 20:52:00|        Data|      0.753|          0.0|            64|\n",
       "|  SUB00000001|2024-11-25 13:25:00|         SMS|        0.0|          0.0|            86|\n",
       "|  SUB00000001|2024-11-07 08:51:00|         SMS|        0.0|          0.0|            76|\n",
       "|  SUB00000001|2024-11-03 20:38:00|   Streaming|     11.141|          0.0|            86|\n",
       "|  SUB00000001|2024-10-26 04:34:00|         SMS|        0.0|          0.0|            69|\n",
       "|  SUB00000001|2024-10-25 18:41:00|         SMS|        0.0|          0.0|            68|\n",
       "|  SUB00000001|2024-10-23 02:58:00|       Voice|        0.0|        13.37|            81|\n",
       "|  SUB00000001|2024-10-10 07:11:00|         SMS|        0.0|          0.0|            72|\n",
       "|  SUB00000001|2024-10-06 16:33:00|        Data|       0.15|          0.0|            92|\n",
       "|  SUB00000001|2024-10-05 17:21:00|       Voice|        0.0|         5.75|            58|\n",
       "|  SUB00000001|2024-10-01 22:15:00|   Streaming|      8.424|          0.0|            88|\n",
       "|  SUB00000001|2024-09-30 17:46:00|         SMS|        0.0|          0.0|            69|\n",
       "+-------------+-------------------+------------+-----------+-------------+--------------+\n",
       "only showing top 20 rows\n",
       "\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "Records found: 77\n",
       "\n",
       "=== Query 2: Recent Network Quality Issues ===\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "+-------------------+-------------+-------------+--------------+------------+\n",
       "|         usage_date|subscriber_id|cell_tower_id|signal_quality|service_type|\n",
       "+-------------------+-------------+-------------+--------------+------------+\n",
       "|2024-12-31 18:48:00|  SUB00008783|TOWER_LAX_002|            45|   Streaming|\n",
       "|2024-12-31 16:33:00|  SUB00003787|TOWER_MIA_005|            45|   Streaming|\n",
       "|2024-12-31 14:47:00|  SUB00003103|TOWER_CHI_003|            45|   Streaming|\n",
       "|2024-12-31 10:42:00|  SUB00007036|TOWER_SFO_006|            45|       Voice|\n",
       "|2024-12-30 15:57:00|  SUB00009393|TOWER_NYC_001|            45|       Voice|\n",
       "|2024-12-30 14:06:00|  SUB00003452|TOWER_MIA_005|            45|   Streaming|\n",
       "|2024-12-30 10:08:00|  SUB00007996|TOWER_CHI_003|            45|        Data|\n",
       "|2024-12-30 09:53:00|  SUB00001662|TOWER_LAX_002|            45|       Voice|\n",
       "|2024-12-29 20:16:00|  SUB00005675|TOWER_LAX_002|            45|         SMS|\n",
       "|2024-12-28 19:45:00|  SUB00001138|TOWER_CHI_003|            45|         SMS|\n",
       "|2024-12-28 14:19:00|  SUB00001952|TOWER_LAX_002|            45|   Streaming|\n",
       "|2024-12-26 20:22:00|  SUB00001295|TOWER_SFO_006|            45|        Data|\n",
       "|2024-12-26 08:53:00|  SUB00003791|TOWER_NYC_001|            45|        Data|\n",
       "|2024-12-26 08:11:00|  SUB00001829|TOWER_HOU_004|            45|        Data|\n",
       "|2024-12-25 18:06:00|  SUB00007149|TOWER_LAX_002|            45|   Streaming|\n",
       "|2024-12-25 17:55:00|  SUB00004022|TOWER_CHI_003|            45|       Voice|\n",
       "|2024-12-24 21:16:00|  SUB00000262|TOWER_CHI_003|            45|       Voice|\n",
       "|2024-12-24 14:31:00|  SUB00005090|TOWER_SFO_006|            45|   Streaming|\n",
       "|2024-12-24 13:52:00|  SUB00009977|TOWER_SFO_006|            45|   Streaming|\n",
       "|2024-12-23 20:05:00|  SUB00005866|TOWER_SFO_006|            45|        Data|\n",
       "+-------------------+-------------+-------------+--------------+------------+\n",
       "only showing top 20 rows\n",
       "\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "Network quality issues found: 6909\n",
       "\n",
       "=== Query 3: Subscriber Data Usage Trends ===\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "+-------------+-------------------+------------+-----------+-------------+\n",
       "|subscriber_id|         usage_date|service_type|data_volume|call_duration|\n",
       "+-------------+-------------------+------------+-----------+-------------+\n",
       "|  SUB00000001|2024-04-07 14:03:00|   Streaming|       6.03|          0.0|\n",
       "|  SUB00000001|2024-04-11 10:25:00|       Voice|        0.0|        13.75|\n",
       "|  SUB00000001|2024-04-15 13:03:00|       Voice|        0.0|         7.19|\n",
       "|  SUB00000001|2024-04-17 19:44:00|         SMS|        0.0|          0.0|\n",
       "|  SUB00000001|2024-04-19 18:04:00|       Voice|        0.0|        10.23|\n",
       "|  SUB00000001|2024-04-26 19:42:00|   Streaming|      3.414|          0.0|\n",
       "|  SUB00000001|2024-04-30 02:00:00|        Data|      0.394|          0.0|\n",
       "|  SUB00000001|2024-04-30 06:16:00|         SMS|        0.0|          0.0|\n",
       "|  SUB00000001|2024-05-07 11:30:00|        Data|      0.577|          0.0|\n",
       "|  SUB00000001|2024-05-08 10:26:00|   Streaming|      6.758|          0.0|\n",
       "|  SUB00000001|2024-05-09 13:10:00|        Data|      0.388|          0.0|\n",
       "|  SUB00000001|2024-05-11 08:42:00|        Data|      1.548|          0.0|\n",
       "|  SUB00000001|2024-05-12 18:11:00|         SMS|        0.0|          0.0|\n",
       "|  SUB00000001|2024-05-28 06:28:00|         SMS|        0.0|          0.0|\n",
       "|  SUB00000001|2024-05-29 18:01:00|         SMS|        0.0|          0.0|\n",
       "|  SUB00000001|2024-06-04 16:43:00|        Data|      1.258|          0.0|\n",
       "|  SUB00000001|2024-06-05 05:18:00|        Data|      0.253|          0.0|\n",
       "|  SUB00000001|2024-06-08 13:06:00|       Voice|        0.0|         10.2|\n",
       "|  SUB00000001|2024-06-22 14:22:00|   Streaming|     14.625|          0.0|\n",
       "|  SUB00000001|2024-06-22 16:07:00|        Data|      0.912|          0.0|\n",
       "+-------------+-------------------+------------+-----------+-------------+\n",
       "only showing top 20 rows\n",
       "\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "Usage trend records found: 4454\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Demonstrate liquid clustering benefits with optimized queries\n",
    "\n",
    "\n",
    "# Query 1: Subscriber usage history - benefits from subscriber_id clustering\n",
    "\n",
    "print(\"=== Query 1: Subscriber Usage History ===\")\n",
    "\n",
    "subscriber_history = spark.sql(\"\"\"\n",
    "\n",
    "SELECT subscriber_id, usage_date, service_type, data_volume, call_duration, signal_quality\n",
    "\n",
    "FROM telecom.analytics.network_usage\n",
    "\n",
    "WHERE subscriber_id = 'SUB00000001'\n",
    "\n",
    "ORDER BY usage_date DESC\n",
    "\n",
    "\"\"\")\n",
    "\n",
    "\n",
    "\n",
    "subscriber_history.show()\n",
    "\n",
    "print(f\"Records found: {subscriber_history.count()}\")\n",
    "\n",
    "\n",
    "\n",
    "# Query 2: Time-based network quality analysis - benefits from usage_date clustering\n",
    "\n",
    "print(\"\\n=== Query 2: Recent Network Quality Issues ===\")\n",
    "\n",
    "network_quality = spark.sql(\"\"\"\n",
    "\n",
    "SELECT usage_date, subscriber_id, cell_tower_id, signal_quality, service_type\n",
    "\n",
    "FROM telecom.analytics.network_usage\n",
    "\n",
    "WHERE usage_date >= '2024-06-01' AND signal_quality < 50\n",
    "\n",
    "ORDER BY signal_quality ASC, usage_date DESC\n",
    "\n",
    "\"\"\")\n",
    "\n",
    "\n",
    "\n",
    "network_quality.show()\n",
    "\n",
    "print(f\"Network quality issues found: {network_quality.count()}\")\n",
    "\n",
    "\n",
    "\n",
    "# Query 3: Combined subscriber + time query - optimal for our clustering strategy\n",
    "\n",
    "print(\"\\n=== Query 3: Subscriber Data Usage Trends ===\")\n",
    "\n",
    "usage_trends = spark.sql(\"\"\"\n",
    "\n",
    "SELECT subscriber_id, usage_date, service_type, data_volume, call_duration\n",
    "\n",
    "FROM telecom.analytics.network_usage\n",
    "\n",
    "WHERE subscriber_id LIKE 'SUB000000%' AND usage_date >= '2024-04-01'\n",
    "\n",
    "ORDER BY subscriber_id, usage_date\n",
    "\n",
    "\"\"\")\n",
    "\n",
    "\n",
    "\n",
    "usage_trends.show()\n",
    "\n",
    "print(f\"Usage trend records found: {usage_trends.count()}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 6: Analyze Clustering Effectiveness\n",
    "\n",
    "### Understanding the Impact\n",
    "\n",
    "Let's examine how liquid clustering has organized our data and analyze some aggregate statistics to demonstrate the telecommunications insights possible with this optimized structure.\n",
    "\n",
    "### Key Analytics\n",
    "\n",
    "- **Subscriber usage patterns** and data consumption analysis\n",
    "- **Network performance metrics** and signal quality trends\n",
    "- **Service type adoption** and usage distribution\n",
    "- **Cell tower utilization** and capacity planning"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "=== Subscriber Usage Analysis ===\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "+-------------+--------------+-------------+------------------+------------------+-------------+\n",
       "|subscriber_id|total_sessions|total_data_gb|total_call_minutes|avg_signal_quality|services_used|\n",
       "+-------------+--------------+-------------+------------------+------------------+-------------+\n",
       "|  SUB00003991|            96|       405.35|             99.46|             73.69|            4|\n",
       "|  SUB00008402|            94|      373.892|             196.9|             72.46|            4|\n",
       "|  SUB00002884|            96|       366.61|            202.38|             72.77|            4|\n",
       "|  SUB00001178|            94|      354.778|            205.55|             71.47|            4|\n",
       "|  SUB00007652|            99|      351.278|            121.24|              71.8|            4|\n",
       "|  SUB00006080|            80|      349.189|            102.71|             72.58|            4|\n",
       "|  SUB00002701|            96|      349.039|            173.09|             71.91|            4|\n",
       "|  SUB00006118|            96|      348.346|            116.05|             71.79|            4|\n",
       "|  SUB00002240|            97|      346.885|             183.9|             73.44|            4|\n",
       "|  SUB00004370|           100|       345.46|            214.57|             72.53|            4|\n",
       "|  SUB00007000|            91|      343.121|            187.22|             72.98|            4|\n",
       "|  SUB00001870|            89|      342.377|            130.75|             72.37|            4|\n",
       "|  SUB00009297|            84|      339.809|             96.16|             72.52|            4|\n",
       "|  SUB00009761|           100|      339.532|            170.73|             71.64|            4|\n",
       "|  SUB00001463|            93|      339.301|            204.39|             72.54|            4|\n",
       "|  SUB00006741|            99|      338.496|            170.65|             73.08|            4|\n",
       "|  SUB00005865|           100|       337.23|            241.68|             72.31|            4|\n",
       "|  SUB00000283|            97|      337.002|            215.66|              71.4|            4|\n",
       "|  SUB00005090|           100|       336.74|            214.14|             72.23|            4|\n",
       "|  SUB00000783|            98|      334.239|            172.57|             72.13|            4|\n",
       "+-------------+--------------+-------------+------------------+------------------+-------------+\n",
       "only showing top 20 rows\n",
       "\n",
       "\n",
       "=== Service Type Usage Patterns ===\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "+------------+-----------+-------------+------------------+------------------+------------------+\n",
       "|service_type|total_usage|total_data_gb|total_call_minutes|avg_signal_quality|unique_subscribers|\n",
       "+------------+-----------+-------------+------------------+------------------+------------------+\n",
       "|         SMS|     150136|          0.0|               0.0|             72.52|              9999|\n",
       "|   Streaming|     150049|  1275253.862|               0.0|             72.52|             10000|\n",
       "|        Data|     149947|   190729.154|               0.0|             72.47|              9999|\n",
       "|       Voice|     149360|          0.0|         1230155.2|             72.48|              9999|\n",
       "+------------+-----------+-------------+------------------+------------------+------------------+\n",
       "\n",
       "\n",
       "=== Cell Tower Performance ===\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "+-------------+-----------------+------------------+------------------+-------------+------------------+\n",
       "|cell_tower_id|total_connections|unique_subscribers|avg_signal_quality|total_data_gb|total_call_minutes|\n",
       "+-------------+-----------------+------------------+------------------+-------------+------------------+\n",
       "|TOWER_LAX_002|            86180|              9959|             72.53|   209550.744|          176764.7|\n",
       "|TOWER_HOU_004|            85831|              9956|             72.48|   211175.342|          175565.2|\n",
       "|TOWER_MIA_005|            85690|              9954|             72.47|   210204.599|         177086.58|\n",
       "|TOWER_SFO_006|            85587|              9966|             72.47|   209599.497|         175356.27|\n",
       "|TOWER_SEA_007|            85580|              9966|             72.54|   209181.485|          175116.0|\n",
       "|TOWER_CHI_003|            85318|              9969|             72.48|   209784.082|         174256.42|\n",
       "|TOWER_NYC_001|            85306|              9953|             72.51|   206487.267|         176010.03|\n",
       "+-------------+-----------------+------------------+------------------+-------------+------------------+\n",
       "\n",
       "\n",
       "=== Hourly Usage Patterns ===\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "+-----------+------------+--------------+------------+------------------+\n",
       "|hour_of_day|usage_events|data_volume_gb|call_minutes|avg_signal_quality|\n",
       "+-----------+------------+--------------+------------+------------------+\n",
       "|          0|        4716|     11146.131|      9561.2|             72.55|\n",
       "|          1|        4647|     11412.267|     9520.85|             72.53|\n",
       "|          2|        4818|     11734.839|     9706.25|             72.63|\n",
       "|          3|        4828|      12269.49|    10174.36|             72.44|\n",
       "|          4|        4697|     11388.678|     9634.72|             72.76|\n",
       "|          5|        9411|     22767.134|    19881.45|             72.51|\n",
       "|          6|       18906|     46540.893|    39919.24|             72.49|\n",
       "|          7|       28338|     68891.571|    57707.75|             72.34|\n",
       "|          8|       37902|     92559.901|    78153.24|             72.53|\n",
       "|          9|       33009|     81097.378|     68248.8|             72.49|\n",
       "|         10|       28797|      72193.79|    58398.59|             72.53|\n",
       "|         11|       38065|      92732.12|    77693.57|              72.5|\n",
       "|         12|       42829|    103465.221|    87559.16|             72.52|\n",
       "|         13|       38098|     92053.008|     80046.4|             72.39|\n",
       "|         14|       33479|     81524.062|    67553.13|             72.52|\n",
       "|         15|       28476|     68810.288|    58093.14|             72.52|\n",
       "|         16|       38151|     93865.084|    77264.46|             72.47|\n",
       "|         17|       42901|    106396.719|    88105.96|             72.53|\n",
       "|         18|       47600|    116729.992|    96568.02|             72.42|\n",
       "|         19|       38308|     94056.492|    78668.55|             72.63|\n",
       "+-----------+------------+--------------+------------+------------------+\n",
       "only showing top 20 rows\n",
       "\n",
       "\n",
       "=== Monthly Network Trends ===\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "+-------+-----------+---------------+--------------------+------------------+------------------+\n",
       "|  month|total_usage|monthly_data_gb|monthly_call_minutes|avg_signal_quality|active_subscribers|\n",
       "+-------+-----------+---------------+--------------------+------------------+------------------+\n",
       "|2024-01|      50986|     123095.808|           105952.17|             72.52|              9749|\n",
       "|2024-02|      47609|     116650.232|            97923.63|             72.48|              9702|\n",
       "|2024-03|      50870|     123873.978|            105374.2|             72.48|              9717|\n",
       "|2024-04|      48916|     120802.198|           100590.86|             72.46|              9726|\n",
       "|2024-05|      50756|     124252.466|            104269.3|             72.53|              9753|\n",
       "|2024-06|      49318|     120021.321|            100302.2|             72.62|              9728|\n",
       "|2024-07|      50493|     124146.114|           103128.33|             72.45|              9739|\n",
       "|2024-08|      50693|     125209.133|           103566.23|              72.5|              9765|\n",
       "|2024-09|      48711|     120633.907|            97907.64|             72.45|              9724|\n",
       "|2024-10|      50898|     123995.643|           104220.41|             72.45|              9764|\n",
       "|2024-11|      49428|      119835.05|           101962.25|              72.5|              9737|\n",
       "|2024-12|      50814|     123467.166|           104957.98|             72.54|              9749|\n",
       "+-------+-----------+---------------+--------------------+------------------+------------------+\n",
       "\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Analyze clustering effectiveness and telecommunications insights\n",
    "\n",
    "\n",
    "# Subscriber usage analysis\n",
    "\n",
    "print(\"=== Subscriber Usage Analysis ===\")\n",
    "\n",
    "subscriber_usage = spark.sql(\"\"\"\n",
    "\n",
    "SELECT subscriber_id, COUNT(*) as total_sessions,\n",
    "\n",
    "       ROUND(SUM(data_volume), 3) as total_data_gb,\n",
    "\n",
    "       ROUND(SUM(call_duration), 2) as total_call_minutes,\n",
    "\n",
    "       ROUND(AVG(signal_quality), 2) as avg_signal_quality,\n",
    "\n",
    "       COUNT(DISTINCT service_type) as services_used\n",
    "\n",
    "FROM telecom.analytics.network_usage\n",
    "\n",
    "GROUP BY subscriber_id\n",
    "\n",
    "ORDER BY total_data_gb DESC\n",
    "\n",
    "\"\"\")\n",
    "\n",
    "\n",
    "\n",
    "subscriber_usage.show()\n",
    "\n",
    "\n",
    "# Service type usage patterns\n",
    "\n",
    "print(\"\\n=== Service Type Usage Patterns ===\")\n",
    "\n",
    "service_patterns = spark.sql(\"\"\"\n",
    "\n",
    "SELECT service_type, COUNT(*) as total_usage,\n",
    "\n",
    "       ROUND(SUM(data_volume), 3) as total_data_gb,\n",
    "\n",
    "       ROUND(SUM(call_duration), 2) as total_call_minutes,\n",
    "\n",
    "       ROUND(AVG(signal_quality), 2) as avg_signal_quality,\n",
    "\n",
    "       COUNT(DISTINCT subscriber_id) as unique_subscribers\n",
    "\n",
    "FROM telecom.analytics.network_usage\n",
    "\n",
    "GROUP BY service_type\n",
    "\n",
    "ORDER BY total_usage DESC\n",
    "\n",
    "\"\"\")\n",
    "\n",
    "\n",
    "\n",
    "service_patterns.show()\n",
    "\n",
    "\n",
    "# Cell tower performance\n",
    "\n",
    "print(\"\\n=== Cell Tower Performance ===\")\n",
    "\n",
    "tower_performance = spark.sql(\"\"\"\n",
    "\n",
    "SELECT cell_tower_id, COUNT(*) as total_connections,\n",
    "\n",
    "       COUNT(DISTINCT subscriber_id) as unique_subscribers,\n",
    "\n",
    "       ROUND(AVG(signal_quality), 2) as avg_signal_quality,\n",
    "\n",
    "       ROUND(SUM(data_volume), 3) as total_data_gb,\n",
    "\n",
    "       ROUND(SUM(call_duration), 2) as total_call_minutes\n",
    "\n",
    "FROM telecom.analytics.network_usage\n",
    "\n",
    "GROUP BY cell_tower_id\n",
    "\n",
    "ORDER BY total_connections DESC\n",
    "\n",
    "\"\"\")\n",
    "\n",
    "\n",
    "\n",
    "tower_performance.show()\n",
    "\n",
    "\n",
    "# Hourly usage patterns\n",
    "\n",
    "print(\"\\n=== Hourly Usage Patterns ===\")\n",
    "\n",
    "hourly_patterns = spark.sql(\"\"\"\n",
    "\n",
    "SELECT HOUR(usage_date) as hour_of_day, COUNT(*) as usage_events,\n",
    "\n",
    "       ROUND(SUM(data_volume), 3) as data_volume_gb,\n",
    "\n",
    "       ROUND(SUM(call_duration), 2) as call_minutes,\n",
    "\n",
    "       ROUND(AVG(signal_quality), 2) as avg_signal_quality\n",
    "\n",
    "FROM telecom.analytics.network_usage\n",
    "\n",
    "GROUP BY HOUR(usage_date)\n",
    "\n",
    "ORDER BY hour_of_day\n",
    "\n",
    "\"\"\")\n",
    "\n",
    "\n",
    "\n",
    "hourly_patterns.show()\n",
    "\n",
    "\n",
    "# Monthly network trends\n",
    "\n",
    "print(\"\\n=== Monthly Network Trends ===\")\n",
    "\n",
    "monthly_trends = spark.sql(\"\"\"\n",
    "\n",
    "SELECT DATE_FORMAT(usage_date, 'yyyy-MM') as month,\n",
    "\n",
    "       COUNT(*) as total_usage,\n",
    "\n",
    "       ROUND(SUM(data_volume), 3) as monthly_data_gb,\n",
    "\n",
    "       ROUND(SUM(call_duration), 2) as monthly_call_minutes,\n",
    "\n",
    "       ROUND(AVG(signal_quality), 2) as avg_signal_quality,\n",
    "\n",
    "       COUNT(DISTINCT subscriber_id) as active_subscribers\n",
    "\n",
    "FROM telecom.analytics.network_usage\n",
    "\n",
    "GROUP BY DATE_FORMAT(usage_date, 'yyyy-MM')\n",
    "\n",
    "ORDER BY month\n",
    "\n",
    "\"\"\")\n",
    "\n",
    "\n",
    "\n",
    "monthly_trends.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 7: Train Telecommunications Churn Prediction Model\n",
    "\n",
    "### Machine Learning for Telecommunications Business Improvement\n",
    "\n",
    "Now we'll train a machine learning model to predict customer churn. This model can help telecommunications companies:\n",
    "\n",
    "- **Reduce customer loss** by identifying at-risk subscribers early\n",
    "- **Improve customer retention** with targeted interventions\n",
    "- **Optimize marketing spend** by focusing on high-risk customers\n",
    "- **Enhance customer experience** by addressing pain points proactively\n",
    "\n",
    "### Model Approach\n",
    "\n",
    "We'll use a **Random Forest Classifier** to predict customer churn based on:\n",
    "\n",
    "- Usage patterns (data volume, call duration, service types)\n",
    "- Network quality metrics (signal strength)\n",
    "- Temporal patterns (usage frequency, time of day)\n",
    "- Service diversity and engagement levels\n",
    "\n",
    "### Business Impact\n",
    "\n",
    "- **Churn Prevention**: Early identification of subscribers likely to churn\n",
    "- **Cost Savings**: Reduced customer acquisition costs through retention\n",
    "- **Revenue Protection**: Preservation of recurring revenue streams\n",
    "- **Customer Insights**: Understanding factors driving customer dissatisfaction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Created subscriber features for 10000 subscribers\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "+----------+-----+\n",
       "|churn_risk|count|\n",
       "+----------+-----+\n",
       "|         1| 1204|\n",
       "|         0| 8796|\n",
       "+----------+-----+\n",
       "\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Prepare data for machine learning - create churn labels and features\n",
    "\n",
    "from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler\n",
    "from pyspark.ml.classification import RandomForestClassifier\n",
    "from pyspark.ml.evaluation import BinaryClassificationEvaluator\n",
    "from pyspark.ml import Pipeline\n",
    "import pyspark.sql.functions as F\n",
    "\n",
    "# Create subscriber-level features for churn prediction\n",
    "subscriber_features = spark.sql(\"\"\"\n",
    "SELECT \n",
    "    subscriber_id,\n",
    "    COUNT(*) as total_sessions,\n",
    "    ROUND(SUM(data_volume), 3) as total_data_gb,\n",
    "    ROUND(SUM(call_duration), 2) as total_call_minutes,\n",
    "    ROUND(AVG(signal_quality), 2) as avg_signal_quality,\n",
    "    COUNT(DISTINCT service_type) as services_used,\n",
    "    COUNT(DISTINCT cell_tower_id) as towers_used,\n",
    "    COUNT(DISTINCT DATE(usage_date)) as active_days,\n",
    "    ROUND(AVG(HOUR(usage_date)), 2) as avg_usage_hour,\n",
    "    -- Simulate churn based on low usage and poor signal quality\n",
    "    CASE WHEN \n",
    "        COUNT(*) < 30 OR \n",
    "        AVG(signal_quality) < 65 OR \n",
    "        COUNT(DISTINCT service_type) < 3 \n",
    "    THEN 1 ELSE 0 END as churn_risk\n",
    "FROM telecom.analytics.network_usage\n",
    "GROUP BY subscriber_id\n",
    "\"\"\")\n",
    "\n",
    "print(f\"Created subscriber features for {subscriber_features.count()} subscribers\")\n",
    "subscriber_features.groupBy(\"churn_risk\").count().show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Training set: 8079 subscribers\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "Test set: 1921 subscribers\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Feature engineering for churn prediction\n",
    "\n",
    "# Create indexers for categorical features (though we have few in this dataset)\n",
    "# Most features are already numeric\n",
    "\n",
    "# Assemble features for the model\n",
    "feature_cols = [\"total_sessions\", \"total_data_gb\", \"total_call_minutes\", \n",
    "                \"avg_signal_quality\", \"services_used\", \"towers_used\", \n",
    "                \"active_days\", \"avg_usage_hour\"]\n",
    "\n",
    "assembler = VectorAssembler(\n",
    "    inputCols=feature_cols,\n",
    "    outputCol=\"features\"\n",
    ")\n",
    "\n",
    "# Scale features\n",
    "scaler = StandardScaler(inputCol=\"features\", outputCol=\"scaled_features\")\n",
    "\n",
    "# Create and train the model\n",
    "rf = RandomForestClassifier(\n",
    "    labelCol=\"churn_risk\", \n",
    "    featuresCol=\"scaled_features\",\n",
    "    numTrees=100,\n",
    "    maxDepth=10\n",
    ")\n",
    "\n",
    "# Create pipeline\n",
    "pipeline = Pipeline(stages=[assembler, scaler, rf])\n",
    "\n",
    "# Split data\n",
    "train_data, test_data = subscriber_features.randomSplit([0.8, 0.2], seed=42)\n",
    "\n",
    "print(f\"Training set: {train_data.count()} subscribers\")\n",
    "print(f\"Test set: {test_data.count()} subscribers\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Training churn prediction model...\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "Model AUC: 1.0000\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "+-------------+--------------+------------------+----------+----------+-----------+\n",
       "|subscriber_id|total_sessions|avg_signal_quality|churn_risk|prediction|probability|\n",
       "+-------------+--------------+------------------+----------+----------+-----------+\n",
       "|  SUB00000003|            90|             71.76|         0|       0.0|  [1.0,0.0]|\n",
       "|  SUB00000007|            97|              73.7|         0|       0.0|  [1.0,0.0]|\n",
       "|  SUB00000009|            60|             72.57|         0|       0.0|  [1.0,0.0]|\n",
       "|  SUB00000014|            73|             71.38|         0|       0.0|  [1.0,0.0]|\n",
       "|  SUB00000020|            78|             71.35|         0|       0.0|  [1.0,0.0]|\n",
       "|  SUB00000024|            37|             71.76|         0|       0.0|  [1.0,0.0]|\n",
       "|  SUB00000030|            97|             71.25|         0|       0.0|  [1.0,0.0]|\n",
       "|  SUB00000036|            99|             72.28|         0|       0.0|  [1.0,0.0]|\n",
       "|  SUB00000046|            65|             71.58|         0|       0.0|  [1.0,0.0]|\n",
       "|  SUB00000047|            21|             76.19|         1|       1.0|  [0.0,1.0]|\n",
       "+-------------+--------------+------------------+----------+----------+-----------+\n",
       "only showing top 10 rows\n",
       "\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "+----------+----------+-----+\n",
       "|churn_risk|prediction|count|\n",
       "+----------+----------+-----+\n",
       "|         0|       0.0| 1683|\n",
       "|         1|       1.0|  238|\n",
       "+----------+----------+-----+\n",
       "\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Train the churn prediction model\n",
    "\n",
    "print(\"Training churn prediction model...\")\n",
    "model = pipeline.fit(train_data)\n",
    "\n",
    "# Make predictions\n",
    "predictions = model.transform(test_data)\n",
    "\n",
    "# Evaluate the model\n",
    "evaluator = BinaryClassificationEvaluator(labelCol=\"churn_risk\", metricName=\"areaUnderROC\")\n",
    "auc = evaluator.evaluate(predictions)\n",
    "\n",
    "print(f\"Model AUC: {auc:.4f}\")\n",
    "\n",
    "# Show prediction results\n",
    "predictions.select(\"subscriber_id\", \"total_sessions\", \"avg_signal_quality\", \"churn_risk\", \"prediction\", \"probability\").show(10)\n",
    "\n",
    "# Calculate confusion matrix\n",
    "confusion_matrix = predictions.groupBy(\"churn_risk\", \"prediction\").count()\n",
    "confusion_matrix.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "=== Feature Importance for Churn Prediction ===\n",
       "total_sessions: 0.5294\n",
       "total_data_gb: 0.0431\n",
       "total_call_minutes: 0.0502\n",
       "avg_signal_quality: 0.0011\n",
       "services_used: 0.0000\n",
       "towers_used: 0.0090\n",
       "active_days: 0.3662\n",
       "avg_usage_hour: 0.0010\n",
       "\n",
       "=== Business Impact Analysis ===\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "Total test subscribers: 1921\n",
       "Subscribers predicted as high churn risk: 238\n",
       "Percentage flagged for intervention: 12.4%\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "\n",
       "Estimated average ARPU: $1532.52\n",
       "Potential monthly revenue at risk: $364,739.93\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "\n",
       "Model Performance:\n",
       "Accuracy: 1.0000\n",
       "Precision: 1.0000\n",
       "Recall: 1.0000\n",
       "AUC: 1.0000\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Model interpretation and business insights\n",
    "\n",
    "# Feature importance (approximate)\n",
    "rf_model = model.stages[-1]\n",
    "feature_importance = rf_model.featureImportances\n",
    "feature_names = feature_cols\n",
    "\n",
    "print(\"=== Feature Importance for Churn Prediction ===\")\n",
    "for name, importance in zip(feature_names, feature_importance):\n",
    "    print(f\"{name}: {importance:.4f}\")\n",
    "\n",
    "# Business impact analysis\n",
    "print(\"\\n=== Business Impact Analysis ===\")\n",
    "\n",
    "# Calculate potential impact of churn prediction\n",
    "churn_predictions = predictions.filter(\"prediction = 1\")\n",
    "high_risk_subscribers = churn_predictions.count()\n",
    "total_test_subscribers = test_data.count()\n",
    "\n",
    "print(f\"Total test subscribers: {total_test_subscribers}\")\n",
    "print(f\"Subscribers predicted as high churn risk: {high_risk_subscribers}\")\n",
    "print(f\"Percentage flagged for intervention: {(high_risk_subscribers/total_test_subscribers)*100:.1f}%\")\n",
    "\n",
    "# Calculate average revenue per user (ARPU) estimate\n",
    "avg_data_gb = test_data.agg(F.avg(\"total_data_gb\")).collect()[0][0] or 0\n",
    "avg_call_minutes = test_data.agg(F.avg(\"total_call_minutes\")).collect()[0][0] or 0\n",
    "\n",
    "# Rough ARPU calculation (simplified)\n",
    "estimated_arpu = (avg_data_gb * 10) + (avg_call_minutes * 0.1) + 50  # Base plan\n",
    "potential_monthly_loss = high_risk_subscribers * estimated_arpu\n",
    "\n",
    "print(f\"\\nEstimated average ARPU: ${estimated_arpu:.2f}\")\n",
    "print(f\"Potential monthly revenue at risk: ${potential_monthly_loss:,.2f}\")\n",
    "\n",
    "# Accuracy metrics\n",
    "accuracy = predictions.filter(\"churn_risk = prediction\").count() / predictions.count()\n",
    "precision = predictions.filter(\"prediction = 1 AND churn_risk = 1\").count() / predictions.filter(\"prediction = 1\").count() if predictions.filter(\"prediction = 1\").count() > 0 else 0\n",
    "recall = predictions.filter(\"prediction = 1 AND churn_risk = 1\").count() / predictions.filter(\"churn_risk = 1\").count() if predictions.filter(\"churn_risk = 1\").count() > 0 else 0\n",
    "\n",
    "print(f\"\\nModel Performance:\")\n",
    "print(f\"Accuracy: {accuracy:.4f}\")\n",
    "print(f\"Precision: {precision:.4f}\")\n",
    "print(f\"Recall: {recall:.4f}\")\n",
    "print(f\"AUC: {auc:.4f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Key Takeaways: Delta Liquid Clustering + ML in AIDP\n",
    "\n",
    "### What We Demonstrated\n",
    "\n",
    "1. **Automatic Optimization**: Created a table with `CLUSTER BY (subscriber_id, usage_date)` and let Delta automatically optimize data layout\n",
    "\n",
    "2. **Performance Benefits**: Queries on clustered columns (subscriber_id, usage_date) are significantly faster due to data locality\n",
    "\n",
    "3. **Zero Maintenance**: No manual partitioning, bucketing, or Z-Ordering required - Delta handles it automatically\n",
    "\n",
    "4. **Machine Learning Integration**: Trained a churn prediction model using the optimized data\n",
    "\n",
    "5. **Real-World Use Case**: Telecommunications analytics where customer retention and network monitoring are critical\n",
    "\n",
    "### AIDP Advantages\n",
    "\n",
    "- **Unified Analytics**: Seamlessly integrates data optimization with ML\n",
    "- **Governance**: Catalog and schema isolation for telecommunications data\n",
    "- **Performance**: Optimized for both analytical queries and ML training\n",
    "- **Scalability**: Handles telecommunications-scale data volumes effortlessly\n",
    "\n",
    "### Business Benefits for Telecommunications\n",
    "\n",
    "1. **Churn Prevention**: Early identification of at-risk subscribers\n",
    "2. **Revenue Protection**: Preservation of recurring revenue streams\n",
    "3. **Cost Optimization**: Targeted retention campaigns\n",
    "4. **Customer Experience**: Proactive service quality improvements\n",
    "5. **Network Optimization**: Data-driven capacity planning\n",
    "\n",
    "### Best Practices for Telecommunications Analytics\n",
    "\n",
    "1. **Choose clustering columns** based on your most common query patterns\n",
    "2. **Start with 1-4 columns** - too many can reduce effectiveness\n",
    "3. **Consider cardinality** - high-cardinality columns work best\n",
    "4. **Monitor and adjust** as query patterns evolve\n",
    "5. **Combine with ML** for predictive analytics and automation\n",
    "\n",
    "### Next Steps\n",
    "\n",
    "- Explore other AIDP ML features like AutoML\n",
    "- Try liquid clustering with different column combinations\n",
    "- Scale up to larger telecommunications datasets\n",
    "- Integrate with real network monitoring systems and CDR data\n",
    "- Deploy models for real-time churn prediction and intervention\n",
    "\n",
    "This notebook demonstrates how Oracle AI Data Platform makes advanced telecommunications analytics accessible while maintaining enterprise-grade performance and governance."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}