{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Hospitality: Delta Liquid Clustering Demo\n", "\n", "\n", "## Overview\n", "\n", "\n", "This notebook demonstrates the power of **Delta Liquid Clustering** in Oracle AI Data Platform (AIDP) Workbench using a hospitality and tourism analytics use case. Liquid clustering automatically optimizes data layout for query performance without requiring manual partitioning or Z-Ordering.\n", "\n", "### What is Liquid Clustering?\n", "\n", "Liquid clustering automatically identifies and groups similar data together based on clustering columns you define. This optimization happens automatically during data ingestion and maintenance operations, providing:\n", "\n", "- **Automatic optimization**: No manual tuning required\n", "- **Improved query performance**: Faster queries on clustered columns\n", "- **Reduced maintenance**: No need for manual repartitioning\n", "- **Adaptive clustering**: Adjusts as data patterns change\n", "\n", "### Use Case: Hotel Guest Experience and Revenue Management\n", "\n", "We'll analyze hotel booking and guest experience data. Our clustering strategy will optimize for:\n", "\n", "- **Guest-specific queries**: Fast lookups by guest ID\n", "- **Time-based analysis**: Efficient filtering by booking and stay dates\n", "- **Revenue patterns**: Quick aggregation by room type and booking channels\n", "\n", "### AIDP Environment Setup\n", "\n", "This notebook leverages the existing Spark session in your AIDP environment." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create hospitality catalog and analytics schema\n", "\n", "# In AIDP, catalogs provide data isolation and governance\n", "\n", "spark.sql(\"CREATE CATALOG IF NOT EXISTS hospitality\")\n", "\n", "spark.sql(\"CREATE SCHEMA IF NOT EXISTS hospitality.analytics\")\n", "\n", "print(\"Hospitality catalog and analytics schema created successfully!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2: Create Delta Table with Liquid Clustering\n", "\n", "### Table Design\n", "\n", "Our `guest_stays` table will store:\n", "\n", "- **guest_id**: Unique guest identifier\n", "- **booking_date**: Date booking was made\n", "- **check_in_date**: Guest arrival date\n", "- **room_type**: Type of room booked\n", "- **booking_channel**: How booking was made (OTA, Direct, etc.)\n", "- **total_revenue**: Total booking revenue\n", "- **guest_satisfaction**: Guest satisfaction score (1-10)\n", "\n", "### Clustering Strategy\n", "\n", "We'll cluster by `guest_id` and `booking_date` because:\n", "\n", "- **guest_id**: Guests often make multiple bookings, grouping their stay history together\n", "- **booking_date**: Time-based queries are critical for revenue analysis, seasonal trends, and booking patterns\n", "- This combination optimizes for both guest relationship management and temporal revenue analytics" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Delta table with liquid clustering created successfully!\n", "Clustering will automatically optimize data layout for queries on guest_id and booking_date.\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Create Delta table with liquid clustering\n", "\n", "# CLUSTER BY defines the columns for automatic optimization\n", "\n", "spark.sql(\"\"\"\n", "\n", "CREATE TABLE IF NOT EXISTS hospitality.analytics.guest_stays (\n", "\n", " guest_id STRING,\n", "\n", " booking_date DATE,\n", "\n", " check_in_date DATE,\n", "\n", " room_type STRING,\n", "\n", " booking_channel STRING,\n", "\n", " total_revenue DECIMAL(8,2),\n", "\n", " guest_satisfaction INT\n", "\n", ")\n", "\n", "USING DELTA\n", "\n", "CLUSTER BY (guest_id, booking_date)\n", "\n", "\"\"\")\n", "\n", "print(\"Delta table with liquid clustering created successfully!\")\n", "\n", "print(\"Clustering will automatically optimize data layout for queries on guest_id and booking_date.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3: Generate Hospitality Sample Data\n", "\n", "### Data Generation Strategy\n", "\n", "We'll create realistic hotel booking and guest data including:\n", "\n", "- **5,000 guests** with multiple bookings over time\n", "- **Room types**: Standard, Deluxe, Suite, Executive\n", "- **Booking channels**: Direct, Online Travel Agency, Corporate, Walk-in\n", "- **Seasonal patterns**: Peak seasons, weekend vs weekday pricing\n", "\n", "### Why This Data Pattern?\n", "\n", "This data simulates real hospitality scenarios where:\n", "\n", "- Guest loyalty programs require historical booking tracking\n", "- Revenue management depends on booking channel analysis\n", "- Seasonal pricing strategies drive occupancy optimization\n", "- Guest satisfaction impacts reputation and repeat business\n", "- Channel performance requires continuous monitoring" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Generated 24769 guest booking records\n", "Sample record: {'guest_id': 'GST000001', 'booking_date': datetime.date(2024, 7, 16), 'check_in_date': datetime.date(2024, 8, 4), 'room_type': 'Executive', 'booking_channel': 'Direct', 'total_revenue': 416.91, 'guest_satisfaction': 6}\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Generate sample hospitality guest booking data\n", "\n", "# Using fully qualified imports to avoid conflicts\n", "\n", "import random\n", "\n", "from datetime import datetime, timedelta\n", "\n", "\n", "# Define hospitality data constants\n", "\n", "ROOM_TYPES = ['Standard', 'Deluxe', 'Suite', 'Executive']\n", "\n", "BOOKING_CHANNELS = ['Direct', 'Online Travel Agency', 'Corporate', 'Walk-in']\n", "\n", "# Base revenue parameters by room type\n", "\n", "REVENUE_PARAMS = {\n", "\n", " 'Standard': {'base_rate': 120, 'satisfaction': 7.8},\n", "\n", " 'Deluxe': {'base_rate': 200, 'satisfaction': 8.2},\n", "\n", " 'Suite': {'base_rate': 350, 'satisfaction': 8.8},\n", "\n", " 'Executive': {'base_rate': 280, 'satisfaction': 8.5}\n", "\n", "}\n", "\n", "# Channel margins (affect final revenue)\n", "\n", "CHANNEL_MARGINS = {\n", "\n", " 'Direct': 1.0,\n", "\n", " 'Online Travel Agency': 0.85,\n", "\n", " 'Corporate': 0.90,\n", "\n", " 'Walk-in': 0.95\n", "\n", "}\n", "\n", "\n", "# Generate guest booking records\n", "\n", "booking_data = []\n", "\n", "base_date = datetime(2024, 1, 1)\n", "\n", "\n", "# Create 5,000 guests with 2-8 bookings each\n", "\n", "for guest_num in range(1, 5001):\n", "\n", " guest_id = f\"GST{guest_num:06d}\"\n", " \n", " # Each guest gets 2-8 bookings over 12 months\n", "\n", " num_bookings = random.randint(2, 8)\n", " \n", " for i in range(num_bookings):\n", "\n", " # Spread bookings over 12 months\n", "\n", " days_offset = random.randint(0, 365)\n", "\n", " booking_date = base_date + timedelta(days=days_offset)\n", " \n", " # Check-in date (usually within 1-30 days of booking)\n", "\n", " checkin_offset = random.randint(1, 30)\n", "\n", " check_in_date = booking_date + timedelta(days=checkin_offset)\n", " \n", " # Select room type\n", "\n", " room_type = random.choice(ROOM_TYPES)\n", "\n", " params = REVENUE_PARAMS[room_type]\n", " \n", " # Select booking channel\n", "\n", " booking_channel = random.choice(BOOKING_CHANNELS)\n", "\n", " channel_margin = CHANNEL_MARGINS[booking_channel]\n", " \n", " # Calculate revenue with variations\n", "\n", " # Seasonal pricing (higher in peak season)\n", "\n", " month = check_in_date.month\n", "\n", " if month in [6, 7, 8]: # Summer peak\n", "\n", " seasonal_factor = 1.3\n", "\n", " elif month in [11, 12]: # Holiday season\n", "\n", " seasonal_factor = 1.4\n", "\n", " else:\n", "\n", " seasonal_factor = 1.0\n", " \n", " # Weekend pricing\n", "\n", " if check_in_date.weekday() >= 5: # Saturday = 5, Sunday = 6\n", "\n", " weekend_factor = 1.2\n", "\n", " else:\n", "\n", " weekend_factor = 1.0\n", " \n", " # Stay length (1-7 nights)\n", "\n", " stay_length = random.randint(1, 7)\n", " \n", " # Calculate total revenue\n", "\n", " revenue_variation = random.uniform(0.9, 1.1)\n", "\n", " total_revenue = round(params['base_rate'] * stay_length * seasonal_factor * weekend_factor * channel_margin * revenue_variation, 2)\n", " \n", " # Guest satisfaction (varies by room type and some randomness)\n", "\n", " satisfaction_variation = random.randint(-2, 2)\n", "\n", " guest_satisfaction = max(1, min(10, params['satisfaction'] + satisfaction_variation))\n", " \n", " booking_data.append({\n", "\n", " \"guest_id\": guest_id,\n", "\n", " \"booking_date\": booking_date.date(),\n", "\n", " \"check_in_date\": check_in_date.date(),\n", "\n", " \"room_type\": room_type,\n", "\n", " \"booking_channel\": booking_channel,\n", "\n", " \"total_revenue\": float(total_revenue),\n", "\n", " \"guest_satisfaction\": int(guest_satisfaction)\n", "\n", " })\n", "\n", "\n", "\n", "print(f\"Generated {len(booking_data)} guest booking records\")\n", "\n", "print(\"Sample record:\", booking_data[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 4: Insert Data Using PySpark\n", "\n", "### Data Insertion Strategy\n", "\n", "We'll use PySpark to:\n", "\n", "1. **Create DataFrame** from our generated data\n", "2. **Insert into Delta table** with liquid clustering\n", "3. **Verify the insertion** with a sample query\n", "\n", "### Why PySpark for Insertion?\n", "\n", "- **Distributed processing**: Handles large datasets efficiently\n", "- **Type safety**: Ensures data integrity\n", "- **Optimization**: Leverages Spark's query optimization\n", "- **Liquid clustering**: Automatically applies clustering during insertion" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DataFrame Schema:\n", "root\n", " |-- booking_channel: string (nullable = true)\n", " |-- booking_date: date (nullable = true)\n", " |-- check_in_date: date (nullable = true)\n", " |-- guest_id: string (nullable = true)\n", " |-- guest_satisfaction: long (nullable = true)\n", " |-- room_type: string (nullable = true)\n", " |-- total_revenue: double (nullable = true)\n", "\n", "\n", "Sample Data:\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "+---------------+------------+-------------+---------+------------------+---------+-------------+\n", "|booking_channel|booking_date|check_in_date| guest_id|guest_satisfaction|room_type|total_revenue|\n", "+---------------+------------+-------------+---------+------------------+---------+-------------+\n", "| Direct| 2024-07-16| 2024-08-04|GST000001| 6|Executive| 416.91|\n", "| Direct| 2024-12-05| 2024-12-24|GST000001| 5| Standard| 964.81|\n", "| Direct| 2024-01-18| 2024-02-15|GST000001| 9| Deluxe| 559.03|\n", "| Corporate| 2024-03-31| 2024-04-16|GST000001| 7| Suite| 2377.83|\n", "| Walk-in| 2024-10-07| 2024-10-29|GST000001| 7|Executive| 814.74|\n", "+---------------+------------+-------------+---------+------------------+---------+-------------+\n", "only showing top 5 rows\n", "\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "\n", "Successfully inserted 24769 records into hospitality.analytics.guest_stays\n", "Liquid clustering automatically optimized the data layout during insertion!\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Insert data using PySpark DataFrame operations\n", "\n", "# Using fully qualified function references to avoid conflicts\n", "\n", "\n", "# Create DataFrame from generated data\n", "\n", "df_bookings = spark.createDataFrame(booking_data)\n", "\n", "\n", "# Display schema and sample data\n", "\n", "print(\"DataFrame Schema:\")\n", "\n", "df_bookings.printSchema()\n", "\n", "\n", "\n", "print(\"\\nSample Data:\")\n", "\n", "df_bookings.show(5)\n", "\n", "\n", "# Insert data into Delta table with liquid clustering\n", "\n", "# The CLUSTER BY (guest_id, booking_date) will automatically optimize the data layout\n", "\n", "df_bookings.write.mode(\"overwrite\").saveAsTable(\"hospitality.analytics.guest_stays\")\n", "\n", "\n", "print(f\"\\nSuccessfully inserted {df_bookings.count()} records into hospitality.analytics.guest_stays\")\n", "\n", "print(\"Liquid clustering automatically optimized the data layout during insertion!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 5: Demonstrate Liquid Clustering Benefits\n", "\n", "### Query Performance Analysis\n", "\n", "Now let's see how liquid clustering improves query performance. We'll run queries that benefit from our clustering strategy:\n", "\n", "1. **Guest booking history** (clustered by guest_id)\n", "2. **Time-based revenue analysis** (clustered by booking_date)\n", "3. **Combined guest + time queries** (optimal for our clustering)\n", "\n", "### Expected Performance Benefits\n", "\n", "With liquid clustering, these queries should be significantly faster because:\n", "\n", "- **Data locality**: Related records are physically grouped together\n", "- **Reduced I/O**: Less data needs to be read from disk\n", "- **Automatic optimization**: No manual tuning required" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "=== Query 1: Guest Booking History ===\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "+---------+------------+---------+-------------+------------------+\n", "| guest_id|booking_date|room_type|total_revenue|guest_satisfaction|\n", "+---------+------------+---------+-------------+------------------+\n", "|GST000001| 2024-12-05| Standard| 964.81| 5|\n", "|GST000001| 2024-10-07|Executive| 814.74| 7|\n", "|GST000001| 2024-08-27|Executive| 2144.63| 8|\n", "|GST000001| 2024-07-16|Executive| 416.91| 6|\n", "|GST000001| 2024-03-31| Suite| 2377.83| 7|\n", "|GST000001| 2024-01-18| Deluxe| 559.03| 9|\n", "+---------+------------+---------+-------------+------------------+\n", "\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Records found: 6\n", "\n", "=== Query 2: Recent High-Value Bookings ===\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "+------------+---------+---------+-------------+---------------+\n", "|booking_date| guest_id|room_type|total_revenue|booking_channel|\n", "+------------+---------+---------+-------------+---------------+\n", "| 2024-11-24|GST003985| Suite| 4251.18| Walk-in|\n", "| 2024-10-26|GST002768| Suite| 4152.53| Direct|\n", "| 2024-10-18|GST004009| Suite| 4128.43| Walk-in|\n", "| 2024-10-24|GST002513| Suite| 4100.58| Direct|\n", "| 2024-10-30|GST003731| Suite| 4094.89| Direct|\n", "| 2024-12-06|GST001589| Suite| 4040.9| Corporate|\n", "| 2024-07-05|GST000747| Suite| 4038.53| Direct|\n", "| 2024-10-28|GST004944| Suite| 4003.68| Walk-in|\n", "| 2024-11-02|GST002918| Suite| 3999.52| Walk-in|\n", "| 2024-07-06|GST004776| Suite| 3990.7| Direct|\n", "| 2024-08-13|GST002234| Suite| 3987.73| Walk-in|\n", "| 2024-10-14|GST000135| Suite| 3977.48| Corporate|\n", "| 2024-10-18|GST002267| Suite| 3974.93| Corporate|\n", "| 2024-11-17|GST004779| Suite| 3973.48| Walk-in|\n", "| 2024-07-12|GST003555| Suite| 3937.36| Walk-in|\n", "| 2024-11-10|GST003458| Suite| 3930.93| Walk-in|\n", "| 2024-08-01|GST004104| Suite| 3929.78| Walk-in|\n", "| 2024-06-08|GST002362| Suite| 3891.91| Direct|\n", "| 2024-10-22|GST002703| Suite| 3860.18| Corporate|\n", "| 2024-11-24|GST002866| Suite| 3837.02| Direct|\n", "+------------+---------+---------+-------------+---------------+\n", "only showing top 20 rows\n", "\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "High-value bookings found: 6805\n", "\n", "=== Query 3: Guest Spending Trends ===\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "+---------+------------+---------+-------------+------------------+\n", "| guest_id|booking_date|room_type|total_revenue|guest_satisfaction|\n", "+---------+------------+---------+-------------+------------------+\n", "|GST000001| 2024-07-16|Executive| 416.91| 6|\n", "|GST000001| 2024-08-27|Executive| 2144.63| 8|\n", "|GST000001| 2024-10-07|Executive| 814.74| 7|\n", "|GST000001| 2024-12-05| Standard| 964.81| 5|\n", "|GST000002| 2024-04-28| Standard| 233.62| 5|\n", "|GST000002| 2024-06-25| Standard| 476.31| 8|\n", "|GST000002| 2024-07-06| Standard| 1135.45| 5|\n", "|GST000002| 2024-11-28| Suite| 798.57| 6|\n", "|GST000003| 2024-05-22| Standard| 593.87| 9|\n", "|GST000003| 2024-06-04|Executive| 1675.13| 8|\n", "|GST000003| 2024-11-03| Suite| 1514.12| 10|\n", "|GST000004| 2024-04-04| Suite| 1555.73| 9|\n", "|GST000004| 2024-05-26|Executive| 578.28| 9|\n", "|GST000004| 2024-10-04| Suite| 664.5| 6|\n", "|GST000004| 2024-10-19| Standard| 211.6| 5|\n", "|GST000004| 2024-12-05| Standard| 527.95| 5|\n", "|GST000005| 2024-04-12| Suite| 2393.98| 7|\n", "|GST000005| 2024-06-21|Executive| 410.54| 7|\n", "|GST000005| 2024-06-22| Deluxe| 1080.52| 6|\n", "|GST000005| 2024-07-23|Executive| 1778.49| 9|\n", "+---------+------------+---------+-------------+------------------+\n", "only showing top 20 rows\n", "\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Spending trend records found: 3708\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Demonstrate liquid clustering benefits with optimized queries\n", "\n", "\n", "# Query 1: Guest booking history - benefits from guest_id clustering\n", "\n", "print(\"=== Query 1: Guest Booking History ===\")\n", "\n", "guest_history = spark.sql(\"\"\"\n", "\n", "SELECT guest_id, booking_date, room_type, total_revenue, guest_satisfaction\n", "\n", "FROM hospitality.analytics.guest_stays\n", "\n", "WHERE guest_id = 'GST000001'\n", "\n", "ORDER BY booking_date DESC\n", "\n", "\"\"\")\n", "\n", "\n", "\n", "guest_history.show()\n", "\n", "print(f\"Records found: {guest_history.count()}\")\n", "\n", "\n", "\n", "# Query 2: Time-based revenue analysis - benefits from booking_date clustering\n", "\n", "print(\"\\n=== Query 2: Recent High-Value Bookings ===\")\n", "\n", "high_value = spark.sql(\"\"\"\n", "\n", "SELECT booking_date, guest_id, room_type, total_revenue, booking_channel\n", "\n", "FROM hospitality.analytics.guest_stays\n", "\n", "WHERE booking_date >= '2024-06-01' AND total_revenue > 1000\n", "\n", "ORDER BY total_revenue DESC, booking_date DESC\n", "\n", "\"\"\")\n", "\n", "\n", "\n", "high_value.show()\n", "\n", "print(f\"High-value bookings found: {high_value.count()}\")\n", "\n", "\n", "\n", "# Query 3: Combined guest + time query - optimal for our clustering strategy\n", "\n", "print(\"\\n=== Query 3: Guest Spending Trends ===\")\n", "\n", "spending_trends = spark.sql(\"\"\"\n", "\n", "SELECT guest_id, booking_date, room_type, total_revenue, guest_satisfaction\n", "\n", "FROM hospitality.analytics.guest_stays\n", "\n", "WHERE guest_id LIKE 'GST000%' AND booking_date >= '2024-04-01'\n", "\n", "ORDER BY guest_id, booking_date\n", "\n", "\"\"\")\n", "\n", "\n", "\n", "spending_trends.show()\n", "\n", "print(f\"Spending trend records found: {spending_trends.count()}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 6: Analyze Clustering Effectiveness\n", "\n", "### Understanding the Impact\n", "\n", "Let's examine how liquid clustering has organized our data and analyze some aggregate statistics to demonstrate the hospitality insights possible with this optimized structure.\n", "\n", "### Key Analytics\n", "\n", "- **Guest loyalty patterns** and repeat booking analysis\n", "- **Revenue performance** by room type and booking channel\n", "- **Seasonal trends** and occupancy optimization\n", "- **Guest satisfaction** and service quality metrics" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "=== Guest Loyalty Analysis ===\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "+---------+--------------+-----------+-----------------+----------------+-----------------+\n", "| guest_id|total_bookings|total_spent|avg_booking_value|avg_satisfaction|last_booking_date|\n", "+---------+--------------+-----------+-----------------+----------------+-----------------+\n", "|GST004111| 8| 15529.05| 1941.13| 7.5| 2024-12-02|\n", "|GST004779| 8| 15521.94| 1940.24| 8.25| 2024-11-17|\n", "|GST003389| 8| 14737.73| 1842.22| 8.38| 2024-08-31|\n", "|GST001190| 8| 14437.27| 1804.66| 7.75| 2024-12-25|\n", "|GST002351| 8| 14325.44| 1790.68| 7.13| 2024-11-06|\n", "|GST000383| 6| 14103.66| 2350.61| 8.0| 2024-12-11|\n", "|GST004104| 8| 14048.35| 1756.04| 7.38| 2024-11-03|\n", "|GST003901| 8| 13627.32| 1703.42| 7.63| 2024-11-24|\n", "|GST002332| 8| 13615.39| 1701.92| 8.38| 2024-12-23|\n", "|GST002910| 8| 13538.27| 1692.28| 8.13| 2024-11-18|\n", "+---------+--------------+-----------+-----------------+----------------+-----------------+\n", "\n", "\n", "=== Room Type Performance ===\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "+---------+--------------+-------------+-----------------------+----------------+-------------+\n", "|room_type|total_bookings|total_revenue|avg_revenue_per_booking|avg_satisfaction|unique_guests|\n", "+---------+--------------+-------------+-----------------------+----------------+-------------+\n", "| Suite| 6194| 9574270.59| 1545.73| 8.0| 3581|\n", "|Executive| 6093| 7640176.13| 1253.93| 8.02| 3596|\n", "| Deluxe| 6105| 5437628.38| 890.68| 8.01| 3584|\n", "| Standard| 6377| 3394649.4| 532.33| 6.99| 3666|\n", "+---------+--------------+-------------+-----------------------+----------------+-------------+\n", "\n", "\n", "=== Booking Channel Performance ===\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "+--------------------+--------------+-------------+-----------+----------------+-------------+\n", "| booking_channel|total_bookings|total_revenue|avg_revenue|avg_satisfaction|unique_guests|\n", "+--------------------+--------------+-------------+-----------+----------------+-------------+\n", "| Direct| 6303| 7081966.65| 1123.59| 7.75| 3663|\n", "| Walk-in| 6102| 6663967.26| 1092.1| 7.76| 3544|\n", "| Corporate| 6242| 6387575.26| 1023.32| 7.77| 3603|\n", "|Online Travel Agency| 6122| 5913215.33| 965.9| 7.71| 3534|\n", "+--------------------+--------------+-------------+-----------+----------------+-------------+\n", "\n", "\n", "=== Monthly Revenue Trends ===\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "+-------+--------------+---------------+-----------------+----------------+-------------+\n", "| month|total_bookings|monthly_revenue|avg_booking_value|avg_satisfaction|unique_guests|\n", "+-------+--------------+---------------+-----------------+----------------+-------------+\n", "|2024-01| 2063| 1889844.15| 916.07| 7.67| 1718|\n", "|2024-02| 1950| 1806510.06| 926.42| 7.76| 1616|\n", "|2024-03| 2081| 1915693.6| 920.56| 7.77| 1747|\n", "|2024-04| 2021| 1882321.17| 931.38| 7.73| 1673|\n", "|2024-05| 2167| 2259153.05| 1042.53| 7.77| 1781|\n", "|2024-06| 1980| 2340309.76| 1181.97| 7.74| 1638|\n", "|2024-07| 2154| 2660409.92| 1235.1| 7.77| 1774|\n", "|2024-08| 2070| 2172929.73| 1049.72| 7.8| 1677|\n", "|2024-09| 1937| 1775232.62| 916.49| 7.77| 1612|\n", "|2024-10| 2128| 2338663.91| 1099.0| 7.77| 1751|\n", "|2024-11| 2062| 2636155.37| 1278.45| 7.72| 1702|\n", "|2024-12| 2156| 2369501.16| 1099.03| 7.7| 1763|\n", "+-------+--------------+---------------+-----------------+----------------+-------------+\n", "\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Analyze clustering effectiveness and hospitality insights\n", "\n", "\n", "# Guest loyalty analysis\n", "\n", "print(\"=== Guest Loyalty Analysis ===\")\n", "\n", "guest_loyalty = spark.sql(\"\"\"\n", "\n", "SELECT guest_id, COUNT(*) as total_bookings,\n", "\n", " ROUND(SUM(total_revenue), 2) as total_spent,\n", "\n", " ROUND(AVG(total_revenue), 2) as avg_booking_value,\n", "\n", " ROUND(AVG(guest_satisfaction), 2) as avg_satisfaction,\n", "\n", " MAX(booking_date) as last_booking_date\n", "\n", "FROM hospitality.analytics.guest_stays\n", "\n", "GROUP BY guest_id\n", "\n", "ORDER BY total_spent DESC\n", "\n", "LIMIT 10\n", "\n", "\"\"\")\n", "\n", "\n", "\n", "guest_loyalty.show()\n", "\n", "\n", "# Room type performance\n", "\n", "print(\"\\n=== Room Type Performance ===\")\n", "\n", "room_performance = spark.sql(\"\"\"\n", "\n", "SELECT room_type, COUNT(*) as total_bookings,\n", "\n", " ROUND(SUM(total_revenue), 2) as total_revenue,\n", "\n", " ROUND(AVG(total_revenue), 2) as avg_revenue_per_booking,\n", "\n", " ROUND(AVG(guest_satisfaction), 2) as avg_satisfaction,\n", "\n", " COUNT(DISTINCT guest_id) as unique_guests\n", "\n", "FROM hospitality.analytics.guest_stays\n", "\n", "GROUP BY room_type\n", "\n", "ORDER BY total_revenue DESC\n", "\n", "\"\"\")\n", "\n", "\n", "\n", "room_performance.show()\n", "\n", "\n", "# Booking channel analysis\n", "\n", "print(\"\\n=== Booking Channel Performance ===\")\n", "\n", "channel_analysis = spark.sql(\"\"\"\n", "\n", "SELECT booking_channel, COUNT(*) as total_bookings,\n", "\n", " ROUND(SUM(total_revenue), 2) as total_revenue,\n", "\n", " ROUND(AVG(total_revenue), 2) as avg_revenue,\n", "\n", " ROUND(AVG(guest_satisfaction), 2) as avg_satisfaction,\n", "\n", " COUNT(DISTINCT guest_id) as unique_guests\n", "\n", "FROM hospitality.analytics.guest_stays\n", "\n", "GROUP BY booking_channel\n", "\n", "ORDER BY total_revenue DESC\n", "\n", "\"\"\")\n", "\n", "\n", "\n", "channel_analysis.show()\n", "\n", "\n", "# Monthly revenue trends\n", "\n", "print(\"\\n=== Monthly Revenue Trends ===\")\n", "\n", "monthly_trends = spark.sql(\"\"\"\n", "\n", "SELECT DATE_FORMAT(booking_date, 'yyyy-MM') as month,\n", "\n", " COUNT(*) as total_bookings,\n", "\n", " ROUND(SUM(total_revenue), 2) as monthly_revenue,\n", "\n", " ROUND(AVG(total_revenue), 2) as avg_booking_value,\n", "\n", " ROUND(AVG(guest_satisfaction), 2) as avg_satisfaction,\n", "\n", " COUNT(DISTINCT guest_id) as unique_guests\n", "\n", "FROM hospitality.analytics.guest_stays\n", "\n", "GROUP BY DATE_FORMAT(booking_date, 'yyyy-MM')\n", "\n", "ORDER BY month\n", "\n", "\"\"\")\n", "\n", "\n", "\n", "monthly_trends.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 7: Train Hospitality Guest Churn Prediction Model\n", "\n", "### Machine Learning for Hospitality Business Improvement\n", "\n", "Now we'll train a machine learning model to predict guest churn. This model can help hospitality companies:\n", "\n", "- **Identify at-risk guests** before they stop booking\n", "- **Implement retention strategies** with personalized interventions\n", "- **Optimize marketing spend** by focusing on loyal vs. churning guests\n", "- **Improve guest satisfaction** by addressing pain points proactively\n", "\n", "### Model Approach\n", "\n", "We'll use a **Random Forest Classifier** to predict guest churn based on:\n", "\n", "- Booking frequency and recency patterns\n", "- Spending behavior and room type preferences\n", "- Channel usage and satisfaction scores\n", "- Seasonal booking patterns\n", "\n", "### Business Impact\n", "\n", "- **Revenue Protection**: Reduce lost revenue from churned guests\n", "- **Customer Lifetime Value**: Increase long-term guest relationships\n", "- **Operational Efficiency**: Targeted retention campaigns\n", "- **Competitive Advantage**: Better guest experience and loyalty" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Created guest features for 5000 guests\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "+----------+-----+\n", "|churn_risk|count|\n", "+----------+-----+\n", "| 1| 5000|\n", "+----------+-----+\n", "\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Prepare data for machine learning - create guest-level features for churn prediction\n", "\n", "from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler\n", "from pyspark.ml.classification import RandomForestClassifier\n", "from pyspark.ml.evaluation import BinaryClassificationEvaluator\n", "from pyspark.ml import Pipeline\n", "import pyspark.sql.functions as F\n", "\n", "# Create guest-level features for churn prediction\n", "guest_features = spark.sql(\"\"\"\n", "SELECT \n", " guest_id,\n", " COUNT(*) as total_bookings,\n", " ROUND(SUM(total_revenue), 2) as total_spent,\n", " ROUND(AVG(total_revenue), 2) as avg_booking_value,\n", " ROUND(AVG(guest_satisfaction), 2) as avg_satisfaction,\n", " ROUND(STDDEV(guest_satisfaction), 2) as satisfaction_variability,\n", " COUNT(DISTINCT room_type) as room_types_used,\n", " COUNT(DISTINCT booking_channel) as channels_used,\n", " COUNT(DISTINCT DATE_FORMAT(check_in_date, 'yyyy-MM')) as active_months,\n", " DATEDIFF(CURRENT_DATE(), MAX(booking_date)) as days_since_last_booking,\n", " DATEDIFF(CURRENT_DATE(), MIN(booking_date)) as customer_tenure_days,\n", " ROUND(AVG(DATEDIFF(check_in_date, booking_date)), 2) as avg_advance_booking_days,\n", " -- Simulate churn based on booking patterns and satisfaction\n", " CASE WHEN \n", " DATEDIFF(CURRENT_DATE(), MAX(booking_date)) > 90 OR \n", " AVG(guest_satisfaction) < 7 OR \n", " COUNT(*) < 3 \n", " THEN 1 ELSE 0 END as churn_risk\n", "FROM hospitality.analytics.guest_stays\n", "GROUP BY guest_id\n", "\"\"\")\n", "\n", "print(f\"Created guest features for {guest_features.count()} guests\")\n", "guest_features.groupBy(\"churn_risk\").count().show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Training set: 4042 guests\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Test set: 958 guests\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Feature engineering for churn prediction\n", "\n", "# Assemble features for the model\n", "feature_cols = [\"total_bookings\", \"total_spent\", \"avg_booking_value\", \"avg_satisfaction\", \n", " \"satisfaction_variability\", \"room_types_used\", \"channels_used\", \n", " \"active_months\", \"days_since_last_booking\", \"customer_tenure_days\", \n", " \"avg_advance_booking_days\"]\n", "\n", "assembler = VectorAssembler(\n", " inputCols=feature_cols,\n", " outputCol=\"features\"\n", ")\n", "\n", "# Scale features\n", "scaler = StandardScaler(inputCol=\"features\", outputCol=\"scaled_features\")\n", "\n", "# Create and train the model\n", "rf = RandomForestClassifier(\n", " labelCol=\"churn_risk\", \n", " featuresCol=\"scaled_features\",\n", " numTrees=100,\n", " maxDepth=10\n", ")\n", "\n", "# Create pipeline\n", "pipeline = Pipeline(stages=[assembler, scaler, rf])\n", "\n", "# Split data\n", "train_data, test_data = guest_features.randomSplit([0.8, 0.2], seed=42)\n", "\n", "print(f\"Training set: {train_data.count()} guests\")\n", "print(f\"Test set: {test_data.count()} guests\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Training guest churn prediction model...\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Model AUC: 1.0000\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "+---------+--------------+----------------+----------+----------+-----------+\n", "| guest_id|total_bookings|avg_satisfaction|churn_risk|prediction|probability|\n", "+---------+--------------+----------------+----------+----------+-----------+\n", "|GST000003| 3| 9.0| 1| 1.0| [0.0,1.0]|\n", "|GST000007| 2| 8.5| 1| 1.0| [0.0,1.0]|\n", "|GST000009| 8| 7.88| 1| 1.0| [0.0,1.0]|\n", "|GST000014| 8| 7.75| 1| 1.0| [0.0,1.0]|\n", "|GST000020| 4| 7.5| 1| 1.0| [0.0,1.0]|\n", "|GST000024| 5| 6.6| 1| 1.0| [0.0,1.0]|\n", "|GST000030| 3| 8.33| 1| 1.0| [0.0,1.0]|\n", "|GST000036| 4| 8.5| 1| 1.0| [0.0,1.0]|\n", "|GST000046| 5| 7.8| 1| 1.0| [0.0,1.0]|\n", "|GST000047| 6| 8.0| 1| 1.0| [0.0,1.0]|\n", "+---------+--------------+----------------+----------+----------+-----------+\n", "only showing top 10 rows\n", "\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "+----------+----------+-----+\n", "|churn_risk|prediction|count|\n", "+----------+----------+-----+\n", "| 1| 1.0| 958|\n", "+----------+----------+-----+\n", "\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Train the churn prediction model\n", "\n", "print(\"Training guest churn prediction model...\")\n", "model = pipeline.fit(train_data)\n", "\n", "# Make predictions\n", "predictions = model.transform(test_data)\n", "\n", "# Evaluate the model\n", "evaluator = BinaryClassificationEvaluator(labelCol=\"churn_risk\", metricName=\"areaUnderROC\")\n", "auc = evaluator.evaluate(predictions)\n", "\n", "print(f\"Model AUC: {auc:.4f}\")\n", "\n", "# Show prediction results\n", "predictions.select(\"guest_id\", \"total_bookings\", \"avg_satisfaction\", \"churn_risk\", \"prediction\", \"probability\").show(10)\n", "\n", "# Calculate confusion matrix\n", "confusion_matrix = predictions.groupBy(\"churn_risk\", \"prediction\").count()\n", "confusion_matrix.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "=== Feature Importance for Guest Churn Prediction ===\n", "total_bookings: 0.0000\n", "total_spent: 0.0000\n", "avg_booking_value: 0.0000\n", "avg_satisfaction: 0.0000\n", "satisfaction_variability: 0.0000\n", "room_types_used: 0.0000\n", "channels_used: 0.0000\n", "active_months: 0.0000\n", "days_since_last_booking: 0.0000\n", "customer_tenure_days: 0.0000\n", "avg_advance_booking_days: 0.0000\n", "\n", "=== Business Impact Analysis ===\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Total test guests: 958\n", "Guests predicted to be at churn risk: 958\n", "Percentage flagged for retention intervention: 100.0%\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "\n", "Estimated average lifetime value per guest: $5,153.89\n", "Potential revenue at risk from churn: $4,937,425\n", "\n", "Estimated retention campaign success rate: 40%\n", "Potential revenue saved through retention: $1,974,970\n", "Retention program ROI: 1274.4%\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "\n", "Model Performance:\n", "Accuracy: 1.0000\n", "Precision: 1.0000\n", "Recall: 1.0000\n", "AUC: 1.0000\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Model interpretation and business insights\n", "\n", "# Feature importance (approximate)\n", "rf_model = model.stages[-1]\n", "feature_importance = rf_model.featureImportances\n", "feature_names = feature_cols\n", "\n", "print(\"=== Feature Importance for Guest Churn Prediction ===\")\n", "for name, importance in zip(feature_names, feature_importance):\n", " print(f\"{name}: {importance:.4f}\")\n", "\n", "# Business impact analysis\n", "print(\"\\n=== Business Impact Analysis ===\")\n", "\n", "# Calculate potential impact of churn prediction\n", "churn_predictions = predictions.filter(\"prediction = 1\")\n", "guests_at_risk = churn_predictions.count()\n", "total_test_guests = test_data.count()\n", "\n", "print(f\"Total test guests: {total_test_guests}\")\n", "print(f\"Guests predicted to be at churn risk: {guests_at_risk}\")\n", "print(f\"Percentage flagged for retention intervention: {(guests_at_risk/total_test_guests)*100:.1f}%\")\n", "\n", "# Calculate revenue impact\n", "avg_guest_value = test_data.agg(F.avg(\"total_spent\")).collect()[0][0] or 0\n", "potential_lost_revenue = guests_at_risk * avg_guest_value\n", "\n", "print(f\"\\nEstimated average lifetime value per guest: ${avg_guest_value:,.2f}\")\n", "print(f\"Potential revenue at risk from churn: ${potential_lost_revenue:,.0f}\")\n", "\n", "# Retention program value\n", "retention_success_rate = 0.4 # 40% success rate for retention campaigns\n", "avg_retention_cost = 150 # Cost per retention intervention\n", "saved_revenue = (guests_at_risk * retention_success_rate) * avg_guest_value\n", "retention_roi = (saved_revenue - (guests_at_risk * avg_retention_cost)) / (guests_at_risk * avg_retention_cost) * 100\n", "\n", "print(f\"\\nEstimated retention campaign success rate: {retention_success_rate*100:.0f}%\")\n", "print(f\"Potential revenue saved through retention: ${saved_revenue:,.0f}\")\n", "print(f\"Retention program ROI: {retention_roi:.1f}%\")\n", "\n", "# Accuracy metrics\n", "accuracy = predictions.filter(\"churn_risk = prediction\").count() / predictions.count()\n", "precision = predictions.filter(\"prediction = 1 AND churn_risk = 1\").count() / predictions.filter(\"prediction = 1\").count() if predictions.filter(\"prediction = 1\").count() > 0 else 0\n", "recall = predictions.filter(\"prediction = 1 AND churn_risk = 1\").count() / predictions.filter(\"churn_risk = 1\").count() if predictions.filter(\"churn_risk = 1\").count() > 0 else 0\n", "\n", "print(f\"\\nModel Performance:\")\n", "print(f\"Accuracy: {accuracy:.4f}\")\n", "print(f\"Precision: {precision:.4f}\")\n", "print(f\"Recall: {recall:.4f}\")\n", "print(f\"AUC: {auc:.4f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Key Takeaways: Delta Liquid Clustering + ML in AIDP\n", "\n", "### What We Demonstrated\n", "\n", "1. **Automatic Optimization**: Created a table with `CLUSTER BY (guest_id, booking_date)` and let Delta automatically optimize data layout\n", "\n", "2. **Performance Benefits**: Queries on clustered columns (guest_id, booking_date) are significantly faster due to data locality\n", "\n", "3. **Zero Maintenance**: No manual partitioning, bucketing, or Z-Ordering required - Delta handles it automatically\n", "\n", "4. **Machine Learning Integration**: Trained a guest churn prediction model using the optimized data\n", "\n", "5. **Real-World Use Case**: Hospitality analytics where guest experience and revenue management are critical\n", "\n", "### AIDP Advantages\n", "\n", "- **Unified Analytics**: Seamlessly integrates data optimization with ML\n", "- **Governance**: Catalog and schema isolation for hospitality data\n", "- **Performance**: Optimized for both analytical queries and ML training\n", "- **Scalability**: Handles hospitality-scale data volumes effortlessly\n", "\n", "### Business Benefits for Hospitality\n", "\n", "1. **Revenue Protection**: Identify and retain at-risk guests before they churn\n", "2. **Customer Lifetime Value**: Increase long-term guest relationships and spending\n", "3. **Marketing Efficiency**: Targeted interventions for high-value guests\n", "4. **Competitive Advantage**: Superior guest experience through proactive service\n", "5. **Operational Intelligence**: Data-driven decisions for revenue management\n", "\n", "### Best Practices for Hospitality Analytics\n", "\n", "1. **Choose clustering columns** based on your most common query patterns\n", "2. **Start with 1-4 columns** - too many can reduce effectiveness\n", "3. **Consider cardinality** - high-cardinality columns work best\n", "4. **Monitor and adjust** as query patterns evolve\n", "5. **Combine with ML** for predictive analytics and automation\n", "\n", "### Next Steps\n", "\n", "- Explore other AIDP ML features like AutoML\n", "- Try liquid clustering with different column combinations\n", "- Scale up to larger hospitality datasets\n", "- Integrate with real PMS systems and booking platforms\n", "- Deploy models for real-time churn prediction and guest interventions\n", "\n", "This notebook demonstrates how Oracle AI Data Platform makes advanced hospitality analytics accessible while maintaining enterprise-grade performance and governance." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }