{ "cells": [ { "cell_type": "code", "execution_count": null, "id": "65284526-9bd7-4352-a60c-46a012411946", "metadata": {}, "outputs": [], "source": [ "Oracle AI Data Platform v1.0\n", "\n", "Copyright © 2025, Oracle and/or its affiliates.\n", "\n", "Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/" ] }, { "cell_type": "markdown", "id": "c391de18-a5ea-4fc3-95ab-05593f9bbda0", "metadata": { "type": "python" }, "source": [ "# Access Object Storage Data\n", " **Accessing External Data in OCI from AI Data Platform**\n", " \n", "This notebook demonstrates how to efficiently read and write data from OCI Object Storage using AI Data Platform. You’ll learn three different approaches:\n", " \n", " 1. **Direct Read/Write to OCI Object Storage**\n", " 2. **Accessing OCI Object Storage via an External Volume**\n", " - Creating an external volume\n", " - Reading and writing data\n", " 3. **Using an External Table Referencing OCI Object Storage**\n", " - Creating an external table\n", " - Reading and writing data\n", "\n", " **Prerequisites**\n", "\n", "Before you begin, ensure you have:\n", " - An OCI Object Storage bucket created in your tenancy.\n", " - The necessary IAM policies for accessing AI Data Platform. Learn more about permissions.\n", " - A configured AI Data Platform environment with a compute cluster created.\n", "\n", "\n", " **Parameters**\n", " \n", "Change the values in the parameters section, before executing the parameter cell and remaining cells;\n", " - OCI Object Storage bucket name and namespace - must change these to your bucket name and namespace\n", " - Folder and file name\n", " - Catalog name\n", " - Schema name\n", " - External volume name\n", " - External table name\n", "\n", " **Next Steps**\n", "\n", "Follow the step-by-step instructions in this notebook to implement each approach. You can also refer to the following resources for additional guidance:\n", "\t•\tWorking with OCI Object Storage\n", "\t•\tMounting OCI Object Storage as an External Volume\n", "\t•\tUsing External Tables in AI Data Platform\n", "\n", "By the end of this notebook, you’ll be able to choose the best method for accessing Object Storage data in your OCI environment." ] }, { "cell_type": "code", "execution_count": null, "id": "d3bf8e1a-27c3-466d-a89c-fad03509c6d1", "metadata": { "execution": { "iopub.status.busy": "2025-03-25T15:50:01.623Z" } }, "outputs": [], "source": [ "# Define parameters (can set parameters in a workflow job)\n", "\n", "oci_bucket=oidlUtils.parameters.getParameter(\"OCI_BUCKET\", \"oci://replace_bucket_name@replace_namespace\")\n", "folder_name=oidlUtils.parameters.getParameter(\"FOLDER_NAME\", \"new_data_folder\")\n", "file_name=oidlUtils.parameters.getParameter(\"FILE_NAME\", \"new_data\")\n", "catalog_name=oidlUtils.parameters.getParameter(\"CATALOG_NAME\", \"default\")\n", "schema_name=oidlUtils.parameters.getParameter(\"SCHEMA_NAME\", \"default\")\n", "external_volume_name=oidlUtils.parameters.getParameter(\"EXTERNAL_VOLUME_NAME\", \"ext_volume\")\n", "external_table_name=oidlUtils.parameters.getParameter(\"EXTERNAL_TABLE_NAME\", \"ext_table\")\n", "\n", "#End of parameters that need to be set, you can run the rest of the cells" ] }, { "cell_type": "code", "execution_count": 1, "id": "bc95f314-579e-44d7-9328-18777bc1b72d", "metadata": { "execution": { "iopub.status.busy": "2025-03-25T15:50:16.507Z" } }, "outputs": [], "source": [ "# Define paths\n", "oci_file_path = f\"{oci_bucket}/{folder_name}/{file_name}\"\n", "external_volume_path = f\"/Volumes/{catalog_name}/{schema_name}/{external_volume_name}/{folder_name}/{file_name}\"\n", "external_table_path = f\"{oci_bucket}/{folder_name}/{file_name}\"" ] }, { "cell_type": "code", "execution_count": null, "id": "bcd4c2c6-be2c-4062-829c-947f8df4201e", "metadata": { "execution": { "iopub.status.busy": "2025-03-25T15:50:25.962Z" } }, "outputs": [], "source": [ "# Display parameter values\n", "print(f\"Using OCI Bucket: {oci_bucket}\")\n", "print(f\"Folder Name: {folder_name}\")\n", "print(f\"File Name: {file_name}\")\n", "print(f\"External Volume Name: {external_volume_name}\")\n", "print(f\"External Volume Path: {external_volume_path}\")\n", "print(f\"External Table Name: {external_table_name}\")\n", "print(f\"External Table Path: {external_table_path}\")\n", "print(f\"Schema: {schema_name}\")\n" ] }, { "cell_type": "markdown", "id": "4bc98141-e686-4bb5-9de2-c6b834b862e6", "metadata": {}, "source": [ "# **1. Directly Read from and Write to OCI Object Storage**\n", "We will:\n", " - Create a sample DataFrame.\n", " - Write the DataFrame directly to an OCI Object Storage bucket as CSV format, you can expriment and change the format.\n", " - Read the data back from OCI Object Storage.\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "d057c4e4-101b-4995-85bf-7ea75561cb2b", "metadata": { "execution": { "iopub.status.busy": "2025-03-25T15:50:49.171Z" } }, "outputs": [ { "data": { "text/html": [ "
Data written directly to oci://your_bucket@your_namespace/data/demodata\n",
       "Data read directly from OCI:\n",
       "+-------+---+\n",
       "|   name|age|\n",
       "+-------+---+\n",
       "|    Bob| 35|\n",
       "|Charlie| 25|\n",
       "|  Alice| 30|\n",
       "+-------+---+\n",
       "\n",
       "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Create a sample DataFrame\n", "data = [(\"Alice\", 30), (\"Bob\", 35), (\"Charlie\", 25)]\n", "columns = [\"name\", \"age\"]\n", "df = spark.createDataFrame(data, columns)\n", "\n", "# Write DataFrame to OCI Object Storage\n", "df.write.mode(\"overwrite\").option(\"header\", True).format(\"csv\").save(oci_file_path)\n", "print(f\"Data written directly to {oci_file_path}\")\n", "\n", "# Read from OCI Object Storage\n", "df_read = spark.read.option(\"header\", True).format(\"csv\").load(oci_file_path)\n", "print(\"Data read directly from OCI:\")\n", "df_read.show()" ] }, { "cell_type": "markdown", "id": "c053c618-608f-47d0-9843-a31a9790d8c0", "metadata": {}, "source": [ "# **2. Accessing OCI Object Storage via an External Volume**\n", "We will:\n", " - Create an **external volume** referencing OCI Object Storage.\n", " - Write data to the external volume.\n", " - Read data back from the external volume." ] }, { "cell_type": "markdown", "id": "2dac79a4-a900-4890-bd3a-3beddcd65d15", "metadata": {}, "source": [ "# 2-1. Create External Volume referencing Object Storage\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "eca85fb6-dcb4-4fec-a1a3-2a5b2c351f0e", "metadata": { "execution": { "iopub.status.busy": "2025-03-25T15:50:59.593Z" } }, "outputs": [ { "data": { "text/html": [ "
External volume \"extvolume\" created.\n",
       "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "create_volume_sql = f\"\"\"create external volume if not exists {catalog_name}.{schema_name}.{external_volume_name} location '{oci_bucket}/'\"\"\"\n", "spark.sql(create_volume_sql)\n", "print(f\"External volume '{external_volume_name}' created.\")" ] }, { "cell_type": "markdown", "id": "da7cde1e-1d30-4581-8062-00bd844f3208", "metadata": {}, "source": [ "# 2-2. Read/Write via External Volume" ] }, { "cell_type": "code", "execution_count": null, "id": "038b325b-6c99-43fb-b718-f7e178e00d80", "metadata": { "execution": { "iopub.status.busy": "2025-03-25T15:51:14.938Z" } }, "outputs": [], "source": [ "# Write to External Volume\n", "df.write.mode(\"overwrite\").option(\"header\", True).format(\"csv\").save(external_volume_path)\n", "print(f\"Data written to external volume at {external_volume_path}\")\n", "\n", "# Read from External Volume\n", "df_volume = spark.read.option(\"header\", True).format(\"csv\").load(external_volume_path)\n", "print(\"Data read from external volume:\")\n", "df_volume.show()\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "9b469f3f-cf69-494f-a4d7-f1f178d38025", "metadata": { "execution": { "iopub.status.busy": "2025-03-25T15:51:19.140Z" }, "type": "python" }, "outputs": [ { "data": { "text/html": [ "
External volume \"extvolume\" dropped.\n",
       "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Cleanup\n", "drop_volume_sql = f\"\"\"drop volume {catalog_name}.{schema_name}.{external_volume_name}\"\"\"\n", "spark.sql(drop_volume_sql)\n", "print(f\"External volume '{external_volume_name}' dropped.\")" ] }, { "cell_type": "markdown", "id": "dc48fcc3-6bbb-4442-a20e-e9d4dc8abb87", "metadata": {}, "source": [ "# **3. Using an External Table Referencing OCI Object Storage**\n", "We will:\n", "- Create an **external table** referencing OCI Object Storage.\n", "- Write data to the external table location.\n", "- Read data from the external table using SQL.\n" ] }, { "cell_type": "markdown", "id": "a72f8ad3-ca40-4dd6-8632-237ed94739f6", "metadata": {}, "source": [ "# 3-1. Create External Table" ] }, { "cell_type": "code", "execution_count": null, "id": "5511abc5-66fa-46da-be84-502639153edc", "metadata": { "execution": { "iopub.status.busy": "2025-03-25T15:51:33.684Z" } }, "outputs": [], "source": [ "create_table_sql = f\"\"\"CREATE TABLE IF NOT EXISTS {catalog_name}.{schema_name}.{external_table_name} (name STRING, age INT) USING CSV OPTIONS (path='{oci_file_path}',delimiter=',',header='true')\"\"\"\n", "print(create_table_sql)\n", "spark.sql(create_table_sql)\n", "print(f\"External table '{catalog_name}.{schema_name}.{external_table_name}' created.\")\n" ] }, { "cell_type": "markdown", "id": "3442b815-de63-4953-aedc-4d0200223190", "metadata": { "type": "markdown" }, "source": [ "# 3-2. Query External Table" ] }, { "cell_type": "code", "execution_count": 1, "id": "c4290ca5-4f84-4eb6-aecd-6dad026e90cb", "metadata": { "execution": { "iopub.status.busy": "2025-03-25T15:51:49.169Z" }, "type": "python" }, "outputs": [ { "data": { "text/html": [ "
Data read from external table:\n",
       "+-------+---+\n",
       "|   name|age|\n",
       "+-------+---+\n",
       "|    Bob| 35|\n",
       "|Charlie| 25|\n",
       "|  Alice| 30|\n",
       "+-------+---+\n",
       "\n",
       "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Read from External Table\n", "df_table = spark.sql(f\"SELECT * FROM {catalog_name}.{schema_name}.{external_table_name}\")\n", "print(\"Data read from external table:\")\n", "df_table.show()" ] }, { "cell_type": "markdown", "id": "1d06f772-20cf-4b56-bc7b-9f2ac2609588", "metadata": {}, "source": [ "# 3-2. Write Data to External Table" ] }, { "cell_type": "code", "execution_count": 1, "id": "489bb1e5-41d0-4e98-9a55-1fa72aa87235", "metadata": { "execution": { "iopub.status.busy": "2025-03-25T15:51:59.880Z" } }, "outputs": [ { "data": { "text/html": [ "
Data written to external table location oci://your_bucket@your_namespace/data/demodata\n",
       "Data read from external table:\n",
       "+-------+---+\n",
       "|   name|age|\n",
       "+-------+---+\n",
       "|    Bob| 35|\n",
       "|Charlie| 25|\n",
       "|  Alice| 30|\n",
       "+-------+---+\n",
       "\n",
       "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df.write.mode(\"overwrite\").option(\"overwriteSchema\", \"true\").option(\"header\", True).format(\"csv\").save(external_table_path)\n", "print(f\"Data written to external table location {external_table_path}\")\n", "\n", "# Read from External Table\n", "df_table = spark.sql(f\"SELECT * FROM {catalog_name}.{schema_name}.{external_table_name}\")\n", "print(\"Data read from external table:\")\n", "df_table.show()" ] }, { "cell_type": "code", "execution_count": 1, "id": "d897cb3c-cd24-4fc8-9074-c31e20276755", "metadata": { "execution": { "iopub.status.busy": "2025-03-25T15:52:06.082Z" }, "type": "python" }, "outputs": [ { "data": { "text/html": [ "
External table \"exttable\" dropped.\n",
       "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Cleanup\n", "drop_table_sql = f\"\"\"drop table {catalog_name}.{schema_name}.{external_table_name}\"\"\"\n", "spark.sql(drop_table_sql)\n", "print(f\"External table '{external_table_name}' dropped.\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 5 }