{ "cells": [ { "cell_type": "markdown", "id": "9d6c3e5b-0b30-40bc-af4e-1d1057aa8115", "metadata": {}, "source": [ "## Oracle AI Data Platform v1.0\n", "\n", "Copyright © 2025, Oracle and/or its affiliates.\n", "\n", "Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/" ] }, { "cell_type": "markdown", "id": "37c977db-3f6b-4625-9f71-ebde52bd0bf9", "metadata": { "execution": { "iopub.status.busy": "2025-04-04T21:34:34.078Z" }, "type": "python" }, "source": [ "# Ingest from Multi Cloud Storage\n", "\n", "This notebook illustrates how to ingest data from multiple cloud storage systems include;\n", "- Ingest from Azure ADLS - https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction\n", "- Ingest from AWS S3 - https://aws.amazon.com/pm/serv-s3\n", "\n", "The pattern for the integration is common, the dependent libraries are needed which will downloading the correct versions identified below and installing into your compute cluster. The cloud specific details are in the notebook cells below for each platform and generally involve setting spark.conf values. \n" ] }, { "cell_type": "markdown", "id": "1b226341-9af7-439b-88cb-51c66d573cfe", "metadata": { "type": "markdown" }, "source": [ "## Ingest from Azure ADLS\n", "\n", "This example ingests data from ADLS and writes it into a delta table in the catalog.\n", "\n", "### Prerequisites\n", "\n", "1. Install Azure JAR file from\n", " - https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-azure/3.3.4\n", "\n", "2. Install dependent libraries\n", " - https://mvnrepository.com/artifact/com.fasterxml.jackson.core/jackson-databind/2.12.7\n", " - https://mvnrepository.com/artifact/com.fasterxml.jackson.core/jackson-core/2.12.7\n", " - https://mvnrepository.com/artifact/org.codehaus.jackson/jackson-mapper-asl/1.9.13\n", "\n", "3. Restart the cluster.\n", "4. Use your notebooks and python tasks." ] }, { "cell_type": "code", "execution_count": null, "id": "d83f2376-12d0-436a-8d40-998952c3bc84", "metadata": { "type": "python" }, "outputs": [], "source": [ "# Change for your details\n", "storage_account_name=\"your_storage_account_name\"\n", "client_id=\"your_client_id\"\n", "secret=\"your_secret\"\n", "tenant=\"your_tenant\"\n", "container=\"your_container\"\n", "data_file=\"your_file_name\" #change to any type, just make sure the spark.read reflects the type\n", "target_table_name=\"default.default.data_from_adls\"\n", "# end of changes\n", "\n", "spark.conf.set(f\"fs.azure.account.auth.type.{storage_account_name}.dfs.core.windows.net\", \"OAuth\")\n", "spark.conf.set(f\"fs.azure.account.oauth.provider.type.{storage_account_name}.dfs.core.windows.net\", \"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider\")\n", "spark.conf.set(f\"fs.azure.account.oauth2.client.id.{storage_account_name}.dfs.core.windows.net\", client_id)\n", "spark.conf.set(f\"fs.azure.account.oauth2.client.secret.{storage_account_name}.dfs.core.windows.net\",secret)\n", "spark.conf.set(f\"fs.azure.account.oauth2.client.endpoint.{storage_account_name}.dfs.core.windows.net\", f\"https://login.microsoftonline.com/{tenant}/oauth2/token\")\n", "\n", "df = spark \\\n", " .read \\\n", " .format(\"csv\") \\\n", " .option(\"header\", True) \\\n", " .load(f\"abfss://{container}@{storage_account_name}.dfs.core.windows.net/{data_file}\")\n", "df.show()\n", "\n", "df.write.mode(\"overwrite\").format(\"delta\").saveAsTable(target_table_name)" ] }, { "cell_type": "markdown", "id": "18199aa5-6284-4db1-8c86-85b070d628f9", "metadata": { "execution": { "iopub.status.busy": "2025-04-04T21:45:12.416Z" }, "type": "markdown" }, "source": [ "## Integration with AWS S3\n", "\n", "This example ingests data from S3 and writes it into a delta table in the catalog.\n", "\n", "### Prerequisites\n", "\n", "1. Install Hadoop AWS JAR file from\n", " - https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/3.3.4\n", "\n", "2. Install bundle - upload to object storage, external volume, install from\n", " - https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-bundle/1.12.262\n", "\n", "The bundle is 280Mb, today you will **install in cluster by using external volume**! You woill need to upload to OCI Object Storage and then create an external volume.\n", "\n", "3. Simple configuration can be done using spark configuration on cluster\n", "```\n", "spark.hadoop.fs.s3a.secret.key = your_secret_key\n", "spark.hadoop.fs.s3a.access.key = your_access_key\n", "```\n", "\n", "4. Restart the cluster.\n", "\n", "5. Use your notebooks and python tasks." ] }, { "cell_type": "code", "execution_count": null, "id": "e3341a49-ba4e-4e3c-827c-70c00ab3e75d", "metadata": { "type": "python" }, "outputs": [], "source": [ "# Change for your details\n", "bucket_name = 'your_bucket'\n", "file_name = 'your_file_name'\n", "target_table_name = 'default.default.data_from_s3'\n", "region=\"us-east-1\"\n", "# end of changes\n", "\n", "spark.conf.set(f\"fs.s3a.aws.credentials.provider\", \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\")\n", "\n", "df = spark.read.json(f\"s3a://{bucket_name}/{file_name}\")\n", "df.show()\n", "\n", "df.write.mode(\"overwrite\").format(\"delta\").saveAsTable(target_table_name)\n" ] }, { "cell_type": "markdown", "id": "5cacfd8a-200d-4ba8-b5e6-fdd5560ae1eb", "metadata": { "type": "markdown" }, "source": [ "## AWS S3 with boto3\n", "\n", "You will need to install boto3 by creating a requirements.txt file and including the package boto3 and installing as a library in your cluster." ] }, { "cell_type": "code", "execution_count": null, "id": "c0c67993-36bd-4a67-a9c2-e1cb19a620b7", "metadata": { "type": "python" }, "outputs": [], "source": [ "import boto3\n", "\n", "# Change for your details\n", "secret=\"your_secret\"\n", "access=\"your_access_key\"\n", "region=\"us-east-1\"\n", "# end of changes\n", "\n", "s3 = boto3.client('s3',aws_access_key_id=access,aws_secret_access_key=secret, region_name=region)\n", "prefix = '/'\n", "\n", "# List all objects in the bucket under the prefix\n", "response = s3.list_objects_v2(Bucket=bucket_name, Prefix=\"\")\n", "for content in response.get('Contents', []):\n", " print(content['Key'])\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" }, "title": "Ingest from other clouds in including S3 and ADLS_" }, "nbformat": 4, "nbformat_minor": 5 }