{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "Kz-RRWaZwIE_" }, "source": [ "## Environment Setup" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "XOLh456Zj-vi" }, "outputs": [], "source": [ "!mkdir -p ~/.aws && cp /content/drive/MyDrive/AWS/684947_admin ~/.aws/credentials\n", "!chmod 600 ~/.aws/credentials\n", "!pip install -qq awscli boto3\n", "!aws sts get-caller-identity" ] }, { "cell_type": "markdown", "metadata": { "id": "ejuuRv5pwO0B" }, "source": [ "## Streaming ETL with Glue" ] }, { "cell_type": "markdown", "metadata": { "id": "LBWnq0a6wVgA" }, "source": [ "In this lab you will learn how to ingest, process, and consume streaming data using AWS serverless services such as Kinesis Data Streams, Glue, S3, and Athena. To simulate the data streaming input, we will use Kinesis Data Generator (KDG)." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "aqUemRwEwduC" }, "source": [ "![](https://user-images.githubusercontent.com/62965911/214810281-014f57ff-ed16-4bf5-89b7-2c473e583aaf.png)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "lVbu4Z8zzgf3", "outputId": "314926f2-67d6-4e44-ad51-85a683481a75" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"StackId\": \"arn:aws:cloudformation:us-east-1:684199068947:stack/KinesisGlue/821f3c20-6c97-11ed-a13b-0e9b6ec0e0ff\"\n", "}\n" ] } ], "source": [ "!aws cloudformation create-stack \\\n", "--stack-name KinesisGlue \\\n", "--template-body file://kinesis_glue.yml \\\n", "--capabilities CAPABILITY_NAMED_IAM" ] }, { "cell_type": "markdown", "metadata": { "id": "HE7cWR7y0pRR" }, "source": [ "### Set up the kinesis stream" ] }, { "cell_type": "markdown", "metadata": { "id": "WHG9xHUg0xRj" }, "source": [ "1. Navigate to AWS Kinesis console \n", "1. Click “Create data stream” \n", "1. Enter the following details\n", " - Data stream name: TicketTransactionStreamingData\n", " - Capacity mode: Provisioned\n", " - Provisioned shards: 2\n", "1. Click Create data stream" ] }, { "cell_type": "markdown", "metadata": { "id": "-GkEG0AO0-O7" }, "source": [ "### Create Table for Kinesis Stream Source in Glue Data Catalog" ] }, { "cell_type": "markdown", "metadata": { "id": "TZH7yW5C073U" }, "source": [ "1. Navigate to AWS Glue console \n", "1. On the AWS Glue menu, select Tables, then select Add table manually from the drop down list. \n", "1. Enter TicketTransactionStreamData as the table name\n", "1. Click Add database and enter tickettransactiondatabase as the database name, and click create. \n", "1. Use the drop down to select the database we just created, and click Next. \n", "1. Select Kinesis as the source, select Stream in my account to select a Kinesis data stream, select the appropriate AWS region where you have created the stream, select the stream name as TicketTransactionStreamingData from the dropdown, and click Next. \n", "1. Choose JSON as the incoming data format, as we will be sending JSON payloads from the Kinesis Data Generator in the following steps. Click Next. \n", "1. Leave the schema as empty, as we will enable the schema detection feature when defining the Glue stream job. Click Next. \n", "1. Leave partition indices empty. Click Next. \n", "1. Review all the details and click Finish. " ] }, { "cell_type": "markdown", "metadata": { "id": "5C5Gk3qH1W_N" }, "source": [ "### Create and trigger the Glue Streaming job" ] }, { "cell_type": "markdown", "metadata": { "id": "8OXHdpIYzxfk" }, "source": [ "1. On the left-hand side of the Glue Console, click on Jobs. This will launch Glue Studio. \n", "1. Select the Visual with a blank canvas option. Click Create \n", "1. Select Amazon Kinesis from the Source drop down list. \n", "1. In the panel on the right under “Data source properties - Kinesis Stream”, configure as follows:\n", " - Amazon Kinesis Source: Data Catalog table\n", " - Database: tickettransactiondatabase\n", " - Table: tickettransactionstreamdata\n", " - Make sure that Detect schema is selected\n", " - Leave all other fields as default\n", "1. Select Amazon S3 from the Target drop down list. \n", "1. Select the Data target - S3 bucket node at the bottom of the graph, and configure as follows:\n", " - Format: Parquet\n", " - Compression Type: None\n", " - S3 Target Location: Select Browse S3 and select the “mod-xxx-dmslabs3bucket-xxx” bucket\n", "1. Append TicketTransactionStreamingData/ to the S3 URL. The path should look similar to s3://mod-xxx-dmslabs3bucket-xxx/TicketTransactionStreamingData/ - don’t forget the / at the end. The job will automatically create the folder. \n", "1. Finally, select the Job details tab at the top and configure as follows:\n", " - Name: TicketTransactionStreamingJob\n", " - IAM Role: Select the xxx-GlueLabRole-xxx from the drop down list\n", " - Type: Spark Streaming\n", "1. Press the Save button in the top right-hand corner to create the job.\n", "1. Once you see the Successfully created job message in the banner, click the Run button to start the job." ] }, { "cell_type": "markdown", "metadata": { "id": "XpZc5MZL3IKT" }, "source": [ "### Trigger streaming data from KDG" ] }, { "cell_type": "markdown", "metadata": { "id": "qg2gEUdl3G9V" }, "source": [ "1. Launch KDG using the url you bookmarked from the lab setup. Login using the user and password you provided when deploying the Cloudformation stack.\n", "1. Make sure you select the appropriate region, from the dropdown list, select the TicketTransactionStreamingData as the target Kinesis stream, leave Records per second as the default (100 records per second); for the record template, type in NormalTransaction as the payload name, and copy the template payload as follows:\n", " ```json\n", " {\n", " \"customerId\": \"{{random.number(50)}}\",\n", " \"transactionAmount\": {{random.number(\n", " {\n", " \"min\":10,\n", " \"max\":150\n", " }\n", " )}},\n", " \"sourceIp\" : \"{{internet.ip}}\",\n", " \"status\": \"{{random.weightedArrayElement({\n", " \"weights\" : [0.8,0.1,0.1],\n", " \"data\": [\"OK\",\"FAIL\",\"PENDING\"]\n", " } \n", " )}}\",\n", " \"transactionTime\": \"{{date.now}}\" \n", " }\n", " ```\n", "1. Click Send data to trigger the simulated transaction streaming data. " ] }, { "cell_type": "markdown", "metadata": { "id": "Mh30U7Eo_283" }, "source": [ "### Verify the Glue stream job" ] }, { "cell_type": "markdown", "metadata": { "id": "gk4gZvEPAEBp" }, "source": [ "1. Navigate to Amazon S3 console \n", "1. Navigate to the S3 bucket path we’ve set as Glue Stream Job target, note the folder structure of the processed data. \n", "1. Check the folder content using current date and time as the folder name. Verify that the streaming data has been transformed into parquet format and persisted into corresponding folders." ] }, { "cell_type": "markdown", "metadata": { "id": "oV-x-yHWAFan" }, "source": [ "### Create Glue Crawler for the transformed data" ] }, { "cell_type": "markdown", "metadata": { "id": "Ty-87UYdAV2o" }, "source": [ "1. Navigate to AWS Glue console \n", "1. On the AWS Glue menu, select Crawlers and click Add crawler. \n", "1. Enter TicketTransactionParquetDataCrawler as the name of the crawler, click Next. \n", "1. Leave the default to specify Data stores as Crawler source type and Crawl all folders, click Next. \n", "1. Choose S3 as data store and choose Specified path in my account. Click the icon next to Include path input to select the S3 bucket. Make sure you select the folder TicketTransactionStreamingData. Click Select. \n", "1. Select No to indicate no other data store needed, then click Next. \n", "1. Choose an existing IAM role, using the dropdown list to select the role with GlueLabRole in the name, click Next. \n", "1. As the data is partitioned to hour, so we set the crawler to run every hour to make sure newly added partition is added. Click Next. \n", "1. Using the dropdown list to select tickettransactiondatabase as the output database, enter parquet_ as the prefix for the table, click Next. \n", "1. Review the crawler configuration and click Finish to create the crawler. \n", "1. Once the crawler is created, select the crawler and click Run crawler to trigger the first run. \n", "1. When crawler job stopped, go to Glue Data catalog, under Tables, verify that parquet_tickettransactionstreamingdata table is listed. \n", "1. Click the parquet_tickettransactionstreamingdata table, verify that Glue has correctly identified the streaming data format while transforming source data from Json format to Parquet. " ] }, { "cell_type": "markdown", "metadata": { "id": "TRkbhAE7AXUd" }, "source": [ "### Trigger abnormal transaction data from KDG" ] }, { "cell_type": "markdown", "metadata": { "id": "ce3tcvlUAk-_" }, "source": [ "1. Keep the KDG streaming data running, open another browser and launch KDG using the url you bookmarked from the lab setup, login using the user and password you provided when deploying the stack.\n", "1. Make sure you select the appropriate region, from the dropdown list, select the TicketTransactionStreamingData as the target Kinesis stream, put Records per second as 1; click Template 2, and prepare to copy abnormal transaction data, \n", "1. For the record template, type in AbnormalTransaction as the payload name, and copy the template payload as below,\n", " ```json\n", " {\n", " \"customerId\": \"{{random.number(50)}}\",\n", " \"transactionAmount\": {{random.number(\n", " {\n", " \"min\":10,\n", " \"max\":150\n", " }\n", " )}},\n", " \"sourceIp\" : \"221.233.116.256\",\n", " \"status\": \"{{random.weightedArrayElement({\n", " \"weights\" : [0.8,0.1,0.1],\n", " \"data\": [\"OK\",\"FAIL\",\"PENDING\"]\n", " } \n", " )}}\",\n", " \"transactionTime\": \"{{date.now}}\" \n", " }\n", " ```\n", "1. Click Send data to simulate abnormal transactions (1 transaction per second all from the same source IP address)." ] }, { "cell_type": "markdown", "metadata": { "id": "17cYlNllAmb2" }, "source": [ "### Detect Abnormal Transactions using Ad-Hoc query from Athena" ] }, { "cell_type": "markdown", "metadata": { "id": "hZ6x_2_13nH3" }, "source": [ "1. Navigate to AWS Athena console \n", "1. Make sure you select AwsDataCatalog as Data source and tickettransactiondatabase as the database, refresh to make sure the parquet_tickettransactionstreamingdata is showing in the table list. \n", "1. Copy the query below, this is to query the last hour counting the number of transactions by source IP. You should see there’s a large amount of transactions from the same sourceip.\n", " ```sql\n", " SELECT count(*) as numberOfTransactions, sourceip\n", " FROM \"tickettransactiondatabase\".\"parquet_tickettransactionstreamingdata\" \n", " WHERE ingest_year='2022'\n", " AND cast(ingest_year as bigint)=year(now())\n", " AND cast(ingest_month as bigint)=month(now())\n", " AND cast(ingest_day as bigint)=day_of_month(now())\n", " AND cast(ingest_hour as bigint)=hour(now())\n", " GROUP BY sourceip\n", " Order by numberOfTransactions DESC;\n", " ```\n", "1. Copy query below, this is to further check if the transaction details are from the same source IP. The query verifies that the requests are coming from the same ip but with different customer id, so it’s verified as an abnormal transaction.\n", " ```sql\n", " SELECT *\n", " FROM \"tickettransactiondatabase\".\"parquet_tickettransactionstreamingdata\" \n", " WHERE ingest_year='2022'\n", " AND cast(ingest_year as bigint)=year(now())\n", " AND cast(ingest_month as bigint)=month(now())\n", " AND cast(ingest_day as bigint)=day_of_month(now())\n", " AND cast(ingest_hour as bigint)=hour(now())\n", " AND sourceip='221.233.116.256'\n", " limit 100;\n", "```" ] }, { "cell_type": "markdown", "metadata": { "id": "KNf6s9L-A3Fo" }, "source": [ "## Clean Up" ] }, { "cell_type": "markdown", "metadata": { "id": "9oKGBx8tBJ5_" }, "source": [ "1. In your AWS account, navigate to the CloudFormation console.\n", "1. On the CloudFormation console, select stack which you have created.\n", "1. Click on Action drop down and select delete stack" ] }, { "cell_type": "markdown", "metadata": { "id": "GhN-M1bNBGNz" }, "source": [ "As you created, Kinesis Analytics application manually, so need to delete it by selecting your analytics application . Click on Action drop down and select delete application" ] }, { "cell_type": "markdown", "metadata": { "id": "UUy_255EBfMF" }, "source": [ "Delete the Glue Crawlers, Tables and Databases also manually." ] } ], "metadata": { "colab": { "provenance": [], "toc_visible": true }, "kernelspec": { "display_name": "Python 3.9.7 ('env-spacy')", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.9.7 (default, Sep 16 2021, 08:50:36) \n[Clang 10.0.0 ]" }, "vscode": { "interpreter": { "hash": "343191058819caea96d5cde1bd3b1a75b4807623ce2cda0e1c8499e39ac847e3" } } }, "nbformat": 4, "nbformat_minor": 0 }