{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Pipeline for the topology classifier with Apache Spark\n", "\n", "**2. Event Filtering and Feature Engineering** In this stage we prepare the input files for the three classifier models. Starting from the output of the previous stage (data ingestion) and producing the test and training datasets in Apache Parquet format.\n", "\n", "To run this notebook we used the following configuration:\n", "* *Software stack*: Spark 3.3.2\n", "* *Platform*: CentOS 7, Python 3.9\n", "* *Spark cluster*: Analytix" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Setting default log level to \"WARN\".\n", "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n", "23/03/03 20:48:05 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.\n", "23/03/03 20:48:22 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!\n" ] } ], "source": [ "# No need to run this when using CERN SWAN service\n", "# Just add the configuration parameters for Spark on the \"star\" button integration\n", "\n", "# pip install pyspark or use your favorite way to set Spark Home, here we use findspark\n", "import findspark\n", "findspark.init('/home/luca/Spark/spark-3.3.2-bin-hadoop3') #set path to SPARK_HOME\n", "\n", "# Create Spark session and configure according to your environment\n", "from pyspark.sql import SparkSession\n", "\n", "spark = ( SparkSession.builder\n", " .appName(\"2-Feature Preparation\")\n", " .master(\"yarn\")\n", " .config(\"spark.driver.memory\",\"2g\")\n", " .config(\"spark.executor.memory\",\"64g\")\n", " .config(\"spark.executor.cores\",\"8\")\n", " .config(\"spark.dynamicAllocation.enabled\",\"true\")\n", " .config(\"spark.ui.showConsoleProgress\", \"false\")\n", " .getOrCreate()\n", " )\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "\n", "
SparkSession - in-memory
\n", " \n", "SparkContext
\n", "\n", " \n", "\n", "v3.3.2
yarn
2-Feature Preparation