{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "d6244506",
   "metadata": {},
   "source": [
    "# PySpark miscellanea\n",
    "\n",
    "<p style=\"margin:45px\">\n",
    "<a href=\"https://github.com/groda/big_data\"><div style=\"width:90px;position:absolute;top:0;right:0;\"><img src=\"logo_bdb.png\"></div></a>\n",
    "    \n",
    "<h3>Table of Contents<span class=\"tocSkip\"></span></h3>\n",
    "<div class=\"toc\"><ul class=\"toc-item\"><li><span><a href=\"#How-to-get-your-application's-id-in-pyspark\" data-toc-modified-id=\"How-to-get-your-application's-id-in-pyspark-1.1\">How to get your application's id in pyspark</a></span></li><li><span><a href=\"#How-to-get/set-default-parallelism-in-pyspark\" data-toc-modified-id=\"How-to-get/set-default-parallelism-in-pyspark-1.2\">How to get/set default parallelism in pyspark</a></span></li><li><span><a href=\"#About--spark-defaults.conf\" data-toc-modified-id=\"About--spark-defaults.conf-1.3\">About  <code>spark-defaults.conf</code></a></span></li></ul></div>\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9dfe7c0a",
   "metadata": {},
   "source": [
    "## How to get your application's id in pyspark\n",
    "\n",
    "\n",
    "See also: [How to extract application ID from the PySpark context](https://stackoverflow.com/questions/30983226/how-to-extract-application-id-from-the-pyspark-context)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "02dbf961",
   "metadata": {},
   "source": [
    "### Starting from a Spark session\n",
    "\n",
    "What is a [Spark session](https://spark.apache.org/docs/latest/sql-getting-started.html#starting-point-sparksession)?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "0fd4ce7b",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.sql import SparkSession\n",
    "spark = SparkSession \\\n",
    "    .builder \\\n",
    "    .appName(\"App ID\") \\\n",
    "    .getOrCreate()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d02813d3",
   "metadata": {},
   "source": [
    "Get the session's context ([what is a Spark context?](https://spark.apache.org/docs/latest/rdd-programming-guide.html#initializing-spark) and [detailed documentation](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.html))."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "7d8b99e3",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "        <div>\n",
       "            <p><b>SparkContext</b></p>\n",
       "\n",
       "            <p><a href=\"http://192.168.0.199:4040\">Spark UI</a></p>\n",
       "\n",
       "            <dl>\n",
       "              <dt>Version</dt>\n",
       "                <dd><code>v3.1.2</code></dd>\n",
       "              <dt>Master</dt>\n",
       "                <dd><code>local[*]</code></dd>\n",
       "              <dt>AppName</dt>\n",
       "                <dd><code>App ID</code></dd>\n",
       "            </dl>\n",
       "        </div>\n",
       "        "
      ],
      "text/plain": [
       "<SparkContext master=local[*] appName=App ID>"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sc = spark.sparkContext\n",
    "sc"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "57df230b",
   "metadata": {},
   "source": [
    "Get `applicationId` from the context."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "bed1685a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'local-1667946735598'"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sc.applicationId"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de33f1ce",
   "metadata": {},
   "source": [
    "All in one step:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "f0023074",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'local-1667946735598'"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "spark.sparkContext.applicationId"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dc31a770",
   "metadata": {},
   "source": [
    "**Note:** if you're using the _pyspark shell_ (see [using the shell](https://spark.apache.org/docs/latest/rdd-programming-guide.html#initializing-spark)), `SparkContext` is created automatically and it can be accessed from the variable called `sc`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "44ece7a4",
   "metadata": {},
   "source": [
    "## How to get/set default parallelism in pyspark"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "880e6ada",
   "metadata": {},
   "source": [
    "Create a [Spark session](https://spark.apache.org/docs/latest/sql-getting-started.html#starting-point-sparksession)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "451653da",
   "metadata": {},
   "outputs": [],
   "source": [
    "spark = SparkSession \\\n",
    "        .builder \\\n",
    "        .appName(\"defaultParallelism\") \\\n",
    "        .getOrCreate()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d2fe39a2",
   "metadata": {},
   "source": [
    "Check the value of `defaultParallelism`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "f221e3d1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "8"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "spark.sparkContext.defaultParallelism"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "69b3994c",
   "metadata": {},
   "source": [
    "To change a property it's necessary to stop and start a new context/session."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "4cf47f42",
   "metadata": {},
   "outputs": [],
   "source": [
    "spark = SparkSession \\\n",
    "        .builder \\\n",
    "        .appName(\"Set parallelism\") \\\n",
    "        .config(\"spark.default.parallelism\", 4) \\\n",
    "        .getOrCreate()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c6725687",
   "metadata": {},
   "source": [
    "Default parallelism hasn't changed!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "6b592d7e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "8"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "spark.sparkContext.defaultParallelism"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "49e168d0",
   "metadata": {},
   "source": [
    "Stop and start session anew."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "da0512cd",
   "metadata": {},
   "outputs": [],
   "source": [
    "spark.stop()\n",
    "spark = SparkSession \\\n",
    "        .builder \\\n",
    "        .appName(\"Set parallelism\") \\\n",
    "        .config(\"spark.default.parallelism\", 4) \\\n",
    "        .getOrCreate()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "bba24e5e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "spark.sparkContext.defaultParallelism"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1ef9f91f",
   "metadata": {},
   "source": [
    "Great! Now the context has been changed (and also the applications's name has been updated)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "0e2fbc63",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "        <div>\n",
       "            <p><b>SparkContext</b></p>\n",
       "\n",
       "            <p><a href=\"http://192.168.0.199:4040\">Spark UI</a></p>\n",
       "\n",
       "            <dl>\n",
       "              <dt>Version</dt>\n",
       "                <dd><code>v3.1.2</code></dd>\n",
       "              <dt>Master</dt>\n",
       "                <dd><code>local[*]</code></dd>\n",
       "              <dt>AppName</dt>\n",
       "                <dd><code>Set parallelism</code></dd>\n",
       "            </dl>\n",
       "        </div>\n",
       "        "
      ],
      "text/plain": [
       "<SparkContext master=local[*] appName=Set parallelism>"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "spark.sparkContext"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "238722de",
   "metadata": {},
   "source": [
    "### What is `spark.default.parallelism`?\n",
    "\n",
    "This property determines the default number of chunks in which an RDD ([Resilient Distributed Dataset](https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-distributed-datasets-rdds)) is partitioned.\n",
    "\n",
    "Unless specified by the user, the value of is set based on the _cluster manager_:\n",
    " - in standalone mode it is equal to the number of (virtual) cores on the local machine\n",
    " - in Mesos: 8\n",
    " - for YARN, Kubernetes: total number of cores on all executor nodes or 2, whichever is larger\n",
    " \n",
    " (see [Spark configuration/Execution behavior](https://spark.apache.org/docs/latest/configuration.html#execution-behavior))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8cb775c1",
   "metadata": {},
   "source": [
    "## About  `spark-defaults.conf`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4cb5650c",
   "metadata": {},
   "source": [
    "The file `spark-defaults.conf` contains the default Spark configuration properties and it is by default located in Spark's configuration directory `$SPARK_HOME/conf` (see [Spark Configuration](https://spark.apache.org/docs/latest/configuration.html)). \n",
    "\n",
    "The format of `spark-defaults.conf` is whitespace-separated lines containing property name and value, for instance:\n",
    "```\n",
    "spark.master            spark://5.6.7.8:7077\n",
    "spark.executor.memory   4g\n",
    "spark.eventLog.enabled  true\n",
    "spark.serializer        org.apache.spark.serializer.KryoSerializer\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "81a5dae5",
   "metadata": {},
   "source": [
    "### Where is my `spark-defaults.conf`?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ac1d292",
   "metadata": {},
   "source": [
    "If no file `spark-defaults.conf` is contained in Spark's configuration directory, you should find a _template_ configuration file `spark-defaults.conf.template`. You can rename this to `spark-defaults.conf` and use it as default configuration file.\n",
    "\n",
    "Let's look for all files called `spark-defaults*` in Spark's configuration directory:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "6fede028",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['/usr/local/spark-3.1.2-bin-hadoop3.2/conf/spark-defaults.conf.template']"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import glob, os\n",
    "glob.glob(os.path.join(os.environ[\"SPARK_HOME\"], \"conf\", \"spark-defaults*\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "077ff7c0",
   "metadata": {},
   "source": [
    "We found a template `spark-defaults.conf` file. \n",
    "\n",
    "Save the output from last cell (`_`) in the variable `conf_file`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "fa368738",
   "metadata": {},
   "outputs": [],
   "source": [
    "conf_file = _"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "04ddcf43",
   "metadata": {},
   "source": [
    "Look at the contents of `spark-defaults.conf.template`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "4ab0f938",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "#\n",
      "# Licensed to the Apache Software Foundation (ASF) under one or more\n",
      "# contributor license agreements.  See the NOTICE file distributed with\n",
      "# this work for additional information regarding copyright ownership.\n",
      "# The ASF licenses this file to You under the Apache License, Version 2.0\n",
      "# (the \"License\"); you may not use this file except in compliance with\n",
      "# the License.  You may obtain a copy of the License at\n",
      "#\n",
      "#    http://www.apache.org/licenses/LICENSE-2.0\n",
      "#\n",
      "# Unless required by applicable law or agreed to in writing, software\n",
      "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
      "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
      "# See the License for the specific language governing permissions and\n",
      "# limitations under the License.\n",
      "#\n",
      "\n",
      "# Default system properties included when running spark-submit.\n",
      "# This is useful for setting default environmental settings.\n",
      "\n",
      "# Example:\n",
      "# spark.master                     spark://master:7077\n",
      "# spark.eventLog.enabled           true\n",
      "# spark.eventLog.dir               hdfs://namenode:8021/directory\n",
      "# spark.serializer                 org.apache.spark.serializer.KryoSerializer\n",
      "# spark.driver.memory              5g\n",
      "# spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value -Dnumbers=\"one two three\"\n",
      "\n"
     ]
    }
   ],
   "source": [
    "with open(conf_file[0], 'r') as f:\n",
    "    print(f.read()) "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "21692085",
   "metadata": {},
   "source": [
    "As you see, everything is commented out in the template file. To set some of the properties, rename `spark-defaults.conf.template` and uncomment and edit the properties you want to set as defaults."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2bc0e3e7",
   "metadata": {},
   "source": [
    "### How to override the default configuration directory\n",
    "\n",
    "If you want to have your Spark configuration files in a directory other than `$SPARK_HOME/conf`, you can set the environment variable `SPARK_CONF_DIR`. \n",
    "\n",
    "Spark will then look in `$SPARK_CONF_DIR` for all of its configuration files: `spark-defaults.conf`, `spark-env.sh`, `log4j2.properties`, etc. (see https://spark.apache.org/docs/latest/configuration.html#overriding-configuration-directory).\n",
    "\n",
    "Here's the list of files in the default Spark configuration directory:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "ff9642c7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['spark-env.sh.template',\n",
       " 'fairscheduler.xml.template',\n",
       " 'metrics.properties.template',\n",
       " 'workers.template',\n",
       " 'log4j.properties.template',\n",
       " 'spark-defaults.conf.template',\n",
       " 'log4j.properties']"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "os.listdir(os.path.join(os.environ[\"SPARK_HOME\"],\"conf\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "91a61ed7",
   "metadata": {},
   "source": [
    "But now assume that you have no `spark-defaults.conf` or did not configure Spark anywhere else. Still, Spark has some _default values_ for several properties.\n",
    "\n",
    "Where are those properties defined and how to get their default values?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5fd5be6f",
   "metadata": {},
   "source": [
    "#### Spark configuration properties\n",
    "\n",
    "Spark's documentation provides the list of all [available properties](https://spark.apache.org/docs/latest/configuration.html#available-properties) grouped into several categories:\n",
    "\n",
    " - [application properties](https://spark.apache.org/docs/latest/configuration.html#application-properties)\n",
    " - [runtime environment](https://spark.apache.org/docs/latest/configuration.html#runtime-environment)\n",
    " - [shuffle behavior](https://spark.apache.org/docs/latest/configuration.html#shuffle-behavior)\n",
    " - [Spark UI](https://spark.apache.org/docs/latest/configuration.html#spark-ui)\n",
    " - [compression and serialization](https://spark.apache.org/docs/latest/configuration.html#compression-and-serialization)\n",
    " - [memory management](https://spark.apache.org/docs/latest/configuration.html#memory-management)\n",
    " - [execution behavior](https://spark.apache.org/docs/latest/configuration.html#execution-behavior)\n",
    " - [executor metrics](https://spark.apache.org/docs/latest/configuration.html#executor-metrics)\n",
    " - [networking](https://spark.apache.org/docs/latest/configuration.html#networking)\n",
    " - [scheduling](https://spark.apache.org/docs/latest/configuration.html#scheduling)\n",
    " - [barrier execution mode](https://spark.apache.org/docs/latest/configuration.html#barrier-execution-mode)\n",
    " - [dynamic allocation](https://spark.apache.org/docs/latest/configuration.html#dynamic-allocation)\n",
    " - [thread configurations](https://spark.apache.org/docs/latest/configuration.html#thread-configurations)\n",
    " - [security](https://spark.apache.org/docs/latest/security.html)\n",
    " - [Spark SQL](https://spark.apache.org/docs/latest/configuration.html#spark-sql)\n",
    " - [Spark streaming](https://spark.apache.org/docs/latest/configuration.html#spark-streaming)\n",
    " - [SparkR](https://spark.apache.org/docs/latest/configuration.html#sparkr)\n",
    " - [GraphX](https://spark.apache.org/docs/latest/configuration.html#graphx)\n",
    " - [Deploy](https://spark.apache.org/docs/latest/configuration.html#deploy)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "98e1480f",
   "metadata": {},
   "source": [
    "All properties have a default value that should accommodate most situations.\n",
    "\n",
    "As a beginner you might want to give your application a name by configuring `spark.app.name` and perhaps change the default values of the following properties:\n",
    " - `spark.master` and `spark.submit.deployMode` to define where the application should be deployed\n",
    " - `spark.driver.memory` and `spark.driver.maxResultSize` to control the memory usage of the driver\n",
    " - `spark.executor.memory` and `spark.executor.cores` to control executors\n",
    " \n",
    " \n",
    "For instance let's create a new session"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "844047f9",
   "metadata": {},
   "outputs": [],
   "source": [
    "spark.stop()\n",
    "spark = SparkSession \\\n",
    "        .builder \\\n",
    "        .appName(\"my_app\") \\\n",
    "        .config(\"spark.driver.memory\", \"2g\") \\\n",
    "        .getOrCreate()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f3c250e0",
   "metadata": {},
   "source": [
    "Show properties included in the Spark context"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "9f25abd1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('spark.driver.port', '50313'),\n",
       " ('spark.app.id', 'local-1667946737593'),\n",
       " ('spark.sql.warehouse.dir',\n",
       "  'file:/Users/x/Documents/notebooks/spark-warehouse'),\n",
       " ('spark.executor.id', 'driver'),\n",
       " ('spark.app.startTime', '1667946737538'),\n",
       " ('spark.app.name', 'my_app'),\n",
       " ('spark.rdd.compress', 'True'),\n",
       " ('spark.driver.memory', '2g'),\n",
       " ('spark.serializer.objectStreamReset', '100'),\n",
       " ('spark.master', 'local[*]'),\n",
       " ('spark.submit.pyFiles', ''),\n",
       " ('spark.submit.deployMode', 'client'),\n",
       " ('spark.default.parallelism', '4'),\n",
       " ('spark.ui.showConsoleProgress', 'true'),\n",
       " ('spark.driver.host', '192.168.0.199')]"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "spark.sparkContext.getConf().getAll()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": false,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}