{ "metadata": { "name": "", "signature": "sha256:9fbcab805da32a163e6ef5abfe42f4a73fcb15f087c82b0d11874fd6dfec0b64" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Spark is a distributed computing framework, which is built on top of some ideas such as ClusterManager (ResourceManager), applications, tasks and etc.\n", "- Distributed computing is usually across a cluster by (1) requesting resources and (2) scheduling tasks, which can be done via a uniform interface called **ClusterManager**. Examples are [yarn](http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html) (hadoop, mapr, ec2) , [Mesos](http://mesos.apache.org/), [standalone](http://spark.apache.org/docs/latest/spark-standalone.html) and [local(multi-core mode on a single machine)](http://localhost:8888/notebooks/setup%20up%20spark%20with%20ipython.ipynb).\n", "- The interface between Spark framework and a ClusterManager is the `SparkContext` object in your driver program.\n", "- Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers (either Spark\u2019s own standalone cluster manager or Mesos/YARN), which allocate resources across applications.\n", "- Once connected, Spark acquires **executors** on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks for the executors to run.\n", "- Each spark application has its own SparkContext (on scheduling side) and its own couple of executors on different nodes in its own JVM (on executor side). Data sharing among different applications can only be done via external storage system.\n", "- Spark is agnostic to the underlying cluster manager. As long as it can acquire executor processes, and these communicate with each other, it is relatively easy to run it even on a cluster manager that also supports other applications (e.g. Mesos/YARN).\n", "- References:\n", " - [Spark 1.1.1 Cluster Mode Overview](http://spark.apache.org/docs/latest/cluster-overview.html)\n", " - [Spark application submission manual](http://spark.apache.org/docs/latest/submitting-applications.html)\n", " - [Spark standalone cluster setup](http://spark.apache.org/docs/latest/spark-standalone.html): one master and a couple of nodes" ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }