{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "![Spark Logo](http://spark-mooc.github.io/web-assets/images/ta_Spark-logo-small.png)\n", "# **Simple example with Spark**\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook illustrates the use of [Spark](https://spark.apache.org) in [SWAN](http://swan.web.cern.ch).\n", "\n", "The current setup allows to execute [PySpark](http://spark.apache.org/docs/latest/api/python/) operations on a local standalone Spark instance. This can be used for testing with small datasets.\n", "\n", "In the future, SWAN users will be able to attach external Spark clusters to their notebooks, so they can target bigger datasets. Moreover, a Scala Jupyter kernel will be added to use Spark from Scala as well." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import the necessary modules" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `pyspark` module is available to perform the necessary imports." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from pyspark import SparkContext" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create a `SparkContext`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A `SparkContext` needs to be created before running any Spark operation. This context is linked to the local Spark instance." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sc = SparkContext()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Run Spark actions and transformations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's use our `SparkContext` to parallelize a list." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [], "source": [ "rdd = sc.parallelize([1, 2, 4, 8])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can count the number of elements in the list." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "4" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rdd.count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's now `map` a function to our RDD to increment all its elements." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[2, 3, 5, 9]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rdd.map(lambda x: x + 1).collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also calculate the sum of all the elements with `reduce`." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "15" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rdd.reduce(lambda x, y: x + y)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.10" } }, "nbformat": 4, "nbformat_minor": 0 }