<!DOCTYPE html> <html> <head> <meta name="databricks-html-version" content="1"> <title>004_RDDsTransformationsActions - Databricks</title> <meta charset="utf-8"> <meta name="google" content="notranslate"> <meta http-equiv="Content-Language" content="en"> <meta http-equiv="Content-Type" content="text/html; charset=UTF8"> <link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Source+Code+Pro:400,700"> <link rel="stylesheet" type="text/css" href="https://databricks-prod-cloudfront.cloud.databricks.com/static/201602081754420800-0c2673ac858e227cad536fdb45d140aeded238db/lib/css/bootstrap.min.css"> <link rel="stylesheet" type="text/css" href="https://databricks-prod-cloudfront.cloud.databricks.com/static/201602081754420800-0c2673ac858e227cad536fdb45d140aeded238db/lib/jquery-ui-bundle/jquery-ui.min.css"> <link rel="stylesheet" type="text/css" href="https://databricks-prod-cloudfront.cloud.databricks.com/static/201602081754420800-0c2673ac858e227cad536fdb45d140aeded238db/css/main.css"> <link rel="stylesheet" href="https://databricks-prod-cloudfront.cloud.databricks.com/static/201602081754420800-0c2673ac858e227cad536fdb45d140aeded238db/css/print.css" media="print"> <link rel="icon" type="image/png" href="https://databricks-prod-cloudfront.cloud.databricks.com/static/201602081754420800-0c2673ac858e227cad536fdb45d140aeded238db/img/favicon.ico"/> <script>window.settings = {"sparkDocsSearchGoogleCx":"004588677886978090460:_rj0wilqwdm","dbcForumURL":"http://forums.databricks.com/","dbfsS3Host":"https://databricks-prod-storage-sydney.s3.amazonaws.com","enableThirdPartyApplicationsUI":false,"enableClusterAcls":false,"notebookRevisionVisibilityHorizon":0,"enableTableHandler":true,"isAdmin":true,"enableLargeResultDownload":false,"nameAndEmail":"Raazesh Sainudiin (r.sainudiin@math.canterbury.ac.nz)","enablePresentationTimerConfig":true,"enableFullTextSearch":true,"enableElasticSparkUI":true,"clusters":true,"hideOffHeapCache":false,"applications":false,"useStaticGuide":false,"fileStoreBase":"FileStore","configurableSparkOptionsSpec":[{"keyPattern":"spark\\.kryo(\\.[^\\.]+)+","valuePattern":".*","keyPatternDisplay":"spark.kryo.*","valuePatternDisplay":"*","description":"Configuration options for Kryo serialization"},{"keyPattern":"spark\\.io\\.compression\\.codec","valuePattern":"(lzf|snappy|org\\.apache\\.spark\\.io\\.LZFCompressionCodec|org\\.apache\\.spark\\.io\\.SnappyCompressionCodec)","keyPatternDisplay":"spark.io.compression.codec","valuePatternDisplay":"snappy|lzf","description":"The codec used to compress internal data such as RDD partitions, broadcast variables and shuffle outputs."},{"keyPattern":"spark\\.serializer","valuePattern":"(org\\.apache\\.spark\\.serializer\\.JavaSerializer|org\\.apache\\.spark\\.serializer\\.KryoSerializer)","keyPatternDisplay":"spark.serializer","valuePatternDisplay":"org.apache.spark.serializer.JavaSerializer|org.apache.spark.serializer.KryoSerializer","description":"Class to use for serializing objects that will be sent over the network or need to be cached in serialized form."},{"keyPattern":"spark\\.rdd\\.compress","valuePattern":"(true|false)","keyPatternDisplay":"spark.rdd.compress","valuePatternDisplay":"true|false","description":"Whether to compress serialized RDD partitions (e.g. for StorageLevel.MEMORY_ONLY_SER). Can save substantial space at the cost of some extra CPU time."},{"keyPattern":"spark\\.speculation","valuePattern":"(true|false)","keyPatternDisplay":"spark.speculation","valuePatternDisplay":"true|false","description":"Whether to use speculation (recommended off for streaming)"},{"keyPattern":"spark\\.es(\\.[^\\.]+)+","valuePattern":".*","keyPatternDisplay":"spark.es.*","valuePatternDisplay":"*","description":"Configuration options for ElasticSearch"},{"keyPattern":"es(\\.([^\\.]+))+","valuePattern":".*","keyPatternDisplay":"es.*","valuePatternDisplay":"*","description":"Configuration options for ElasticSearch"},{"keyPattern":"spark\\.(storage|shuffle)\\.memoryFraction","valuePattern":"0?\\.0*([1-9])([0-9])*","keyPatternDisplay":"spark.(storage|shuffle).memoryFraction","valuePatternDisplay":"(0.0,1.0)","description":"Fraction of Java heap to use for Spark's shuffle or storage"},{"keyPattern":"spark\\.streaming\\.backpressure\\.enabled","valuePattern":"(true|false)","keyPatternDisplay":"spark.streaming.backpressure.enabled","valuePatternDisplay":"true|false","description":"Enables or disables Spark Streaming's internal backpressure mechanism (since 1.5). This enables the Spark Streaming to control the receiving rate based on the current batch scheduling delays and processing times so that the system receives only as fast as the system can process. Internally, this dynamically sets the maximum receiving rate of receivers. This rate is upper bounded by the values `spark.streaming.receiver.maxRate` and `spark.streaming.kafka.maxRatePerPartition` if they are set."},{"keyPattern":"spark\\.streaming\\.receiver\\.maxRate","valuePattern":"^([0-9]{1,})$","keyPatternDisplay":"spark.streaming.receiver.maxRate","valuePatternDisplay":"numeric","description":"Maximum rate (number of records per second) at which each receiver will receive data. Effectively, each stream will consume at most this number of records per second. Setting this configuration to 0 or a negative number will put no limit on the rate. See the deployment guide in the Spark Streaming programing guide for mode details."},{"keyPattern":"spark\\.streaming\\.kafka\\.maxRatePerPartition","valuePattern":"^([0-9]{1,})$","keyPatternDisplay":"spark.streaming.kafka.maxRatePerPartition","valuePatternDisplay":"numeric","description":"Maximum rate (number of records per second) at which data will be read from each Kafka partition when using the Kafka direct stream API introduced in Spark 1.3. See the Kafka Integration guide for more details."},{"keyPattern":"spark\\.streaming\\.kafka\\.maxRetries","valuePattern":"^([0-9]{1,})$","keyPatternDisplay":"spark.streaming.kafka.maxRetries","valuePatternDisplay":"numeric","description":"Maximum number of consecutive retries the driver will make in order to find the latest offsets on the leader of each partition (a default value of 1 means that the driver will make a maximum of 2 attempts). Only applies to the Kafka direct stream API introduced in Spark 1.3."},{"keyPattern":"spark\\.streaming\\.ui\\.retainedBatches","valuePattern":"^([0-9]{1,})$","keyPatternDisplay":"spark.streaming.ui.retainedBatches","valuePatternDisplay":"numeric","description":"How many batches the Spark Streaming UI and status APIs remember before garbage collecting."}],"enableReactNotebookComments":true,"enableResetPassword":true,"enableJobsSparkUpgrade":true,"sparkVersions":[{"key":"1.3.x-ubuntu15.10","displayName":"Spark 1.3.0","packageLabel":"spark-1.3-jenkins-ip-10-30-9-162-U0c2673ac85-Sa2ee4664b2-2016-02-09-02:05:59.455061","upgradable":true,"deprecated":false,"customerVisible":true},{"key":"1.4.x-ubuntu15.10","displayName":"Spark 1.4.1","packageLabel":"spark-1.4-jenkins-ip-10-30-9-162-U0c2673ac85-S33a1e4b9c6-2016-02-09-02:05:59.455061","upgradable":true,"deprecated":false,"customerVisible":true},{"key":"1.5.x-ubuntu15.10","displayName":"Spark 1.5.2","packageLabel":"spark-1.5-jenkins-ip-10-30-9-162-U0c2673ac85-S5917a1044d-2016-02-09-02:05:59.455061","upgradable":true,"deprecated":false,"customerVisible":true},{"key":"1.6.x-ubuntu15.10","displayName":"Spark 1.6.0","packageLabel":"spark-1.6-jenkins-ip-10-30-9-162-U0c2673ac85-Scabba801f3-2016-02-09-02:05:59.455061","upgradable":true,"deprecated":false,"customerVisible":true},{"key":"master","displayName":"Spark master (dev)","packageLabel":"","upgradable":true,"deprecated":false,"customerVisible":false}],"enableRestrictedClusterCreation":false,"enableFeedback":false,"defaultNumWorkers":8,"serverContinuationTimeoutMillis":10000,"driverStderrFilePrefix":"stderr","driverStdoutFilePrefix":"stdout","enableSparkDocsSearch":true,"prefetchSidebarNodes":true,"sparkHistoryServerEnabled":true,"sanitizeMarkdownHtml":true,"enableIPythonImportExport":true,"enableNotebookHistoryDiffing":true,"branch":"2.12.3","accountsLimit":-1,"enableNotebookGitBranching":true,"local":false,"displayDefaultContainerMemoryGB":6,"deploymentMode":"production","useSpotForWorkers":false,"enableUserInviteWorkflow":false,"enableStaticNotebooks":true,"dbcGuideURL":"#workspace/databricks_guide/00 Welcome to Databricks","enableCssTransitions":true,"pricingURL":"https://databricks.com/product/pricing","enableClusterAclsConfig":false,"orgId":0,"enableNotebookGitVersioning":true,"files":"files/","enableDriverLogsUI":true,"disableLegacyDashboards":false,"enableWorkspaceAclsConfig":true,"dropzoneMaxFileSize":4096,"enableNewDashboardViews":false,"driverLog4jFilePrefix":"log4j","enableMavenLibraries":true,"displayRowLimit":1000,"defaultSparkVersion":{"key":"1.5.x-ubuntu15.10","displayName":"Spark 1.5.2","packageLabel":"spark-1.5-jenkins-ip-10-30-9-162-U0c2673ac85-S5917a1044d-2016-02-09-02:05:59.455061","upgradable":true,"deprecated":false,"customerVisible":true},"clusterPublisherRootId":5,"enableLatestJobRunResultPermalink":true,"disallowAddingAdmins":false,"enableSparkConfUI":true,"enableOrgSwitcherUI":false,"clustersLimit":-1,"enableJdbcImport":true,"logfiles":"logfiles/","enableWebappSharding":false,"enableClusterDeltaUpdates":true,"csrfToken":"1f2013f6-c2fd-4ab5-b68c-a2ff4e325639","useFixedStaticNotebookVersionForDevelopment":false,"enableBasicReactDialogBoxes":true,"requireEmailUserName":true,"enableDashboardViews":false,"dbcFeedbackURL":"http://feedback.databricks.com/forums/263785-product-feedback","enableWorkspaceAclService":true,"someName":"Raazesh Sainudiin","enableWorkspaceAcls":true,"gitHash":"0c2673ac858e227cad536fdb45d140aeded238db","userFullname":"Raazesh Sainudiin","enableClusterCreatePage":false,"enableImportFromUrl":true,"enableMiniClusters":false,"enableWebSocketDeltaUpdates":true,"enableDebugUI":false,"showHiddenSparkVersions":false,"allowNonAdminUsers":true,"userId":100005,"dbcSupportURL":"","staticNotebookResourceUrl":"https://databricks-prod-cloudfront.cloud.databricks.com/static/201602081754420800-0c2673ac858e227cad536fdb45d140aeded238db/","enableSparkPackages":true,"enableHybridClusterType":false,"enableNotebookHistoryUI":true,"availableWorkspaces":[{"name":"Workspace 0","orgId":0}],"enableFolderHtmlExport":true,"enableSparkVersionsUI":true,"databricksGuideStaticUrl":"","enableHybridClusters":true,"notebookLoadingBackground":"#fff","enableNewJobRunDetailsPage":true,"enableDashboardExport":true,"user":"r.sainudiin@math.canterbury.ac.nz","enableServerAutoComplete":true,"enableStaticHtmlImport":true,"defaultMemoryPerContainerMB":6000,"enablePresenceUI":true,"tablesPublisherRootId":7,"enableNewInputWidgetUI":false,"accounts":true,"enableNewProgressReportUI":true,"defaultCoresPerContainer":4};</script> <script>var __DATABRICKS_NOTEBOOK_MODEL = {"version":"NotebookV1","origId":5323,"name":"004_RDDsTransformationsActions","language":"scala","commands":[{"version":"CommandV1","origId":5325,"guid":"a2c780de-6bf2-434f-8763-0cd048151057","subtype":"command","commandType":"auto","position":0.25,"command":"%md\n\n# [Scalable Data Science](http://www.math.canterbury.ac.nz/~r.sainudiin/courses/ScalableDataScience/)\n\n\n### prepared by [Raazesh Sainudiin](https://nz.linkedin.com/in/raazesh-sainudiin-45955845) and [Sivanand Sivaram](https://www.linkedin.com/in/sivanand)\n\n*supported by* [](https://databricks.com/)\nand \n[](https://www.awseducate.com/microsite/CommunitiesEngageHome)","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"startTime":0.0,"submitTime":1.456369866559E12,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"25ed0d9d-8541-41f7-b1b6-72c3e8955107"},{"version":"CommandV1","origId":129138,"guid":"0a0e087a-06a3-4b53-8dd5-8714272198af","subtype":"command","commandType":"auto","position":0.375,"command":"%md\nThe [html source url](https://raw.githubusercontent.com/raazesh-sainudiin/scalable-data-science/master/db/week2/02_SparkEssentials/004_RDDsTransformationsActions.html) of this databricks notebook and its recorded Uji  for **course mechanics, logistics, expectations and course-project suggestions**:\n\n* [Workspace -> scalable-data-science -> work -> potentialProjectIdeas (relative to 'Workspace' link!)](/#workspace/scalable-data-science/work/potentialProjectIdeas)\n\n**NOTE:** The links to other notebook may not work in a different shard depending on where you uploaded the 'scalable-data-science' archive to! This can be easily fixed by correcting the directory containing 'scalable-data-science' folder. Here it is assumed to be in 'Workspace' folder.\n\n[](https://www.youtube.com/v/zgkvusQdNLY?rel=0&autoplay=1&modestbranding=1&start=0&end=797)\n\n\nand its remaining recorded Uji \n\n[](https://www.youtube.com/v/zgkvusQdNLY?rel=0&autoplay=1&modestbranding=1&start=797&end=4537)","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"824048fb-896f-4de8-b1ee-847958330691"},{"version":"CommandV1","origId":5326,"guid":"3e89cc7d-21cd-422e-80be-0b9f55e60fbe","subtype":"command","commandType":"auto","position":0.5,"command":"%md\n# **Introduction to Spark**\n## Spark Essentials: RDDs, Transformations and Actions\n\n* This introductory notebook describes how to get started running Spark (Scala) code in Notebooks.\n* Working with Spark's Resilient Distributed Datasets (RDDs)\n * creating RDDs\n * performing basic transformations on RDDs\n * performing basic actions on RDDs\n","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"startTime":0.0,"submitTime":1.456369866587E12,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":null,"commentThread":[],"commentsVisible":false,"parentHierarchy":null,"diffInserts":null,"diffDeletes":null,"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":null,"showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"fb9b36ed-5f37-4cb9-85c2-47a1a02b5bbf"},{"version":"CommandV1","origId":7790,"guid":"4d23a68c-12ee-422f-9109-2a8a78a411ce","subtype":"command","commandType":"auto","position":0.875,"command":"%md\n# Spark Cluster Overview:\n## Driver Program, Cluster Manager and Worker Nodes\n\nThe *driver* does the following:\n1. connects to a *cluster manager* to allocate resources across applications\n* acquire *executors* on cluster nodes\n * executor processs run compute tasks and cache data in memory or disk on a *worker node*\n* sends *application* (user program built on Spark) to the executors\n* sends *tasks* for the executors to run\n * task is a unit of work that will sent to one executor\n \n\n\nSee [http://spark.apache.org/docs/latest/cluster-overview.html](http://spark.apache.org/docs/latest/cluster-overview.html) for an overview of the spark cluster. This is embeded in-place below for convenience. Scroll to the bottom to see a Glossary of terms used above and their meanings. You can right-click inside the embedded html ``<frame>...</frame>`` and use the left and right arrows to navigate within it!","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"startTime":0.0,"submitTime":1.45636986662E12,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"61cd22d7-2386-4950-8702-4d4e03906fa3"},{"version":"CommandV1","origId":8141,"guid":"225ee71e-e801-449b-8f78-9d58bea8849b","subtype":"command","commandType":"auto","position":1.15625,"command":"//This allows easy embedding of publicly available information into any other notebook\n//when viewing in git-book just ignore this block - you may have to manually chase the URL in frameIt(\"URL\").\n//Example usage:\n// displayHTML(frameIt(\"https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation#Topics_in_LDA\",250))\ndef frameIt( u:String, h:Int ) : String = {\n \"\"\"<iframe \n src=\"\"\"\"+ u+\"\"\"\"\n width=\"95%\" height=\"\"\"\" + h + \"\"\"\"\n sandbox>\n <p>\n <a href=\"http://spark.apache.org/docs/latest/index.html\">\n Fallback link for browsers that, unlikely, don't support frames\n </a>\n </p>\n</iframe>\"\"\"\n }\ndisplayHTML(frameIt(\"http://spark.apache.org/docs/latest/cluster-overview.html\",700))","commandVersion":0,"state":"finished","results":{"type":"htmlSandbox","data":"<iframe \n src=\"http://spark.apache.org/docs/latest/cluster-overview.html\"\n width=\"95%\" height=\"700\"\n sandbox>\n <p>\n <a href=\"http://spark.apache.org/docs/latest/index.html\">\n Fallback link for browsers that, unlikely, don't support frames\n </a>\n </p>\n</iframe>","arguments":{},"addedWidgets":{},"removedWidgets":[]},"errorSummary":null,"error":null,"startTime":1.466136568354E12,"submitTime":1.466136364202E12,"finishTime":1.46613657331E12,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":true,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"e6b98e41-9ca6-4800-8f75-cc6959a1c07d"},{"version":"CommandV1","origId":5327,"guid":"0fb33957-6073-4329-ba07-a5af7a9cced7","subtype":"command","commandType":"auto","position":1.25,"command":"%md\n## The Abstraction of Resilient Distributed Dataset (RDD)\n\n#### RDD is a fault-tolerant collection of elements that can be operated on in parallel\n\n#### Two types of Operations are possible on an RDD\n* Transformations\n* Actions\n\n**(watch now 2:26)**:\n\n[](https://www.youtube.com/v/3nreQ1N7Jvk?rel=0&autoplay=1&modestbranding=1&start=1&end=146)\n\n***\n\n## Transformations\n**(watch now 1:18)**:\n\n[](https://www.youtube.com/v/360UHWy052k?rel=0&autoplay=1&modestbranding=1)\n\n***\n\n\n## Actions\n**(watch now 0:48)**:\n\n[](https://www.youtube.com/v/F2G4Wbc5ZWQ?rel=0&autoplay=1&modestbranding=1&start=1&end=48)\n\n***\n\n**Key Points**\n* Resilient distributed datasets (RDDs) are the primary abstraction in Spark.\n* RDDs are immutable once created:\n * can transform it.\n * can perform actions on it.\n * but cannot change an RDD once you construct it.\n* Spark tracks each RDD's lineage information or recipe to enable its efficient recomputation if a machine fails.\n* RDDs enable operations on collections of elements in parallel.\n* We can construct RDDs by:\n * parallelizing Scala collections such as lists or arrays\n * by transforming an existing RDD,\n * from files in distributed file systems such as (HDFS, S3, etc.).\n* We can specify the number of partitions for an RDD\n* The more partitions in an RDD, the more opportunities for parallelism\n* There are **two types of operations** you can perform on an RDD:\n * **transformations** (are lazily evaluated) \n * map\n * flatMap\n * filter\n * distinct\n * ...\n * **actions** (actual evaluation happens)\n * count\n * reduce\n * take\n * collect\n * takeOrdered\n * ...\n* Spark transformations enable us to create new RDDs from an existing RDD.\n* RDD transformations are lazy evaluations (results are not computed right away)\n* Spark remembers the set of transformations that are applied to a base data set (this is the lineage graph of RDD) \n* The allows Spark to automatically recover RDDs from failures and slow workers.\n* The lineage graph is a recipe for creating a result and it can be optimized before execution.\n* A transformed RDD is executed only when an action runs on it.\n* You can also persist, or cache, RDDs in memory or on disk (this speeds up iterative ML algorithms that transforms the initial RDD iteratively).\n* Here is a great reference URL for working with Spark.\n * [The latest Spark programming guide](http://spark.apache.org/docs/latest/programming-guide.html) and it is embedded below in-place for your convenience.\n ","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"startTime":0.0,"submitTime":1.456369866807E12,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"44e1bd54-e6dd-45aa-bcec-d035356e39eb"},{"version":"CommandV1","origId":8617,"guid":"5c35fde1-e334-41b5-b825-0f7e8e9515c9","subtype":"command","commandType":"auto","position":1.34375,"command":"displayHTML(frameIt(\"http://spark.apache.org/docs/latest/programming-guide.html\",800))","commandVersion":0,"state":"finished","results":{"type":"htmlSandbox","data":"<iframe \n src=\"http://spark.apache.org/docs/latest/programming-guide.html\"\n width=\"95%\" height=\"800\"\n sandbox>\n <p>\n <a href=\"http://spark.apache.org/docs/latest/index.html\">\n Fallback link for browsers that, unlikely, don't support frames\n </a>\n </p>\n</iframe>","arguments":{},"addedWidgets":{},"removedWidgets":[]},"errorSummary":null,"error":null,"startTime":1.466136657425E12,"submitTime":1.466136453431E12,"finishTime":1.466136657612E12,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":true,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"4f3e2665-80fc-4043-b733-8aeb8c9e3bbf"},{"version":"CommandV1","origId":8618,"guid":"1e504662-d156-484a-ade9-42600b14563f","subtype":"command","commandType":"auto","position":1.390625,"command":"%md\n## Let us get our hands dirty in Spark implementing these ideas!","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"startTime":0.0,"submitTime":1.456369866873E12,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"851edea0-e228-4822-8e79-df224c512636"},{"version":"CommandV1","origId":6651,"guid":"422862db-5d9b-4c78-8c2d-3edfa41867ab","subtype":"command","commandType":"auto","position":1.4375,"command":"%md\n#### Let us look at the legend and overview of the visual RDD Api in the following notebook:\n* [in Workspace -> scalable-data-science -> xtraResources -> visualRDDApi -> guide (relative to 'Workspace' link!)](/#workspace/scalable-data-science/xtraResources/visualRDDApi/guide).\n\n**NOTE:** The links to other notebook may not work in a different shard depending on where you uploaded the 'scalable-data-science' archive to! This can be easily fixed by correcting the directory containing 'scalable-data-science' folder. Here it is assumed to be in 'Workspace' folder. This NOTE won't keep reappearing :)\n\n","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"startTime":0.0,"submitTime":1.456369866912E12,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"42c8d077-d350-4209-9091-69bdfd37853e"},{"version":"CommandV1","origId":5328,"guid":"35a3289c-a3bc-4d27-b3a5-d7992c9d67c2","subtype":"command","commandType":"auto","position":1.625,"command":"%md\n### Running **Spark**\nThe variable **sc** allows you to access a Spark Context to run your Spark programs.\nRecall ``SparkContext`` is in the Driver Program.\n\n\n\n**NOTE: Do not create the *sc* variable - it is already initialized for you. **","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"startTime":0.0,"submitTime":1.456369866949E12,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"markdown","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":null,"commentThread":[],"commentsVisible":false,"parentHierarchy":null,"diffInserts":null,"diffDeletes":null,"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":null,"showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"7fe76b6a-558d-438b-84aa-3f5ad14bb20d"},{"version":"CommandV1","origId":8771,"guid":"0211806e-f01b-4774-8ab2-450024918802","subtype":"command","commandType":"auto","position":1.8125,"command":"%md\n### We will do the following next:\n1. Create an RDD using `sc.parallelize`\n* Perform the `collect` action on the RDD and find the number of partitions it is made of using `getNumPartitions` action\n* Perform the ``take`` action on the RDD\n* Transform the RDD by ``map`` to make another RDD\n* Transform the RDD by ``filter`` to make another RDD\n* Perform the ``reduce`` action on the RDD\n* Transform the RDD by ``flatMap`` to make another RDD\n* Create a Pair RDD\n* Perform some transformations on a Pair RDD\n* Where in the cluster is your computation running?\n* Shipping Closures, Broadcast Variables and Accumulator Variables\n* Spark Essentials: Summary\n* HOMEWORK","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"startTime":0.0,"submitTime":1.456369866987E12,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"1052e9f0-d44e-46b1-be5a-da9cc6eba50b"},{"version":"CommandV1","origId":5329,"guid":"d1e7d4d9-f7fa-4a54-84fd-d01edcab51d2","subtype":"command","commandType":"auto","position":2.0,"command":"%md\n### 1. Create an RDD using `sc.parallelize`\n\nFirst, let us create an RDD of three elements (of integer type ``Int``) from a Scala ``Seq`` (or ``List`` or ``Array``) with two partitions by using the ``parallelize`` method of the available Spark Context ``sc`` as follows:","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"startTime":0.0,"submitTime":1.456369867026E12,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"e1737683-d03a-483c-a3f5-5280ab1d1944"},{"version":"CommandV1","origId":33663,"guid":"712b6841-88dc-45aa-82c8-a37dc01d4bd6","subtype":"command","commandType":"auto","position":2.125,"command":"val x = sc.parallelize(Array(1, 2, 3), 2) // <Ctrl+Enter> to evaluate this cell (using 2 partitions)","commandVersion":0,"state":"finished","results":{"type":"html","data":"<div class=\"ansiout\">x: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[556] at parallelize at <console>:34\n</div>","arguments":{},"addedWidgets":{},"removedWidgets":[]},"errorSummary":null,"error":null,"startTime":1.456978129059E12,"submitTime":1.456978105382E12,"finishTime":1.456978130058E12,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"17955eee-988d-4ae4-9159-cfb91ab77f1c"},{"version":"CommandV1","origId":32002,"guid":"0f2295fa-149b-4fec-842a-bbb26f2a113c","subtype":"command","commandType":"auto","position":2.25,"command":"x. // place the cursor after 'x.' and hit Tab to see the methods available for the RDD x we created","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"0aae707c-e893-43d9-be45-cb8d07c44e1a"},{"version":"CommandV1","origId":5331,"guid":"72c28429-33c4-4deb-98cb-94dd03ba8bfd","subtype":"command","commandType":"auto","position":2.28125,"command":"%md\n### 2. Perform the `collect` action on the RDD and find the number of partitions it is made of using `getNumPartitions` action\n\nNo action has been taken by ``sc.parallelize`` above. To see what is \"cooked\" by the recipe for RDD ``x`` we need to take an action. \n\nThe simplest is the ``collect`` action which returns all of the elements of the RDD as an ``Array`` to the driver program and displays it.\n\n*So you have to make sure that all of that data will fit in the driver program if you call ``collect`` action!*","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"startTime":0.0,"submitTime":1.456369867089E12,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"e51a6f69-b10b-4cb2-8129-4a1be61e5f14"},{"version":"CommandV1","origId":9729,"guid":"977d1065-84d5-4f5a-b5b9-6012747af15f","subtype":"command","commandType":"auto","position":2.3125,"command":"%md\n#### Let us look at the [collect action in detail](/#workspace/scalable-data-science/xtraResources/visualRDDApi/recall/actions/collect) and return here to try out the example codes.\n\n\n","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"startTime":0.0,"submitTime":1.456369867132E12,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"8c33e95a-5cee-4822-8972-fc764f63bdfc"},{"version":"CommandV1","origId":32121,"guid":"780da155-4d46-4bf6-81a9-ebdf2d6a5bbc","subtype":"command","commandType":"auto","position":2.34375,"command":"%md\nLet us perform a `collect` action on RDD `x` as follows: ","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"e0a3a687-cc92-40d5-9fb5-ec6c0d0d86b7"},{"version":"CommandV1","origId":5332,"guid":"c699bf5e-f1c9-47ca-b4de-ca48d9e6b68c","subtype":"command","commandType":"auto","position":2.359375,"command":"x.collect() // <Ctrl+Enter> to collect (action) elements of rdd; should be (1, 2, 3)","commandVersion":0,"state":"finished","results":{"type":"html","data":"<div class=\"ansiout\">res0: Array[Int] = Array(1, 2, 3)\n</div>","arguments":{},"addedWidgets":{},"removedWidgets":[]},"errorSummary":null,"error":null,"startTime":1.456978135294E12,"submitTime":1.456978111592E12,"finishTime":1.456978135411E12,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"52b088a3-e334-4d43-becc-7d7c71fc175b"},{"version":"CommandV1","origId":5333,"guid":"1f70ef74-105a-457f-8bc2-e1d984cb6dd6","subtype":"command","commandType":"auto","position":2.375,"command":"%md\n*CAUTION:* ``collect`` can crash the driver when called upon an RDD with massively many elements. \nSo, it is better to use other diplaying actions like ``take`` or ``takeOrdered`` as follows:","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"startTime":0.0,"submitTime":1.456369867183E12,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"7ca905de-b775-46c7-8cc2-21cf3f195d8e"},{"version":"CommandV1","origId":32001,"guid":"d4dc707a-9248-46ec-802c-a1235e6adc03","subtype":"command","commandType":"auto","position":2.3857421875,"command":"%md\n#### Let us look at the [getNumPartitions action in detail](/#workspace/scalable-data-science/xtraResources/visualRDDApi/recall/actions/getNumPartitions) and return here to try out the example codes.\n\n","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"da3128c5-4524-4cc5-85af-e96b9272abae"},{"version":"CommandV1","origId":31999,"guid":"8040c1ba-46cf-404d-a11b-b67a14c2e1a1","subtype":"command","commandType":"auto","position":2.38671875,"command":"// <Ctrl+Enter> to evaluate this cell and find the number of partitions in RDD x\nx.getNumPartitions ","commandVersion":0,"state":"finished","results":{"type":"html","data":"<div class=\"ansiout\">res1: Int = 2\n</div>","arguments":{},"addedWidgets":{},"removedWidgets":[]},"errorSummary":"<div class=\"ansiout\"><console>:42: error: Int does not take parameters\n x.getNumPartitions()\n ^\n</div>","error":null,"startTime":1.456978141464E12,"submitTime":1.456978117265E12,"finishTime":1.456978141546E12,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"9751bff4-985f-425b-bd47-673335349146"},{"version":"CommandV1","origId":32125,"guid":"b97b2e6b-5277-4809-aba7-edab2fc952b1","subtype":"command","commandType":"auto","position":2.390625,"command":"%md\nWe can see which elements of the RDD are in which parition by calling `glom()` before `collect()`. \n\n`glom()` flattens elements of the same partition into an `Array`. ","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"365ebcee-e2e4-4b65-a02b-d0243500623c"},{"version":"CommandV1","origId":32122,"guid":"fdb28e6d-6125-45c8-b3a6-20e8c8f049bb","subtype":"command","commandType":"auto","position":2.3984375,"command":"x.glom().collect() // glom() flattens elements on the same partition","commandVersion":0,"state":"finished","results":{"type":"html","data":"<div class=\"ansiout\">res2: Array[Array[Int]] = Array(Array(1), Array(2, 3))\n</div>","arguments":{},"addedWidgets":{},"removedWidgets":[]},"errorSummary":null,"error":null,"startTime":1.45697816189E12,"submitTime":1.456978138216E12,"finishTime":1.456978161978E12,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"b245ddb3-26f5-4b8a-8d38-80d6a4c85e76"},{"version":"CommandV1","origId":32126,"guid":"d8d30bc8-be74-483f-b61a-1b81fd3ab7df","subtype":"command","commandType":"auto","position":2.40625,"command":"%md\nThus from the output above, `Array[Array[Int]] = Array(Array(1), Array(2, 3))`, we know that `1` is in one partition while `2` and `3` are in another partition.","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"c940fe14-d434-4bd0-ae80-4ec95f64de48"},{"version":"CommandV1","origId":33664,"guid":"60c1b098-abb4-403b-8617-4f80b2b5357f","subtype":"command","commandType":"auto","position":2.421875,"command":"%md\n##### You Try!\nCrate an RDD `x` with three elements, 1,2,3, and this time do not specifiy the number of partitions. Then the default number of partitions will be used.\nFind out what this is for the cluster you are attached to. \n\nThe default number of partitions for an RDD depends on the cluster this notebook is attached to among others - see [programming-guide](http://spark.apache.org/docs/latest/programming-guide.html).","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"b9283b04-6256-4d37-9584-f698ae6f3d0d"},{"version":"CommandV1","origId":5330,"guid":"bbe49621-3880-4878-847e-fbb83aedaa09","subtype":"command","commandType":"auto","position":2.427734375,"command":"val x = sc.parallelize(Seq(1, 2, 3)) // <Shift+Enter> to evaluate this cell (using default number of partitions)","commandVersion":0,"state":"finished","results":{"type":"html","data":"<div class=\"ansiout\"></div>","arguments":{},"addedWidgets":{},"removedWidgets":[]},"errorSummary":"<div class=\"ansiout\">Unclosed block</div>","error":null,"startTime":1.457043380201E12,"submitTime":1.457043334101E12,"finishTime":1.457043381764E12,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"f72a0fa1-d810-41dd-b747-9025cebe020f"},{"version":"CommandV1","origId":33665,"guid":"d798c494-c416-41f0-99bc-139533a42c84","subtype":"command","commandType":"auto","position":2.4296875,"command":"x.getNumPartitions // <Shift+Enter> to evaluate this cell","commandVersion":0,"state":"finished","results":{"type":"html","data":"<div class=\"ansiout\"></div>","arguments":{},"addedWidgets":{},"removedWidgets":[]},"errorSummary":null,"error":null,"startTime":1.45704338798E12,"submitTime":1.45704334186E12,"finishTime":1.457043388024E12,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"db95c084-8c4a-496d-b31f-50127bc7c1de"},{"version":"CommandV1","origId":33788,"guid":"e2cb5b09-e73c-4695-8a5d-68574dbf0e32","subtype":"command","commandType":"auto","position":2.43359375,"command":"x.glom().collect() // <Ctrl+Enter> to evaluate this cell","commandVersion":0,"state":"finished","results":{"type":"html","data":"<div class=\"ansiout\"></div>","arguments":{},"addedWidgets":{},"removedWidgets":[]},"errorSummary":null,"error":null,"startTime":1.457043401746E12,"submitTime":1.457043355645E12,"finishTime":1.457043401789E12,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"97385778-2b71-44df-b996-b2cedbadebc3"},{"version":"CommandV1","origId":5334,"guid":"9a657f15-33aa-4d2c-99a3-6c4bf85cde58","subtype":"command","commandType":"auto","position":2.4375,"command":"%md\n### 3. Perform the `take` action on the RDD\n\nThe ``.take(n)`` action returns an array with the first ``n`` elements of the RDD.","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"startTime":0.0,"submitTime":1.456369867223E12,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"ef02b6c7-3306-48e3-8de6-42063a6b2492"},{"version":"CommandV1","origId":5335,"guid":"bdea60ec-1e3c-4b8a-af59-d329a95d1290","subtype":"command","commandType":"auto","position":2.46875,"command":"x.take(2) // Ctrl+Enter to take two elements from the RDD x","commandVersion":0,"state":"finished","results":{"type":"html","data":"<div class=\"ansiout\">res5: Array[Int] = Array(1, 2)\n</div>","arguments":{},"addedWidgets":{},"removedWidgets":[]},"errorSummary":null,"error":null,"startTime":1.456978574509E12,"submitTime":1.456978550803E12,"finishTime":1.456978574632E12,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"514c0a77-8d0f-4daf-a135-49293d4a3f98"},{"version":"CommandV1","origId":33430,"guid":"d5205553-7ae5-4d8e-a4d9-b31d7efcc295","subtype":"command","commandType":"auto","position":2.4754638671875,"command":"%md\n##### You Try!\nFill in the parenthes `( )` below in order to `take` just one element from RDD `x`.","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"4f41ffcd-ba33-43f4-830c-3f393ededd7f"},{"version":"CommandV1","origId":33429,"guid":"c203f1e4-f9e1-43cc-94e1-4eb5d1b32510","subtype":"command","commandType":"auto","position":2.482177734375,"command":"x.take( ) // fill in the parenthesis to take just one element from RDD x and Cntrl+Enter","commandVersion":0,"state":"finished","results":{"type":"html","data":"<div class=\"ansiout\"></div>","arguments":{},"addedWidgets":{},"removedWidgets":[]},"errorSummary":null,"error":null,"startTime":1.45704342341E12,"submitTime":1.457043377301E12,"finishTime":1.457043423453E12,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"c1614fae-a61b-49ba-ade3-5f6cd5024b48"},{"version":"CommandV1","origId":5345,"guid":"614d9279-e947-443a-bd36-5f6336ff0760","subtype":"command","commandType":"auto","position":2.49560546875,"command":"%md\n***\n\n### 4. Transform the RDD by ``map`` to make another RDD\n\nThe ``map`` transformation returns a new RDD that's formed by passing each element of the source RDD through a function (closure). The closure is automatically passed on to the workers for evaluation (when an action is called later). ","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"startTime":0.0,"submitTime":1.456369867281E12,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"6a4c41f7-ba92-4fc9-97d3-ae4fd4518a94"},{"version":"CommandV1","origId":6646,"guid":"1fccd68d-6013-4b21-9aa1-d3262da7976e","subtype":"command","commandType":"auto","position":2.4971160888671875,"command":"%md\n#### Let us look at the [map transformation in detail](/#workspace/scalable-data-science/xtraResources/visualRDDApi/recall/transformations/map) and return here to try out the example codes.\n\n","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"startTime":0.0,"submitTime":1.456369867322E12,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"b8cfc765-05dd-4540-8276-e1d292f104b2"},{"version":"CommandV1","origId":6648,"guid":"3203b100-1a41-4509-8936-407f59b6bd31","subtype":"command","commandType":"auto","position":2.4978370666503906,"command":"// Shift+Enter to make RDD x and RDD y that is mapped from x\nval x = sc.parallelize(Array(\"b\", \"a\", \"c\")) // make RDD x: [b, a, c]\nval y = x.map(z => (z,1)) // map x into RDD y: [(b, 1), (a, 1), (c, 1)]","commandVersion":0,"state":"finished","results":{"type":"html","data":"<div class=\"ansiout\">x: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[734] at parallelize at <console>:37\ny: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[735] at map at <console>:38\n</div>","arguments":{},"addedWidgets":{},"removedWidgets":[]},"errorSummary":null,"error":null,"startTime":1.456978839332E12,"submitTime":1.456978815577E12,"finishTime":1.456978839716E12,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"aa5f77e9-fe83-4e0a-8c87-c454eba650de"},{"version":"CommandV1","origId":6649,"guid":"9dc4119d-1e02-4738-b8a1-8261cdd404b3","subtype":"command","commandType":"auto","position":2.498377799987793,"command":"// Cntrl+Enter to collect and print the two RDDs\nprintln(x.collect().mkString(\", \"))\nprintln(y.collect().mkString(\", \"))","commandVersion":0,"state":"finished","results":{"type":"html","data":"<div class=\"ansiout\">b, a, c\n(b,1), (a,1), (c,1)\n</div>","arguments":{},"addedWidgets":{},"removedWidgets":[]},"errorSummary":null,"error":null,"startTime":1.456978875692E12,"submitTime":1.4569788517E12,"finishTime":1.456978875865E12,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"ea192225-1a29-430c-829b-a139056dc9e2"},{"version":"CommandV1","origId":5342,"guid":"90be7996-dd77-45ac-b81a-7471687d7ff0","subtype":"command","commandType":"auto","position":2.748046875,"command":"%md\n***\n\n### 5. Transform the RDD by ``filter`` to make another RDD\n\nThe ``filter`` transformation returns a new RDD that's formed by selecting those elements of the source RDD on which the function returns ``true``.","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"startTime":0.0,"submitTime":1.456369867386E12,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"3d6c791c-4495-4ff7-9cf0-ef138f4aa8e7"},{"version":"CommandV1","origId":8774,"guid":"fabfe881-3ade-4370-9bc0-a6504ed0548a","subtype":"command","commandType":"auto","position":2.748291015625,"command":"%md\n#### Let us look at the [filter transformation in detail](/#workspace/scalable-data-science/xtraResources/visualRDDApi/recall/transformations/filter) and return here to try out the example codes.\n\n","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"startTime":0.0,"submitTime":1.456369867426E12,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"6536d958-75ee-4826-a881-86c613abbcb0"},{"version":"CommandV1","origId":8776,"guid":"02031b4e-1374-4d6d-b1a0-dd405fdaa837","subtype":"command","commandType":"auto","position":2.74835205078125,"command":"//Shift+Enter to make RDD x and filter it by (n => n%2 == 1) to make RDD y\nval x = sc.parallelize(Array(1,2,3))\n// the closure (n => n%2 == 1) in the filter will \n// return True if element n in RDD x has remainder 1 when divided by 2 (i.e., if n is odd)\nval y = x.filter(n => n%2 == 1) ","commandVersion":0,"state":"finished","results":{"type":"html","data":"<div class=\"ansiout\">x: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[768] at parallelize at <console>:37\ny: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[769] at filter at <console>:40\n</div>","arguments":{},"addedWidgets":{},"removedWidgets":[]},"errorSummary":null,"error":null,"startTime":1.456978967413E12,"submitTime":1.456978943688E12,"finishTime":1.456978967555E12,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"5e9e0566-ca72-4bc5-8768-0055c2f887f0"},{"version":"CommandV1","origId":8779,"guid":"5899ac25-4f02-4f58-afea-b66bad297a1c","subtype":"command","commandType":"auto","position":2.748382568359375,"command":"// Cntrl+Enter to collect and print the two RDDs\nprintln(x.collect().mkString(\", \"))\nprintln(y.collect().mkString(\", \"))","commandVersion":0,"state":"finished","results":{"type":"html","data":"<div class=\"ansiout\">1, 2, 3\n1, 3\n</div>","arguments":{},"addedWidgets":{},"removedWidgets":[]},"errorSummary":null,"error":null,"startTime":1.456978975435E12,"submitTime":1.456978951717E12,"finishTime":1.456978975582E12,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"09568fe7-70b3-4187-85d5-b1090b73db58"},{"version":"CommandV1","origId":5350,"guid":"8c948990-d5d1-49e2-967a-c151282b9880","subtype":"command","commandType":"auto","position":2.75,"command":"%md\n***\n### 6. Perform the ``reduce`` action on the RDD\n\nReduce aggregates a data set element using a function (closure). \nThis function takes two arguments and returns one and can often be seen as a binary operator. \nThis operator has to be commutative and associative so that it can be computed correctly in parallel (where we have little control over the order of the operations!).","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"startTime":0.0,"submitTime":1.4563698675E12,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"39dbd525-5cd9-41d4-b7f7-23e6614f0643"},{"version":"CommandV1","origId":6741,"guid":"60e6c8be-f6ab-4d93-b2b2-623c7eae9c89","subtype":"command","commandType":"auto","position":2.765625,"command":"%md\n### Let us look at the [reduce action in detail](/#workspace/scalable-data-science/xtraResources/visualRDDApi/recall/actions/reduce) and return here to try out the example codes.\n\n","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"startTime":0.0,"submitTime":1.456369867539E12,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"b2261f35-448c-472f-8ad7-39c52b5be335"},{"version":"CommandV1","origId":8803,"guid":"7b9ea653-91de-48aa-a026-34f608e92a90","subtype":"command","commandType":"auto","position":2.7734375,"command":"//Shift+Enter to make RDD x of inteegrs 1,2,3,4 and reduce it to sum\nval x = sc.parallelize(Array(1,2,3,4))\nval y = x.reduce((a,b) => a+b)","commandVersion":0,"state":"finished","results":{"type":"html","data":"<div class=\"ansiout\">x: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[808] at parallelize at <console>:37\ny: Int = 10\n</div>","arguments":{},"addedWidgets":{},"removedWidgets":[]},"errorSummary":null,"error":null,"startTime":1.456979104661E12,"submitTime":1.45697908093E12,"finishTime":1.456979104824E12,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"bb7df362-19b8-45c8-9d54-744623ef944f"},{"version":"CommandV1","origId":8804,"guid":"361888cb-82a6-442f-b2b4-3ef063db021e","subtype":"command","commandType":"auto","position":2.77734375,"command":"//Cntrl+Enter to collect and print RDD x and the Int y, sum of x\nprintln(x.collect.mkString(\", \"))\nprintln(y)","commandVersion":0,"state":"finished","results":{"type":"html","data":"<div class=\"ansiout\">1, 2, 3, 4\n10\n</div>","arguments":{},"addedWidgets":{},"removedWidgets":[]},"errorSummary":null,"error":null,"startTime":1.456979114231E12,"submitTime":1.456979090459E12,"finishTime":1.456979114327E12,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"a2a1b18d-d0e2-42d3-b0c7-a56739768505"},{"version":"CommandV1","origId":5370,"guid":"6e44fdb9-7398-4799-a97a-acd5e2e47e4c","subtype":"command","commandType":"auto","position":2.994140625,"command":"%md\n### 7. Transform an RDD by ``flatMap`` to make another RDD\n\n``flatMap`` is similar to ``map`` but each element from input RDD can be mapped to zero or more output elements. \nTherefore your function should return a sequential collection such as an ``Array`` rather than a single element as shown below.","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"startTime":0.0,"submitTime":1.456369867607E12,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"b10dcfd8-4806-477e-b268-51c3bdf31bb6"},{"version":"CommandV1","origId":8857,"guid":"bc6f8ae1-516b-45a4-9aa0-08bdda1c4e0a","subtype":"command","commandType":"auto","position":2.9951171875,"command":"%md\n### Let us look at the [flatMap transformation in detail](/#workspace/scalable-data-science/xtraResources/visualRDDApi/recall/transformations/flatMap) and return here to try out the example codes.\n\n","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"startTime":0.0,"submitTime":1.456369867699E12,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"92d5a0ad-b58a-4eb4-9c5f-4bf8335af8ce"},{"version":"CommandV1","origId":8858,"guid":"e513e5aa-7f42-425d-a9c2-b841f44ad7b4","subtype":"command","commandType":"auto","position":2.99560546875,"command":"//Shift+Enter to make RDD x and flatMap it into RDD by closure (n => Array(n, n*100, 42))\nval x = sc.parallelize(Array(1,2,3))\nval y = x.flatMap(n => Array(n, n*100, 42))","commandVersion":0,"state":"finished","results":{"type":"html","data":"<div class=\"ansiout\">x: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[844] at parallelize at <console>:37\ny: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[845] at flatMap at <console>:38\n</div>","arguments":{},"addedWidgets":{},"removedWidgets":[]},"errorSummary":null,"error":null,"startTime":1.456979235253E12,"submitTime":1.45697921137E12,"finishTime":1.456979235376E12,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"929c7cbd-c640-42cc-9be5-8668505e11a4"},{"version":"CommandV1","origId":8859,"guid":"51913513-0217-44a6-8ed1-7cf7866ab684","subtype":"command","commandType":"auto","position":2.995849609375,"command":"//Cntrl+Enter to collect and print RDDs x and y\nprintln(x.collect().mkString(\", \"))\nprintln(y.collect().mkString(\", \"))","commandVersion":0,"state":"finished","results":{"type":"html","data":"<div class=\"ansiout\">1, 2, 3\n1, 100, 42, 2, 200, 42, 3, 300, 42\n</div>","arguments":{},"addedWidgets":{},"removedWidgets":[]},"errorSummary":null,"error":null,"startTime":1.456979240991E12,"submitTime":1.456979216787E12,"finishTime":1.456979241165E12,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"892c5ffd-0f14-4813-9aea-876933c555ae"},{"version":"CommandV1","origId":5378,"guid":"8aa7681c-59bf-4415-8e49-3dba3b608bb6","subtype":"command","commandType":"auto","position":2.999755859375,"command":"%md\n### 8. Create a Pair RDD\n\nLet's next work with RDD of ``(key,value)`` pairs called a *Pair RDD* or *Key-Value RDD*.","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"startTime":0.0,"submitTime":1.456369867782E12,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"c5e96697-360a-49b6-bae7-a2c4df8cd3fd"},{"version":"CommandV1","origId":5379,"guid":"d4d08eea-b1fb-4164-8ac9-eda24543c3a3","subtype":"command","commandType":"auto","position":3.0,"command":"// Cntrl+Enter to make RDD words and display it by collect\nval words = sc.parallelize(Array(\"a\", \"b\", \"a\", \"a\", \"b\", \"b\", \"a\", \"a\", \"a\", \"b\", \"b\"))\nwords.collect()","commandVersion":0,"state":"finished","results":{"type":"html","data":"<div class=\"ansiout\">words: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[859] at parallelize at <console>:35\nres13: Array[String] = Array(a, b, a, a, b, b, a, a, a, b, b)\n</div>","arguments":{},"addedWidgets":{},"removedWidgets":[]},"errorSummary":null,"error":"<div class=\"ansiout\"><console>:12: error: value reduceByKey is not a member of org.apache.spark.rdd.RDD[(String, Int)]\n val wordcounts = words.map(s => (s, 1)).reduceByKey(_ + _).collect()\n ^\n</div>","startTime":1.456979296284E12,"submitTime":1.456979272551E12,"finishTime":1.45697929641E12,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":null,"commentThread":[],"commentsVisible":false,"parentHierarchy":null,"diffInserts":null,"diffDeletes":null,"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":null,"showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"72fa4658-f513-4b98-bc36-45377c409e6d"},{"version":"CommandV1","origId":34036,"guid":"f02c9c50-52be-484d-b41c-7a814de3326a","subtype":"command","commandType":"auto","position":3.46875,"command":"%md\nLet's make a Pair RDD called `wordCountPairRDD` that is made of (key,value) pairs with key=word and value=1 in order to encode each occurrence of each word in the RDD `words`, as follows:","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"3de0c059-b4e5-4653-ad84-944bc365d24f"},{"version":"CommandV1","origId":34034,"guid":"45a0445b-1403-4177-90f9-f85fad447521","subtype":"command","commandType":"auto","position":3.9375,"command":"// Cntrl+Enter to make and collect Pair RDD wordCountPairRDD\nval wordCountPairRDD = words.map(s => (s, 1))\nwordCountPairRDD.collect()","commandVersion":0,"state":"finished","results":{"type":"html","data":"<div class=\"ansiout\">wordCountPairRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[867] at map at <console>:37\nres14: Array[(String, Int)] = Array((a,1), (b,1), (a,1), (a,1), (b,1), (b,1), (a,1), (a,1), (a,1), (b,1), (b,1))\n</div>","arguments":{},"addedWidgets":{},"removedWidgets":[]},"errorSummary":null,"error":null,"startTime":1.456979327798E12,"submitTime":1.45697930407E12,"finishTime":1.456979327906E12,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"55414096-0963-4b6c-8d27-b1d79f40a463"},{"version":"CommandV1","origId":35482,"guid":"bc6a6ece-322c-498c-bfed-5483a0754865","subtype":"command","commandType":"auto","position":4.171875,"command":"%md\n### 9. Perform some transformations on a Pair RDD\n\nLet's next work with RDD of ``(key,value)`` pairs called a *Pair RDD* or *Key-Value RDD*.\n\nNow some of the Key-Value transformations that we could perform include the following.\n* **`reduceByKey` transformation**\n * which takes an RDD and returns a new RDD of key-value pairs, such that:\n * the values for each key are aggregated using the given reduced function\n * and the reduce function has to be of the type that takes two values and returns one value.\n* **`sortByKey` transformation**\n * this returns a new RDD of key-value pairs that's sorted by keys in ascending order\n* **`groupByKey` transformation**\n * this returns a new RDD consisting of key and iterable-valued pairs.\n\nLet's see some concrete examples next.","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"0e9287e6-fce6-4f1b-95b3-633815816967"},{"version":"CommandV1","origId":8856,"guid":"04387f9a-1c78-43bf-8541-8d1fbf1f2d88","subtype":"command","commandType":"auto","position":4.376953125,"command":"%md\n","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"startTime":0.0,"submitTime":1.456369867817E12,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"9a2fbd18-87b7-492c-8498-4892811c2ca5"},{"version":"CommandV1","origId":34035,"guid":"aa2177d6-3d02-4381-84ac-3fe19ff225da","subtype":"command","commandType":"auto","position":4.40625,"command":"// Cntrl+Enter to reduceByKey and collect wordcounts RDD\n//val wordcounts = wordCountPairRDD.reduceByKey( _ + _ )\nval wordcounts = wordCountPairRDD.reduceByKey( (v1,v2) => v1+v2 )\nwordcounts.collect()","commandVersion":0,"state":"finished","results":{"type":"html","data":"<div class=\"ansiout\">wordcounts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[908] at reduceByKey at <console>:42\nres16: Array[(String, Int)] = Array((a,6), (b,5))\n</div>","arguments":{},"addedWidgets":{},"removedWidgets":[]},"errorSummary":null,"error":null,"startTime":1.456979500836E12,"submitTime":1.456979477097E12,"finishTime":1.456979501045E12,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"33fafe8e-039e-4437-afaa-f19368dc8a38"},{"version":"CommandV1","origId":34033,"guid":"e996fbb9-2609-4956-b4c1-2f721101d60b","subtype":"command","commandType":"auto","position":4.875,"command":"%md\nNow, let us do just the crucial steps and avoid collecting intermediate RDDs (something we should avoid for large datasets anyways, as they may not fit in the driver program).","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"fa6d164e-45f1-4864-83a5-55a2a29b560c"},{"version":"CommandV1","origId":34032,"guid":"86c8d8c9-5072-4113-a9fa-142e3da85c9a","subtype":"command","commandType":"auto","position":6.75,"command":"//Cntrl+Enter to make words RDD and do the word count in two lines\nval words = sc.parallelize(Array(\"a\", \"b\", \"a\", \"a\", \"b\", \"b\", \"a\", \"a\", \"a\", \"b\", \"b\"))\nval wordcounts = words.map(s => (s, 1)).reduceByKey(_ + _).collect() ","commandVersion":0,"state":"finished","results":{"type":"html","data":"<div class=\"ansiout\">words: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[914] at parallelize at <console>:37\nwordcounts: Array[(String, Int)] = Array((a,6), (b,5))\n</div>","arguments":{},"addedWidgets":{},"removedWidgets":[]},"errorSummary":null,"error":null,"startTime":1.456979529013E12,"submitTime":1.456979505291E12,"finishTime":1.456979529248E12,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"cadeb587-7c91-4b04-b2e8-77ec147dc0b0"},{"version":"CommandV1","origId":35474,"guid":"abb824a5-47d8-4e05-b872-346cb4192c49","subtype":"command","commandType":"auto","position":8.625,"command":"%md\n##### You Try!\nYou try evaluating `sortByKey()` which will make a new RDD that consists of the elements of the original pair RDD that are sorted by Keys.","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"99ac2055-9092-4254-8aaf-72ad2ba2856c"},{"version":"CommandV1","origId":35475,"guid":"b8ed6fe9-319a-4106-983b-5d5502e07d4e","subtype":"command","commandType":"auto","position":9.5625,"command":"// Shift+Enter and comprehend code\nval words = sc.parallelize(Array(\"a\", \"b\", \"a\", \"a\", \"b\", \"b\", \"a\", \"a\", \"a\", \"b\", \"b\"))\nval wordCountPairRDD = words.map(s => (s, 1))\nval wordCountPairRDDSortedByKey = wordCountPairRDD.sortByKey()","commandVersion":0,"state":"finished","results":{"type":"html","data":"<div class=\"ansiout\"></div>","arguments":{},"addedWidgets":{},"removedWidgets":[]},"errorSummary":"<div class=\"ansiout\"><console>:40: error: value sortByKey is not a member of Array[(String, Int)]\n wordcounts.sortByKey()\n ^\n</div>","error":null,"startTime":1.45704348176E12,"submitTime":1.457043435645E12,"finishTime":1.457043481805E12,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"26dca38e-25a9-48d5-9df2-bb3e2c5656ee"},{"version":"CommandV1","origId":35477,"guid":"e002ec48-41a6-43f4-8526-41d2795f3ea2","subtype":"command","commandType":"auto","position":9.796875,"command":"wordCountPairRDD.collect() // Shift+Enter and comprehend code","commandVersion":0,"state":"finished","results":{"type":"html","data":"<div class=\"ansiout\"></div>","arguments":{},"addedWidgets":{},"removedWidgets":[]},"errorSummary":null,"error":null,"startTime":1.457043498319E12,"submitTime":1.457043452214E12,"finishTime":1.457043498359E12,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"590e6f43-8e99-4b16-920a-7248f541faca"},{"version":"CommandV1","origId":35476,"guid":"0876db2e-9654-4160-905f-030e56b9ca57","subtype":"command","commandType":"auto","position":10.03125,"command":"wordCountPairRDDSortedByKey.collect() // Cntrl+Enter and comprehend code","commandVersion":0,"state":"finished","results":{"type":"html","data":"<div class=\"ansiout\"></div>","arguments":{},"addedWidgets":{},"removedWidgets":[]},"errorSummary":null,"error":null,"startTime":1.457043501967E12,"submitTime":1.457043455854E12,"finishTime":1.457043502009E12,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"3724d8e8-6cc4-438b-b01c-f8ab19afb0ac"},{"version":"CommandV1","origId":35481,"guid":"f54a9616-fc8c-4ee0-a1eb-2777492c6d9b","subtype":"command","commandType":"auto","position":10.08984375,"command":"%md\n\nThe next key value transformation we will see is `groupByKey`\n\nWhen we apply the `groupByKey` transformation to `wordCountPairRDD` we end up with a new RDD that contains two elements.\nThe first element is the tuple `b` and an iterable `CompactBuffer(1,1,1,1,1)` obtained by grouping the value `1` for each of the five key value pairs `(b,1)`.\nSimilarly the second element is the key `a` and an iterable `CompactBuffer(1,1,1,1,1,1)` obtained by grouping the value `1` for each of the six key value pairs `(a,1)`.\n\n*CAUTION*: `groupByKey` can cause a large amount of data movement across the network.\nIt also can create very large iterables at a worker.\nImagine you have an RDD where you have 1 billion pairs that have the key `a`.\nAll of the values will have to fit in a single worker if you use group by key.\nSo instead of a group by key, consider using reduced by key.","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"c81438da-16ef-47dd-b32e-5a49f84f6415"},{"version":"CommandV1","origId":35480,"guid":"9dd26e14-bb3c-4d66-a267-63ea871d282c","subtype":"command","commandType":"auto","position":10.1484375,"command":"%md\n","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"89367975-7cc7-46c8-be7c-2cef5577ead5"},{"version":"CommandV1","origId":35478,"guid":"57cb0ba0-bdce-429b-afae-3fd9642e0250","subtype":"command","commandType":"auto","position":10.265625,"command":"val wordCountPairRDDGroupByKey = wordCountPairRDD.groupByKey() // <Shift+Enter> CAUTION: this transformation can be very wide!","commandVersion":0,"state":"finished","results":{"type":"html","data":"<div class=\"ansiout\">wordCountPairRDDGroupByKey: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[984] at groupByKey at <console>:38\n</div>","arguments":{},"addedWidgets":{},"removedWidgets":[]},"errorSummary":null,"error":null,"startTime":1.456979738257E12,"submitTime":1.456979714513E12,"finishTime":1.456979738346E12,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"f979fb92-a790-493e-aeb7-834839d76a16"},{"version":"CommandV1","origId":35479,"guid":"911887a4-ed69-4cad-a58f-999bcdab40af","subtype":"command","commandType":"auto","position":10.3828125,"command":"wordCountPairRDDGroupByKey.collect() // Cntrl+Enter","commandVersion":0,"state":"finished","results":{"type":"html","data":"<div class=\"ansiout\">res19: Array[(String, Iterable[Int])] = Array((a,CompactBuffer(1, 1, 1, 1, 1, 1)), (b,CompactBuffer(1, 1, 1, 1, 1)))\n</div>","arguments":{},"addedWidgets":{},"removedWidgets":[]},"errorSummary":null,"error":null,"startTime":1.456979740674E12,"submitTime":1.456979716945E12,"finishTime":1.456979740943E12,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"ad290d60-a2e6-422b-90fc-2b8377a95ae8"},{"version":"CommandV1","origId":36733,"guid":"39580b37-6737-4a79-bc3f-8441ddb40e39","subtype":"command","commandType":"auto","position":10.44140625,"command":"%md\n### 10. Where in the cluster is your computation running?\n","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"63ba7ec8-d52f-4d92-8e47-220916f6c2ac"},{"version":"CommandV1","origId":36736,"guid":"445b7bcf-3e1f-4a0a-8e4f-571ad1f56ed0","subtype":"command","commandType":"auto","position":10.5,"command":"val list = 1 to 10\nvar sum = 0\nlist.foreach(x => sum = sum + x)\nprint(sum)","commandVersion":0,"state":"finished","results":{"type":"html","data":"<div class=\"ansiout\">55list: scala.collection.immutable.Range.Inclusive = Range(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)\nsum: Int = 55\n</div>","arguments":{},"addedWidgets":{},"removedWidgets":[]},"errorSummary":null,"error":null,"startTime":1.45697987504E12,"submitTime":1.456979851295E12,"finishTime":1.456979875169E12,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"9e604314-b215-409a-803e-d554035e7186"},{"version":"CommandV1","origId":36738,"guid":"af0be455-3dfb-4f3a-b136-43fd0ed936c8","subtype":"command","commandType":"auto","position":14.25,"command":"val rdd = sc.parallelize(1 to 10)\nvar sum = 0\nrdd.foreach(x => sum = sum + x)\nrdd.collect\nprint(sum)","commandVersion":0,"state":"finished","results":{"type":"html","data":"<div class=\"ansiout\">0rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1034] at parallelize at <console>:40\nsum: Int = 0\n</div>","arguments":{},"addedWidgets":{},"removedWidgets":[]},"errorSummary":null,"error":null,"startTime":1.456979960124E12,"submitTime":1.456979936379E12,"finishTime":1.456979960331E12,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"d58fbb1e-f228-4795-976e-cec185b24597"},{"version":"CommandV1","origId":35631,"guid":"884818c3-454d-4183-ac04-bfae8fae07e0","subtype":"command","commandType":"auto","position":16.125,"command":"%md\n### 11. Shipping Closures, Broadcast Variables and Accumulator Variables\n\n#### Closures, Broadcast and Accumulator Variables\n**(watch now 2:06)**:\n\n[](https://www.youtube.com/v/I9Zcr4R35Ao?rel=0&autoplay=1&modestbranding=1)\n\n\nWe will use these variables in the sequel.\n\n#### SUMMARY\nSpark automatically creates closures \n * for functions that run on RDDs at workers,\n * and for any global variables that are used by those workers\n * one closure per worker is sent with every task\n * and there's no communication between workers\n * closures are one way from the driver to the worker\n * any changes that you make to the global variables at the workers \n * are not sent to the driver or\n * are not sent to other workers.\n \n \n The problem we have is that these closures\n * are automatically created are sent or re-sent with every job\n * with a large global variable it gets inefficient to send/resend lots of data to each worker\n * we cannot communicate that back to the driver\n \n \n To do this, Spark provides shared variables in two different types.\n * **broadcast variables**\n * lets us to efficiently send large read-only values to all of the workers\n * these are saved at the workers for use in one or more Spark operations. \n * **accumulator variables**\n * These allow us to aggregate values from workers back to the driver.\n * only the driver can access the value of the accumulator \n * for the tasks, the accumulators are basically write-only\n \n ***\n \n ### 12. Spark Essentials: Summary\n **(watch now: 0:29)**\n \n[](https://www.youtube.com/v/F50Vty9Ia8Y?rel=0&autoplay=1&modestbranding=1)\n\n*NOTE:* In databricks cluster, we (the course coordinator/administrators) set the number of workers for you.","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"startTime":0.0,"submitTime":0.0,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"c25f5c96-5ff5-49dd-ab3a-3884c8eb5f2a"},{"version":"CommandV1","origId":8869,"guid":"a682d562-a23a-4e12-999c-b5d73c3d72e5","subtype":"command","commandType":"auto","position":17.0625,"command":"%md\n### 13. HOMEWORK \nSee the notebook in this folder named `005_RDDsTransformationsActionsHOMEWORK`. \nThis notebook will give you more examples of the operations above as well as others we will be using later, including:\n* Perform the ``takeOrdered`` action on the RDD\n* Transform the RDD by ``distinct`` to make another RDD and\n* Doing a bunch of transformations to our RDD and performing an action in a single cell.","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"startTime":0.0,"submitTime":1.45636986788E12,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"90968368-8574-4c91-ad55-6ec07ac79b30"},{"version":"CommandV1","origId":5382,"guid":"966ee4f6-75a0-4c74-9070-05361336c59b","subtype":"command","commandType":"auto","position":18.0,"command":"%md\n***\n***\n### **Importing Standard Scala and Java libraries**\n* For other libraries that are not available by default, you can upload other libraries to the Workspace.\n* Refer to the **[Libraries](/#workspace/databricks_guide/02 Product Overview/07 Libraries)** guide for more details.\n","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"startTime":0.0,"submitTime":1.456369867921E12,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":null,"diffDeletes":null,"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":null,"showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"856edb0d-6597-4cd8-a2d7-ed3f441c4b1e"},{"version":"CommandV1","origId":5383,"guid":"4690f278-b1c5-4090-91a7-9ee5f08088de","subtype":"command","commandType":"auto","position":18.5,"command":"import scala.math._\nval x = min(1, 10)","commandVersion":0,"state":"finished","results":{"type":"html","data":"<div class=\"ansiout\">import scala.math._\nx: Int = 1\n</div>","arguments":{},"addedWidgets":{},"removedWidgets":[]},"errorSummary":null,"error":null,"startTime":1.45636988687E12,"submitTime":1.456369867975E12,"finishTime":1.456369886938E12,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":null,"diffDeletes":null,"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":null,"showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"e1b4f945-6c24-4ddb-9358-9badf0006cc2"},{"version":"CommandV1","origId":5384,"guid":"20256615-a359-4538-ba66-6064ecab1387","subtype":"command","commandType":"auto","position":18.75,"command":"import java.util.HashMap\nval map = new HashMap[String, Int]()\nmap.put(\"a\", 1)\nmap.put(\"b\", 2)\nmap.put(\"c\", 3)\nmap.put(\"d\", 4)\nmap.put(\"e\", 5)\n","commandVersion":0,"state":"finished","results":{"type":"html","data":"<div class=\"ansiout\">import java.util.HashMap\nmap: java.util.HashMap[String,Int] = {a=1, b=2, c=3, d=4, e=5}\nres9: Int = 0\n</div>","arguments":{},"addedWidgets":{},"removedWidgets":[]},"errorSummary":null,"error":null,"startTime":1.456369886943E12,"submitTime":1.456369867998E12,"finishTime":1.456369887167E12,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":null,"diffDeletes":null,"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":null,"showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"ba5cbc31-7d4f-4447-b536-05b04070e887"},{"version":"CommandV1","origId":5385,"guid":"d6d0f10f-b469-423d-a663-e42eb6a29090","subtype":"command","commandType":"auto","position":22.0,"command":"%md\n\n# [Scalable Data Science](http://www.math.canterbury.ac.nz/~r.sainudiin/courses/ScalableDataScience/)\n\n\n### prepared by [Raazesh Sainudiin](https://nz.linkedin.com/in/raazesh-sainudiin-45955845) and [Sivanand Sivaram](https://www.linkedin.com/in/sivanand)\n\n*supported by* [](https://databricks.com/)\nand \n[](https://www.awseducate.com/microsite/CommunitiesEngageHome)","commandVersion":0,"state":"finished","results":null,"errorSummary":null,"error":null,"startTime":0.0,"submitTime":1.456369868017E12,"finishTime":0.0,"collapsed":false,"bindings":{},"inputWidgets":{},"displayType":"table","width":"auto","height":"auto","xColumns":null,"yColumns":null,"pivotColumns":null,"pivotAggregation":null,"customPlotOptions":{},"commentThread":[],"commentsVisible":false,"parentHierarchy":[],"diffInserts":[],"diffDeletes":[],"globalVars":{},"latestUser":"r.sainudiin@math.canterbury.ac.nz","commandTitle":"","showCommandTitle":false,"hideCommandCode":false,"hideCommandResult":false,"iPythonMetadata":null,"nuid":"9fd08dfe-3c53-46e2-b38f-587fc843b5ab"}],"dashboards":[],"guid":"a5f18107-cf7f-4683-89cd-0273649194f7","globalVars":{},"iPythonMetadata":null,"inputWidgets":{}};</script> <script src="https://databricks-prod-cloudfront.cloud.databricks.com/static/201602081754420800-0c2673ac858e227cad536fdb45d140aeded238db/js/notebook-main.js" onerror="window.mainJsLoadError = true;"></script> </head> <body> <script> if (window.mainJsLoadError) { var u = 'https://databricks-prod-cloudfront.cloud.databricks.com/static/201602081754420800-0c2673ac858e227cad536fdb45d140aeded238db/js/notebook-main.js'; var b = document.getElementsByTagName('body')[0]; var c = document.createElement('div'); c.innerHTML = ('<h1>Network Error</h1>' + '<p><b>Please check your network connection and try again.</b></p>' + '<p>Could not load a required resource: ' + u + '</p>'); c.style.margin = '30px'; c.style.padding = '20px 50px'; c.style.backgroundColor = '#f5f5f5'; c.style.borderRadius = '5px'; b.appendChild(c); } </script> </body> </html>