{"paragraphs":[{"title":"1 - Markdown","text":"%md \n# Apache Zeppelin + Apache Spark + Apache Cassandra\n\nThis is a demonstration showing how to use [Apache Zeppelin notebook](https://zeppelin.incubator.apache.org/) to interact with [Apache Cassandra](http://cassandra.apache.org/) NoSQL database through [Apache Spark](http://spark.apache.org/) or directly through Cassandra CQL language.\n\n*Please note this is an unofficial demo and tutorial.*\n\n### Apache Spark\n\nApache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.\nMore details can be found here [http://spark.apache.org](http://spark.apache.org/).\nIt is used for Cassandra data processing needs (ETL, transformations, analytics ...).\n\n### DataStax Spark Cassandra Connector\n\n[DataStax](http://www.datatstax.com) have developped a Spark Cassandra Connector to be able to read and write Cassandra data from Spark API. \nThe Spark Cassandra Connector lets you expose Cassandra tables as Spark RDDs (or DataFrames), write Spark RDDs (or DataFrames) to Cassandra tables, and execute arbitrary SQL queries from your Spark applications.\n\nUseful links:\n* [The Spark Cassandra Connector github repository](https://github.com/datastax/spark-cassandra-connector)\n* [Getting started with Apache Spark and Cassandra](https://academy.datastax.com/fr/demos/getting-started-apache-spark-and-cassandra)\n* [Free training on DataStax Enterprise Analytics with Apache Spark](https://academy.datastax.com/fr/courses/getting-started-apache-spark)\n\n### CQL Language\n\nThe Cassandra Query Language (CQL) is the primary language for communicating with the Cassandra database.\nDocumentation on CQL usage:\n* [Introduction to CQL](http://docs.datastax.com/en/cql/3.3/cql/cqlIntro.html)\n* [Using CQL](https://docs.datastax.com/en/cql/3.3/cql/cql_using/useAboutCQL.html)\n\nThe Cassandra CQL Interpreter for Apache Zeppelin is written by my colleague Duy Hai Doan [@doanduyhai](https://twitter.com/doanduyhai)\n[CQL Interpreter documentation for Apache Zeppelin 0.5.5](https://zeppelin.incubator.apache.org/docs/0.5.5-incubating/interpreter/cassandra.html)\n\n\n","dateUpdated":"Jan 25, 2016 3:18:27 PM","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorMode":"ace/mode/markdown","tableHide":false,"editorHide":true,"title":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1452705252373_-553593021","id":"20160113-181412_1813689583","dateCreated":"Jan 13, 2016 6:14:12 PM","dateStarted":"Jan 25, 2016 1:00:39 PM","dateFinished":"Jan 25, 2016 1:00:41 PM","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:21","errorMessage":"","focus":true},{"title":"2 - Shell command example","text":"%sh pwd\nls -l","dateUpdated":"Jan 25, 2016 12:50:46 PM","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorMode":"ace/mode/sh","title":true,"editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1452705252373_-553593021","id":"20160113-181412_1623555306","dateCreated":"Jan 13, 2016 6:14:12 PM","dateStarted":"Jan 23, 2016 7:05:17 PM","dateFinished":"Jan 23, 2016 7:05:17 PM","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:22"},{"title":"3 - Add Spark library to read CSV files","text":"%dep\n\nz.reset()\n\n// Add spark-csv package\n// Versions and documentation on https://github.com/databricks/spark-csv\nz.load(\"com.databricks:spark-csv_2.11:1.3.0\")","dateUpdated":"Jan 25, 2016 12:50:47 PM","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorMode":"ace/mode/scala","title":true,"editorHide":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1452705252374_-552438775","id":"20160113-181412_342563478","dateCreated":"Jan 13, 2016 6:14:12 PM","dateStarted":"Jan 23, 2016 7:09:25 PM","dateFinished":"Jan 23, 2016 7:09:30 PM","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:23"},{"title":"4 - download demo CSV file","text":"%sh\n\n#wget https://raw.githubusercontent.com/victorcouste/zeppelin-spark-cassandra-demo/master/albums.csv\n\n# Or download direclty the demo CSV file from https://raw.githubusercontent.com/victorcouste/zeppelin-spark-cassandra-demo/master/albums.csv","dateUpdated":"Jan 25, 2016 12:50:49 PM","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorMode":"ace/mode/sh","title":true,"editorHide":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1453564554096_633894927","id":"20160123-165554_1618215392","dateCreated":"Jan 23, 2016 4:55:54 PM","dateStarted":"Jan 23, 2016 5:38:26 PM","dateFinished":"Jan 23, 2016 5:38:26 PM","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:24"},{"title":"4 - Spark : load cSV File in dataFrame","text":"%spark\n\n//Display Spark version used\nprintln(\"Spark version:\"+sc.version)\n\nval df_albums = sqlContext.read\n.format(\"com.databricks.spark.csv\")\n.option(\"header\", \"true\")\n.load(\"albums.csv\")\n.cache\n\n//If you want to store albums.csv in a specific folder run\n//val df_albums = sqlContext.read\n//.format(\"com.databricks.spark.csv\")\n//.option(\"header\", \"true\")\n//.load(\"/your_path/albums.csv\")\n//.cache","dateUpdated":"Jan 25, 2016 12:50:50 PM","config":{"colWidth":12,"editorMode":"ace/mode/scala","graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"title":true,"editorHide":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1452705252374_-552438775","id":"20160113-181412_829518708","dateCreated":"Jan 13, 2016 6:14:12 PM","dateStarted":"Jan 23, 2016 7:09:46 PM","dateFinished":"Jan 23, 2016 7:09:59 PM","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:25"},{"title":"5 - Spark : Print Schema, Show, Filter, GroupBy dataFrame","text":"df_albums.printSchema()\ndf_albums.show()\ndf_albums.filter(\"year > 2000\").show()\ndf_albums.groupBy(\"year\").count().show()","dateUpdated":"Jan 25, 2016 12:50:51 PM","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorMode":"ace/mode/scala","title":true,"editorHide":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1452791373699_2004306939","id":"20160114-180933_599744065","dateCreated":"Jan 14, 2016 6:09:33 PM","dateStarted":"Jan 23, 2016 7:10:19 PM","dateFinished":"Jan 23, 2016 7:10:22 PM","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:26"},{"title":"6 - SparkSQL on DataFrame 1","text":"df_albums.registerTempTable(\"spark_albums_table\")\n\nsqlContext.sql(\"SELECT * FROM spark_albums_table\").show","dateUpdated":"Jan 25, 2016 12:50:52 PM","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorMode":"ace/mode/scala","title":true,"editorHide":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1452727401154_-925697979","id":"20160114-002321_1001583714","dateCreated":"Jan 14, 2016 12:23:21 AM","dateStarted":"Jan 23, 2016 7:10:30 PM","dateFinished":"Jan 23, 2016 7:10:31 PM","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:27"},{"title":"7 - SparkSQL On DataFrame 2","text":"%sql\nSELECT country,count(*) as nb FROM spark_albums_table group by country","dateUpdated":"Jan 25, 2016 12:50:53 PM","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[{"name":"nb","index":1,"aggr":"sum"}],"groups":[],"scatter":{"yAxis":{"name":"nb","index":1,"aggr":"sum"}}},"enabled":true,"editorMode":"ace/mode/sql","title":true,"editorHide":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1452727254054_1813162519","id":"20160114-002054_842505952","dateCreated":"Jan 14, 2016 12:20:54 AM","dateStarted":"Jan 23, 2016 7:10:35 PM","dateFinished":"Jan 23, 2016 7:10:36 PM","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:28"},{"title":"8 - Cassandra CQL : Create keyspace, help and describe keyspaces","text":"%cassandra\n\nCREATE KEYSPACE IF NOT EXISTS ks_music \nWITH replication = {\n\t'class' : 'SimpleStrategy',\n\t'replication_factor' : 1\n};\n\n//help;\n\n//describe keyspaces;\n\n//describe keyspace ks_music;","dateUpdated":"Jan 25, 2016 12:50:55 PM","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorMode":"ace/mode/scala","title":true,"editorHide":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1453223929648_-1331254783","id":"20160119-181849_56311459","dateCreated":"Jan 19, 2016 6:18:49 PM","dateStarted":"Jan 23, 2016 7:10:45 PM","dateFinished":"Jan 23, 2016 7:10:48 PM","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:29"},{"title":"9 - Cassandra CQL : Tables creation","text":"%cassandra\n\nuse ks_music;\n\nDROP TABLE albums;\n\nCREATE TABLE IF NOT EXISTS albums ( \n\tartist text,\n\talbum text,\n\tyear text,\n\tcountry text,\n\tquality text,\n\tstatus text,\n\tPRIMARY KEY (album) \n);\n\nDROP TABLE nbalbums_by_year;\n\nCREATE TABLE IF NOT EXISTS nbalbums_by_year ( \n\tyear text,\n nbalbums int,\n\tPRIMARY KEY (year)\n);\n\ndescribe table albums;","dateUpdated":"Jan 25, 2016 12:50:56 PM","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorMode":"ace/mode/scala","title":true,"editorHide":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1452705252374_-552438775","id":"20160113-181412_1134300065","dateCreated":"Jan 13, 2016 6:14:12 PM","dateStarted":"Jan 23, 2016 7:10:58 PM","dateFinished":"Jan 23, 2016 7:11:04 PM","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:30"},{"title":"10 - Cassandra CQL","text":"%cassandra\nselect * from ks_music.albums limit 10;","dateUpdated":"Jan 25, 2016 12:50:57 PM","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorMode":"ace/mode/scala","title":true,"editorHide":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1453566459682_214434306","id":"20160123-172739_1656103366","dateCreated":"Jan 23, 2016 5:27:39 PM","dateStarted":"Jan 23, 2016 7:11:22 PM","dateFinished":"Jan 23, 2016 7:11:22 PM","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:31"},{"title":"11 - Write Spark DataFrame in Cassandra 1","text":"df_albums.write\n.format(\"org.apache.spark.sql.cassandra\")\n.option(\"header\",\"false\")\n.mode(\"append\")\n.options(Map( \"table\" -> \"albums\", \"keyspace\" -> \"ks_music\"))\n.save()","dateUpdated":"Jan 25, 2016 12:50:58 PM","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorMode":"ace/mode/scala","title":true,"editorHide":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1452726691942_184908463","id":"20160114-001131_987960956","dateCreated":"Jan 14, 2016 12:11:31 AM","dateStarted":"Jan 23, 2016 7:11:27 PM","dateFinished":"Jan 23, 2016 7:11:33 PM","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:32"},{"title":"12 - CQL Cassandra query","text":"%cassandra\nselect * from ks_music.albums limit 10;","dateUpdated":"Jan 25, 2016 12:51:00 PM","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[{"name":"album","index":0,"aggr":"sum"}],"values":[{"name":"artist","index":1,"aggr":"sum"}],"groups":[],"scatter":{"xAxis":{"name":"album","index":0,"aggr":"sum"},"yAxis":{"name":"artist","index":1,"aggr":"sum"}}},"enabled":true,"editorMode":"ace/mode/scala","title":true,"editorHide":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1453566108533_-1432582146","id":"20160123-172148_256698728","dateCreated":"Jan 23, 2016 5:21:48 PM","dateStarted":"Jan 23, 2016 7:11:37 PM","dateFinished":"Jan 23, 2016 7:11:37 PM","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:33"},{"title":"13 - Write Spark DataFrame In Cassandra 2","text":"\nval df_nbalbyms_by_year = sqlContext.sql(\"SELECT year,count(*) as nbalbums FROM spark_albums_table group by year\")\n\ndf_nbalbyms_by_year.show\n\ndf_nbalbyms_by_year.write\n.format(\"org.apache.spark.sql.cassandra\")\n.option(\"header\",\"false\")\n.mode(\"overwrite\")\n.options(Map( \"table\" -> \"nbalbums_by_year\", \"keyspace\" -> \"ks_music\"))\n.save()","dateUpdated":"Jan 25, 2016 12:51:02 PM","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorMode":"ace/mode/scala","title":true,"editorHide":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1452726411317_-251398958","id":"20160114-000651_1120689629","dateCreated":"Jan 14, 2016 12:06:51 AM","dateStarted":"Jan 23, 2016 7:11:47 PM","dateFinished":"Jan 23, 2016 7:11:48 PM","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:34"},{"title":"14 - Cassandra CQL query","text":"%cassandra\nselect * from ks_music.nbalbums_by_year;","dateUpdated":"Jan 25, 2016 12:51:04 PM","config":{"colWidth":12,"graph":{"mode":"multiBarChart","height":300,"optionOpen":false,"keys":[{"name":"year","index":0,"aggr":"sum"}],"values":[{"name":"nbalbums","index":1,"aggr":"sum"}],"groups":[],"scatter":{"xAxis":{"name":"year","index":0,"aggr":"sum"}}},"enabled":true,"editorMode":"ace/mode/scala","title":true,"editorHide":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1452726719797_-158641656","id":"20160114-001159_1878746586","dateCreated":"Jan 14, 2016 12:11:59 AM","dateStarted":"Jan 23, 2016 7:15:01 PM","dateFinished":"Jan 23, 2016 7:15:01 PM","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:35"},{"title":"15 - Spark DataFrame from Cassandra Table","text":"val df_albums_cassandra = sqlContext\n .read\n .format(\"org.apache.spark.sql.cassandra\")\n .options(Map( \"table\" -> \"albums\", \"keyspace\" -> \"ks_music\" ))\n .load()\n \ndf_albums_cassandra.show\n","dateUpdated":"Jan 25, 2016 12:51:04 PM","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorMode":"ace/mode/scala","title":true,"editorHide":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1452705935779_1078727104","id":"20160113-182535_1195558506","dateCreated":"Jan 13, 2016 6:25:35 PM","dateStarted":"Jan 23, 2016 7:12:15 PM","dateFinished":"Jan 23, 2016 7:12:15 PM","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:36"},{"title":"16 - SparkSQL table from Cassandra table","text":"%sql\nCREATE TEMPORARY TABLE table_albums\nUSING org.apache.spark.sql.cassandra\nOPTIONS ( cluster \"Test Cluster\", keyspace \"ks_music\", table \"albums\", pushdown \"true\")","dateUpdated":"Jan 25, 2016 12:51:06 PM","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[{"name":"null","index":0,"aggr":"sum"}],"values":[],"groups":[],"scatter":{"xAxis":{"name":"null","index":0,"aggr":"sum"}}},"enabled":true,"editorMode":"ace/mode/sql","title":true,"editorHide":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1452727507469_-1200238483","id":"20160114-002507_1756327366","dateCreated":"Jan 14, 2016 12:25:07 AM","dateStarted":"Jan 23, 2016 7:12:20 PM","dateFinished":"Jan 23, 2016 7:12:21 PM","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:37"},{"title":"17 - SparkSQL query on Cassandra Data","text":"%sql\nselect country, count(*) as count from table_albums group by country having count>${albumCountThreshold=1000} order by count","dateUpdated":"Jan 25, 2016 12:51:07 PM","config":{"colWidth":12,"graph":{"mode":"multiBarChart","height":196,"optionOpen":false,"keys":[{"name":"country","index":0,"aggr":"sum"}],"values":[{"name":"nb","index":1,"aggr":"sum"}],"groups":[],"scatter":{"xAxis":{"name":"country","index":0,"aggr":"sum"},"yAxis":{"name":"nb","index":1,"aggr":"sum"}}},"enabled":true,"editorMode":"ace/mode/sql","title":true,"editorHide":true},"settings":{"params":{"albumCountThreshold":"1500"},"forms":{"albumCountThreshold":{"name":"albumCountThreshold","defaultValue":"1000","hidden":false}}},"jobName":"paragraph_1452791611370_-1853276955","id":"20160114-181331_1246292720","dateCreated":"Jan 14, 2016 6:13:31 PM","dateStarted":"Jan 23, 2016 7:19:15 PM","dateFinished":"Jan 23, 2016 7:19:15 PM","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:38"},{"dateUpdated":"Jan 25, 2016 12:51:09 PM","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorMode":"ace/mode/scala","editorHide":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1452760495107_-1993909005","id":"20160114-093455_823181330","dateCreated":"Jan 14, 2016 9:34:55 AM","dateStarted":"Jan 23, 2016 5:42:16 PM","dateFinished":"Jan 23, 2016 5:42:17 PM","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:39"}],"name":"Demo_Zeppelin_Spark_Cassandra","id":"2BAMHCRKT","angularObjects":{"2BAR5Q1CK":[],"2B8HBPA4J":[],"2B8SYS13E":[],"2B9RS14RM":[],"2B9K44KEM":[],"2BBTVXAEZ":[],"2BB93S3V8":[],"2BBP6Q212":[],"2B9VGMWN8":[],"2B8V4FGJX":[],"2B7Z92PY7":[],"2B98HH4BJ":[],"2B9MPUCPQ":[],"2BB2GBKK9":[]},"config":{"looknfeel":"default"},"info":{}}