{"paragraphs":[{"text":"%md\n\n## Spark HBase - A DataFrame Based Connector\n\nCreated by @RobHryniewicz\nver 0.1 - last updated on Jun 26, 2016\n\n\n\n## Introduction\n\n[Spark-HBase connector](https://github.com/hortonworks/shc) was developed by Hortonworks along with Bloomberg. The connector leverages Spark SQL Data Sources API introduced in Spark-1.2.0. It bridges the gap between the simple HBase Key Value store and complex relational SQL queries and enables users to perform complex data analytics on top of HBase using Spark. An HBase DataFrame is a standard Spark DataFrame, and is able to interact with any other data sources such as Hive, ORC, Parquet, JSON, etc.\n\n## Prerequisites\n\n* [HDP 2.5 TP](http://hortonworks.com/tech-preview-hdp-2-5)\n\n## Background\n\nThere are several open source Spark HBase connectors available either as Spark packages, as independent projects or in HBase trunk. Spark has moved to the Dataset/DataFrame APIs, which provides built-in query plan optimization. Now, end users prefer to use DataFrames/Datasets based interface. The HBase connector in the HBase trunk has a rich support at the RDD level, e.g. BulkPut, etc, but its DataFrame support is not as rich. HBase trunk connector relies on the standard HadoopRDD with HBase built-in TableInputFormat has some performance limitations. In addition, BulkGet performed in the the driver may be a single point of failure. There are some other alternative implementations. Take [**Spark-SQL-on-HBase**](https://github.com/Huawei-Spark/Spark-SQL-on-HBase) as an example. It applies very advanced custom optimization techniques by embedding its own query optimization plan inside the standard Spark Catalyst engine, ships the RDD to HBase and performs complicated tasks, such as partial aggregation, inside the HBase coprocessor. This approach is able to achieve high performance, but it difficult to maintain due to its complexity and the rapid evolution of Spark. Also allowing arbitrary code to run inside a coprocessor may pose security risks. The Spark-on-HBase Connector (SHC) has been developed to overcome these potential bottlenecks and weaknesses. It implements the standard Spark Datasource API, and leverages the Spark Catalyst engine for query optimization. In parallel, the RDD is constructed from scratch instead of using TableInputFormat in order to achieve high performance. With this customized RDD, all critical techniques can be applied and fully implemented, such as partition pruning, column pruning, predicate pushdown and data locality. The design makes the maintenance very easy, while achieving a good tradeoff between performance and simplicity.\n\n## Architecture\n\nWe assume Spark and HBase are deployed in the same cluster, and Spark executors are co-located with region servers, as illustrated in the figure below.\n\n![age](http://hortonworks.com/wp-content/uploads/2016/06/age.png)\n\nFigure 1\\. Spark-on-HBase Connector Architecture\n\nAt a high-level, the connector treats both Scan and Get in a similar way, and both actions are performed in the executors. The driver processes the query, aggregates scans/gets based on the region’s metadata, and generates tasks per region. The tasks are sent to the preferred executors co-­located with the region server, and are performed in parallel in the executors to achieve better data locality and concurrency. If a region does not hold the data required, that region server is not assigned any task. A task may consist of multiple Scans and BulkGets, and the data requests by a task is retrieved from only one region server, and this region server will also be the locality preference for the task. Note that the driver is not involved in the real job execution except for scheduling tasks. This avoids the driver being the bottleneck.\n\n## Table Catalog\n\nTo bring the HBase table as a relational table into Spark, we define a mapping between HBase and Spark tables, called Table Catalog. There are two critical parts of this catalog. One is the rowkey definition and the other is the mapping between table column in Spark and the column family and column qualifier in HBase. Please refer to the Usage section for details.\n\n\n## Native Avro support\n\nThe connector supports the Avro format natively, as it is a very common practice to persist structured data into HBase as a byte array. User can persist the Avro record into HBase directly. Internally, the Avro schema is converted to a native Spark Catalyst data type automatically. Note that both key-value parts in an HBase table can be defined in Avro format. Please refer to the examples/test cases in the repo for exact usage.\n\n## Predicate Pushdown\n\nThe connector only retrieves required columns from region server to reduce network overhead and avoid redundant processing in Spark Catalyst engine. Existing standard HBase filters are used to perform predicate push-down without leveraging the coprocessor capability. Because HBase is not aware of the data type except for byte array, and the order inconsistency between Java primitive types and byte array, we have to preprocess the filter condition before setting the filter in the Scan operation to avoid any data loss. Inside the region server, records not matching the query condition are filtered out.\n\n## Partition Pruning\n\nBy extracting the row key from the predicates, we split the Scan/BulkGet into multiple non-overlapping ranges, only the region servers that has the requested data will perform Scan/BulkGet. Currently, the partition pruning is performed on the first dimension of the row keys. For example, if a row key is “key1:key2:key3”, the partition pruning will be based on “key1” only. Note that the WHERE conditions need to be defined carefully. Otherwise, the partition pruning may not take effect. For example, `WHERE rowkey1 > \"abc\" OR column = \"xyz\"` (where rowkey1 is the first dimension of the rowkey, and column is a regular hbase column) will result in a full scan, as we have to cover all the ranges because of the **OR** logic.\n\n## Data Locality\n\nWhen a Spark executor is co-located with HBase region servers, data locality is achieved by identifying the region server location, and makes best effort to co-locate the task with the region server. Each executor performs Scan/BulkGet on the part of the data co-located on the same host.\n\n## Scan and BulkGet\n\nThese two operators are exposed to users by specifying WHERE CLAUSE, e.g., `WHERE column > x and column < y` for scan and `WHERE column = x` for get. The operations are performed in the executors, and the driver only constructs these operations. Internally they are converted to scan and/or get, and Iterator[Row] is returned to catalyst engine for upper layer processing.\n\n## Usage\n\nThe following illustrates the basic procedure on how to use the connector. For more details and advanced use case, such as Avro and composite key support, please refer to the [examples](https://github.com/hortonworks/shc/tree/master/src/main/scala/org/apache/spark/sql/execution/datasources/hbase/examples) in the repository.","user":"admin","dateUpdated":"2016-06-25T01:16:09+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1466803614523_1195275638","id":"20160624-212654_941142543","result":{"code":"SUCCESS","type":"HTML","msg":"<h2>Spark HBase - A DataFrame Based Connector</h2>\n<p>Created by @RobHryniewicz\n<br  />ver 0.1 - last updated on Jun 26, 2016</p>\n<h2>Introduction</h2>\n<p><a href=\"https://github.com/hortonworks/shc\">Spark-HBase connector</a> was developed by Hortonworks along with Bloomberg. The connector leverages Spark SQL Data Sources API introduced in Spark-1.2.0. It bridges the gap between the simple HBase Key Value store and complex relational SQL queries and enables users to perform complex data analytics on top of HBase using Spark. An HBase DataFrame is a standard Spark DataFrame, and is able to interact with any other data sources such as Hive, ORC, Parquet, JSON, etc.</p>\n<h2>Prerequisites</h2>\n<ul>\n<li><a href=\"http://hortonworks.com/tech-preview-hdp-2-5\">HDP 2.5 TP</a></li>\n</ul>\n<h2>Background</h2>\n<p>There are several open source Spark HBase connectors available either as Spark packages, as independent projects or in HBase trunk. Spark has moved to the Dataset/DataFrame APIs, which provides built-in query plan optimization. Now, end users prefer to use DataFrames/Datasets based interface. The HBase connector in the HBase trunk has a rich support at the RDD level, e.g. BulkPut, etc, but its DataFrame support is not as rich. HBase trunk connector relies on the standard HadoopRDD with HBase built-in TableInputFormat has some performance limitations. In addition, BulkGet performed in the the driver may be a single point of failure. There are some other alternative implementations. Take <a href=\"https://github.com/Huawei-Spark/Spark-SQL-on-HBase\"><strong>Spark-SQL-on-HBase</strong></a> as an example. It applies very advanced custom optimization techniques by embedding its own query optimization plan inside the standard Spark Catalyst engine, ships the RDD to HBase and performs complicated tasks, such as partial aggregation, inside the HBase coprocessor. This approach is able to achieve high performance, but it difficult to maintain due to its complexity and the rapid evolution of Spark. Also allowing arbitrary code to run inside a coprocessor may pose security risks. The Spark-on-HBase Connector (SHC) has been developed to overcome these potential bottlenecks and weaknesses. It implements the standard Spark Datasource API, and leverages the Spark Catalyst engine for query optimization. In parallel, the RDD is constructed from scratch instead of using TableInputFormat in order to achieve high performance. With this customized RDD, all critical techniques can be applied and fully implemented, such as partition pruning, column pruning, predicate pushdown and data locality. The design makes the maintenance very easy, while achieving a good tradeoff between performance and simplicity.</p>\n<h2>Architecture</h2>\n<p>We assume Spark and HBase are deployed in the same cluster, and Spark executors are co-located with region servers, as illustrated in the figure below.</p>\n<p><img src=\"http://hortonworks.com/wp-content/uploads/2016/06/age.png\" alt=\"age\" /></p>\n<p>Figure 1. Spark-on-HBase Connector Architecture</p>\n<p>At a high-level, the connector treats both Scan and Get in a similar way, and both actions are performed in the executors. The driver processes the query, aggregates scans/gets based on the region’s metadata, and generates tasks per region. The tasks are sent to the preferred executors co-­located with the region server, and are performed in parallel in the executors to achieve better data locality and concurrency. If a region does not hold the data required, that region server is not assigned any task. A task may consist of multiple Scans and BulkGets, and the data requests by a task is retrieved from only one region server, and this region server will also be the locality preference for the task. Note that the driver is not involved in the real job execution except for scheduling tasks. This avoids the driver being the bottleneck.</p>\n<h2>Table Catalog</h2>\n<p>To bring the HBase table as a relational table into Spark, we define a mapping between HBase and Spark tables, called Table Catalog. There are two critical parts of this catalog. One is the rowkey definition and the other is the mapping between table column in Spark and the column family and column qualifier in HBase. Please refer to the Usage section for details.</p>\n<h2>Native Avro support</h2>\n<p>The connector supports the Avro format natively, as it is a very common practice to persist structured data into HBase as a byte array. User can persist the Avro record into HBase directly. Internally, the Avro schema is converted to a native Spark Catalyst data type automatically. Note that both key-value parts in an HBase table can be defined in Avro format. Please refer to the examples/test cases in the repo for exact usage.</p>\n<h2>Predicate Pushdown</h2>\n<p>The connector only retrieves required columns from region server to reduce network overhead and avoid redundant processing in Spark Catalyst engine. Existing standard HBase filters are used to perform predicate push-down without leveraging the coprocessor capability. Because HBase is not aware of the data type except for byte array, and the order inconsistency between Java primitive types and byte array, we have to preprocess the filter condition before setting the filter in the Scan operation to avoid any data loss. Inside the region server, records not matching the query condition are filtered out.</p>\n<h2>Partition Pruning</h2>\n<p>By extracting the row key from the predicates, we split the Scan/BulkGet into multiple non-overlapping ranges, only the region servers that has the requested data will perform Scan/BulkGet. Currently, the partition pruning is performed on the first dimension of the row keys. For example, if a row key is “key1:key2:key3”, the partition pruning will be based on “key1” only. Note that the WHERE conditions need to be defined carefully. Otherwise, the partition pruning may not take effect. For example, <code>WHERE rowkey1 &gt; \"abc\" OR column = \"xyz\"</code> (where rowkey1 is the first dimension of the rowkey, and column is a regular hbase column) will result in a full scan, as we have to cover all the ranges because of the <strong>OR</strong> logic.</p>\n<h2>Data Locality</h2>\n<p>When a Spark executor is co-located with HBase region servers, data locality is achieved by identifying the region server location, and makes best effort to co-locate the task with the region server. Each executor performs Scan/BulkGet on the part of the data co-located on the same host.</p>\n<h2>Scan and BulkGet</h2>\n<p>These two operators are exposed to users by specifying WHERE CLAUSE, e.g., <code>WHERE column &gt; x and column &lt; y</code> for scan and <code>WHERE column = x</code> for get. The operations are performed in the executors, and the driver only constructs these operations. Internally they are converted to scan and/or get, and Iterator[Row] is returned to catalyst engine for upper layer processing.</p>\n<h2>Usage</h2>\n<p>The following illustrates the basic procedure on how to use the connector. For more details and advanced use case, such as Avro and composite key support, please refer to the <a href=\"https://github.com/hortonworks/shc/tree/master/src/main/scala/org/apache/spark/sql/execution/datasources/hbase/examples\">examples</a> in the repository.</p>\n"},"dateCreated":"2016-06-24T21:26:54+0000","dateStarted":"2016-06-25T01:16:05+0000","dateFinished":"2016-06-25T01:16:05+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:353","focus":true},{"text":"%md\n\n## Setup & Pre-Check\n\nBefore executing the following paragraphs make sure that\n1. `hbase-site.xml` is copied from `hbase-client/conf` to `spark-client/conf`.<br>\nAs **root** user execute the following*:\n    a. `cd /usr/hdp/current/spark-client/conf`\n    b. `cp /usr/hdp/current/hbase-client/conf/hbase-site.xml .`<br><br>\n2. Next, via Ambari verify that HBase Master and Region Server have been started, i.e.\n    - Active HBase Master / HBase\n    - RegionServer / HBase\n\n\\* You can access your HDP terminal in one of two ways:\n- in your browser address bar type `<IP>:4200` (e.g. `127.0.0.1:4200`) \n- or SSH from your terminal: `$ ssh root@<IP> -p 2222` (e.g. `ssh root@127.0.0.1 -p 2222`)","user":"admin","dateUpdated":"2016-06-25T01:22:47+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorHide":true,"editorMode":"ace/mode/markdown"},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1466803621331_283636928","id":"20160624-212701_1606637155","result":{"code":"SUCCESS","type":"HTML","msg":"<h2>Setup &amp; Pre-Check</h2>\n<p>Before executing the following paragraphs make sure that</p>\n<ol>\n<li><code>hbase-site.xml</code> is copied from <code>hbase-client/conf</code> to <code>spark-client/conf</code>.<br>\n<br  />As <strong>root</strong> user execute the following*:\n<br  />a. <code>cd /usr/hdp/current/spark-client/conf</code>\n<br  />b. <code>cp /usr/hdp/current/hbase-client/conf/hbase-site.xml .</code><br><br></li>\n<li>Next, via Ambari verify that HBase Master and Region Server have been started, i.e.<ul>\n<li>Active HBase Master / HBase</li>\n<li>RegionServer / HBase</li>\n</ul>\n</li>\n</ol>\n<p>* You can access your HDP terminal in one of two ways:</p>\n<ul>\n<li>in your browser address bar type <code>&lt;IP&gt;:4200</code> (e.g. <code>127.0.0.1:4200</code>)</li>\n<li>or SSH from your terminal: <code>$ ssh root@&lt;IP&gt; -p 2222</code> (e.g. <code>ssh root@127.0.0.1 -p 2222</code>)</li>\n</ul>\n"},"dateCreated":"2016-06-24T21:27:01+0000","dateStarted":"2016-06-25T01:22:46+0000","dateFinished":"2016-06-25T01:22:46+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:354","focus":true},{"title":"Set Dependencies","text":"%dep\n\nz.reset()\nz.load(\"zhzhan:shc:0.0.11-1.6.1-s_2.10\")","user":"admin","dateUpdated":"2016-06-24T22:58:39+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorMode":"ace/mode/scala","editorHide":false,"title":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1466803622874_1265900870","id":"20160624-212702_1707463999","result":{"code":"SUCCESS","type":"TEXT","msg":"DepInterpreter(%dep) deprecated. Remove dependencies and repositories through GUI interpreter menu instead.\nDepInterpreter(%dep) deprecated. Load dependency through GUI interpreter menu instead.\nres0: org.apache.zeppelin.dep.Dependency = org.apache.zeppelin.dep.Dependency@52503f53\n"},"dateCreated":"2016-06-24T21:27:02+0000","dateStarted":"2016-06-24T22:58:40+0000","dateFinished":"2016-06-24T22:58:49+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:355"},{"text":"%md\n","user":"admin","dateUpdated":"2016-06-24T22:58:40+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorMode":"ace/mode/markdown","editorHide":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1466803647547_67438694","id":"20160624-212727_1517244510","result":{"code":"SUCCESS","type":"TEXT","msg":""},"dateCreated":"2016-06-24T21:27:27+0000","dateStarted":"2016-06-24T22:58:40+0000","dateFinished":"2016-06-24T22:58:40+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:356"},{"title":"Define HBase & Spark table schema mapping","text":"%spark\nimport org.apache.spark.sql.execution.datasources.hbase._\ndef catalog = s\"\"\"{\n  |\"table\":{\"namespace\":\"default\", \"name\":\"table1\"},\n  |\"rowkey\":\"key\",\n  |\"columns\":{\n    |\"col0\":{\"cf\":\"rowkey\", \"col\":\"key\", \"type\":\"string\"},\n    |\"col1\":{\"cf\":\"cf1\", \"col\":\"col1\", \"type\":\"boolean\"},\n    |\"col2\":{\"cf\":\"cf2\", \"col\":\"col2\", \"type\":\"double\"},\n    |\"col3\":{\"cf\":\"cf3\", \"col\":\"col3\", \"type\":\"float\"},\n    |\"col4\":{\"cf\":\"cf4\", \"col\":\"col4\", \"type\":\"int\"},\n    |\"col5\":{\"cf\":\"cf5\", \"col\":\"col5\", \"type\":\"bigint\"},\n    |\"col6\":{\"cf\":\"cf6\", \"col\":\"col6\", \"type\":\"smallint\"},\n    |\"col7\":{\"cf\":\"cf7\", \"col\":\"col7\", \"type\":\"string\"},\n    |\"col8\":{\"cf\":\"cf8\", \"col\":\"col8\", \"type\":\"tinyint\"}\n  |}\n|}\"\"\".stripMargin","user":"admin","dateUpdated":"2016-06-24T22:58:40+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorMode":"ace/mode/scala","editorHide":false,"tableHide":false,"title":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1466803663120_-1433836190","id":"20160624-212743_1922516687","result":{"code":"SUCCESS","type":"TEXT","msg":"import org.apache.spark.sql.execution.datasources.hbase._\ncatalog: String\n"},"dateCreated":"2016-06-24T21:27:43+0000","dateStarted":"2016-06-24T22:58:42+0000","dateFinished":"2016-06-24T22:59:33+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:357"},{"text":"%md\n","user":"admin","dateUpdated":"2016-06-24T22:58:40+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorMode":"ace/mode/markdown","editorHide":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1466804631511_273854573","id":"20160624-214351_322256261","result":{"code":"SUCCESS","type":"TEXT","msg":""},"dateCreated":"2016-06-24T21:43:51+0000","dateStarted":"2016-06-24T22:58:40+0000","dateFinished":"2016-06-24T22:58:40+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:358"},{"title":"Define Spark row format ","text":"%spark \n\ncase class HBaseRecord(\n  col0: String,\n  col1: Boolean,\n  col2: Double,\n  col3: Float,\n  col4: Int,\n  col5: Long,\n  col6: Short,\n  col7: String,\n  col8: Byte)","user":"admin","dateUpdated":"2016-06-24T22:58:40+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorMode":"ace/mode/scala","editorHide":false,"title":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1466804710100_1512299985","id":"20160624-214510_47211312","result":{"code":"SUCCESS","type":"TEXT","msg":"defined class HBaseRecord\n"},"dateCreated":"2016-06-24T21:45:10+0000","dateStarted":"2016-06-24T22:58:49+0000","dateFinished":"2016-06-24T22:59:34+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:359"},{"text":"%md\n","user":"admin","dateUpdated":"2016-06-24T22:58:40+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorMode":"ace/mode/markdown","editorHide":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1466804726734_-406427455","id":"20160624-214526_1976542390","result":{"code":"SUCCESS","type":"TEXT","msg":""},"dateCreated":"2016-06-24T21:45:26+0000","dateStarted":"2016-06-24T22:58:40+0000","dateFinished":"2016-06-24T22:58:40+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:360"},{"title":"Define a row generator","text":"%spark\n\nobject HBaseRecord2 {def apply(i: Int, t: String): HBaseRecord = { val s = s\"\"\"row${\"%03d\".format(i)}\"\"\"\n    HBaseRecord(\n      s,\n      i % 2 == 0,\n      i.toDouble,\n      i.toFloat,\n      i,\n      i.toLong,\n      i.toShort,\n      s\"String$i: $t\",\n      i.toByte)\n}}","user":"admin","dateUpdated":"2016-06-24T22:58:40+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorMode":"ace/mode/scala","editorHide":false,"title":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1466804778727_2054457682","id":"20160624-214618_1159565273","result":{"code":"SUCCESS","type":"TEXT","msg":"defined module HBaseRecord2\n"},"dateCreated":"2016-06-24T21:46:18+0000","dateStarted":"2016-06-24T22:59:34+0000","dateFinished":"2016-06-24T22:59:34+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:361"},{"text":"%md\n\n","user":"admin","dateUpdated":"2016-06-24T22:58:40+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorMode":"ace/mode/markdown","editorHide":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1466804805167_839267414","id":"20160624-214645_442784269","result":{"code":"SUCCESS","type":"TEXT","msg":""},"dateCreated":"2016-06-24T21:46:45+0000","dateStarted":"2016-06-24T22:58:40+0000","dateFinished":"2016-06-24T22:58:40+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:362"},{"title":"Generate 256 rows of data","text":"%spark\n\nval data = (0 to 255).map { i => HBaseRecord2(i, \"extra\") }","user":"admin","dateUpdated":"2016-06-24T22:58:40+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorMode":"ace/mode/scala","editorHide":false,"title":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1466804814848_975114424","id":"20160624-214654_1172075436","result":{"code":"SUCCESS","type":"TEXT","msg":"data: scala.collection.immutable.IndexedSeq[HBaseRecord] = Vector(HBaseRecord(row000,true,0.0,0.0,0,0,0,String0: extra,0), HBaseRecord(row001,false,1.0,1.0,1,1,1,String1: extra,1), HBaseRecord(row002,true,2.0,2.0,2,2,2,String2: extra,2), HBaseRecord(row003,false,3.0,3.0,3,3,3,String3: extra,3), HBaseRecord(row004,true,4.0,4.0,4,4,4,String4: extra,4), HBaseRecord(row005,false,5.0,5.0,5,5,5,String5: extra,5), HBaseRecord(row006,true,6.0,6.0,6,6,6,String6: extra,6), HBaseRecord(row007,false,7.0,7.0,7,7,7,String7: extra,7), HBaseRecord(row008,true,8.0,8.0,8,8,8,String8: extra,8), HBaseRecord(row009,false,9.0,9.0,9,9,9,String9: extra,9), HBaseRecord(row010,true,10.0,10.0,10,10,10,String10: extra,10), HBaseRecord(row011,false,11.0,11.0,11,11,11,String11: extra,11), HBaseRecord(row012,true,12...."},"dateCreated":"2016-06-24T21:46:54+0000","dateStarted":"2016-06-24T22:59:34+0000","dateFinished":"2016-06-24T22:59:35+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:363"},{"text":"%md\n\n","user":"admin","dateUpdated":"2016-06-24T22:58:40+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorMode":"ace/mode/markdown","editorHide":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1466804825535_-1251427461","id":"20160624-214705_702186998","result":{"code":"SUCCESS","type":"TEXT","msg":""},"dateCreated":"2016-06-24T21:47:05+0000","dateStarted":"2016-06-24T22:58:40+0000","dateFinished":"2016-06-24T22:58:40+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:364"},{"title":"Write data to HBase table","text":"%spark\n\nsc.parallelize(data).toDF.write.options(\n  Map(HBaseTableCatalog.tableCatalog -> catalog, HBaseTableCatalog.newTable -> \"5\")).format(\"org.apache.spark.sql.execution.datasources.hbase\").save()","user":"admin","dateUpdated":"2016-06-24T22:58:40+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorMode":"ace/mode/scala","editorHide":false,"title":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1466804830158_109429398","id":"20160624-214710_1102634259","result":{"code":"SUCCESS","type":"TEXT","msg":""},"dateCreated":"2016-06-24T21:47:10+0000","dateStarted":"2016-06-24T22:59:35+0000","dateFinished":"2016-06-24T22:59:42+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:365"},{"text":"%md\n\n","user":"admin","dateUpdated":"2016-06-24T22:58:40+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorMode":"ace/mode/markdown","editorHide":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1466804849895_2051687819","id":"20160624-214729_339519123","result":{"code":"SUCCESS","type":"TEXT","msg":""},"dateCreated":"2016-06-24T21:47:29+0000","dateStarted":"2016-06-24T22:58:40+0000","dateFinished":"2016-06-24T22:58:40+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:366"},{"title":"Define a function to read data from HBase","text":"%spark\n\nimport org.apache.spark.sql._\n\ndef withCatalog(cat: String): DataFrame = {\n  sqlContext\n  .read\n  .options(Map(HBaseTableCatalog.tableCatalog->cat))\n  .format(\"org.apache.spark.sql.execution.datasources.hbase\")\n  .load()\n}","user":"admin","dateUpdated":"2016-06-24T22:58:40+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorMode":"ace/mode/scala","editorHide":false,"title":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1466804864295_-1780964541","id":"20160624-214744_935116517","result":{"code":"SUCCESS","type":"TEXT","msg":"import org.apache.spark.sql._\nwithCatalog: (cat: String)org.apache.spark.sql.DataFrame\n"},"dateCreated":"2016-06-24T21:47:44+0000","dateStarted":"2016-06-24T22:59:35+0000","dateFinished":"2016-06-24T22:59:43+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:367"},{"text":"%md\n\n","user":"admin","dateUpdated":"2016-06-24T22:58:40+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorMode":"ace/mode/markdown","editorHide":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1466804879102_1458636522","id":"20160624-214759_846299105","result":{"code":"SUCCESS","type":"TEXT","msg":""},"dateCreated":"2016-06-24T21:47:59+0000","dateStarted":"2016-06-24T22:58:40+0000","dateFinished":"2016-06-24T22:58:41+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:368"},{"title":"Create a DataFrame from the HBase Catalog","text":"%spark \n\nval df = withCatalog(catalog)","user":"admin","dateUpdated":"2016-06-24T22:58:40+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorHide":false,"title":true,"editorMode":"ace/mode/scala"},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1466804883627_-1870595711","id":"20160624-214803_2046855961","result":{"code":"SUCCESS","type":"TEXT","msg":"df: org.apache.spark.sql.DataFrame = [col4: int, col7: string, col1: boolean, col3: float, col6: smallint, col0: string, col8: tinyint, col2: double, col5: bigint]\n"},"dateCreated":"2016-06-24T21:48:03+0000","dateStarted":"2016-06-24T22:59:43+0000","dateFinished":"2016-06-24T22:59:44+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:369"},{"text":"%md\n","user":"admin","dateUpdated":"2016-06-24T22:58:41+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorMode":"ace/mode/markdown","editorHide":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1466804941966_343542264","id":"20160624-214901_1245630067","result":{"code":"SUCCESS","type":"TEXT","msg":""},"dateCreated":"2016-06-24T21:49:01+0000","dateStarted":"2016-06-24T22:58:41+0000","dateFinished":"2016-06-24T22:58:41+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:370"},{"title":"Typical DataFrame operation (filter)","text":"%spark\n\nval s = df.filter(((\n  $\"col0\" <= \"row050\" && $\"col0\" > \"row040\") ||\n  $\"col0\" === \"row005\" || $\"col0\" === \"row020\" ||\n  $\"col0\" === \"r20\" || $\"col0\" <= \"row005\") &&\n  ($\"col4\" === 1 || $\"col4\" === 42))\n  .select(\"col0\", \"col1\", \"col4\")","user":"admin","dateUpdated":"2016-06-24T22:58:41+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorMode":"ace/mode/scala","editorHide":false,"title":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1466804954732_769874694","id":"20160624-214914_811694998","result":{"code":"SUCCESS","type":"TEXT","msg":"s: org.apache.spark.sql.DataFrame = [col0: string, col1: boolean, col4: int]\n"},"dateCreated":"2016-06-24T21:49:14+0000","dateStarted":"2016-06-24T22:59:43+0000","dateFinished":"2016-06-24T22:59:45+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:371"},{"text":"%md\n\n","user":"admin","dateUpdated":"2016-06-24T22:58:41+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorMode":"ace/mode/markdown","editorHide":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1466804971692_213927857","id":"20160624-214931_2048565216","result":{"code":"SUCCESS","type":"TEXT","msg":""},"dateCreated":"2016-06-24T21:49:31+0000","dateStarted":"2016-06-24T22:58:41+0000","dateFinished":"2016-06-24T22:58:41+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:372"},{"title":"Show data in columns","text":"%spark\n\ns.show","user":"admin","dateUpdated":"2016-06-24T22:58:41+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorMode":"ace/mode/scala","editorHide":false,"title":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1466804977387_1176184857","id":"20160624-214937_554428354","result":{"code":"SUCCESS","type":"TEXT","msg":"+------+-----+----+\n|  col0| col1|col4|\n+------+-----+----+\n|row001|false|   1|\n|row042| true|  42|\n+------+-----+----+\n\n"},"dateCreated":"2016-06-24T21:49:37+0000","dateStarted":"2016-06-24T22:59:44+0000","dateFinished":"2016-06-24T22:59:47+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:373"},{"text":"%md\n","user":"admin","dateUpdated":"2016-06-24T22:58:41+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorMode":"ace/mode/markdown","editorHide":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1466804988924_-1911794491","id":"20160624-214948_2007471071","result":{"code":"SUCCESS","type":"TEXT","msg":""},"dateCreated":"2016-06-24T21:49:48+0000","dateStarted":"2016-06-24T22:58:41+0000","dateFinished":"2016-06-24T22:58:41+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:374"},{"title":"Register temporary table","text":"%spark\n\ndf.registerTempTable(\"table\")","user":"admin","dateUpdated":"2016-06-24T22:58:41+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorMode":"ace/mode/scala","editorHide":false,"title":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1466805001140_-130624911","id":"20160624-215001_1131970480","result":{"code":"SUCCESS","type":"TEXT","msg":""},"dateCreated":"2016-06-24T21:50:01+0000","dateStarted":"2016-06-24T22:59:45+0000","dateFinished":"2016-06-24T22:59:47+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:375"},{"text":"%md\n\n","user":"admin","dateUpdated":"2016-06-24T22:58:41+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorMode":"ace/mode/markdown","editorHide":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1466805024303_1960146584","id":"20160624-215024_1017957917","result":{"code":"SUCCESS","type":"TEXT","msg":""},"dateCreated":"2016-06-24T21:50:24+0000","dateStarted":"2016-06-24T22:58:41+0000","dateFinished":"2016-06-24T22:58:41+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:376"},{"title":"Use SQL syntax to query temporary table","text":"%spark\n\nsqlContext.sql(\"select count(col1) from table\").show","user":"admin","dateUpdated":"2016-06-24T22:58:41+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorMode":"ace/mode/scala","editorHide":false,"title":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1466805032031_-1000880951","id":"20160624-215032_78931382","result":{"code":"SUCCESS","type":"TEXT","msg":"+---+\n|_c0|\n+---+\n|256|\n+---+\n\n"},"dateCreated":"2016-06-24T21:50:32+0000","dateStarted":"2016-06-24T22:59:47+0000","dateFinished":"2016-06-24T22:59:50+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:377"},{"text":"%md\n\n## **Putting It All Together**\n\nWe've just given a quick overview of how HBase supports Spark at the DataFrame level. With the DataFrame API Spark applications can work with data stored in HBase table as easily as any data stored in other data sources. With this new feature, data in HBase tables can be easily consumed by Spark applications and other interactive tools, e.g. users can run a complex SQL query on top of an HBase table inside Spark, perform a table join against Dataframe, or integrate with Spark Streaming to implement a more complicated system\n\n## **What’s Next?**\n\nCurrently, the connector is hosted in [Hortonworks repo](https://github.com/hortonworks/shc), and published as a [Spark package](http://spark-packages.org/package/zhzhan/shc). It is in the process of being migrated to Apache HBase trunk.\n\nFuture work will include:\n\n* optimization of underlying computing architecture for Scan and BulkGet\n* JSON user interface for ease of use\n* DataFrame writing path\n* Avro support\n* Java primitive types (short, int, long, float, double etc.)\n* composite row key\n* timestamp semantics (optional)","user":"admin","dateUpdated":"2016-06-24T22:58:41+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorMode":"ace/mode/markdown","editorHide":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1466805036871_-1535281849","id":"20160624-215036_582949583","result":{"code":"SUCCESS","type":"HTML","msg":"<h2><strong>Putting It All Together</strong></h2>\n<p>We've just given a quick overview of how HBase supports Spark at the DataFrame level. With the DataFrame API Spark applications can work with data stored in HBase table as easily as any data stored in other data sources. With this new feature, data in HBase tables can be easily consumed by Spark applications and other interactive tools, e.g. users can run a complex SQL query on top of an HBase table inside Spark, perform a table join against Dataframe, or integrate with Spark Streaming to implement a more complicated system</p>\n<h2><strong>What’s Next?</strong></h2>\n<p>Currently, the connector is hosted in <a href=\"https://github.com/hortonworks/shc\">Hortonworks repo</a>, and published as a <a href=\"http://spark-packages.org/package/zhzhan/shc\">Spark package</a>. It is in the process of being migrated to Apache HBase trunk.</p>\n<p>Future work will include:</p>\n<ul>\n<li>optimization of underlying computing architecture for Scan and BulkGet</li>\n<li>JSON user interface for ease of use</li>\n<li>DataFrame writing path</li>\n<li>Avro support</li>\n<li>Java primitive types (short, int, long, float, double etc.)</li>\n<li>composite row key</li>\n<li>timestamp semantics (optional)</li>\n</ul>\n"},"dateCreated":"2016-06-24T21:50:36+0000","dateStarted":"2016-06-24T22:58:41+0000","dateFinished":"2016-06-24T22:58:41+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:378"},{"text":"%md ","user":"admin","dateUpdated":"2016-06-24T22:58:41+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorHide":true,"editorMode":"ace/mode/markdown"},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1466805686370_-1585932431","id":"20160624-220126_2020741756","result":{"code":"SUCCESS","type":"TEXT","msg":""},"dateCreated":"2016-06-24T22:01:26+0000","dateStarted":"2016-06-24T22:58:41+0000","dateFinished":"2016-06-24T22:58:41+0000","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:379"},{"text":"%md ","dateUpdated":"2016-06-24T22:58:41+0000","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1466809121857_1038542676","id":"20160624-225841_1410118809","dateCreated":"2016-06-24T22:58:41+0000","status":"READY","progressUpdateIntervalMs":500,"$$hashKey":"object:380"}],"name":"Spark HBase - A DataFrame Based Connector","id":"2BRZCAM4E","lastReplName":{"value":"md"},"angularObjects":{"2BR44EQTD:shared_process":[],"2BQXB2Q56:shared_process":[],"2BNVUS6WF:shared_process":[],"2BQ8ZV7GK:shared_process":[],"2BP5X5WH3:shared_process":[],"2BRKE4HX7:shared_process":[]},"config":{"looknfeel":"default"},"info":{}}