{"paragraphs":[{"text":"### First steps with Hive\n\nCreate your database on hive. We will name it with your EPFL gaspar name.\n\n```shell\n%jdbc(hive)\ncreate database if not exists your_gaspar_name\n location '/user/your_gaspar_name/hive'; \n```\n\nCreate an external Hive table. Hive will create a reference to the files but it will not manage the files itself. If you drop the table, only the definition in Hive is deleted. This exercise will work only if you have completed the HDFS exercises.\n\n```shell\n%jdbc(hive)\ncreate external table if not exists your_gaspar_name.traffic_count(Sdate string,Cosit int,Study int,Period int,LaneNumber int,LaneDescription string,LaneDirection int,DirectionDescription string,Volume int,Flags int,FlagText string,Setup int,NumBins int,Bins int)\n row format delimited fields terminated by ','\n stored as textfile\n location '/user/your_gaspar_name/work1/';\n```\n\nNow verify that your table was properly created.\n```shell\n%jdbc(hive)\nselect * from your_gaspar_name.traffic_count limit 10;\n```\n\nReplace your_gaspar_name with your username and try the above commands in the next cells, create as many cells as needed.\n\nNote the first row. what did we do wrong?\n","user":"ebouille","dateUpdated":"2018-03-28T13:28:37+0200","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"editOnDblClick":true,"language":"markdown"},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"

First steps with Hive

Create your database on hive. We will name it with your EPFL gaspar name.

%jdbc(hive)\ncreate database if not exists your_gaspar_name\n  location '/user/your_gaspar_name/hive'; \n

Create an external Hive table. Hive will create a reference to the files but it will not manage the files itself. If you drop the table, only the definition in Hive is deleted. This exercise will work only if you have completed the HDFS exercises.

%jdbc(hive)\ncreate external table if not exists your_gaspar_name.traffic_count(Sdate string,Cosit int,Study int,Period int,LaneNumber int,LaneDescription string,LaneDirection int,DirectionDescription string,Volume int,Flags int,FlagText string,Setup int,NumBins int,Bins int)\n    row format delimited fields terminated by ','\n    stored as textfile\n    location '/user/your_gaspar_name/work1/';\n

Now verify that your table was properly created.

%jdbc(hive)\nselect * from your_gaspar_name.traffic_count limit 10;\n

Replace your_gaspar_name with your username and try the above commands in the next cells, create as many cells as needed.

Note the first row. what did we do wrong?

\n"}]},"apps":[],"jobName":"paragraph_1521555185538_1739158077","id":"20180320-102737_103167059","dateCreated":"2018-03-20T15:13:05+0100","dateStarted":"2018-03-28T13:28:37+0200","dateFinished":"2018-03-28T13:28:37+0200","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:16668"},{"text":"#### Answers\n\nPrerequesite, you must first have completed the HDFS exercises and have at least imported the traffic files under /user/{your_gaspar_name}/work1 directory.\n\nCopy the provided code in Zeppelin cells, replace all occurrences of {your_gaspar_name} with your username and run the cells.\n\nNote that the first rows is the header of the imported traffic files. Hive does not remove it automatically. We should have removed the first row manually before copying it into HDFS.\n","user":"ebouille","dateUpdated":"2018-03-28T10:35:10+0200","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"editOnDblClick":true,"language":"markdown"},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"

Answers

Prerequesite, you must first have completed the HDFS exercises and have at least imported the traffic files under /user/{your_gaspar_name}/work1 directory.

Copy the provided code in Zeppelin cells, replace all occurrences of {your_gaspar_name} with your username and run the cells.

Note that the first rows is the header of the imported traffic files. Hive does not remove it automatically. We should have removed the first row manually before copying it into HDFS.

\n"}]},"apps":[],"jobName":"paragraph_1522223039116_1686003616","id":"20180328-094359_313340091","dateCreated":"2018-03-28T09:43:59+0200","dateStarted":"2018-03-28T10:35:10+0200","dateFinished":"2018-03-28T10:35:10+0200","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:16669"},{"text":"%jdbc(hive)\ncreate database if not exists {your_gaspar_name}\n location '/user/{your_gaspar_name}/hive'\n ","user":"ebouille","dateUpdated":"2018-03-28T10:49:26+0200","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"editOnDblClick":false,"language":"sql"},"editorMode":"ace/mode/sql"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"Query executed successfully. Affected rows : -1"}]},"apps":[],"jobName":"paragraph_1521597526899_-1482790124","id":"20180321-025846_71567959","dateCreated":"2018-03-21T02:58:46+0100","dateStarted":"2018-03-28T10:49:17+0200","dateFinished":"2018-03-28T10:49:18+0200","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:16670"},{"text":"%jdbc(hive)\ncreate external table if not exists {your_gaspar_name}.traffic_count(Sdate string,Cosit int,Study int,Period int,LaneNumber int,LaneDescription string,LaneDirection int,DirectionDescription string,Volume int,Flags int,FlagText string,Setup int,NumBins int,Bins int)\n row format delimited fields terminated by ','\n stored as textfile\n location '/user/{your_gaspar_name}/work1/';","user":"ebouille","dateUpdated":"2018-03-28T10:59:38+0200","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"editOnDblClick":false,"language":"sql"},"editorMode":"ace/mode/sql"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"Query executed successfully. Affected rows : -1"}]},"apps":[],"jobName":"paragraph_1522224950957_1589747395","id":"20180328-101550_1299033602","dateCreated":"2018-03-28T10:15:50+0200","dateStarted":"2018-03-28T10:58:58+0200","dateFinished":"2018-03-28T10:58:58+0200","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:16671"},{"text":"%jdbc(hive)\nselect * from {your_gaspar_name}.traffic_count limit 10;","user":"ebouille","dateUpdated":"2018-03-28T11:00:00+0200","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"editOnDblClick":false,"language":"sql"},"editorMode":"ace/mode/sql"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TABLE","data":"traffic_count.sdate\ttraffic_count.cosit\ttraffic_count.study\ttraffic_count.period\ttraffic_count.lanenumber\ttraffic_count.lanedescription\ttraffic_count.lanedirection\ttraffic_count.directiondescription\ttraffic_count.volume\ttraffic_count.flags\ttraffic_count.flagtext\ttraffic_count.setup\ttraffic_count.numbins\ttraffic_count.bins\nSdate\tnull\tnull\tnull\tnull\tLaneDescription\tnull\tDirectionDescription\tnull\tnull\tFlag Text\tnull\tnull\tnull\n26/07/2016 12:00\t70301\t1\t60\t1\tNorthbound\t1\tNorthEast\t-1\t10240\t\"Bad data\tnull\t1\t0\n26/07/2016 12:00\t70301\t1\t60\t2\tSouthbound\t2\tSouthWest\t-1\t10240\t\"Bad data\tnull\t1\t0\n26/07/2016 13:00\t70301\t1\t60\t1\tNorthbound\t1\tNorthEast\t70\t8192\tChecked\t1\t0\tnull\n26/07/2016 13:00\t70301\t1\t60\t2\tSouthbound\t2\tSouthWest\t79\t8192\tChecked\t1\t0\tnull\n26/07/2016 14:00\t70301\t1\t60\t1\tNorthbound\t1\tNorthEast\t96\t8192\tChecked\t1\t0\tnull\n26/07/2016 14:00\t70301\t1\t60\t2\tSouthbound\t2\tSouthWest\t102\t8192\tChecked\t1\t0\tnull\n26/07/2016 15:00\t70301\t1\t60\t1\tNorthbound\t1\tNorthEast\t95\t8192\tChecked\t1\t0\tnull\n26/07/2016 15:00\t70301\t1\t60\t2\tSouthbound\t2\tSouthWest\t96\t8192\tChecked\t1\t0\tnull\n26/07/2016 16:00\t70301\t1\t60\t1\tNorthbound\t1\tNorthEast\t106\t8192\tChecked\t1\t0\tnull\n"}]},"apps":[],"jobName":"paragraph_1522224959910_-80447580","id":"20180328-101559_975725070","dateCreated":"2018-03-28T10:15:59+0200","dateStarted":"2018-03-28T10:59:12+0200","dateFinished":"2018-03-28T10:59:12+0200","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:16672"},{"text":"%md\n### Import one day of Twitter data into Hive\n\nCreate an external table from HDFS dir __/datasets/twitter_one_day__ and call it __twitter_one_day__. The table should have a single column named __json__ of type __string__. Do not forget to use your own database (gaspar name)!\n\nA few hints:\n(1) You can explore the __/datasets/twitter_one_day__ directory from your terminal with the __hdfs dfs -ls__ command. You will notice that the files are in bzip2 format. Do not worry about that, Hive knows how to handle compressed text files automatically.\n(2) The files have only one field per line\n(3) If you do not specify the row format, the default format __fields terminated by '\\n'__ will be used.\n\nAfter the table __twitter_one_day__ is created, verify the content of its first row with a select command (limit 1). Use the output of the select query to identify the json fields where the the language and the timestamp information of the tweet are stored. You can use __http://jsonprettyprint.com/__ to pretty print the json string.\n\n","user":"ebouille","dateUpdated":"2018-03-28T10:35:11+0200","config":{"colWidth":12,"editorMode":"ace/mode/markdown","results":{},"enabled":true,"editorSetting":{"editOnDblClick":true,"language":"markdown"},"editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"

Import one day of Twitter data into Hive

Create an external table from HDFS dir /datasets/twitter_one_day and call it twitter_one_day. The table should have a single column named json of type string. Do not forget to use your own database (gaspar name)!

A few hints:\n
(1) You can explore the /datasets/twitter_one_day directory from your terminal with the hdfs dfs -ls command. You will notice that the files are in bzip2 format. Do not worry about that, Hive knows how to handle compressed text files automatically.\n
(2) The files have only one field per line\n
(3) If you do not specify the row format, the default format fields terminated by '\\n' will be used.

After the table twitter_one_day is created, verify the content of its first row with a select command (limit 1). Use the output of the select query to identify the json fields where the the language and the timestamp information of the tweet are stored. You can use http://jsonprettyprint.com/ to pretty print the json string.

\n"}]},"apps":[],"jobName":"paragraph_1521555185539_1738773328","id":"20180320-133148_586015451","dateCreated":"2018-03-20T15:13:05+0100","dateStarted":"2018-03-28T10:35:12+0200","dateFinished":"2018-03-28T10:35:12+0200","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:16673"},{"text":"%md\n#### Answers\n\nWe use the same create table command as before when we created a table from the traffic data. We only need to change the schema of the table to (json string), and the location to /datasets/twitter_one_day.\n\nDo not forget to replace {your_gaspar_name} with your username. And not mentioning a database name is possible, but it defaults to the __default__ database, which is shared by everyone using the cluster. This was the most common error in this class: people accessing, modifying and deleting the same tables simultaneously.\n\nThe json fields of interest are __lang__ and __timestamp_m__.\n","user":"ebouille","dateUpdated":"2018-03-28T11:40:02+0200","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"editOnDblClick":true,"language":"markdown"},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"

Answers

We use the same create table command as before when we created a table from the traffic data. We only need to change the schema of the table to (json string), and the location to /datasets/twitter_one_day.

Do not forget to replace {your_gaspar_name} with your username. And not mentioning a database name is possible, but it defaults to the default database, which is shared by everyone using the cluster. This was the most common error in this class: people accessing, modifying and deleting the same tables simultaneously.

The json fields of interest are lang and timestamp_m.

\n"}]},"apps":[],"jobName":"paragraph_1522225492386_631500994","id":"20180328-102452_1304380360","dateCreated":"2018-03-28T10:24:52+0200","dateStarted":"2018-03-28T11:40:02+0200","dateFinished":"2018-03-28T11:40:02+0200","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:16674"},{"text":"%jdbc(hive)\ndrop table {your_gaspar_name}.twitter_one_day;\ncreate external table if not exists {your_gaspar_name}.twitter_one_day(json string)\n stored as textfile\n location '/datasets/twitter_one_day/';\n","user":"ebouille","dateUpdated":"2018-03-28T10:46:01+0200","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"editOnDblClick":false,"language":"sql"},"editorMode":"ace/mode/sql"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"Query executed successfully. Affected rows : -1"},{"type":"TEXT","data":"Query executed successfully. Affected rows : -1"}]},"apps":[],"jobName":"paragraph_1521597769923_2111501808","id":"20180321-030249_154474664","dateCreated":"2018-03-21T03:02:49+0100","dateStarted":"2018-03-28T10:45:24+0200","dateFinished":"2018-03-28T10:45:24+0200","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:16675"},{"text":"%md\n### Twitter language frequency\n\nCompute the language frequencies, sorted in decreasing order of popularity. You should only use standard SQL group and count commands.\n\nTake advantage of the Zeppelin graph toolbar that appears with the results.\n\n```shell\n%jdbc(hive)\nwith q as (\n select\n get_json_object(json, '$.lang') as lang\n from CHANGEME.twitter_one_day\n)\n\nselect lang, count(*) as count\nfrom q\ngroup by lang\norder by count desc;\n```\n\nYou can try different visualizations of the results with the embedded Zeppelin graph interface.\n\nFor further reading see the distinction between __group by__ and __sort by__ in __https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy__ . You can try this at home.\n","user":"ebouille","dateUpdated":"2018-03-28T10:35:13+0200","config":{"colWidth":12,"editorMode":"ace/mode/markdown","results":{"0":{"graph":{"mode":"pieChart","height":300,"optionOpen":true,"setting":{"multiBarChart":{"stacked":false}},"commonSetting":{},"keys":[{"name":"lang","index":0,"aggr":"sum"}],"groups":[],"values":[{"name":"count","index":1,"aggr":"sum"}]},"helium":{}}},"enabled":true,"editorSetting":{"editOnDblClick":true,"language":"markdown"},"editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"

Twitter language frequency

Compute the language frequencies, sorted in decreasing order of popularity. You should only use standard SQL group and count commands.

Take advantage of the Zeppelin graph toolbar that appears with the results.

%jdbc(hive)\nwith q as (\n    select\n        get_json_object(json, '$.lang') as lang\n    from CHANGEME.twitter_one_day\n)\n\nselect lang, count(*) as count\nfrom q\ngroup by lang\norder by count desc;\n

You can try different visualizations of the results with the embedded Zeppelin graph interface.

For further reading see the distinction between group by and sort by in https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy . You can try this at home.

\n"}]},"apps":[],"jobName":"paragraph_1521555185543_1737234333","id":"20180320-135324_2117720219","dateCreated":"2018-03-20T15:13:05+0100","dateStarted":"2018-03-28T10:35:13+0200","dateFinished":"2018-03-28T10:35:13+0200","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:16676"},{"text":"%md\n#### Answer\n\nUse the json field __lang__ and your gaspar name.\n","user":"ebouille","dateUpdated":"2018-03-28T12:29:15+0200","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"editOnDblClick":true,"language":"markdown"},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"

Answer

Use the json field lang and your gaspar name.

\n"}]},"apps":[],"jobName":"paragraph_1522232491962_117716872","id":"20180328-122131_1042605176","dateCreated":"2018-03-28T12:21:31+0200","dateStarted":"2018-03-28T12:29:15+0200","dateFinished":"2018-03-28T12:29:15+0200","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:16677"},{"text":"%jdbc(hive)\nwith q as (\n select\n get_json_object(json, '$.lang') as lang\n from {your_gaspar_name}.twitter_one_day\n)\n\nselect lang, count(*) as count\nfrom q\ngroup by lang\norder by count desc;","user":"ebouille","dateUpdated":"2018-03-28T10:47:25+0200","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"editOnDblClick":false,"language":"sql"},"editorMode":"ace/mode/sql"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TABLE","data":"lang\tcount\nen\t1000596\nnull\t904724\nja\t681507\nar\t263575\nes\t238278\nund\t214158\npt\t190038\nko\t174870\nth\t111280\nfr\t63765\nin\t59658\ntr\t51658\ntl\t42213\nzh\t23703\nru\t21566\nit\t15748\nhi\t14164\nde\t12345\npl\t9509\nnl\t7706\nfa\t4195\nht\t4148\nur\t3893\net\t3838\nel\t2905\nsv\t2690\nta\t1890\nfi\t1592\ncs\t1482\nda\t1466\nno\t1339\neu\t1324\nro\t1281\ncy\t1218\nuk\t1005\nvi\t978\nne\t888\nhu\t823\nlt\t725\nlv\t625\niw\t498\nsl\t496\nis\t446\nsr\t399\nbn\t383\ngu\t339\nbg\t288\nmr\t243\nml\t231\nmy\t154\nte\t125\nkn\t123\nsi\t58\nckb\t53\npa\t44\nor\t27\nhy\t25\nps\t24\nkm\t23\nsd\t18\nka\t17\nam\t14\ndv\t6\nlo\t4\nug\t2\nbo\t1\n"}]},"apps":[],"jobName":"paragraph_1521602618863_766709691","id":"20180321-042338_1447379256","dateCreated":"2018-03-21T04:23:38+0100","dateStarted":"2018-03-28T10:46:17+0200","dateFinished":"2018-03-28T10:46:51+0200","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:16678"},{"text":"%md\n### Extract the timeseries of languages used in twitter.\n\nIn this exercise, you will only keep the language and timestamp information (in ms since 01.01.1971 00:00:00 +00:00) of each tweet from the __twitter_one_day__ table. The result will be a table with a column __lang__ of type string and a column __time__ of type timestamp. You will store this result into a new table called __twitter_lang__.\n\nWe provide part of the query. You must fix all CHANGEME as needed in order to perform the above operation. Use the drop table command if you do not get the table right the first time.\n\n```shell\n%jdbc(hive)\n\ncreate table CHANGEME\nstored as parquet\nas\nselect\n get_json_object(json, CHANGEME) as CHANGEME,\n from_utc_timestamp(cast(get_json_object(json, \"$.timestamp_ms\") as bigint), 'UTC') as time\nfrom CHANGEME;\n```\n\n","user":"ebouille","dateUpdated":"2018-03-28T12:25:21+0200","config":{"colWidth":12,"editorMode":"ace/mode/markdown","results":{},"enabled":true,"editorSetting":{"editOnDblClick":true,"language":"markdown"},"editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"

Extract the timeseries of languages used in twitter.

In this exercise, you will only keep the language and timestamp information (in ms since 01.01.1971 00:00:00 +00:00) of each tweet from the twitter_one_day table. The result will be a table with a column lang of type string and a column time of type timestamp. You will store this result into a new table called twitter_lang.

We provide part of the query. You must fix all CHANGEME as needed in order to perform the above operation. Use the drop table command if you do not get the table right the first time.

%jdbc(hive)\n\ncreate table CHANGEME\nstored as parquet\nas\nselect\n    get_json_object(json, CHANGEME) as CHANGEME,\n    from_utc_timestamp(cast(get_json_object(json, \"$.timestamp_ms\") as bigint), 'UTC') as time\nfrom CHANGEME;\n

\n"}]},"apps":[],"jobName":"paragraph_1521555185542_1737619082","id":"20180320-134103_2004356345","dateCreated":"2018-03-20T15:13:05+0100","dateStarted":"2018-03-28T11:38:22+0200","dateFinished":"2018-03-28T11:38:22+0200","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:16679"},{"text":"%md\n\n#### Answers\n\nExtract the json fields __lang__ and __timestamp_ms__ from the rows, and convert the json __timestamp_ms__ with the from_utc_timestamp function.\n","user":"ebouille","dateUpdated":"2018-03-28T12:29:47+0200","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"editOnDblClick":true,"language":"markdown"},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"

Answers

Extract the json fields lang and timestamp_ms from the rows, and convert the json timestamp_ms with the from_utc_timestamp function.

\n"}]},"apps":[],"jobName":"paragraph_1522232821140_771696413","id":"20180328-122701_1776992808","dateCreated":"2018-03-28T12:27:01+0200","dateStarted":"2018-03-28T12:29:47+0200","dateFinished":"2018-03-28T12:29:47+0200","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:16680"},{"text":"%jdbc(hive)\n\ndrop table if exists {your_gaspar_name}.twitter_lang;\ncreate table {your_gaspar_name}.twitter_lang\nstored as parquet\nas\nselect\n get_json_object(json, '$.lang') as lang,\n from_utc_timestamp(cast(get_json_object(json, '$.timestamp_ms') as bigint), 'UTC') as time\nfrom {your_gaspar_name}.twitter_one_day;","user":"ebouille","dateUpdated":"2018-03-28T12:30:48+0200","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"editOnDblClick":false,"language":"sql"},"editorMode":"ace/mode/sql","editorHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"Query executed successfully. Affected rows : -1"},{"type":"TEXT","data":"Query executed successfully. Affected rows : -1"}]},"apps":[],"jobName":"paragraph_1521602666923_-1654252982","id":"20180321-042426_1858484630","dateCreated":"2018-03-21T04:24:26+0100","dateStarted":"2018-03-28T12:26:33+0200","dateFinished":"2018-03-28T12:27:16+0200","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:16681"},{"text":"%md\n### The mysterious Hive query\n\nCan you guess the meaning of this Hive command?\n\n```shell\n%jdbc(hive)\ncreate table CHANGEME.twitter_lang_csv\nrow format delimited\n fields terminated by ','\n escaped by '\"'\n lines terminated by '\\n'\nstored as textfile\nas\nwith q as (\n select \n lang,\n cast(date_format(time, 'HH') as int) as hour\n from CHANGEME\n)\nselect lang, hour, count(*) as count\nfrom q\ngroup by lang, hour;\n```\n\nReplace all CHANGME occurences as appropriate and try it.","user":"ebouille","dateUpdated":"2018-03-28T10:35:14+0200","config":{"colWidth":12,"editorMode":"ace/mode/markdown","results":{},"enabled":true,"editorSetting":{"editOnDblClick":true,"language":"markdown"},"editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"

The mysterious Hive query

Can you guess the meaning of this Hive command?

%jdbc(hive)\ncreate table CHANGEME.twitter_lang_csv\nrow format delimited\n    fields terminated by ','\n    escaped by '\"'\n    lines terminated by '\\n'\nstored as textfile\nas\nwith q as (\n    select \n        lang,\n        cast(date_format(time, 'HH') as int) as hour\n    from CHANGEME\n)\nselect lang, hour, count(*) as count\nfrom q\ngroup by lang, hour;\n

Replace all CHANGME occurences as appropriate and try it.

\n"}]},"apps":[],"jobName":"paragraph_1521555185543_1737234333","id":"20180320-135452_2072033476","dateCreated":"2018-03-20T15:13:05+0100","dateStarted":"2018-03-28T10:35:15+0200","dateFinished":"2018-03-28T10:35:15+0200","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:16682"},{"text":"%md\n##### Answers\n\nYou just need to replace all CHANGEME with your gaspar name in order to run this command.\n\nThis command create a new table called __twitter_lang_csv__ in your database. The table is stored in text form in csv (__row format__). The new table aggregates the timestamps per hour and count how often each language (__lang__ field) appears per hour.\n\nNote that we must decompose this query using nested select commands. The inner select creates table __q__ with a column lang (from __lang__ json field) and a column hour (computed from __timestamp_ms__ field) for each tweet. The outer select performs a __group by__ per language and per hour and counts the number of rows in each group.\n\n\n","user":"ebouille","dateUpdated":"2018-03-28T12:38:46+0200","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"editOnDblClick":true,"language":"markdown"},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"

Answers

You just need to replace all CHANGEME with your gaspar name in order to run this command.

This command create a new table called twitter_lang_csv in your database. The table is stored in text form in csv (row format). The new table aggregates the timestamps per hour and count how often each language (lang field) appears per hour.

Note that we must decompose this query using nested select commands. The inner select creates table q with a column lang (from lang json field) and a column hour (computed from timestamp_ms field) for each tweet. The outer select performs a group by per language and per hour and counts the number of rows in each group.

\n"}]},"apps":[],"jobName":"paragraph_1522232992771_-35093027","id":"20180328-122952_1728388484","dateCreated":"2018-03-28T12:29:52+0200","dateStarted":"2018-03-28T12:38:46+0200","dateFinished":"2018-03-28T12:38:46+0200","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:16683"},{"text":"%jdbc(hive)\ndrop table if exists {your_gaspar_name}.twitter_lang_csv;\ncreate table {your_gaspar_name}.twitter_lang_csv\nrow format delimited\n fields terminated by ','\n escaped by '\"'\n lines terminated by '\\n'\nstored as textfile\nas\nwith q as (\n select \n lang,\n cast(date_format(time, 'HH') as int) as hour\n from {your_gaspar_name}.twitter_lang\n)\nselect lang, hour, count(*) as count\nfrom q\ngroup by lang, hour;","user":"ebouille","dateUpdated":"2018-03-28T12:45:33+0200","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"editOnDblClick":false,"language":"sql"},"editorMode":"ace/mode/sql"},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"Query executed successfully. Affected rows : -1"},{"type":"TEXT","data":"Query executed successfully. Affected rows : -1"}]},"apps":[],"jobName":"paragraph_1521621384868_2075729577","id":"20180321-093624_789131470","dateCreated":"2018-03-21T09:36:24+0100","dateStarted":"2018-03-28T12:44:52+0200","dateFinished":"2018-03-28T12:45:04+0200","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:16684"},{"text":"%md\n### Display language frequencies at different times of the day\n\nIn the next query you will display the frequecny tweets in Portuguese (pt) and Korean (ko) for each hour of the day.\n\nHints: (1) Only keep the tweets where the hour is not null and the lang is in ('pt', 'ko'), (2) use the SQL __where__ clause.\n\nSelect the bar chart view in Zeppelin graph toolbar that appears with the result. Then the open the settings and arrange the output fields (drag and drop) in the keys, groups and values properties until you get an insightful plot.\n\n\n","user":"ebouille","dateUpdated":"2018-03-28T10:35:15+0200","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"editOnDblClick":true,"language":"markdown"},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"

Display language frequencies at different times of the day

In the next query you will display the frequecny tweets in Portuguese (pt) and Korean (ko) for each hour of the day.

Hints: (1) Only keep the tweets where the hour is not null and the lang is in ('pt', 'ko'), (2) use the SQL where clause.

Select the bar chart view in Zeppelin graph toolbar that appears with the result. Then the open the settings and arrange the output fields (drag and drop) in the keys, groups and values properties until you get an insightful plot.

\n"}]},"apps":[],"jobName":"paragraph_1521621779930_-1897437686","id":"20180321-094259_347812219","dateCreated":"2018-03-21T09:42:59+0100","dateStarted":"2018-03-28T10:35:16+0200","dateFinished":"2018-03-28T10:35:16+0200","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:16685"},{"text":"%md\n#### Answer\n\nSelect rows from the twitter_lang_csv table with the appropriate filter as explained, and uses the bar-plot visualization in Zepplin. In the settings, use the hours for the keys, group by lang and use count SUM for the values.\n\nNotice the variations of the language during the day, whih correspond to the timezones in Brazil ('pt') and Korea ('ko').\n\n","user":"ebouille","dateUpdated":"2018-03-28T12:52:28+0200","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"editOnDblClick":true,"language":"markdown"},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"

Answer

Select rows from the twitter_lang_csv table with the appropriate filter as explained, and uses the bar-plot visualization in Zepplin. In the settings, use the hours for the keys, group by lang and use count SUM for the values.

Notice the variations of the language during the day, whih correspond to the timezones in Brazil ('pt') and Korea ('ko').

\n"}]},"apps":[],"jobName":"paragraph_1522234074039_290933787","id":"20180328-124754_434581777","dateCreated":"2018-03-28T12:47:54+0200","dateStarted":"2018-03-28T12:52:28+0200","dateFinished":"2018-03-28T12:52:28+0200","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:16686"},{"text":"%jdbc(hive)\nselect lang, hour, count from {your_gaspar_name}.twitter_lang_csv\nwhere hour is not null and lang in ('pt','ko');\n\n","user":"ebouille","dateUpdated":"2018-03-28T12:52:57+0200","config":{"tableHide":false,"editorSetting":{"editOnDblClick":false,"language":"sql"},"colWidth":12,"editorMode":"ace/mode/sql","editorHide":false,"results":{"0":{"graph":{"mode":"multiBarChart","height":300,"optionOpen":true,"setting":{"multiBarChart":{"stacked":false},"pieChart":{},"stackedAreaChart":{"style":"stream"}},"commonSetting":{},"keys":[{"name":"hour","index":1,"aggr":"sum"}],"groups":[{"name":"lang","index":0,"aggr":"sum"}],"values":[{"name":"count","index":2,"aggr":"sum"}]},"helium":{}}},"enabled":true},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TABLE","data":"lang\thour\tcount\nko\t0\t2769\nko\t1\t4039\nko\t2\t6015\nko\t3\t7192\nko\t4\t9439\nko\t5\t9422\nko\t6\t8968\nko\t7\t8610\nko\t8\t4065\nko\t9\t9007\nko\t10\t8931\nko\t11\t9809\nko\t12\t10990\nko\t13\t11959\nko\t14\t12709\nko\t15\t13098\nko\t16\t12379\nko\t17\t9185\nko\t18\t6018\nko\t19\t3613\nko\t20\t2168\nko\t21\t1517\nko\t22\t1305\nko\t23\t1663\npt\t0\t11820\npt\t1\t11732\npt\t2\t12263\npt\t3\t11004\npt\t4\t8263\npt\t5\t5215\npt\t6\t3076\npt\t7\t1846\npt\t8\t689\npt\t9\t1616\npt\t10\t2592\npt\t11\t3884\npt\t12\t5616\npt\t13\t7552\npt\t14\t9285\npt\t15\t10257\npt\t16\t10652\npt\t17\t10515\npt\t18\t10023\npt\t19\t9816\npt\t20\t9845\npt\t21\t10381\npt\t22\t10728\npt\t23\t11368\n"}]},"apps":[],"jobName":"paragraph_1521555185544_1735310588","id":"20180320-135453_712665411","dateCreated":"2018-03-20T15:13:05+0100","dateStarted":"2018-03-28T12:47:12+0200","dateFinished":"2018-03-28T12:47:12+0200","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:16687"},{"text":"%md\n\n### Find the most retweeted users\n\nIn the next queries you will compute the list of the 50 most retweeted users.\n\nYou will first create the table __twitter_users__ and store it in __parquet__ format.\n\n```shell\n%jdbc(hive)\ncreate table CHANGEME\nstored as parquet\nas\nselect CHANGEME(json, '$.retweeted_status.user.screen_name') as name\nfrom CHANGEME\nwhere CHANGEME(json, '$.retweeted_status') is not null;\n```\n\nNext you will create the table of the 50 most popular users in decreasing order of retweets for the day.\n\n```shell\n%jdbc(hive)\ncreate table CHANGEME.twitter_users_count\nrow format delimited\n fields terminated by ','\n escaped by '\"'\n lines terminated by '\\n'\nstored as textfile\nselect name, count(*) as count\nfrom CHANGEME.twitter_users\ngroup by name\norder by count desc\nlimit 50;\n```\n\nNote that we have created the table in textfile format, and we have specified a row format as a CSV file. To wrap this exercise up, try to find this file in HDFS and use the __hdfs dfs -cat__ command to visualize it. You can then use __hdfs dfs -get__ and __scp__ or rsync to copy it on your laptop.\n","user":"ebouille","dateUpdated":"2018-03-28T10:35:16+0200","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"editOnDblClick":true,"language":"markdown"},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"

Find the most retweeted users

In the next queries you will compute the list of the 50 most retweeted users.

You will first create the table twitter_users and store it in parquet format.

%jdbc(hive)\ncreate table CHANGEME\nstored as parquet\nas\nselect CHANGEME(json, '$.retweeted_status.user.screen_name') as name\nfrom CHANGEME\nwhere CHANGEME(json, '$.retweeted_status') is not null;\n

Next you will create the table of the 50 most popular users in decreasing order of retweets for the day.

%jdbc(hive)\ncreate table CHANGEME.twitter_users_count\nrow format delimited\n    fields terminated by ','\n    escaped by '\"'\n    lines terminated by '\\n'\nstored as textfile\nselect name, count(*) as count\nfrom CHANGEME.twitter_users\ngroup by name\norder by count desc\nlimit 50;\n

Note that we have created the table in textfile format, and we have specified a row format as a CSV file. To wrap this exercise up, try to find this file in HDFS and use the hdfs dfs -cat command to visualize it. You can then use hdfs dfs -get and scp or rsync to copy it on your laptop.

\n"}]},"apps":[],"jobName":"paragraph_1521624060771_875679668","id":"20180321-102100_1524832575","dateCreated":"2018-03-21T10:21:00+0100","dateStarted":"2018-03-28T10:35:16+0200","dateFinished":"2018-03-28T10:35:16+0200","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:16688"},{"text":"%md\n#### Answers","user":"ebouille","dateUpdated":"2018-03-28T12:55:11+0200","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"editOnDblClick":true,"language":"markdown"},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"

Answers

\n"}]},"apps":[],"jobName":"paragraph_1522234502408_1114342406","id":"20180328-125502_1198545539","dateCreated":"2018-03-28T12:55:02+0200","dateStarted":"2018-03-28T12:55:11+0200","dateFinished":"2018-03-28T12:55:11+0200","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:16689"},{"text":"%jdbc(hive)\ncreate table {your_gaspar_name}.twitter_users\nstored as parquet\nas\nselect get_json_object(json, '$.retweeted_status.user.screen_name') as name\nfrom {your_gaspar_name}.twitter_one_day\nwhere get_json_object(json, '$.retweeted_status') is not null;","user":"ebouille","dateUpdated":"2018-03-28T12:56:12+0200","config":{"colWidth":12,"editorMode":"ace/mode/sql","results":{},"enabled":true,"editorSetting":{"editOnDblClick":false,"language":"sql"}},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"Query executed successfully. Affected rows : -1"}]},"apps":[],"jobName":"paragraph_1521555185548_1733771593","id":"20180320-111048_1278675940","dateCreated":"2018-03-20T15:13:05+0100","dateStarted":"2018-03-28T12:54:41+0200","dateFinished":"2018-03-28T12:55:17+0200","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:16690"},{"text":"%jdbc(hive)\ncreate table {your_gaspar_name}.twitter_users_count\nrow format delimited\n fields terminated by ','\n escaped by '\"'\n lines terminated by '\\n'\nstored as textfile\nas\nselect name, count(*) as count\nfrom {your_gaspar_name}.twitter_users\ngroup by name\norder by count desc\nlimit 50;","user":"ebouille","dateUpdated":"2018-03-28T12:56:34+0200","config":{"colWidth":12,"editorMode":"ace/mode/sql","results":{},"enabled":true,"editorSetting":{"editOnDblClick":false,"language":"sql"}},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"TEXT","data":"Query executed successfully. Affected rows : -1"}]},"apps":[],"jobName":"paragraph_1521555185550_1734541090","id":"20180320-111112_142668305","dateCreated":"2018-03-20T15:13:05+0100","dateStarted":"2018-03-28T12:55:54+0200","dateFinished":"2018-03-28T12:56:07+0200","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:16691"},{"user":"ebouille","dateUpdated":"2018-03-28T10:35:17+0200","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"editOnDblClick":true,"language":"markdown"},"editorMode":"ace/mode/markdown"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1521649524358_1553553338","id":"20180321-172524_600795842","dateCreated":"2018-03-21T17:25:24+0100","status":"READY","progressUpdateIntervalMs":500,"$$hashKey":"object:16692"},{"text":"%md\n### That's all, folks","user":"ebouille","dateUpdated":"2018-03-21T11:47:58+0100","config":{"colWidth":12,"editorMode":"ace/mode/markdown","results":{},"enabled":true,"editorSetting":{"editOnDblClick":true,"language":"markdown"},"editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"

That's all, folks

\n"}]},"apps":[],"jobName":"paragraph_1521555185551_1734156341","id":"20180320-143329_5934928","dateCreated":"2018-03-20T15:13:05+0100","dateStarted":"2018-03-21T11:47:58+0100","dateFinished":"2018-03-21T11:47:58+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:16693"},{"text":"%md\n","user":"ebouille","dateUpdated":"2018-03-21T11:47:58+0100","config":{"colWidth":12,"enabled":true,"results":{},"editorSetting":{"editOnDblClick":true,"language":"markdown"},"editorMode":"ace/mode/markdown"},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1521629278211_241926698","id":"20180321-114758_916463376","dateCreated":"2018-03-21T11:47:58+0100","status":"READY","progressUpdateIntervalMs":500,"$$hashKey":"object:16694"}],"name":"DSLAB-Week5-1","id":"2DBVM6KTX","angularObjects":{"2CHS8UYQQ:shared_process":[],"2CK8A9MEG:shared_process":[],"2CKAY1A8Y:shared_process":[],"2C4U48MY3_spark2:shared_process":[],"2CKEKWY8Z:shared_process":[]},"config":{"looknfeel":"default","personalizedMode":"false"},"info":{}}