{"paragraphs":[{"text":"%md\n### First steps with HDFS\n\nLogin to the IC cluster with your __gaspar__ login name and password.\n\n```shell\nssh your-gaspar-login@iccluster042.iccluster.epfl.ch\n```\n\nOnce logged on the cluster, you can get started with the HDFS command line interface with\n\n```shell\nhdfs dfs\n```\n\nThe output is the list of HDFS file system actions available via the hdfs command line. Notice how most of the commands behave like the familiar Linux file system commands.\n\nAs a first exercise, you will explore the content of the cluster's HDFS file system using the __hdfs dfs__ command.\n\n1. We have created a directory on HDFS for each of you, can you find yours?\n2. Create a folder __work1__ in your HDFS directory and change the access rights of your directory so that only you and the hadoop group can read and write into it.\n3. Copy the 2017 _Traffic Count_ data published by the [Calderdale Metropolitan Borough Council (UK)](https://data.gov.uk/dataset/0c64970c-756a-46b2-9282-4a62016c7c64/traffic-count) to your __work1__ directory. A copy of the data is available from the [dslab 2019 github repository](https://github.com/dslab2019/dslab2019.github.io/blob/master/data/week5/01012017_to_31072017.csv.bz2?raw=true)\n\nHints: (1) use the __scp__ or the __wget__ commands to copy the data locally in your home directory, and one of the __hdfs dfs__ commands to copy the local file to your HDFS directory, (2) HDFS does not like spaces in filenames.\n","user":"ebouille","dateUpdated":"2019-03-20T04:06:46+0100","config":{"colWidth":12,"fontSize":9,"enabled":true,"results":{},"editorSetting":{"language":"markdown","editOnDblClick":true,"completionKey":"TAB","completionSupport":false},"editorMode":"ace/mode/markdown","editorHide":true,"tableHide":false},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"

First steps with HDFS

\n

Login to the IC cluster with your gaspar login name and password.

\n
ssh your-gaspar-login@iccluster042.iccluster.epfl.ch\n
\n

Once logged on the cluster, you can get started with the HDFS command line interface with

\n
hdfs dfs\n
\n

The output is the list of HDFS file system actions available via the hdfs command line. Notice how most of the commands behave like the familiar Linux file system commands.

\n

As a first exercise, you will explore the content of the cluster's HDFS file system using the hdfs dfs command.

\n
    \n
  1. We have created a directory on HDFS for each of you, can you find yours?
  2. \n
  3. Create a folder work1 in your HDFS directory and change the access rights of your directory so that only you and the hadoop group can read and write into it.
  4. \n
  5. Copy the 2017 Traffic Count data published by the Calderdale Metropolitan Borough Council (UK) to your work1 directory. A copy of the data is available from the dslab 2019 github repository
  6. \n
\n

Hints: (1) use the scp or the wget commands to copy the data locally in your home directory, and one of the hdfs dfs commands to copy the local file to your HDFS directory, (2) HDFS does not like spaces in filenames.

\n"}]},"apps":[],"jobName":"paragraph_1553043527523_1416749460","id":"20190320-015847_1939308302","dateCreated":"2019-03-20T01:58:47+0100","dateStarted":"2019-03-20T04:06:46+0100","dateFinished":"2019-03-20T04:06:46+0100","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:168"},{"text":"%md\n### First steps with Hive\n\nCreate your database on hive. We will name it with your EPFL gaspar name.\n\n```shell\n%jdbc(hive)\ncreate database if not exists your_gaspar_name\n location '/homes/your_gaspar_name/hive'; \n```\n\nCreate an external Hive table. Hive will create a reference to the files but it will not manage the files itself. If you drop the table, only the definition in Hive is deleted. This exercise will work only if you have completed the HDFS exercises.\n\n```shell\n%jdbc(hive)\ncreate external table if not exists your_gaspar_name.traffic_count(Sdate string,Cosit int,Study int,Period int,LaneNumber int,LaneDescription string,LaneDirection int,DirectionDescription string,Volume int,Flags int,FlagText string,Setup int,NumBins int,Bins int)\n row format delimited fields terminated by ','\n stored as textfile\n location '/homes/your_gaspar_name/work1/';\n```\n\nNow verify that your table was properly created.\n```shell\n%jdbc(hive)\nselect * from your_gaspar_name.traffic_count limit 10;\n```\n\nReplace your_gaspar_name with your username and try the above commands in the next cells, create as many cells as needed.\n\nNote the first row. what did we do wrong?\n","user":"ebouille","dateUpdated":"2019-03-20T02:48:20+0100","config":{"tableHide":false,"editorSetting":{"editOnDblClick":true,"language":"markdown","completionSupport":false},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"results":{},"enabled":true,"fontSize":9},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"

First steps with Hive

\n

Create your database on hive. We will name it with your EPFL gaspar name.

\n
%jdbc(hive)\ncreate database if not exists your_gaspar_name\n  location '/homes/your_gaspar_name/hive'; \n
\n

Create an external Hive table. Hive will create a reference to the files but it will not manage the files itself. If you drop the table, only the definition in Hive is deleted. This exercise will work only if you have completed the HDFS exercises.

\n
%jdbc(hive)\ncreate external table if not exists your_gaspar_name.traffic_count(Sdate string,Cosit int,Study int,Period int,LaneNumber int,LaneDescription string,LaneDirection int,DirectionDescription string,Volume int,Flags int,FlagText string,Setup int,NumBins int,Bins int)\n    row format delimited fields terminated by ','\n    stored as textfile\n    location '/homes/your_gaspar_name/work1/';\n
\n

Now verify that your table was properly created.

\n
%jdbc(hive)\nselect * from your_gaspar_name.traffic_count limit 10;\n
\n

Replace your_gaspar_name with your username and try the above commands in the next cells, create as many cells as needed.

\n

Note the first row. what did we do wrong?

\n"}]},"apps":[],"jobName":"paragraph_1552982661594_1435584615","id":"20180320-102737_103167059","dateCreated":"2019-03-19T09:04:21+0100","dateStarted":"2019-03-20T02:48:20+0100","dateFinished":"2019-03-20T02:48:20+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:169"},{"text":"%jdbc(hive)\n","user":"ebouille","dateUpdated":"2019-03-20T04:07:28+0100","config":{"colWidth":12,"editorMode":"ace/mode/sql","results":{"2":{"graph":{"mode":"table","height":300,"optionOpen":false,"setting":{"table":{"tableGridState":{},"tableColumnTypeState":{"names":{"traffic_count.sdate":"string","traffic_count.cosit":"string","traffic_count.study":"string","traffic_count.period":"string","traffic_count.lanenumber":"string","traffic_count.lanedescription":"string","traffic_count.lanedirection":"string","traffic_count.directiondescription":"string","traffic_count.volume":"string","traffic_count.flags":"string","traffic_count.flagtext":"string","traffic_count.setup":"string","traffic_count.numbins":"string","traffic_count.bins":"string"},"updated":false},"tableOptionSpecHash":"[{\"name\":\"useFilter\",\"valueType\":\"boolean\",\"defaultValue\":false,\"widget\":\"checkbox\",\"description\":\"Enable filter for columns\"},{\"name\":\"showPagination\",\"valueType\":\"boolean\",\"defaultValue\":false,\"widget\":\"checkbox\",\"description\":\"Enable pagination for better navigation\"},{\"name\":\"showAggregationFooter\",\"valueType\":\"boolean\",\"defaultValue\":false,\"widget\":\"checkbox\",\"description\":\"Enable a footer for displaying aggregated values\"}]","tableOptionValue":{"useFilter":false,"showPagination":false,"showAggregationFooter":false},"updated":false,"initialized":false}},"commonSetting":{}}}},"enabled":true,"editorSetting":{"editOnDblClick":false,"language":"sql","completionSupport":true},"fontSize":9},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[]},"apps":[],"jobName":"paragraph_1552982661595_-501542124","id":"20180321-025846_71567959","dateCreated":"2019-03-19T09:04:21+0100","dateStarted":"2019-03-20T04:07:28+0100","dateFinished":"2019-03-20T04:07:28+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:170"},{"text":"%md\n### Import one day of Twitter data into Hive\n\nCreate an external table from HDFS dir __/datasets/twitter_one_day__ and call it __your_gaspar_name.twitter_one_day__. The table should have a single column named __json__ of type __string__. Do not forget to use your own database (gaspar name)!\n\nA few hints:\n(1) You can explore the __/datasets/twitter_one_day__ directory from your terminal with the __hdfs dfs -ls__ command. You will notice that the files are in bzip2 format. Do not worry about that, Hive knows how to handle compressed text files automatically.\n(2) The files have only one field per line\n(3) If you do not specify the row format, the default format __fields terminated by '\\n'__ will be used.\n\nAfter the table __twitter_one_day__ is created, verify the content of its first row with a select command (limit 1). Use the output of the select query to identify the json fields where the the language and the timestamp information of the tweet are stored. You can use [__http://jsonprettyprint.com/__](http://jsonprettyprint.com/) to pretty print the json string.\n\n","user":"ebouille","dateUpdated":"2019-03-20T04:07:39+0100","config":{"tableHide":false,"editorSetting":{"editOnDblClick":true,"language":"markdown","completionSupport":false},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"results":{},"enabled":true,"fontSize":9},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"

Import one day of Twitter data into Hive

\n

Create an external table from HDFS dir /datasets/twitter_one_day and call it your_gaspar_name.twitter_one_day. The table should have a single column named json of type string. Do not forget to use your own database (gaspar name)!

\n

A few hints:\n
(1) You can explore the /datasets/twitter_one_day directory from your terminal with the hdfs dfs -ls command. You will notice that the files are in bzip2 format. Do not worry about that, Hive knows how to handle compressed text files automatically.\n
(2) The files have only one field per line\n
(3) If you do not specify the row format, the default format fields terminated by '\\n' will be used.

\n

After the table twitter_one_day is created, verify the content of its first row with a select command (limit 1). Use the output of the select query to identify the json fields where the the language and the timestamp information of the tweet are stored. You can use http://jsonprettyprint.com/ to pretty print the json string.

\n"}]},"apps":[],"jobName":"paragraph_1552982661595_-1044606358","id":"20180320-133148_586015451","dateCreated":"2019-03-19T09:04:21+0100","dateStarted":"2019-03-20T04:07:31+0100","dateFinished":"2019-03-20T04:07:31+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:171"},{"text":"%jdbc(hive)\n","user":"ebouille","dateUpdated":"2019-03-20T03:39:03+0100","config":{"colWidth":12,"editorMode":"ace/mode/sql","results":{"1":{"graph":{"mode":"table","height":300,"optionOpen":false,"setting":{"table":{"tableGridState":{},"tableColumnTypeState":{"names":{"lang":"string","count":"string"},"updated":false},"tableOptionSpecHash":"[{\"name\":\"useFilter\",\"valueType\":\"boolean\",\"defaultValue\":false,\"widget\":\"checkbox\",\"description\":\"Enable filter for columns\"},{\"name\":\"showPagination\",\"valueType\":\"boolean\",\"defaultValue\":false,\"widget\":\"checkbox\",\"description\":\"Enable pagination for better navigation\"},{\"name\":\"showAggregationFooter\",\"valueType\":\"boolean\",\"defaultValue\":false,\"widget\":\"checkbox\",\"description\":\"Enable a footer for displaying aggregated values\"}]","tableOptionValue":{"useFilter":false,"showPagination":false,"showAggregationFooter":false},"updated":false,"initialized":false}},"commonSetting":{}}}},"enabled":true,"editorSetting":{"editOnDblClick":false,"language":"sql","completionSupport":true},"fontSize":9,"tableHide":false},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1552982661596_1860730122","id":"20180321-030249_154474664","dateCreated":"2019-03-19T09:04:21+0100","dateStarted":"2019-03-20T03:36:31+0100","dateFinished":"2019-03-20T03:37:18+0100","status":"FINISHED","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:172"},{"text":"%md\n### Twitter language frequency\n\nCompute the language frequencies, sorted in decreasing order of popularity. You should only use standard SQL group and count commands.\n\nTake advantage of the Zeppelin graph toolbar that appears with the results.\n\n```shell\n%jdbc(hive)\nwith q as (\n select\n get_json_object(json, '$.lang') as lang\n from CHANGEME.twitter_one_day\n)\n\nselect lang, count(*) as count\nfrom q\ngroup by lang\norder by count desc;\n```\n\nYou can try different visualizations of the results with the embedded Zeppelin graph interface.\n\nFor further reading see the distinction between __group by__ and __sort by__ in __https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy__ . You can try this at home.\n","user":"ebouille","dateUpdated":"2019-03-20T01:06:15+0100","config":{"tableHide":false,"editorSetting":{"editOnDblClick":true,"language":"markdown","completionSupport":false},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"results":{"0":{"graph":{"mode":"pieChart","height":300,"optionOpen":true,"setting":{"multiBarChart":{"stacked":false}},"commonSetting":{},"keys":[{"name":"lang","index":0,"aggr":"sum"}],"groups":[],"values":[{"name":"count","index":1,"aggr":"sum"}]},"helium":{}}},"enabled":true,"fontSize":9},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"

Twitter language frequency

\n

Compute the language frequencies, sorted in decreasing order of popularity. You should only use standard SQL group and count commands. Take advantage of the embedded Zeppelin graph interface in order to plot the results.

\n
%jdbc(hive)\nwith q as (\n    select\n        get_json_object(json, '$.lang') as lang\n    from CHANGEME.twitter_one_day\n)\n\nselect lang, count(*) as count\nfrom q\ngroup by lang\norder by count desc;\n
\n

You can try different visualizations of the results with the embedded Zeppelin graph interface.

\n

For further reading see the distinction between group by and sort by in https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy . You can try this at home.

\n"}]},"apps":[],"jobName":"paragraph_1552982661597_-1636106340","id":"20180320-135324_2117720219","dateCreated":"2019-03-19T09:04:21+0100","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:173"},{"text":"%jdbc(hive)\nwith q as (\n select\n get_json_object(json, '$.lang') as lang\n from CHANGEME.twitter_one_day\n)\n\nselect lang, count(*) as count\nfrom q\ngroup by lang\norder by count desc;","user":"ebouille","dateUpdated":"2019-03-19T09:04:21+0100","config":{"colWidth":12,"editorMode":"ace/mode/sql","results":{},"enabled":true,"editorSetting":{"editOnDblClick":false,"language":"sql","completionSupport":true},"fontSize":9},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1552982661597_-1103952253","id":"20180321-042338_1447379256","dateCreated":"2019-03-19T09:04:21+0100","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:174"},{"text":"%md\n### Extract the timeseries of languages used in twitter.\n\nIn this exercise, you will only keep the language and timestamp information (in ms since 01.01.1971 00:00:00 +00:00) of each tweet from the __twitter_one_day__ table. The result will be a table with a column __lang__ of type string and a column __time__ of type timestamp. You will store this result into a new table called __twitter_lang__.\n\nWe provide part of the query. You must fix all CHANGEME as needed in order to perform the above operation. Use the drop table command if you do not get the table right the first time.\n\n```shell\n%jdbc(hive)\n\ncreate table CHANGEME\nstored as parquet\nas\nselect\n get_json_object(json, CHANGEME) as CHANGEME,\n from_utc_timestamp(cast(CHANGEME as bigint), 'UTC') as CHANGEME\nfrom CHANGEME;\n```\n\n","user":"ebouille","dateUpdated":"2019-03-19T09:04:21+0100","config":{"tableHide":false,"editorSetting":{"editOnDblClick":true,"language":"markdown"},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"results":{},"enabled":true,"fontSize":9},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"

Extract the timeseries of languages used in twitter.

\n

In this exercise, you will only keep the language and timestamp information of each tweet from the twitter_one_day table. The result will be a table with a column lang of type string and a column time of type timestamp. You will store this result into a new table called twitter_lang.

\n

We provide part of the query. You must fix all CHANGEME as needed in order to perform the above operation. Use the drop table command if you do not get the table right the first time.

\n
%jdbc(hive)\n\ncreate table CHANGEME\nstored as parquet\nas\nselect\n    get_json_object(json, '$.CHANGEME') as CHANGEME,\n    from_utc_timestamp(cast(CHANGEME as bigint), 'UTC') as CHANGEME\nfrom CHANGEME;\n
\n"}]},"apps":[],"jobName":"paragraph_1552982661598_-43048750","id":"20180320-134103_2004356345","dateCreated":"2019-03-19T09:04:21+0100","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:175"},{"text":"%jdbc(hive)\n\ncreate table CHANGEME\nstored as parquet\nas\nselect\n get_json_object(json, '$.CHANGEME') as CHANGEME,\n from_utc_timestamp(cast(CHANGEME as bigint), 'UTC') as CHANGEME\nfrom CHANGEME;","user":"ebouille","dateUpdated":"2019-03-19T09:04:21+0100","config":{"colWidth":12,"editorMode":"ace/mode/sql","results":{},"enabled":true,"editorSetting":{"editOnDblClick":false,"language":"sql","completionSupport":true},"fontSize":9},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1552982661598_-2031734481","id":"20180321-042426_1858484630","dateCreated":"2019-03-19T09:04:21+0100","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:176"},{"text":"%md\n### The mysterious Hive query\n\nCan you guess the meaning of this Hive command?\n\n```shell\n%jdbc(hive)\ncreate table CHANGEME.twitter_lang_csv\nrow format delimited\n fields terminated by ','\n escaped by '\"'\n lines terminated by '\\n'\nstored as textfile\nas\nwith q as (\n select \n lang,\n cast(date_format(time, 'HH') as int) as hour\n from CHANGEME\n)\nselect lang, hour, count(*) as count\nfrom q\ngroup by lang, hour;\n```\n\nReplace all CHANGME occurences as appropriate and try it.","user":"ebouille","dateUpdated":"2019-03-19T09:04:21+0100","config":{"tableHide":false,"editorSetting":{"editOnDblClick":true,"language":"markdown"},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"results":{},"enabled":true,"fontSize":9},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"

The mysterious Hive query

\n

Can you guess the meaning of this Hive command?

\n
%jdbc(hive)\ncreate table CHANGEME.twitter_lang_csv\nrow format delimited\n    fields terminated by ','\n    escaped by '\"'\n    lines terminated by '\\n'\nstored as textfile\nas\nwith q as (\n    select \n        lang,\n        cast(date_format(time, 'HH') as int) as hour\n    from CHANGEME\n)\nselect lang, hour, count(*) as count\nfrom q\ngroup by lang, hour;\n
\n

Replace all CHANGME occurences as appropriate and try it.

\n"}]},"apps":[],"jobName":"paragraph_1552982661599_-1854597728","id":"20180320-135452_2072033476","dateCreated":"2019-03-19T09:04:21+0100","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:177"},{"text":"%jdbc(hive)\ncreate table CHANGEME.twitter_lang_time\nstored as parquet\nas\nwith q as (\n select \n lang,\n cast(date_format(time, 'HH') as int) as hour\n from CHANGEME\n)\nselect lang, hour, count(*) as count\nfrom q\ngroup by lang, hour;","user":"ebouille","dateUpdated":"2019-03-19T09:04:21+0100","config":{"colWidth":12,"editorMode":"ace/mode/sql","results":{},"enabled":true,"editorSetting":{"editOnDblClick":false,"language":"sql","completionSupport":true},"fontSize":9},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1552982661599_860397655","id":"20180321-093624_789131470","dateCreated":"2019-03-19T09:04:21+0100","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:178"},{"text":"%md\n### Display language frequencies at different times of the day\n\nIn the next query you will display the frequecny tweets in Portuguese (pt) and Korean (ko) for each hour of the day.\n\nHints: (1) Only keep the tweets where the hour is not null and the lang is in ('pt', 'ko'), (2) use the SQL __where__ clause.\n\nSelect the bar chart view in Zeppelin graph toolbar that appears with the result. Then the open the settings and arrange the output fields (drag and drop) in the keys, groups and values properties until you get an insightful plot.\n\n\n","user":"ebouille","dateUpdated":"2019-03-19T09:04:21+0100","config":{"tableHide":false,"editorSetting":{"editOnDblClick":true,"language":"markdown"},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"results":{},"enabled":true,"fontSize":9},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"

Display language frequencies at different times of the day

\n

In the next query you will display the frequecny tweets in Portuguese (pt) and Korean (ko) for each hour of the day.

\n

Hints: (1) Only keep the tweets where the hour is not null and the lang is in ('pt', 'ko'), (2) use the SQL where clause.

\n

Select the bar chart view in Zeppelin graph toolbar that appears with the result. Then the open the settings and arrange the output fields (drag and drop) in the keys, groups and values properties until you get an insightful plot.

\n"}]},"apps":[],"jobName":"paragraph_1552982661600_-440340880","id":"20180321-094259_347812219","dateCreated":"2019-03-19T09:04:21+0100","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:179"},{"text":"%jdbc(hive)\n","user":"ebouille","dateUpdated":"2019-03-19T09:04:21+0100","config":{"tableHide":false,"editorSetting":{"editOnDblClick":false,"language":"sql","completionSupport":true},"colWidth":12,"editorMode":"ace/mode/sql","editorHide":false,"results":{"0":{"graph":{"mode":"multiBarChart","height":300,"optionOpen":false,"setting":{"multiBarChart":{"stacked":false},"pieChart":{},"stackedAreaChart":{"style":"stream"}},"commonSetting":{},"keys":[{"name":"twitter_lang_csv.hour","index":1,"aggr":"sum"}],"groups":[{"name":"twitter_lang_csv.lang","index":0,"aggr":"sum"}],"values":[{"name":"twitter_lang_csv.count","index":2,"aggr":"sum"}]},"helium":{}}},"enabled":true,"fontSize":9},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1552982661600_-1212752140","id":"20180320-135453_712665411","dateCreated":"2019-03-19T09:04:21+0100","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:180"},{"text":"%md\n\n### Find the most retweeted users\n\nIn the next queries you will compute the list of the 50 most retweeted users.\n\nYou will first create the table __twitter_users__ and store it in __parquet__ format.\n\n```shell\n%jdbc(hive)\ncreate table CHANGEME\nstored as parquet\nas\nselect CHANGEME(json, '$.retweeted_status.user.screen_name') as name\nfrom CHANGEME\nwhere CHANGEME(json, '$.retweeted_status') is not null;\n```\n\nNext you will create the table of the 50 most popular users in decreasing order of retweets for the day.\n\n```shell\n%jdbc(hive)\ncreate table CHANGEME.twitter_users_count\nrow format delimited\n fields terminated by ','\n escaped by '\"'\n lines terminated by '\\n'\nstored as textfile\nselect name, count(*) as count\nfrom CHANGEME.twitter_users\ngroup by name\norder by count desc\nlimit 50;\n```\n\nNote that we have created the table in textfile format, and we have specified a row format as a CSV file. To wrap this exercise up, try to find this file in HDFS and use the __hdfs dfs -cat__ command to visualize it. You can then use __hdfs dfs -get__ and __scp__ or rsync to copy it on your laptop.\n","user":"ebouille","dateUpdated":"2019-03-19T09:04:21+0100","config":{"tableHide":false,"editorSetting":{"editOnDblClick":true,"language":"markdown"},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"results":{},"enabled":true,"fontSize":9},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"

Find the most retweeted users

\n

In the next queries you will compute the list of the 50 most retweeted users.

\n

You will first create the table twitter_users and store it in parquet format.

\n
%jdbc(hive)\ncreate table CHANGEME\nstored as parquet\nas\nselect CHANGEME(json, '$.retweeted_status.user.screen_name') as name\nfrom CHANGEME\nwhere CHANGEME(json, '$.retweeted_status') is not null;\n
\n

Next you will create the table of the 50 most popular users in decreasing order of retweets for the day.

\n
%jdbc(hive)\ncreate table CHANGEME.twitter_users_count\nrow format delimited\n    fields terminated by ','\n    escaped by '\"'\n    lines terminated by '\\n'\nstored as textfile\nselect name, count(*) as count\nfrom CHANGEME.twitter_users\ngroup by name\norder by count desc\nlimit 50;\n
\n

Note that we have created the table in textfile format, and we have specified a row format as a CSV file. To wrap this exercise up, try to find this file in HDFS and use the hdfs dfs -cat command to visualize it. You can then use hdfs dfs -get and scp or rsync to copy it on your laptop.

\n"}]},"apps":[],"jobName":"paragraph_1552982661601_936438812","id":"20180321-102100_1524832575","dateCreated":"2019-03-19T09:04:21+0100","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:181"},{"text":"%jdbc(hive)\ncreate table CHANGEME\nstored as parquet\nas\nselect CHANGEME(json, '$.retweeted_status.user.screen_name') as name\nfrom CHANGEME\nwhere CHANGEME(json, '$.retweeted_status') is not null;","user":"ebouille","dateUpdated":"2019-03-19T09:04:21+0100","config":{"colWidth":12,"editorMode":"ace/mode/sql","results":{},"enabled":true,"editorSetting":{"editOnDblClick":false,"language":"sql","completionSupport":true},"fontSize":9},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1552982661601_1404385277","id":"20180320-111048_1278675940","dateCreated":"2019-03-19T09:04:21+0100","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:182"},{"text":"%jdbc(hive)\ncreate table CHANGEME.twitter_users_count\nrow format delimited\n fields terminated by ','\n escaped by '\"'\n lines terminated by '\\n'\nstored as textfile\nselect name, count(*) as count\nfrom CHANGEME.twitter_users\ngroup by name\norder by count desc\nlimit 50;","user":"ebouille","dateUpdated":"2019-03-19T09:04:21+0100","config":{"colWidth":12,"editorMode":"ace/mode/sql","results":{},"enabled":true,"editorSetting":{"editOnDblClick":false,"language":"sql","completionSupport":true},"fontSize":9},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1552982661602_-410788069","id":"20180320-111112_142668305","dateCreated":"2019-03-19T09:04:21+0100","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:183"},{"text":"%md\n### That's all, folks","user":"ebouille","dateUpdated":"2019-03-19T09:04:21+0100","config":{"tableHide":false,"editorSetting":{"editOnDblClick":true,"language":"markdown"},"colWidth":12,"editorMode":"ace/mode/markdown","editorHide":true,"results":{},"enabled":true,"fontSize":9},"settings":{"params":{},"forms":{}},"results":{"code":"SUCCESS","msg":[{"type":"HTML","data":"

That's all, folks

\n"}]},"apps":[],"jobName":"paragraph_1552982661602_-1209851467","id":"20180320-143329_5934928","dateCreated":"2019-03-19T09:04:21+0100","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:184"},{"text":"%md\n","user":"ebouille","dateUpdated":"2019-03-19T09:04:21+0100","config":{"colWidth":12,"editorMode":"ace/mode/markdown","results":{},"enabled":true,"editorSetting":{"editOnDblClick":true,"language":"markdown","completionSupport":false},"fontSize":9},"settings":{"params":{},"forms":{}},"apps":[],"jobName":"paragraph_1552982661603_1884847888","id":"20180321-114758_916463376","dateCreated":"2019-03-19T09:04:21+0100","status":"READY","errorMessage":"","progressUpdateIntervalMs":500,"$$hashKey":"object:185"}],"name":"/ebouille/week5","id":"2E6GW5MNZ","noteParams":{},"noteForms":{},"angularObjects":{"md:shared_process":[],"jdbc:shared_process":[],"spark2:shared_process":[]},"config":{"isZeppelinNotebookCronEnable":false,"looknfeel":"default","personalizedMode":"false"},"info":{}}