{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Exploration and Reporting with PyHive" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example we show how to explore data in Hive and build reports.\n", "\n", "The example uses a dataset of real estate sales in the Sacramento area, which you can download from [here](http://snurran.sics.se/hops/hive/Sacramentorealestatetransactions.csv).\n", "Create a dataset with a name of your choosing, for example `RawData` and upload the CSV file. Make sure the dataset is empty, you will need to delete the auto-generated README.md\n", "\n", "First we need to setup the IPython magic and the connection information by running the following cell:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%reload_ext sql\n", "from hops import hive\n", "hive.setup_hive_connection()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load Data into an External table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's first create a Hive external and let's point the table to the just created dataset. In the query below you should replace `[Projectname]` with your project name.\n", "\n", "The `%%sql` magic allows you to write HiveQL and execute the query against the HiveServer." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%sql\n", "CREATE EXTERNAL TABLE sacramento_properties_ext(\n", "street string,\n", "city string,\n", "zip int,\n", "state string,\n", "beds int,\n", "baths int,\n", "sq__ft float,\n", "sales_type string,\n", "sale_date string,\n", "price float,\n", "latitude float,\n", "longitude float)\n", "ROW FORMAT DELIMITED\n", "FIELDS TERMINATED BY ','\n", "LOCATION '/Projects/[Projectname]/RawData'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To make sure the table was created successfully, we can list all the tables in the project's database and run a simple query against the external table." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", "
tab_name
sacramento_properties_ext
" ], "text/plain": [ "[('sacramento_properties_ext',)]" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "show tables" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
streetcityzipstatebedsbathssq__ftsales_typesale_datepricelatitudelongitude
3526 HIGH STSACRAMENTO95838CA21836.0ResidentialWed May 21 00:00:00 EDT 200859222.038.631912-121.434875
51 OMAHA CTSACRAMENTO95823CA311167.0ResidentialWed May 21 00:00:00 EDT 200868212.038.4789-121.43103
2796 BRANCH STSACRAMENTO95815CA21796.0ResidentialWed May 21 00:00:00 EDT 200868880.038.618305-121.44384
2805 JANETTE WAYSACRAMENTO95815CA21852.0ResidentialWed May 21 00:00:00 EDT 200869307.038.616837-121.43915
6001 MCMAHON DRSACRAMENTO95824CA21797.0ResidentialWed May 21 00:00:00 EDT 200881900.038.51947-121.43577
5828 PEPPERMILL CTSACRAMENTO95841CA311122.0CondoWed May 21 00:00:00 EDT 200889921.038.662594-121.32781
6048 OGDEN NASH WAYSACRAMENTO95842CA321104.0ResidentialWed May 21 00:00:00 EDT 200890895.038.68166-121.35171
2561 19TH AVESACRAMENTO95820CA311177.0ResidentialWed May 21 00:00:00 EDT 200891002.038.53509-121.48137
11150 TRINITY RIVER DR Unit 114RANCHO CORDOVA95670CA22941.0CondoWed May 21 00:00:00 EDT 200894905.038.62119-121.27055
7325 10TH STRIO LINDA95673CA321146.0ResidentialWed May 21 00:00:00 EDT 200898937.038.70091-121.44298
" ], "text/plain": [ "[('3526 HIGH ST', 'SACRAMENTO', 95838, 'CA', 2, 1, 836.0, 'Residential', 'Wed May 21 00:00:00 EDT 2008', 59222.0, 38.631912, -121.434875),\n", " ('51 OMAHA CT', 'SACRAMENTO', 95823, 'CA', 3, 1, 1167.0, 'Residential', 'Wed May 21 00:00:00 EDT 2008', 68212.0, 38.4789, -121.43103),\n", " ('2796 BRANCH ST', 'SACRAMENTO', 95815, 'CA', 2, 1, 796.0, 'Residential', 'Wed May 21 00:00:00 EDT 2008', 68880.0, 38.618305, -121.44384),\n", " ('2805 JANETTE WAY', 'SACRAMENTO', 95815, 'CA', 2, 1, 852.0, 'Residential', 'Wed May 21 00:00:00 EDT 2008', 69307.0, 38.616837, -121.43915),\n", " ('6001 MCMAHON DR', 'SACRAMENTO', 95824, 'CA', 2, 1, 797.0, 'Residential', 'Wed May 21 00:00:00 EDT 2008', 81900.0, 38.51947, -121.43577),\n", " ('5828 PEPPERMILL CT', 'SACRAMENTO', 95841, 'CA', 3, 1, 1122.0, 'Condo', 'Wed May 21 00:00:00 EDT 2008', 89921.0, 38.662594, -121.32781),\n", " ('6048 OGDEN NASH WAY', 'SACRAMENTO', 95842, 'CA', 3, 2, 1104.0, 'Residential', 'Wed May 21 00:00:00 EDT 2008', 90895.0, 38.68166, -121.35171),\n", " ('2561 19TH AVE', 'SACRAMENTO', 95820, 'CA', 3, 1, 1177.0, 'Residential', 'Wed May 21 00:00:00 EDT 2008', 91002.0, 38.53509, -121.48137),\n", " ('11150 TRINITY RIVER DR Unit 114', 'RANCHO CORDOVA', 95670, 'CA', 2, 2, 941.0, 'Condo', 'Wed May 21 00:00:00 EDT 2008', 94905.0, 38.62119, -121.27055),\n", " ('7325 10TH ST', 'RIO LINDA', 95673, 'CA', 3, 2, 1146.0, 'Residential', 'Wed May 21 00:00:00 EDT 2008', 98937.0, 38.70091, -121.44298)]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select * from sacramento_properties_ext limit 10" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load data into a Managed Table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this section we are going to load the data in a managed table. The table is going to be partitioned by ZIP and stored in ORC (a columnar storage format)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%sql\n", "CREATE TABLE sacramento_properties(\n", "street string,\n", "city string,\n", "state string,\n", "beds int,\n", "baths int,\n", "sq__ft float,\n", "sales_type string,\n", "sale_date string,\n", "price float,\n", "latitude float,\n", "longitude float)\n", "PARTITIONED by (zip int)\n", "STORED AS ORC" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tab_name
sacramento_properties
sacramento_properties_ext
" ], "text/plain": [ "[('sacramento_properties',), ('sacramento_properties_ext',)]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "show tables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default we don't allow dynamic inserts. This means that we should specify, for each row, which partition it belongs to. This is meant to avoid having a wrong insert query create a huge number of directories. \n", "\n", "In this case, however, the dataset is small and we know what we are doing. In the next cell we configure the hive session to allow nonstrict dymanic insert." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%sql\n", "set hive.exec.dynamic.partition=true;\n", "set hive.exec.dynamic.partition.mode=nonstrict;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we load the data into the managed table. Partition column(s) should be listed last in the `SELECT` statement." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%sql\n", "INSERT OVERWRITE TABLE sacramento_properties PARTITION (zip)\n", "SELECT street, city, state, beds, baths, sq__ft, sales_type, sale_date, price, latitude, longitude, zip \n", "FROM sacramento_properties_ext" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
streetcitystatebedsbathssq__ftsales_typesale_datepricelatitudelongitudezip
5332 SANDSTONE STCARMICHAELCA311152.0ResidentialWed May 21 00:00:00 EDT 2008181872.038.662106-121.3139495608
5907 ELLERSLEE DRCARMICHAELCA31936.0ResidentialWed May 21 00:00:00 EDT 2008200000.038.664467-121.3268395608
4010 ALEX LNCARMICHAELCA221326.0CondoWed May 21 00:00:00 EDT 2008250134.038.637028-121.31296595608
5925 MALEVILLE AVECARMICHAELCA421120.0ResidentialTue May 20 00:00:00 EDT 2008189000.038.666565-121.32571495608
2109 HAMLET PLCARMICHAELCA221598.0ResidentialTue May 20 00:00:00 EDT 2008484000.038.602753-121.3293295608
5709 RIVER OAK WAYCARMICHAELCA422222.0ResidentialTue May 20 00:00:00 EDT 2008582000.038.602463-121.3309895608
7032 FAIR OAKS BLVDCARMICHAELCA321245.0CondoMon May 19 00:00:00 EDT 2008139500.038.628563-121.328395608
7110 STELLA LN Unit 15CARMICHAELCA221000.0CondoMon May 19 00:00:00 EDT 2008182000.038.637398-121.3000695608
5847 DEL CAMPO LNCARMICHAELCA311713.0ResidentialMon May 19 00:00:00 EDT 2008266000.038.671993-121.3243495608
4622 MEYER WAYCARMICHAELCA421559.0ResidentialMon May 19 00:00:00 EDT 2008285000.038.64913-121.3106795608
" ], "text/plain": [ "[('5332 SANDSTONE ST', 'CARMICHAEL', 'CA', 3, 1, 1152.0, 'Residential', 'Wed May 21 00:00:00 EDT 2008', 181872.0, 38.662106, -121.31394, 95608),\n", " ('5907 ELLERSLEE DR', 'CARMICHAEL', 'CA', 3, 1, 936.0, 'Residential', 'Wed May 21 00:00:00 EDT 2008', 200000.0, 38.664467, -121.32683, 95608),\n", " ('4010 ALEX LN', 'CARMICHAEL', 'CA', 2, 2, 1326.0, 'Condo', 'Wed May 21 00:00:00 EDT 2008', 250134.0, 38.637028, -121.312965, 95608),\n", " ('5925 MALEVILLE AVE', 'CARMICHAEL', 'CA', 4, 2, 1120.0, 'Residential', 'Tue May 20 00:00:00 EDT 2008', 189000.0, 38.666565, -121.325714, 95608),\n", " ('2109 HAMLET PL', 'CARMICHAEL', 'CA', 2, 2, 1598.0, 'Residential', 'Tue May 20 00:00:00 EDT 2008', 484000.0, 38.602753, -121.32932, 95608),\n", " ('5709 RIVER OAK WAY', 'CARMICHAEL', 'CA', 4, 2, 2222.0, 'Residential', 'Tue May 20 00:00:00 EDT 2008', 582000.0, 38.602463, -121.33098, 95608),\n", " ('7032 FAIR OAKS BLVD', 'CARMICHAEL', 'CA', 3, 2, 1245.0, 'Condo', 'Mon May 19 00:00:00 EDT 2008', 139500.0, 38.628563, -121.3283, 95608),\n", " ('7110 STELLA LN Unit 15', 'CARMICHAEL', 'CA', 2, 2, 1000.0, 'Condo', 'Mon May 19 00:00:00 EDT 2008', 182000.0, 38.637398, -121.30006, 95608),\n", " ('5847 DEL CAMPO LN', 'CARMICHAEL', 'CA', 3, 1, 1713.0, 'Residential', 'Mon May 19 00:00:00 EDT 2008', 266000.0, 38.671993, -121.32434, 95608),\n", " ('4622 MEYER WAY', 'CARMICHAEL', 'CA', 4, 2, 1559.0, 'Residential', 'Mon May 19 00:00:00 EDT 2008', 285000.0, 38.64913, -121.31067, 95608)]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select * \n", "FROM sacramento_properties \n", "WHERE zip=95608\n", "LIMIT 10" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sales_typeavg_price
Condo190544.66666666666
Residential314238.8823529412
" ], "text/plain": [ "[('Condo', 190544.66666666666), ('Residential', 314238.8823529412)]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "select sales_type, avg(price) as avg_price\n", "FROM sacramento_properties \n", "WHERE zip=95608\n", "GROUP BY sales_type\n", "LIMIT 10" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualize the data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The query result can also be stored in a Python dictionary. This enables us to choose from a plethora of Python libraries available online, to visualize data and build BI reports\n", "\n", "In the example below we use the folium library to mark where the Condos in our dataset are located in Sacramento. For the next section to work, if you haven't already, please install the folium package in your project environment and restart the kernel. (Also, please re-run the first cell)\n", "\n", "(Internet access is required to visualize the map)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] } ], "source": [ "condos = %sql select * from sacramento_properties where `sales_type` = 'Condo'" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import folium\n", "m = folium.Map(\n", " location=[38.5815700, -121.4944000],\n", " zoom_start=10\n", ")\n", "\n", "for condo in condos:\n", " folium.Marker([float(condo[9]), float(condo[10])], popup=str(condo[8])).add_to(m)\n", "\n", "m" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "python-pyhive_notebook1__meb10000", "language": "python", "name": "python-pyhive_notebook1__meb10000" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" } }, "nbformat": 4, "nbformat_minor": 2 }