{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Exploration and Reporting with PyHive"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this example we show how to explore data in Hive and build reports.\n",
"\n",
"The example uses a dataset of real estate sales in the Sacramento area, which you can download from [here](http://snurran.sics.se/hops/hive/Sacramentorealestatetransactions.csv).\n",
"Create a dataset with a name of your choosing, for example `RawData` and upload the CSV file. Make sure the dataset is empty, you will need to delete the auto-generated README.md\n",
"\n",
"First we need to setup the IPython magic and the connection information by running the following cell:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"%reload_ext sql\n",
"from hops import hive\n",
"hive.setup_hive_connection()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load Data into an External table"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's first create a Hive external and let's point the table to the just created dataset. In the query below you should replace `[Projectname]` with your project name.\n",
"\n",
"The `%%sql` magic allows you to write HiveQL and execute the query against the HiveServer."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%sql\n",
"CREATE EXTERNAL TABLE sacramento_properties_ext(\n",
"street string,\n",
"city string,\n",
"zip int,\n",
"state string,\n",
"beds int,\n",
"baths int,\n",
"sq__ft float,\n",
"sales_type string,\n",
"sale_date string,\n",
"price float,\n",
"latitude float,\n",
"longitude float)\n",
"ROW FORMAT DELIMITED\n",
"FIELDS TERMINATED BY ','\n",
"LOCATION '/Projects/[Projectname]/RawData'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To make sure the table was created successfully, we can list all the tables in the project's database and run a simple query against the external table."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done.\n"
]
},
{
"data": {
"text/html": [
"
\n",
" \n",
" tab_name | \n",
"
\n",
" \n",
" sacramento_properties_ext | \n",
"
\n",
"
"
],
"text/plain": [
"[('sacramento_properties_ext',)]"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql\n",
"show tables"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done.\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" street | \n",
" city | \n",
" zip | \n",
" state | \n",
" beds | \n",
" baths | \n",
" sq__ft | \n",
" sales_type | \n",
" sale_date | \n",
" price | \n",
" latitude | \n",
" longitude | \n",
"
\n",
" \n",
" 3526 HIGH ST | \n",
" SACRAMENTO | \n",
" 95838 | \n",
" CA | \n",
" 2 | \n",
" 1 | \n",
" 836.0 | \n",
" Residential | \n",
" Wed May 21 00:00:00 EDT 2008 | \n",
" 59222.0 | \n",
" 38.631912 | \n",
" -121.434875 | \n",
"
\n",
" \n",
" 51 OMAHA CT | \n",
" SACRAMENTO | \n",
" 95823 | \n",
" CA | \n",
" 3 | \n",
" 1 | \n",
" 1167.0 | \n",
" Residential | \n",
" Wed May 21 00:00:00 EDT 2008 | \n",
" 68212.0 | \n",
" 38.4789 | \n",
" -121.43103 | \n",
"
\n",
" \n",
" 2796 BRANCH ST | \n",
" SACRAMENTO | \n",
" 95815 | \n",
" CA | \n",
" 2 | \n",
" 1 | \n",
" 796.0 | \n",
" Residential | \n",
" Wed May 21 00:00:00 EDT 2008 | \n",
" 68880.0 | \n",
" 38.618305 | \n",
" -121.44384 | \n",
"
\n",
" \n",
" 2805 JANETTE WAY | \n",
" SACRAMENTO | \n",
" 95815 | \n",
" CA | \n",
" 2 | \n",
" 1 | \n",
" 852.0 | \n",
" Residential | \n",
" Wed May 21 00:00:00 EDT 2008 | \n",
" 69307.0 | \n",
" 38.616837 | \n",
" -121.43915 | \n",
"
\n",
" \n",
" 6001 MCMAHON DR | \n",
" SACRAMENTO | \n",
" 95824 | \n",
" CA | \n",
" 2 | \n",
" 1 | \n",
" 797.0 | \n",
" Residential | \n",
" Wed May 21 00:00:00 EDT 2008 | \n",
" 81900.0 | \n",
" 38.51947 | \n",
" -121.43577 | \n",
"
\n",
" \n",
" 5828 PEPPERMILL CT | \n",
" SACRAMENTO | \n",
" 95841 | \n",
" CA | \n",
" 3 | \n",
" 1 | \n",
" 1122.0 | \n",
" Condo | \n",
" Wed May 21 00:00:00 EDT 2008 | \n",
" 89921.0 | \n",
" 38.662594 | \n",
" -121.32781 | \n",
"
\n",
" \n",
" 6048 OGDEN NASH WAY | \n",
" SACRAMENTO | \n",
" 95842 | \n",
" CA | \n",
" 3 | \n",
" 2 | \n",
" 1104.0 | \n",
" Residential | \n",
" Wed May 21 00:00:00 EDT 2008 | \n",
" 90895.0 | \n",
" 38.68166 | \n",
" -121.35171 | \n",
"
\n",
" \n",
" 2561 19TH AVE | \n",
" SACRAMENTO | \n",
" 95820 | \n",
" CA | \n",
" 3 | \n",
" 1 | \n",
" 1177.0 | \n",
" Residential | \n",
" Wed May 21 00:00:00 EDT 2008 | \n",
" 91002.0 | \n",
" 38.53509 | \n",
" -121.48137 | \n",
"
\n",
" \n",
" 11150 TRINITY RIVER DR Unit 114 | \n",
" RANCHO CORDOVA | \n",
" 95670 | \n",
" CA | \n",
" 2 | \n",
" 2 | \n",
" 941.0 | \n",
" Condo | \n",
" Wed May 21 00:00:00 EDT 2008 | \n",
" 94905.0 | \n",
" 38.62119 | \n",
" -121.27055 | \n",
"
\n",
" \n",
" 7325 10TH ST | \n",
" RIO LINDA | \n",
" 95673 | \n",
" CA | \n",
" 3 | \n",
" 2 | \n",
" 1146.0 | \n",
" Residential | \n",
" Wed May 21 00:00:00 EDT 2008 | \n",
" 98937.0 | \n",
" 38.70091 | \n",
" -121.44298 | \n",
"
\n",
"
"
],
"text/plain": [
"[('3526 HIGH ST', 'SACRAMENTO', 95838, 'CA', 2, 1, 836.0, 'Residential', 'Wed May 21 00:00:00 EDT 2008', 59222.0, 38.631912, -121.434875),\n",
" ('51 OMAHA CT', 'SACRAMENTO', 95823, 'CA', 3, 1, 1167.0, 'Residential', 'Wed May 21 00:00:00 EDT 2008', 68212.0, 38.4789, -121.43103),\n",
" ('2796 BRANCH ST', 'SACRAMENTO', 95815, 'CA', 2, 1, 796.0, 'Residential', 'Wed May 21 00:00:00 EDT 2008', 68880.0, 38.618305, -121.44384),\n",
" ('2805 JANETTE WAY', 'SACRAMENTO', 95815, 'CA', 2, 1, 852.0, 'Residential', 'Wed May 21 00:00:00 EDT 2008', 69307.0, 38.616837, -121.43915),\n",
" ('6001 MCMAHON DR', 'SACRAMENTO', 95824, 'CA', 2, 1, 797.0, 'Residential', 'Wed May 21 00:00:00 EDT 2008', 81900.0, 38.51947, -121.43577),\n",
" ('5828 PEPPERMILL CT', 'SACRAMENTO', 95841, 'CA', 3, 1, 1122.0, 'Condo', 'Wed May 21 00:00:00 EDT 2008', 89921.0, 38.662594, -121.32781),\n",
" ('6048 OGDEN NASH WAY', 'SACRAMENTO', 95842, 'CA', 3, 2, 1104.0, 'Residential', 'Wed May 21 00:00:00 EDT 2008', 90895.0, 38.68166, -121.35171),\n",
" ('2561 19TH AVE', 'SACRAMENTO', 95820, 'CA', 3, 1, 1177.0, 'Residential', 'Wed May 21 00:00:00 EDT 2008', 91002.0, 38.53509, -121.48137),\n",
" ('11150 TRINITY RIVER DR Unit 114', 'RANCHO CORDOVA', 95670, 'CA', 2, 2, 941.0, 'Condo', 'Wed May 21 00:00:00 EDT 2008', 94905.0, 38.62119, -121.27055),\n",
" ('7325 10TH ST', 'RIO LINDA', 95673, 'CA', 3, 2, 1146.0, 'Residential', 'Wed May 21 00:00:00 EDT 2008', 98937.0, 38.70091, -121.44298)]"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql\n",
"select * from sacramento_properties_ext limit 10"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load data into a Managed Table"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this section we are going to load the data in a managed table. The table is going to be partitioned by ZIP and stored in ORC (a columnar storage format)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%sql\n",
"CREATE TABLE sacramento_properties(\n",
"street string,\n",
"city string,\n",
"state string,\n",
"beds int,\n",
"baths int,\n",
"sq__ft float,\n",
"sales_type string,\n",
"sale_date string,\n",
"price float,\n",
"latitude float,\n",
"longitude float)\n",
"PARTITIONED by (zip int)\n",
"STORED AS ORC"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done.\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" tab_name | \n",
"
\n",
" \n",
" sacramento_properties | \n",
"
\n",
" \n",
" sacramento_properties_ext | \n",
"
\n",
"
"
],
"text/plain": [
"[('sacramento_properties',), ('sacramento_properties_ext',)]"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql\n",
"show tables"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By default we don't allow dynamic inserts. This means that we should specify, for each row, which partition it belongs to. This is meant to avoid having a wrong insert query create a huge number of directories. \n",
"\n",
"In this case, however, the dataset is small and we know what we are doing. In the next cell we configure the hive session to allow nonstrict dymanic insert."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%sql\n",
"set hive.exec.dynamic.partition=true;\n",
"set hive.exec.dynamic.partition.mode=nonstrict;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we load the data into the managed table. Partition column(s) should be listed last in the `SELECT` statement."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%sql\n",
"INSERT OVERWRITE TABLE sacramento_properties PARTITION (zip)\n",
"SELECT street, city, state, beds, baths, sq__ft, sales_type, sale_date, price, latitude, longitude, zip \n",
"FROM sacramento_properties_ext"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done.\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" street | \n",
" city | \n",
" state | \n",
" beds | \n",
" baths | \n",
" sq__ft | \n",
" sales_type | \n",
" sale_date | \n",
" price | \n",
" latitude | \n",
" longitude | \n",
" zip | \n",
"
\n",
" \n",
" 5332 SANDSTONE ST | \n",
" CARMICHAEL | \n",
" CA | \n",
" 3 | \n",
" 1 | \n",
" 1152.0 | \n",
" Residential | \n",
" Wed May 21 00:00:00 EDT 2008 | \n",
" 181872.0 | \n",
" 38.662106 | \n",
" -121.31394 | \n",
" 95608 | \n",
"
\n",
" \n",
" 5907 ELLERSLEE DR | \n",
" CARMICHAEL | \n",
" CA | \n",
" 3 | \n",
" 1 | \n",
" 936.0 | \n",
" Residential | \n",
" Wed May 21 00:00:00 EDT 2008 | \n",
" 200000.0 | \n",
" 38.664467 | \n",
" -121.32683 | \n",
" 95608 | \n",
"
\n",
" \n",
" 4010 ALEX LN | \n",
" CARMICHAEL | \n",
" CA | \n",
" 2 | \n",
" 2 | \n",
" 1326.0 | \n",
" Condo | \n",
" Wed May 21 00:00:00 EDT 2008 | \n",
" 250134.0 | \n",
" 38.637028 | \n",
" -121.312965 | \n",
" 95608 | \n",
"
\n",
" \n",
" 5925 MALEVILLE AVE | \n",
" CARMICHAEL | \n",
" CA | \n",
" 4 | \n",
" 2 | \n",
" 1120.0 | \n",
" Residential | \n",
" Tue May 20 00:00:00 EDT 2008 | \n",
" 189000.0 | \n",
" 38.666565 | \n",
" -121.325714 | \n",
" 95608 | \n",
"
\n",
" \n",
" 2109 HAMLET PL | \n",
" CARMICHAEL | \n",
" CA | \n",
" 2 | \n",
" 2 | \n",
" 1598.0 | \n",
" Residential | \n",
" Tue May 20 00:00:00 EDT 2008 | \n",
" 484000.0 | \n",
" 38.602753 | \n",
" -121.32932 | \n",
" 95608 | \n",
"
\n",
" \n",
" 5709 RIVER OAK WAY | \n",
" CARMICHAEL | \n",
" CA | \n",
" 4 | \n",
" 2 | \n",
" 2222.0 | \n",
" Residential | \n",
" Tue May 20 00:00:00 EDT 2008 | \n",
" 582000.0 | \n",
" 38.602463 | \n",
" -121.33098 | \n",
" 95608 | \n",
"
\n",
" \n",
" 7032 FAIR OAKS BLVD | \n",
" CARMICHAEL | \n",
" CA | \n",
" 3 | \n",
" 2 | \n",
" 1245.0 | \n",
" Condo | \n",
" Mon May 19 00:00:00 EDT 2008 | \n",
" 139500.0 | \n",
" 38.628563 | \n",
" -121.3283 | \n",
" 95608 | \n",
"
\n",
" \n",
" 7110 STELLA LN Unit 15 | \n",
" CARMICHAEL | \n",
" CA | \n",
" 2 | \n",
" 2 | \n",
" 1000.0 | \n",
" Condo | \n",
" Mon May 19 00:00:00 EDT 2008 | \n",
" 182000.0 | \n",
" 38.637398 | \n",
" -121.30006 | \n",
" 95608 | \n",
"
\n",
" \n",
" 5847 DEL CAMPO LN | \n",
" CARMICHAEL | \n",
" CA | \n",
" 3 | \n",
" 1 | \n",
" 1713.0 | \n",
" Residential | \n",
" Mon May 19 00:00:00 EDT 2008 | \n",
" 266000.0 | \n",
" 38.671993 | \n",
" -121.32434 | \n",
" 95608 | \n",
"
\n",
" \n",
" 4622 MEYER WAY | \n",
" CARMICHAEL | \n",
" CA | \n",
" 4 | \n",
" 2 | \n",
" 1559.0 | \n",
" Residential | \n",
" Mon May 19 00:00:00 EDT 2008 | \n",
" 285000.0 | \n",
" 38.64913 | \n",
" -121.31067 | \n",
" 95608 | \n",
"
\n",
"
"
],
"text/plain": [
"[('5332 SANDSTONE ST', 'CARMICHAEL', 'CA', 3, 1, 1152.0, 'Residential', 'Wed May 21 00:00:00 EDT 2008', 181872.0, 38.662106, -121.31394, 95608),\n",
" ('5907 ELLERSLEE DR', 'CARMICHAEL', 'CA', 3, 1, 936.0, 'Residential', 'Wed May 21 00:00:00 EDT 2008', 200000.0, 38.664467, -121.32683, 95608),\n",
" ('4010 ALEX LN', 'CARMICHAEL', 'CA', 2, 2, 1326.0, 'Condo', 'Wed May 21 00:00:00 EDT 2008', 250134.0, 38.637028, -121.312965, 95608),\n",
" ('5925 MALEVILLE AVE', 'CARMICHAEL', 'CA', 4, 2, 1120.0, 'Residential', 'Tue May 20 00:00:00 EDT 2008', 189000.0, 38.666565, -121.325714, 95608),\n",
" ('2109 HAMLET PL', 'CARMICHAEL', 'CA', 2, 2, 1598.0, 'Residential', 'Tue May 20 00:00:00 EDT 2008', 484000.0, 38.602753, -121.32932, 95608),\n",
" ('5709 RIVER OAK WAY', 'CARMICHAEL', 'CA', 4, 2, 2222.0, 'Residential', 'Tue May 20 00:00:00 EDT 2008', 582000.0, 38.602463, -121.33098, 95608),\n",
" ('7032 FAIR OAKS BLVD', 'CARMICHAEL', 'CA', 3, 2, 1245.0, 'Condo', 'Mon May 19 00:00:00 EDT 2008', 139500.0, 38.628563, -121.3283, 95608),\n",
" ('7110 STELLA LN Unit 15', 'CARMICHAEL', 'CA', 2, 2, 1000.0, 'Condo', 'Mon May 19 00:00:00 EDT 2008', 182000.0, 38.637398, -121.30006, 95608),\n",
" ('5847 DEL CAMPO LN', 'CARMICHAEL', 'CA', 3, 1, 1713.0, 'Residential', 'Mon May 19 00:00:00 EDT 2008', 266000.0, 38.671993, -121.32434, 95608),\n",
" ('4622 MEYER WAY', 'CARMICHAEL', 'CA', 4, 2, 1559.0, 'Residential', 'Mon May 19 00:00:00 EDT 2008', 285000.0, 38.64913, -121.31067, 95608)]"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql\n",
"select * \n",
"FROM sacramento_properties \n",
"WHERE zip=95608\n",
"LIMIT 10"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done.\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" sales_type | \n",
" avg_price | \n",
"
\n",
" \n",
" Condo | \n",
" 190544.66666666666 | \n",
"
\n",
" \n",
" Residential | \n",
" 314238.8823529412 | \n",
"
\n",
"
"
],
"text/plain": [
"[('Condo', 190544.66666666666), ('Residential', 314238.8823529412)]"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql\n",
"select sales_type, avg(price) as avg_price\n",
"FROM sacramento_properties \n",
"WHERE zip=95608\n",
"GROUP BY sales_type\n",
"LIMIT 10"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Visualize the data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The query result can also be stored in a Python dictionary. This enables us to choose from a plethora of Python libraries available online, to visualize data and build BI reports\n",
"\n",
"In the example below we use the folium library to mark where the Condos in our dataset are located in Sacramento. For the next section to work, if you haven't already, please install the folium package in your project environment and restart the kernel. (Also, please re-run the first cell)\n",
"\n",
"(Internet access is required to visualize the map)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done.\n"
]
}
],
"source": [
"condos = %sql select * from sacramento_properties where `sales_type` = 'Condo'"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import folium\n",
"m = folium.Map(\n",
" location=[38.5815700, -121.4944000],\n",
" zoom_start=10\n",
")\n",
"\n",
"for condo in condos:\n",
" folium.Marker([float(condo[9]), float(condo[10])], popup=str(condo[8])).add_to(m)\n",
"\n",
"m"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "python-pyhive_notebook1__meb10000",
"language": "python",
"name": "python-pyhive_notebook1__meb10000"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}