{ "cells": [ { "cell_type": "markdown", "id": "cb60ce5c", "metadata": {}, "source": [ "# SQL Translation to Python" ] }, { "cell_type": "markdown", "id": "83cbd1ad", "metadata": {}, "source": [ "data available via https://ourairports.com/data/" ] }, { "cell_type": "code", "execution_count": null, "id": "1a4912a1", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "# airports = pd.read_csv('./data/airports.csv')\n", "# airport_freq = pd.read_csv('./data/airport-frequencies.csv')\n", "# runways = pd.read_csv('./data/runways.csv')\n", "\n", "airports = pd.read_csv('https://davidmegginson.github.io/ourairports-data/airports.csv')\n", "airport_freq = pd.read_csv('https://davidmegginson.github.io/ourairports-data/airport-frequencies.csv')\n", "runways = pd.read_csv('https://davidmegginson.github.io/ourairports-data/runways.csv')" ] }, { "cell_type": "markdown", "id": "db518130", "metadata": {}, "source": [ "## Order of SQL executions\n", "\n", "| Order\t| Clause\t| Function\t\t\t\t\t\t\t\t\t| \n", "| ----- | --------- | ----------------------------------------- | \n", "| 1\t | FROM\t | Tables are joined to get the base data.\t| \n", "| 2\t | WHERE\t | The base data is filtered.\t\t\t\t| \n", "| 3\t | GROUP BY\t| The filtered base data is grouped.\t\t| \n", "| 4\t | HAVING\t| The grouped base data is filtered.\t\t| \n", "| 5\t | SELECT\t| The final data is returned.\t\t\t\t| \n", "| 6\t | ORDER BY\t| The final data is sorted.\t\t\t\t\t| \n", "| 7\t | LIMIT\t | The returned data is limited to row count\t| " ] }, { "cell_type": "markdown", "id": "8a61234e", "metadata": {}, "source": [ "## SELECT, WHERE, DISTINCT, LIMIT" ] }, { "cell_type": "markdown", "id": "44fa3b1b", "metadata": {}, "source": [ "select * from airports" ] }, { "cell_type": "code", "execution_count": null, "id": "085500b5", "metadata": { "vscode": { "languageId": "plaintext" } }, "outputs": [], "source": [ "airports" ] }, { "cell_type": "markdown", "id": "fc7de832", "metadata": {}, "source": [ "select * from airports limit 3" ] }, { "cell_type": "code", "execution_count": null, "id": "32eb7c4f", "metadata": {}, "outputs": [], "source": [ "airports.head(3)" ] }, { "cell_type": "markdown", "id": "ea9e24e9", "metadata": {}, "source": [ "select id from airports where ident = 'KLAX'" ] }, { "cell_type": "code", "execution_count": null, "id": "47bce847", "metadata": {}, "outputs": [], "source": [ "airports[airports.ident == 'KLAX'].id" ] }, { "cell_type": "code", "execution_count": null, "id": "ac7b120a", "metadata": {}, "outputs": [], "source": [ "# or\n", "airports[airports['ident']=='KLAX'].id" ] }, { "cell_type": "markdown", "id": "c32f2350", "metadata": {}, "source": [ "select distinct type from airport\t" ] }, { "cell_type": "code", "execution_count": null, "id": "6a296a92", "metadata": {}, "outputs": [], "source": [ "airports.type.unique()" ] }, { "cell_type": "code", "execution_count": null, "id": "167be524", "metadata": {}, "outputs": [], "source": [ "airports['type'].unique()" ] }, { "cell_type": "markdown", "id": "5db7132a", "metadata": {}, "source": [ "# SELECT with multiple conditions" ] }, { "cell_type": "markdown", "id": "057027a3", "metadata": {}, "source": [ "select * from airports where iso_region = 'US-CA' and type = 'seaplane_base'" ] }, { "cell_type": "code", "execution_count": null, "id": "6ebff121", "metadata": {}, "outputs": [], "source": [ "airports[(airports['iso_region']=='US-CA') & (airports['type']=='seaplane_base')] # mask" ] }, { "cell_type": "markdown", "id": "da68bea9", "metadata": {}, "source": [ "select ident, name, municipality from airports where iso_region = 'US-CA' and type = 'large_airport'" ] }, { "cell_type": "code", "execution_count": null, "id": "85675a4a", "metadata": {}, "outputs": [], "source": [ "airports[(airports['iso_region']=='US-CA') & (airports['type']=='large_airport')][['ident', 'name', 'municipality']]\n", "# passing a list to __getitem__; \n", "# https://stackoverflow.com/questions/11285613/selecting-multiple-columns-in-a-pandas-dataframe" ] }, { "cell_type": "code", "execution_count": null, "id": "391b73f6", "metadata": {}, "outputs": [], "source": [ "a = airports[(airports['iso_region']=='US-CA') & (airports['type']=='large_airport')][['ident']]\n", "b = airports[(airports['iso_region']=='US-CA') & (airports['type']=='large_airport')]['ident']" ] }, { "cell_type": "code", "execution_count": null, "id": "7424a8fd", "metadata": {}, "outputs": [], "source": [ "type(a)" ] }, { "cell_type": "code", "execution_count": null, "id": "b80b5dbf", "metadata": {}, "outputs": [], "source": [ "type(b)" ] }, { "cell_type": "markdown", "id": "add2a9f5", "metadata": {}, "source": [ "## ORDER BY" ] }, { "cell_type": "markdown", "id": "ef0dee43", "metadata": {}, "source": [ "select * from airport_freq where airport_ident = 'KLAX' order by type" ] }, { "cell_type": "code", "execution_count": null, "id": "fe0e99e8", "metadata": { "scrolled": false }, "outputs": [], "source": [ "airport_freq[airport_freq['airport_ident']=='KLAX'].sort_values('type')" ] }, { "cell_type": "code", "execution_count": null, "id": "69cd17ee", "metadata": {}, "outputs": [], "source": [ "airport_freq[airport_freq.airport_ident == 'KLAX'].sort_values('type')" ] }, { "cell_type": "markdown", "id": "b43350c6", "metadata": {}, "source": [ "select * from airport_freq where airport_ident = 'KLAX' order by type desc" ] }, { "cell_type": "code", "execution_count": null, "id": "589bcb2c", "metadata": {}, "outputs": [], "source": [ "airport_freq[airport_freq.airport_ident == 'KLAX'].sort_values('type', ascending=False)" ] }, { "cell_type": "markdown", "id": "8e464f38", "metadata": {}, "source": [ "## IN… NOT IN" ] }, { "cell_type": "markdown", "id": "cd5702c3", "metadata": {}, "source": [ "select * from airports where type in ('heliport', 'balloonport')" ] }, { "cell_type": "code", "execution_count": null, "id": "795ef133", "metadata": { "scrolled": true }, "outputs": [], "source": [ "airports[airports.type.isin(['heliport', 'balloonport'])] # still, you need to pass a list" ] }, { "cell_type": "code", "execution_count": null, "id": "4f5713a6", "metadata": {}, "outputs": [], "source": [ "# airports[airports.type.isin('heliport', 'balloonport')] # TypeError" ] }, { "cell_type": "markdown", "id": "05f8408d", "metadata": {}, "source": [ "select * from airports where type not in ('heliport', 'balloonport')" ] }, { "cell_type": "code", "execution_count": null, "id": "1deb3549", "metadata": { "scrolled": true }, "outputs": [], "source": [ "airports[~airports.type.isin(['heliport', 'balloonport'])]" ] }, { "cell_type": "markdown", "id": "aa096653", "metadata": {}, "source": [ "## GROUP BY, COUNT, ORDER BY" ] }, { "cell_type": "markdown", "id": "e01daa40", "metadata": {}, "source": [ "select iso_country, type, count(*) from airports group by iso_country, type order by iso_country, type" ] }, { "cell_type": "code", "execution_count": null, "id": "2871a000", "metadata": {}, "outputs": [], "source": [ "airports.groupby(['iso_country', 'type']).size() # as usual, pass a list to df.groupby()" ] }, { "cell_type": "markdown", "id": "a7eb6cd3", "metadata": {}, "source": [ "select iso_country, type, count(\\*) from airports group by iso_country, type order by iso_country, count(\\*) desc" ] }, { "cell_type": "code", "execution_count": null, "id": "995b140e", "metadata": { "scrolled": true }, "outputs": [], "source": [ "type(airports.groupby(['iso_country', 'type']).size()) # pandas.core.series.Series" ] }, { "cell_type": "code", "execution_count": null, "id": "04df365f", "metadata": {}, "outputs": [], "source": [ "type(airports.groupby(['iso_country', 'type']).size().to_frame('size')) # pandas.core.frame.DataFrame" ] }, { "cell_type": "code", "execution_count": null, "id": "b3ebc8c8", "metadata": {}, "outputs": [], "source": [ "airports.groupby(['iso_country', 'type']).size().to_frame('size').sort_values(['iso_country', 'size'])" ] }, { "cell_type": "code", "execution_count": null, "id": "732b97f7", "metadata": {}, "outputs": [], "source": [ "airports.groupby(['iso_country', 'type']).size().to_frame('size').sort_values(['iso_country', 'size'], ascending=[True, False])" ] }, { "cell_type": "code", "execution_count": null, "id": "406917e2", "metadata": {}, "outputs": [], "source": [ "# reset_index()\n", "airports.groupby(['iso_country', 'type']).size().to_frame('size').reset_index().sort_values(['iso_country', 'size'], ascending=[True, False])" ] }, { "cell_type": "code", "execution_count": null, "id": "463ac0cd", "metadata": {}, "outputs": [], "source": [ "type(airports.groupby(['iso_country', 'type'])) # pandas.core.groupby.generic.DataFrameGroupBy" ] }, { "cell_type": "markdown", "id": "445c0132", "metadata": {}, "source": [ "## Top N records" ] }, { "cell_type": "markdown", "id": "36cc8872", "metadata": {}, "source": [ "first, let's create a dataframe of unique airports in each country" ] }, { "cell_type": "code", "execution_count": null, "id": "057fe4b6", "metadata": {}, "outputs": [], "source": [ "# filter out closed airports\n", "by_country = airports[~airports.type.isin(['closed'])].groupby('iso_country')['name'].nunique().to_frame('size').reset_index()" ] }, { "cell_type": "code", "execution_count": null, "id": "206efe12", "metadata": {}, "outputs": [], "source": [ "by_country.head(3)" ] }, { "cell_type": "markdown", "id": "ac0b0d7b", "metadata": {}, "source": [ "select iso_country from by_country order by size desc limit 10" ] }, { "cell_type": "code", "execution_count": null, "id": "1d6836af", "metadata": {}, "outputs": [], "source": [ "by_country.nlargest(10, columns='size')" ] }, { "cell_type": "code", "execution_count": null, "id": "c56af243", "metadata": { "scrolled": true }, "outputs": [], "source": [ "# alternative\n", "by_country.sort_values('size', ascending=False).head(10)\n", "# by_country.sort_values('size', ascending=False).iso_country.head(10)" ] }, { "cell_type": "markdown", "id": "0cd7406c", "metadata": {}, "source": [ "select iso_country from by_country order by size desc limit 10 offset 10" ] }, { "cell_type": "code", "execution_count": null, "id": "47632617", "metadata": {}, "outputs": [], "source": [ "by_country.nlargest(20, columns='size').tail(10)" ] }, { "cell_type": "markdown", "id": "248a1309", "metadata": {}, "source": [ "**Followup: try to add rank per group**\n", "\n", "wait till proper data is ready" ] }, { "cell_type": "code", "execution_count": null, "id": "51097e58", "metadata": {}, "outputs": [], "source": [ "# https://stackoverflow.com/questions/26720916/faster-way-to-rank-rows-in-subgroups-in-pandas-dataframe\n", "\n", "# df[\"rank\"] = df.groupby(\"group_ID\")[\"value\"].rank(\"dense\", ascending=False)" ] }, { "cell_type": "markdown", "id": "f1ef417f", "metadata": {}, "source": [ "## Aggregate functions (MIN, MAX, MEAN)" ] }, { "cell_type": "code", "execution_count": null, "id": "947c4803", "metadata": {}, "outputs": [], "source": [ "runways.head(3)" ] }, { "cell_type": "markdown", "id": "37d9f12c", "metadata": {}, "source": [ "**Calculate min, max, mean, and median length of a runway:**\n", "\n", "select max(length_ft), min(length_ft), avg(length_ft), median(length_ft) from runways" ] }, { "cell_type": "code", "execution_count": null, "id": "6fe328c3", "metadata": {}, "outputs": [], "source": [ "runways.agg({'length_ft':['max', 'min', 'mean', 'median']})" ] }, { "cell_type": "code", "execution_count": null, "id": "7c136836", "metadata": { "scrolled": true }, "outputs": [], "source": [ "runways.agg({'length_ft':['max', 'min', 'mean', 'median']}).T # or df.transpose()" ] }, { "cell_type": "code", "execution_count": null, "id": "d5d9fbdb", "metadata": {}, "outputs": [], "source": [ "# runways.agg({'length_ft':['max', 'min', 'mean', 'median']}).transpose()" ] }, { "cell_type": "markdown", "id": "2126da21", "metadata": {}, "source": [ "## JOIN" ] }, { "cell_type": "markdown", "id": "bcd4cf73", "metadata": {}, "source": [ "**inner join**" ] }, { "cell_type": "markdown", "id": "fb447007", "metadata": {}, "source": [ "select airport_ident, type, description, frequency_mhz from airport_freq join airports on airport_freq.airport_ref = airports.id where airports.ident = 'KLAX'" ] }, { "cell_type": "code", "execution_count": null, "id": "271b3572", "metadata": {}, "outputs": [], "source": [ "airports[airports.ident == 'KLAX'][['id']]" ] }, { "cell_type": "code", "execution_count": null, "id": "a62a69dc", "metadata": {}, "outputs": [], "source": [ "# airport_freq.merge(airports[airports.ident=='KLAX'], \n", "# left_on='airport_ref', \n", "# right_on='id')" ] }, { "cell_type": "code", "execution_count": null, "id": "7f943abc", "metadata": {}, "outputs": [], "source": [ "airport_freq.merge(airports[airports.ident == 'KLAX'][['id']], \n", " left_on='airport_ref', \n", " right_on='id', \n", " how='inner')[['airport_ident', 'type', 'description', 'frequency_mhz']]" ] }, { "cell_type": "markdown", "id": "223cb940", "metadata": {}, "source": [ "## UNION ALL and UNION" ] }, { "cell_type": "markdown", "id": "e3e80a3a", "metadata": {}, "source": [ "select name, municipality from airports where ident = 'KLAX' union all select name, municipality from airports where ident = 'KLGB'" ] }, { "cell_type": "markdown", "id": "547614ee", "metadata": {}, "source": [ "**Use pd.concat() to UNION ALL two dataframes:**" ] }, { "cell_type": "code", "execution_count": null, "id": "36253cf0", "metadata": {}, "outputs": [], "source": [ "airports[airports.ident=='KLAX'][['name', 'municipality']]" ] }, { "cell_type": "code", "execution_count": null, "id": "59238f37", "metadata": {}, "outputs": [], "source": [ "airports[airports.ident=='KLGB'][['name', 'municipality']]" ] }, { "cell_type": "code", "execution_count": null, "id": "0246f18f", "metadata": {}, "outputs": [], "source": [ "pd.concat([\n", " airports[airports.ident=='KLAX'][['name', 'municipality']], \n", " airports[airports.ident=='KLGB'][['name', 'municipality']]\n", "]) # axis=1 will be \"join\"" ] }, { "cell_type": "markdown", "id": "525b9e81", "metadata": {}, "source": [ "**union all**" ] }, { "cell_type": "code", "execution_count": null, "id": "c4b3b3ca", "metadata": {}, "outputs": [], "source": [ "pd.concat([\n", " airports[airports.ident=='KLAX'][['name', 'municipality']], \n", " airports[airports.ident=='KLAX'][['name', 'municipality']]\n", "]) # union all" ] }, { "cell_type": "markdown", "id": "1b6f0e5a", "metadata": {}, "source": [ "**union**" ] }, { "cell_type": "code", "execution_count": null, "id": "29248905", "metadata": {}, "outputs": [], "source": [ "pd.concat([\n", " airports[airports.ident=='KLAX'][['name', 'municipality']], \n", " airports[airports.ident=='KLAX'][['name', 'municipality']]\n", "]).drop_duplicates() # union" ] }, { "cell_type": "markdown", "id": "1d7b0e80", "metadata": {}, "source": [ "## INSERT" ] }, { "cell_type": "markdown", "id": "de70bc0d", "metadata": {}, "source": [ "create table heroes (id integer, name text);\n", "\n", "insert into heroes values (1, 'Harry Potter');\n", "\n", "insert into heroes values (2, 'Ron Weasley');\n", "\n", "insert into heroes values (3, 'Hermione Granger');" ] }, { "cell_type": "code", "execution_count": null, "id": "25196da3", "metadata": {}, "outputs": [], "source": [ "df1 = pd.DataFrame([[1, 'Harry Potter'], [2, 'Ron Weasley']], columns = ['id', 'name'])\n", "# df1 = pd.DataFrame({'id': [1, 2], 'name': ['Harry Potter', 'Ron Weasley']})" ] }, { "cell_type": "code", "execution_count": null, "id": "c3320513", "metadata": {}, "outputs": [], "source": [ "df1" ] }, { "cell_type": "code", "execution_count": null, "id": "b7abf798", "metadata": {}, "outputs": [], "source": [ "df2 = pd.DataFrame({'id':[3], 'name':['Hermione Granger']})" ] }, { "cell_type": "code", "execution_count": null, "id": "7461c3ba", "metadata": {}, "outputs": [], "source": [ "df2" ] }, { "cell_type": "code", "execution_count": null, "id": "d48a5bf9", "metadata": {}, "outputs": [], "source": [ "pd.concat([df1, df2]).reset_index(drop=True) # try drop=False" ] }, { "cell_type": "markdown", "id": "35f171de", "metadata": {}, "source": [ "## UPDATE" ] }, { "cell_type": "markdown", "id": "906f4c67", "metadata": {}, "source": [ "update airports set home_link = 'http://www.lawa.org/welcomelax.aspx' where ident == 'KLAX';" ] }, { "cell_type": "code", "execution_count": null, "id": "25b2fbab", "metadata": {}, "outputs": [], "source": [ "airports[airports.ident=='KLAX'][['home_link']]" ] }, { "cell_type": "code", "execution_count": null, "id": "25b08032", "metadata": {}, "outputs": [], "source": [ "type(airports[airports.ident=='KLAX'][['home_link']])" ] }, { "cell_type": "code", "execution_count": null, "id": "2b3a0cac", "metadata": {}, "outputs": [], "source": [ "type(airports[airports.ident=='KLAX']['home_link'])" ] }, { "cell_type": "code", "execution_count": null, "id": "47f362ba", "metadata": {}, "outputs": [], "source": [ "# airports[airports.ident=='KLAX'][home_link='http://www.lawa.org/welcomelax.aspx']\n", "# airports.loc[airports['ident'] == 'KLAX', 'home_link'] = 'http://www.lawa.org/welcomelax.aspx'" ] }, { "cell_type": "markdown", "id": "28d70d48", "metadata": {}, "source": [ "`.loc` - selects subsets of rows and columns by label only\n", "\n", "`.iloc` - selects subsets of rows and columns by integer location only\n", "\n", "`.at` selects a single scalar value in the DataFrame by label only\n", "\n", "`.iat` selects a single scalar value in the DataFrame by integer location only" ] }, { "cell_type": "code", "execution_count": null, "id": "1c701c75", "metadata": {}, "outputs": [], "source": [ "airports.loc[airports['ident'] == 'KLAX', 'home_link']" ] }, { "cell_type": "code", "execution_count": null, "id": "168b75dc", "metadata": {}, "outputs": [], "source": [ "# airports.at[airports['ident'] == 'KLAX', 'home_link']\n", "# must use index in df.at[]" ] }, { "cell_type": "code", "execution_count": null, "id": "003db08d", "metadata": {}, "outputs": [], "source": [ "airports.loc[airports['ident'] == 'KLAX', 'home_link'] = 'http://www.lawa.org/welcomelax.aspx'" ] }, { "cell_type": "code", "execution_count": null, "id": "7524438f", "metadata": {}, "outputs": [], "source": [ "airports[airports.ident=='KLAX'][['home_link']]" ] }, { "cell_type": "markdown", "id": "1b22d2ec", "metadata": {}, "source": [ "## DELETE" ] }, { "cell_type": "code", "execution_count": null, "id": "b14b6866", "metadata": {}, "outputs": [], "source": [ "airports2 = airports.copy()" ] }, { "cell_type": "code", "execution_count": null, "id": "9b6d1b49", "metadata": {}, "outputs": [], "source": [ "lax_freq = airport_freq[airport_freq.airport_ident=='KLAX'].drop_duplicates()" ] }, { "cell_type": "code", "execution_count": null, "id": "03c2cac3", "metadata": {}, "outputs": [], "source": [ "lax_freq" ] }, { "cell_type": "code", "execution_count": null, "id": "ce83f284", "metadata": {}, "outputs": [], "source": [ "lax_freq2 = lax_freq.copy() # make a copy" ] }, { "cell_type": "markdown", "id": "0aa65630", "metadata": {}, "source": [ "delete from lax_freq where type = 'MISC'" ] }, { "cell_type": "markdown", "id": "900af7e9", "metadata": {}, "source": [ "**option 1: assign the DataFrame to a filtered version of itself:**" ] }, { "cell_type": "code", "execution_count": null, "id": "6c71bf56", "metadata": {}, "outputs": [], "source": [ "lax_freq = lax_freq[lax_freq.type != 'MISC']" ] }, { "cell_type": "code", "execution_count": null, "id": "74334b1f", "metadata": {}, "outputs": [], "source": [ "lax_freq" ] }, { "cell_type": "code", "execution_count": null, "id": "bad43c37", "metadata": {}, "outputs": [], "source": [ "lax_freq2[lax_freq2.type == 'MISC'].index" ] }, { "cell_type": "code", "execution_count": null, "id": "8776a7bf", "metadata": {}, "outputs": [], "source": [ "# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html\n", "lax_freq2.drop(lax_freq2[lax_freq2.type == 'MISC'].index)" ] }, { "cell_type": "markdown", "id": "d54f2805", "metadata": {}, "source": [ "## Immutability" ] }, { "cell_type": "markdown", "id": "83c5ba4c", "metadata": {}, "source": [ "By default, most operators applied to a Pandas dataframe return a new object. Some operators accept a parameter inplace=True, so you can work with the original dataframe instead. \n", "\n", "https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html\n", "\n", "- inplace: bool, default False\n", " - Modify the DataFrame in place (do not create a new object)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 5 }