{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " \n", "
\n", " \n", "
Run in Google Colab
\n", "
\n", "
\n", " \n", "
\n", " \n", "
Download Notebook
\n", "
\n", "
\n", " \n", "
\n", " \n", "
View on GitHub
\n", "
\n", "
\n", "
\n", "\n", "# Population in Zürich\n", "\n", "Zürich Statistical Office collects data on the city and its residents. This data is published as [Linked Data](https://en.wikipedia.org/wiki/Linked_data). \n", "\n", "In this tutorial, we will show **how to work with Linked Data.** Mainly, we will see how to work with population dataset. \n", "We will look into how to query, process, and visualize it. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[1. Population in city districts](#Population-in-city-districts) \n", "[2. Population origin](#Population-origin) \n", "[3. Population distribution: age and time](#Population-distribution:-age-and-time) \n", "[4. Population distribution: age and sex](#Population-distribution:-age-and-sex) \n", "[5. Population distribution: age and origin](#Population-distribution:-age-and-origin) \n", "[6. Population and real estate prices](#Population-and-real-estate-prices) \n", "[7. Causes of death](#Causes-of-death) \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### SPARQL endpoint\n", "\n", "Population data is published as Linked Data thatcan be accessed with [SPARQL queries](https://www.w3.org/TR/rdf-sparql-query/). \n", "You can send queries using HTTP requests. The API endpoint is **[https://ld.stadt-zuerich.ch/query](https://ld.stadt-zuerich.ch/query).** \n", " \n", " \n", "Let's use `SparqlClient` from [graphly](https://github.com/zazuko/graphly) to communicate with the database. \n", "Graphly will allow us to:\n", "* send SPARQL queries\n", "* automatically add prefixes to all queries\n", "* format response to `pandas` or `geopandas`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Uncomment to install dependencies in Colab environment\n", "#!pip install git+https://github.com/zazuko/graphly.git" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import re\n", "\n", "import pandas as pd\n", "import numpy as np\n", "\n", "import plotly.express as px\n", "import plotly.graph_objects as go\n", "from plotly.subplots import make_subplots\n", "\n", "from graphly.api_client import SparqlClient" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def natural_keys(txt: str) -> list[int]:\n", " \"\"\"Extracts the digits from string\n", " Args:\n", " txt: string with digits\n", "\n", " Returns:\n", " list[int] digits in string\n", " \"\"\"\n", " \n", " return [int(s) for s in txt.split() if s.isdigit()]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sparql = SparqlClient(\"https://ld.stadt-zuerich.ch/query\")\n", "sparql.add_prefixes({\n", " \"schema\": \"\",\n", " \"cube\": \"\",\n", " \"property\": \"\",\n", " \"measure\": \"\",\n", " \"collection\": \"\",\n", " \"skos\": \"\",\n", " \"ssz\": \"\"\n", "})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "SPARQL queries can become very long. To improve the readibility, we will work wih [prefixes](https://en.wikibooks.org/wiki/SPARQL/Prefixes).\n", " \n", "Using the `add_prefixes` method, we define persistent prefixes. \n", "Every time you send a query, `graphly` will automatically add the prefixes for you.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Population in city districts\n", "\n", "Let's find the number of inhabitants in different parts of the city. The population data is available in the `BEW` data cube. \n", "\n", "The query for the number of inhabitants in different city districts, over time looks as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "query = \"\"\"\n", "SELECT ?time ?place ?count\n", "FROM \n", "WHERE {\n", " ssz:BEW a cube:Cube;\n", " cube:observationSet/cube:observation ?observation. \n", " \n", " ?observation property:RAUM ?place_uri ;\n", " property:TIME ?time ;\n", " measure:BEW ?count .\n", " ?place_uri skos:inScheme ;\n", " schema:name ?place .\n", " FILTER regex(str(?place),\"ab|Stadtgebiet vor\")\n", "}\n", "ORDER BY ?time\n", "\"\"\"\n", "\n", "df = sparql.send_query(query)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's visualize the number of inhabitants per district. To do this, we will aggregate the numbers per `place`. \n", "The cleaned dataframe becomes:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "df.place = df.place.apply(lambda x: re.findall('Kreis \\d+', x)[0])\n", "\n", "df = pd.pivot_table(df, index=\"time\", columns=\"place\", values=\"count\")\n", "df.dropna(inplace=True)\n", "\n", "df = df[df.columns[np.argsort(-df.iloc[0,])]]\n", "df = df.reset_index().rename_axis(None, axis=1)\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And now we can graph it using a line plot or a histogram." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sorted_df = df.reindex(sorted(df.columns, key=natural_keys), axis=1)\n", "fig = px.line(sorted_df, x=\"time\", y = sorted_df.columns)\n", "fig.update_layout(\n", " title='Population in Zürich Districts', \n", " title_x=0.5,\n", " yaxis_title=\"inhabitants\",\n", " xaxis_title=\"Years\",\n", " legend_title=\"District\"\n", ")\n", "fig.show(\"notebook\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "fig = px.histogram(df, x=\"time\", y=df.columns, barnorm=\"percent\")\n", "fig.update_layout(\n", " title='Population in Zürich Districts', \n", " title_x=0.5,\n", " yaxis_title=\"% of inhabitants\",\n", " xaxis_title=\"Years\",\n", " legend_title=\"District\"\n", ")\n", "fig['layout']['yaxis']['range']= [0,100]\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Population origin\n", "\n", "Let's find the number of foreign and swiss inhabitants. The share of swiss/non-swiss population is available in the `ANT-GGH-HEL` data cube. The population count is available in `BEW` data cube.\n", "\n", "The query for number of inhabitants and foreigners share over time looks as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "query = \"\"\"\n", "SELECT ?time (SUM(?pop_count) AS ?pop) (SUM(?foreigners_count)/SUM(?pop_count) AS ?foreigners) \n", "FROM \n", "WHERE {\n", " ssz:BEW a cube:Cube;\n", " cube:observationSet/cube:observation ?obs_bew. \n", " ?obs_bew property:TIME ?time ;\n", " property:RAUM ?place_uri;\n", " measure:BEW ?pop_count .\n", " \n", " ssz:ANT-GGH-HEL a cube:Cube;\n", " cube:observationSet/cube:observation ?obs_ant. \n", " \n", " ?obs_ant property:TIME ?time ;\n", " property:RAUM ?place_uri;\n", " measure:ANT ?ratio .\n", " \n", " ?place_uri skos:inScheme ;\n", " schema:name ?place .\n", " \n", " BIND((?pop_count * ?ratio/100) AS ?foreigners_count)\n", "}\n", "GROUP BY ?time\n", "ORDER BY ?time\n", "\"\"\"\n", "\n", "df = sparql.send_query(query)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And now lets visualize the data using absolute numbers as well as percentages." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig = make_subplots(rows=2, cols=1)\n", "\n", "fig.append_trace(go.Scatter(x=df[\"time\"],y=df[\"pop\"],\n", " name=\"Total population\", \n", " marker_color=px.colors.qualitative.Vivid[7],\n", " showlegend=False,\n", " ), row=1, col=1)\n", "\n", "fig.append_trace(go.Bar(x=df[\"time\"],y=(1-df[\"foreigners\"])*100,\n", " name=\"swiss\", \n", " marker_color=px.colors.qualitative.Vivid[3]\n", " ), row=2, col=1)\n", "\n", "\n", "fig.append_trace(go.Bar(x=df[\"time\"],y=df[\"foreigners\"]*100,\n", " name=\"foreign\",\n", " marker_color=px.colors.qualitative.Vivid[9]\n", " ), row=2, col=1)\n", "\n", "fig['layout']['yaxis']['title']='inhabitants'\n", "fig['layout']['yaxis2']['title']='Population share in %'\n", "fig['layout']['yaxis2']['range']= [0,100]\n", "fig.update_layout(height=800, title={\"text\": \"Population in Zürich\", \"x\": 0.5}, barmode = \"stack\",\n", " legend = {\"x\": 1, \"y\": 0.37})\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Population distribution: age and time\n", "\n", "Let's find the number of inhabitants in different age groups. The population count per age group is available in the `BEW-ALT-HEL-SEX` data cube. \n", "\n", "The query for the number of inhabitants in various age buckets over time looks as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "query = \"\"\"\n", " SELECT ?time ?age (SUM(?measure) AS ?count) \n", " FROM \n", " WHERE {\n", " ssz:BEW-ALT-HEL-SEX a cube:Cube;\n", " cube:observationSet/cube:observation ?observation. \n", " ?observation property:TIME ?time ;\n", " property:ALT ?age_uri ;\n", " measure:BEW ?measure .\n", " collection:1-Jahres-Altersklasse skos:member ?age_uri.\n", " ?age_uri schema:name ?age .\n", " }\n", " GROUP BY ?time ?age\n", " ORDER BY asc(?time)\n", "\"\"\"\n", "\n", "df = sparql.send_query(query)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's calculate the population share for each age group. The dataframe becomes:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "df[\"year\"] = df.time.apply(getattr, args=(\"year\", ))\n", "df[\"count\"] = df.groupby([\"year\"]).transform(lambda x: (x/x.sum())*100)\n", "df['age'] = df['age'].apply(lambda x: int(str(x.split(\" \")[0])))\n", "\n", "df = df.sort_values(by=[\"year\", \"age\"]).reset_index(drop=True)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And lets visualize it using an interactive plot." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig = px.bar(df, x=\"age\", y=\"count\", animation_frame=\"year\", range_y=[0, 3], range_x=[0, df.age.max()])\n", "fig.update_layout(\n", " title='Population Distribution', \n", " title_x=0.5,\n", " yaxis_title=\"Population share in %\",\n", " xaxis_title=\"Age\",\n", " legend_title=\"District\"\n", ")\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Population distribution: age and origin\n", "\n", "Let's take a look at age distribution among swiss and foreign inhabitants. We can find this data in the `BEW-ALT-HEL-SEX` data cube. \n", "\n", "The query for number of inhabitants in various age buckets, with their origin, over time looks as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "query = \"\"\"\n", " SELECT ?age ?origin (SUM(?measure) AS ?count) \n", " FROM \n", " WHERE {\n", " ssz:BEW-ALT-HEL-SEX a cube:Cube;\n", " cube:observationSet/cube:observation ?observation. \n", " ?observation property:TIME ?time ;\n", " property:ALT/schema:name ?age;\n", " measure:BEW ?measure ;\n", " property:HEL/schema:name ?origin .\n", "\n", " collection:1-Jahres-Altersklasse skos:member ?age_uri.\n", " ?age_uri schema:name ?age .\n", " \n", " FILTER (?time = \"2017-12-31\"^^xsd:date)\n", " }\n", " GROUP BY ?age ?origin\n", " ORDER BY asc(?age)\n", "\"\"\"\n", "\n", "df = sparql.send_query(query)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's calculate the population share for each origin and age group. The dataframe becomes:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[\"age\"] = df[\"age\"].apply(lambda x: int(str(x.split(\" \")[0])))\n", "df[\"count\"] = df[[\"origin\", \"count\"]].groupby([\"origin\"]).transform(lambda x: x/x.sum()*100)\n", "df = df.sort_values(by=[\"age\"]).reset_index(drop=True)\n", "df.loc[df[\"origin\"]==\"Ausland\", \"origin\"] = \"foreign\"\n", "df.loc[df[\"origin\"]==\"Schweiz\", \"origin\"] = \"swiss\"\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig = px.bar(df, x=\"age\", y=\"count\", \n", " barmode=\"overlay\", range_y = [0,4], color=\"origin\")\n", "\n", "fig.update_layout(\n", " title='Population Distribution', \n", " title_x=0.5,\n", " yaxis_title=\"Population share in %\",\n", " xaxis_title=\"Age\",\n", " legend_title=\"Origin\"\n", ")\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Population distribution: age and sex\n", "\n", "Let's take a look at the age distribution for female and male inhabitants. We can find this data in the `BEW-ALT-HEL-SEX` data cube. \n", "\n", "The query for number of inhabitants in various age buckets, with their sex, over time looks as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "query = \"\"\"\n", "SELECT ?time ?sex ?age (SUM(?measure) AS ?count) \n", "FROM \n", "WHERE {\n", " ssz:BEW-ALT-HEL-SEX a cube:Cube;\n", " cube:observationSet/cube:observation ?observation. \n", " ?observation property:TIME ?time ;\n", " measure:BEW ?measure ;\n", " property:SEX/schema:name ?sex ;\n", " property:ALT ?age_uri .\n", " \n", " collection:1-Jahres-Altersklasse skos:member ?age_uri.\n", " ?age_uri schema:name ?age .\n", " \n", "}\n", "GROUP BY ?time ?sex ?age\n", "ORDER BY asc(?time)\n", "\"\"\"\n", "\n", "df = sparql.send_query(query)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's create a dataframe where one row represents one observation. It will allow us to use violin plots for our dataframe. \n", "The dataframe becomes:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.loc[df[\"sex\"]==\"weiblich\", \"sex\"] = \"female\"\n", "df.loc[df[\"sex\"]==\"männlich\", \"sex\"] = \"male\"\n", "\n", "df['age'] = df['age'].apply(lambda x: str(x.split(\" \")[0])).astype(int)\n", "df[\"year\"] = df.time.apply(getattr, args=(\"year\", )).astype(str)\n", "df = df.sort_values(by=[\"year\", \"age\"]).reset_index(drop=True)\n", "\n", "df = df[(df.year == df.year.max()) | ((df.year == df.year.min()))]\n", "df = df[[\"sex\", \"age\", \"year\"]].loc[df.index.repeat(df[\"count\"])].reset_index(drop=True)\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig = px.violin(df, y=\"age\", x=\"year\", color=\"sex\", violinmode=\"overlay\")\n", "fig.data[0].update(span = [0, 105], spanmode='manual')\n", "fig.data[1].update(span = [0, 105], spanmode='manual')\n", "fig.update_layout(title={\"text\": \"Population distrubution\", \"x\": 0.5})\n", "fig.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "fig = go.Figure()\n", "\n", "fig.add_trace(go.Violin(x=df['sex'][df['year'] == \"2002\"],\n", " y=df['age'][df['year'] == \"2002\"],\n", " legendgroup='2002', scalegroup='2002', name='2002',\n", " side='negative',\n", " line_color='blue', \n", " span = [0, 105], \n", " spanmode='manual'))\n", "\n", "fig.add_trace(go.Violin(x=df['sex'][df['year'] == \"2017\"],\n", " y=df['age'][df['year'] == \"2017\"],\n", " legendgroup='2017', scalegroup='2017', name='2017',\n", " side='positive',\n", " line_color='orange',\n", " span = [0, 105], \n", " spanmode='manual'))\n", "\n", "fig.update_traces(meanline_visible=True)\n", "fig.update_layout(title={\"text\": \"Population distrubution\", \"x\": 0.5}, yaxis_title=\"age\")\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Population and real estate prices\n", "\n", "Let's compare real estate prices and number of inhabitants over time. We will need to work with population and real estate data sets. The population data is available in the `BEW` data cube. The real estate prices are in the `QMP-EIG-HAA-OBJ-ZIM` data cube.\n", "\n", "The query for the number of inhabitants and the housing prices over time looks as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "query=\"\"\"\n", "SELECT * \n", "FROM \n", "WHERE{ \n", " {\n", " SELECT ?time (SUM(?pop_count) AS ?pop)\n", " WHERE {\n", " ssz:BEW a cube:Cube;\n", " cube:observationSet/cube:observation ?obs_bew. \n", " ?obs_bew property:TIME ?time ;\n", " property:RAUM ?place_uri_pop;\n", " measure:BEW ?pop_count .\n", "\n", " ?place_uri_pop skos:inScheme ;\n", " schema:name ?place_pop .\n", "\n", " FILTER regex(str(?place_pop),\"ab|Stadtgebiet vor\")\n", " }\n", " GROUP BY ?time\n", " }\n", " {\n", " SELECT ?time (AVG(?quote) AS ?price)\n", " WHERE {\n", " ssz:QMP-EIG-HAA-OBJ-ZIM a cube:Cube;\n", " cube:observationSet/cube:observation ?obs_apt. \n", " ?obs_apt property:TIME ?time ;\n", " property:RAUM ?place_uri_apt;\n", " measure:QMP ?quote . \n", "\n", " ?place_uri_apt skos:inScheme ;\n", " schema:name ?place_apt .\n", "\n", " FILTER (?quote > 0)\n", " FILTER regex(str(?place_apt),\"ab|Stadtgebiet vor\")\n", " }\n", " GROUP BY ?time\n", " ORDER BY ?time\n", " }\n", "}\n", "\"\"\"\n", "\n", "df = sparql.send_query(query)\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create figure with secondary y-axis\n", "fig = make_subplots(specs=[[{\"secondary_y\": True}]])\n", "\n", "# Add traces\n", "fig.add_trace(\n", " go.Scatter(x=df[\"time\"], y=df[\"pop\"], name=\"inhabitants\"),\n", " secondary_y=False,\n", ")\n", "\n", "fig.add_trace(\n", " go.Scatter(x=df[\"time\"], y=df[\"price\"], name=\"price per m2\"),\n", " secondary_y=True,\n", ")\n", "\n", "# Layout\n", "fig.update_layout(title={\"text\": \"Population and real estate prices\", \"x\": 0.5})\n", "fig.update_yaxes(title_text=\"population\", secondary_y=False)\n", "fig.update_yaxes(title_text=\"price per m2\", secondary_y=True)\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Causes of death\n", "\n", "The Statistical Office reports the number of deaths and the cause. Let's try to understand what are the main causes of death in Zurich.\n", "This data is available in the `GES-SEX-TOU` data cube.\n", "\n", "The query for death cause and its broader category for the year 2015 looks as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "query = \"\"\"\n", "SELECT ?tou ?tou_broader (SUM(?ges) AS ?deaths)\n", "FROM \n", "WHERE {\n", " ssz:GES-SEX-TOU a cube:Cube;\n", " cube:observationSet/cube:observation ?obs. \n", " ?obs property:TIME ?time ; \n", " property:TOU ?tou_uri;\n", " measure:GES ?ges .\n", " \n", " ?tou_uri schema:name ?tou ;\n", " skos:broader/schema:name ?tou_broader .\n", "\n", " MINUS {?three_level_tou skos:broader ?tou_uri .}\n", " FILTER (?time = \"2015-12-31\"^^xsd:date)\n", "}\n", "GROUP BY ?tou ?tou_broader\n", "HAVING (?deaths > 0)\n", "ORDER BY ?tou_broader\n", "\"\"\"\n", "\n", "df = sparql.send_query(query)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's aggregate those results under more meaningful group names. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.loc[(df.tou == \"andere infektiöse Krankheiten\"), \"tou_broader\"] = \"\"\n", "df.loc[(df.tou == \"andere infektiöse Krankheiten\"), \"tou\"] = \"Infektiöse Krankheiten\"\n", "df.loc[(df.tou == \"Alkoholische Leberzirrhose\"), \"tou_broader\"] = \"\"\n", "df.loc[(df.tou == \"Unbekannt\"), \"deaths\"] = df.loc[(df.tou_broader == \"Übrige\"), \"deaths\"].sum()\n", "df.loc[(df.tou == \"Unbekannt\"), \"tou_broader\"] = \"\"\n", "df.loc[(df.tou_broader == \"Krebskrankheiten/Bösartige Neubildungen\"), \"tou_broader\"] = \"Krebskrankheiten\"\n", "\n", "df = df.drop(df[df.tou == \"Übrige (ohne unbekannte Todesursachen)\"].index)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig = px.treemap(df, path=['tou_broader', \"tou\"], values='deaths')\n", "fig.update_layout(title={\"text\": \"Causes of Death in 2015\", \"x\": 0.5})\n", "fig.show()" ] } ], "metadata": { "interpreter": { "hash": "7df9fa510e94119d9b6414b7654a807785c4522000477a84b4534377ae180111" }, "kernelspec": { "display_name": "zazuko", "language": "python", "name": "zazuko" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.0" }, "title": "Population" }, "nbformat": 4, "nbformat_minor": 4 }