{ "cells": [ { "cell_type": "markdown", "id": "cd2f0c7a", "metadata": {}, "source": [ "
\n", " \n", "
\n", " \n", "
Run in Google Colab
\n", "
\n", "
\n", " \n", "
\n", " \n", "
Download Notebook
\n", "
\n", "
\n", " \n", "
\n", " \n", "
View on GitHub
\n", "
\n", "
\n", "
\n", "\n", "# Swiss commerce register\n", "\n", "The Federal Office of Justice maintains ZEFIX, the swiss commerce register for all legally operating businesses.\n", "\n", "\n", "The register provides us with company name, type, description, and address. \n", "ZEFIX is also available as [Linked Data](https://en.wikipedia.org/wiki/Linked_data). \n" ] }, { "cell_type": "markdown", "id": "eb7bef9c", "metadata": {}, "source": [ "## Setup" ] }, { "cell_type": "markdown", "id": "cc8bbfa9", "metadata": {}, "source": [ "### SPARQL endpoints\n", "\n", "#### For companies data\n", "Swiss commerce register data can be accessed with [SPARQL queries](https://www.w3.org/TR/rdf-sparql-query/). \n", "You can send queries using HTTP requests. The API endpoint is **[https://lindas.admin.ch/query/](https://int.lindas.admin.ch/query).** \n", "\n", "#### For geodata\n", "To understand companies' location, we will work with Swiss geodata. It can be accessed with GeoSPARQL under **[https://geo.ld.admin.ch/query](https://geo.ld.admin.ch/query).** " ] }, { "cell_type": "markdown", "id": "da9db59c", "metadata": {}, "source": [ "### SPARQL client\n", "\n", "Let's use `SparqlClient` from [graphly](https://github.com/zazuko/graphly) to communicate with both databases. \n", "Graphly will allow us to:\n", "* send SPARQL queries\n", "* automatically add prefixes to all queries\n", "* format response to `pandas` or `geopandas`" ] }, { "cell_type": "code", "execution_count": null, "id": "d2f8138f", "metadata": {}, "outputs": [], "source": [ "import json\n", "import os.path\n", "\n", "import folium\n", "import mapclassify\n", "import matplotlib as mpl\n", "import pandas as pd\n", "import geopandas as gpd\n", "import plotly.express as px\n", "import plotly.graph_objects as go\n", "\n", "from graphly.api_client import SparqlClient\n", "from geopy.extra.rate_limiter import RateLimiter\n", "from geopy.geocoders import Nominatim\n", "from shapely.geometry import Point\n", "\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": null, "id": "f80432ff", "metadata": {}, "outputs": [], "source": [ "# Uncomment to install dependencies in Colab environment\n", "#!pip install mapclassify\n", "#!pip install git+https://github.com/zazuko/graphly.git" ] }, { "cell_type": "code", "execution_count": null, "id": "55a2bac1", "metadata": {}, "outputs": [], "source": [ "sparql = SparqlClient(\"https://lindas.admin.ch/query\")\n", "geosparql = SparqlClient(\"https://geo.ld.admin.ch/query\")\n", "\n", "sparql.add_prefixes({\n", " \"schema\": \"\",\n", " \"admin\": \"\"\n", "})\n", "\n", "geosparql.add_prefixes({\n", " \"dct\": \"\",\n", " \"geonames\": \"\",\n", " \"schema\": \"\",\n", " \"geosparql\": \"\",\n", "})" ] }, { "cell_type": "markdown", "id": "55416e9c", "metadata": {}, "source": [ "SPARQL queries can become very long. To improve the readibility, we will work wih [prefixes](https://en.wikibooks.org/wiki/SPARQL/Prefixes).\n", " \n", "Using the `add_prefixes` method, we can define persistent prefixes. \n", "Every time you send a query, `graphly` will now automatically add the prefixes for you." ] }, { "cell_type": "markdown", "id": "b8165286", "metadata": {}, "source": [ "# Companies by company type\n", "\n", "Swiss law offers many opportnities to business owners. They can choose from several legal entites. What entities can we find in ZEFIX? Which ones are the most popular among enterpreneurs?\n" ] }, { "cell_type": "code", "execution_count": null, "id": "9f22454e", "metadata": {}, "outputs": [], "source": [ "query = \"\"\"\n", "SELECT ?type (COUNT(DISTINCT ?company_iri) AS ?ccount)\n", "WHERE {\n", " ?company_iri a admin:ZefixOrganisation.\n", " ?company_iri schema:additionalType/schema:name ?type .\n", " \n", " FILTER(LANG(?type) = \"de\")\n", "}\n", "GROUP BY ?type\n", "ORDER BY DESC(?ccount)\n", "\"\"\"\n", "\n", "df = sparql.send_query(query)" ] }, { "cell_type": "code", "execution_count": null, "id": "50a72233", "metadata": {}, "outputs": [], "source": [ "# Let's rename variables to english\n", "\n", "de2en = {'Kommanditgesellschaft': \"Limited Partnership\",\n", " 'Ausländische Niederlassung im Handelsregister eingetragen': \"Foreign Branch\",\n", " 'Genossenschaft': \"Cooperative\",\n", " 'Verein': \"Association\",\n", " 'Kollektivgesellschaft': \"General Partnership\",\n", " 'Schweizerische Zweigniederlassung im Handelsregister eingetragen': \"Swiss Branch\",\n", " 'Stiftung': \"Foundation\",\n", " 'Einzelunternehmen': \"Sole proprietorship\",\n", " 'Aktiengesellschaft': \"Joint-stock Company\",\n", " 'Gesellschaft mit beschränkter Haftung GMBH / SARL': \"Limited Liability Company\"}\n", "\n", "df[\"type\"] = df[\"type\"].apply(lambda x: de2en[x] if x in de2en else x)\n" ] }, { "cell_type": "code", "execution_count": null, "id": "3157342a", "metadata": {}, "outputs": [], "source": [ "fig = px.bar(df[df.ccount > 500], y=\"type\", x=\"ccount\", orientation = \"h\", labels={\"type\": \"\", \"ccount\": \"Company Count\"})\n", "fig.update_layout(\n", " title='Which company type is most popular?', \n", " title_x=0.5,\n", ")\n", "fig.update_layout(bargap=0.40)\n", "fig.show()" ] }, { "cell_type": "markdown", "id": "d6ad0090", "metadata": {}, "source": [ "# Companies by municipality\n", "\n", "The economic activity varies significantly across regions. What are the most popular places to register a company? \n" ] }, { "cell_type": "code", "execution_count": null, "id": "2f6e6a56", "metadata": {}, "outputs": [], "source": [ "query = \"\"\"\n", "SELECT ?municipality ?muni_id (COUNT(?sub) AS ?companies)\n", "FROM \n", "FROM \n", "WHERE {\n", "\t?sub a admin:ZefixOrganisation ;\n", " admin:municipality ?muni_id.\n", " ?muni_id schema:name ?municipality;\n", "} \n", "GROUP BY ?municipality ?muni_id\n", "ORDER BY DESC(?companies)\n", "\"\"\"\n", "\n", "df = sparql.send_query(query)" ] }, { "cell_type": "code", "execution_count": null, "id": "87a183d7", "metadata": {}, "outputs": [], "source": [ "# Communes\n", "query = \"\"\" \n", "SELECT ?muni_id ?population ?boundary \n", "\n", "WHERE {\n", " ?muni_iri dct:hasVersion ?version ;\n", " geonames:featureCode geonames:A.ADM3 .\n", " \n", " ?version schema:validUntil \"2020-12-31\"^^;\n", " geonames:population ?population;\n", " geosparql:hasGeometry/geosparql:asWKT ?boundary.\n", " \n", " BIND(IRI(REPLACE(STR(?muni_iri), \"https://geo.ld.admin.ch/boundaries/\", \"https://ld.admin.ch/\")) AS ?muni_id)\n", "}\n", "\"\"\"\n", "communes = geosparql.send_query(query)\n", "communes = communes.set_crs(epsg=4326)\n", "communes[\"center\"] = communes.to_crs(epsg=3035).centroid.to_crs(epsg=4326)" ] }, { "cell_type": "code", "execution_count": null, "id": "a1e025b6", "metadata": {}, "outputs": [], "source": [ "join = pd.merge(communes, df, how=\"inner\", on=\"muni_id\")\n", "join.sort_values(by=\"companies\", ascending=False, inplace=True)\n", "join = join.reset_index(drop=True)\n", "join[[\"municipality\", \"companies\"]]" ] }, { "cell_type": "markdown", "id": "21b0e6e3", "metadata": {}, "source": [ "The most popular places are the biggest cities. \n", "\n", "This should come at no surprise. The more people live in a certain area, the more business founders it will have. The company density is expected to be directly correlated with population density.\n", "\n", "Now, let's take into account the population density. Which regions have most companies per 10.000 inhabitants?" ] }, { "cell_type": "code", "execution_count": null, "id": "97a4a03d", "metadata": {}, "outputs": [], "source": [ "join[\"companies_per_inhabitants\"] = round(join.companies/join.population*10000)\n", "join[\"companies_per_inhabitants\"] = join[\"companies_per_inhabitants\"].astype(int)\n", "join.sort_values(by=\"companies_per_inhabitants\", ascending=False, inplace=True)\n", "join = join.reset_index(drop=True)\n", "join[[\"municipality\", \"companies_per_inhabitants\"]].head()" ] }, { "cell_type": "markdown", "id": "aff4c52e", "metadata": {}, "source": [ "Interesting! Zug is the only bigger city that appears on the top of the list.\n", "\n", "Where are the most popular registration destinations?" ] }, { "cell_type": "code", "execution_count": null, "id": "b02a3960", "metadata": {}, "outputs": [], "source": [ "# Companies per 10.000 inhabitants\n", "\n", "style_function = lambda x: {'fillColor': '#ffffff', \n", " 'color':'#000000', \n", " 'fillOpacity': 0.1, \n", " 'weight': 0.1}\n", "\n", "highlight_function = lambda x: {'fillColor': '#989898', \n", " 'color':'#000000', \n", " 'fillOpacity': 0.8}\n", "\n", "def plot_heatmap(df, viz_column, hover_cols, col2label, title):\n", " \n", " classifier = mapclassify.NaturalBreaks(y=df[viz_column], k=5)\n", " bins = [df[viz_column].min()] + list(classifier.bins)\n", " cols = [\"municipality\", viz_column, *hover_cols]\n", "\n", " m = folium.Map(location=[46.83, 8.13], zoom_start=8, tiles=\"cartodbpositron\")\n", "\n", " folium.Choropleth(\n", " geo_data=json.loads(df.to_json()),\n", " data=df,\n", " columns=cols,\n", " key_on=\"feature.properties.municipality\",\n", " fill_color=\"YlOrRd\",\n", " fill_opacity=1,\n", " line_weight=0,\n", " smooth_factor=0,\n", " bins=bins,\n", " reset=True,\n", " legend_name=col2label[viz_column],\n", " ).add_to(m)\n", "\n", " hover = folium.features.GeoJson(\n", " df,\n", " style_function=style_function, \n", " control=False,\n", " highlight_function=highlight_function,\n", " tooltip=folium.features.GeoJsonTooltip(\n", " fields=cols,\n", " aliases=[col2label[i] for i in cols],\n", " style=(\"background-color: white; color: #333333; font-family: arial; font-size: 12px; padding: 10px;\") \n", " )\n", " )\n", "\n", " folium.LayerControl().add_to(m)\n", " m.add_child(hover)\n", " m.keep_in_front(hover)\n", " \n", " title_html = '''

{}

'''.format(title) \n", " m.get_root().html.add_child(folium.Element(title_html))\n", "\n", " return m" ] }, { "cell_type": "code", "execution_count": null, "id": "7ab6d47c", "metadata": {}, "outputs": [], "source": [ "join = join.drop(columns=\"center\")\n", "col2label = {\"municipality\": \"Municipality\", \"companies_per_inhabitants\": \"Companies per 10.000 inhabitants\", \"companies\": \"Companies \", \"population\": \"Population\"}\n", "plot_heatmap(join, \"companies_per_inhabitants\", [\"companies\", \"population\"], col2label, \"Where are most companies registered?\")" ] }, { "cell_type": "markdown", "id": "1fe20eeb", "metadata": {}, "source": [ "# Companies by address" ] }, { "cell_type": "markdown", "id": "9b20958b", "metadata": {}, "source": [ "We know which municipalities have the most registered companies. Now, let's dive one level deeper.\n", "\n", "Every company has to register its business address. So what swiss addresses are most beloved among companies?" ] }, { "cell_type": "code", "execution_count": null, "id": "3d0f9196", "metadata": {}, "outputs": [], "source": [ "query = \"\"\"\n", "SELECT ?street ?zip ?town (COUNT(?company) AS ?ccount)\n", "FROM \n", "WHERE {\n", " ?company a ;\n", " ?address.\n", " \n", " ?address ?street;\n", " ?zip;\n", " ?town.\n", "}\n", "GROUP BY ?town ?zip ?street\n", "ORDER BY DESC (?ccount)\n", "LIMIT 15\n", "\"\"\"\n", "\n", "df = sparql.send_query(query)\n", "df" ] }, { "cell_type": "markdown", "id": "462df2e4", "metadata": {}, "source": [ "Wow! Over 400 companies are registered under the address of Telan AG in Luzern. \n", "\n", "Some of those addresses represent big offices. For example, Technoparkstrasse 1 in Zurich is one of the biggest office buildings in the city. \n", "\n", "Other addresses are simply registrations used by many, many people. Take a look, and google some of the addresses in Zug. How many companies can you find operating under this address?\n", "\n", "-----------------------------------------\n", "\n", "Thus far we had a look at individual buildings. Now, let's aggregate this result on the street level. What are the most crowded streets with company registrations? Can we find some business centers?\n" ] }, { "cell_type": "code", "execution_count": null, "id": "c79f79eb", "metadata": {}, "outputs": [], "source": [ "query = \"\"\"\n", "SELECT ?thoroughfare ?zip ?town (COUNT(?company) AS ?ccount)\n", "FROM \n", "WHERE {\n", " ?company a ;\n", " ?address.\n", " \n", " ?address ?thoroughfare;\n", " ?zip;\n", " ?town.\n", "}\n", "GROUP BY ?town ?zip ?thoroughfare\n", "HAVING (?ccount >= 100)\n", "ORDER BY DESC (?ccount)\n", "\"\"\"\n", "df = sparql.send_query(query, timeout=60)\n", "df.head(15)\n" ] }, { "cell_type": "markdown", "id": "2c56aa70", "metadata": {}, "source": [ "Slightly different! We see clearly that Zug is the favourite destination for company registrations.\n", "\n", "What are other popular locations?" ] }, { "cell_type": "code", "execution_count": null, "id": "1747db67", "metadata": {}, "outputs": [], "source": [ "geolocator = Nominatim(user_agent=\"zefix\")\n", "find_location = RateLimiter(geolocator.geocode, min_delay_seconds=1)\n", "\n", "def geocode(df: pd.DataFrame, cache_file: str=\"addresses.json\"):\n", "\n", " if os.path.isfile(cache_file): \n", " cached_locations = gpd.read_file(cache_file)\n", " cached_addresses = cached_locations.address\n", " else:\n", " cached_locations = gpd.GeoDataFrame(columns=[\"address\", \"geometry\"], crs=\"EPSG:4326\")\n", " cached_addresses = []\n", " \n", " is_new_address = ~df.address.isin(cached_addresses)\n", "\n", " if any(is_new_address):\n", " \n", " coordinates = []\n", " addresses = []\n", " for address in df.address[is_new_address]:\n", " location = find_location(address)\n", " if coordinates:\n", " coordinates.append(Point(location))\n", " addresses.append(address)\n", "\n", " new_locations = gpd.GeoDataFrame.from_dict({\"address\": addresses, \"geometry\": coordinates})\n", " new_locations = new_locations.set_crs(epsg=4326)\n", " new_locations = new_locations[~new_locations.geometry.is_empty]\n", " all_locations = gpd.GeoDataFrame(pd.concat([new_locations, cached_locations], ignore_index=True), crs=new_locations.crs)\n", " all_locations.to_file(cache_file, driver=\"GeoJSON\")\n", " else:\n", " all_locations = cached_locations\n", "\n", " return pd.merge(all_locations, df, on=\"address\", how=\"right\")" ] }, { "cell_type": "code", "execution_count": null, "id": "7ed2d56d", "metadata": {}, "outputs": [], "source": [ "df.loc[: ,\"address\"] = df[\"thoroughfare\"] + \", \" + df[\"zip\"].astype(str) + \" \" + df[\"town\"] + \", Switzerland\"\n", "plot_df = geocode(df)\n", "\n", "bins = [250, 500, 1000, df.ccount.max()]\n", "plot_df.loc[:,\"text\"] = df[\"thoroughfare\"] + \", \" + df[\"town\"] + \"
Companies: \" + plot_df.ccount.astype(int).astype(str)\n", "classifier = mapclassify.UserDefined(y=df.ccount, bins=bins)\n", "plot_df.loc[:, \"buckets\"] = plot_df[[\"ccount\"]].apply(classifier) \n", "\n", "norm = mpl.colors.Normalize(vmin=0, vmax=len(bins))\n", "colormap = mpl.cm.ScalarMappable(norm=norm, cmap=mpl.cm.inferno)\n", "labels = mapclassify.classifiers._get_mpl_labels(classifier, fmt=\"{:.0f}\")\n", "\n", "fig = go.Figure()\n", "\n", "for bucket in range(len(bins)):\n", "\n", " subset = plot_df[plot_df.buckets == bucket]\n", "\n", " #sizes = (subset.ccount*0.05)**0.9\n", " #sizes[sizes<7] = 7\n", " #sizes[sizes>50] = 50\n", "\n", " fig.add_trace(go.Scattermapbox(\n", " mode=\"markers\",\n", " lat=subset.geometry.y,\n", " lon=subset.geometry.x,\n", " hovertext = subset.text,\n", " hoverinfo = \"text\",\n", " name=labels[bucket],\n", " marker={'size': 7.5, \"color\": \"rgba{}\".format(colormap.to_rgba(bucket+1))}, #((subset.ccount)*0.05)**0.9\n", " ))\n", "\n", "fig.update_layout(\n", " margin={'l': 0, 't': 50, 'b': 0, 'r': 0},\n", " mapbox={\n", " 'center': {\"lat\": 46.80515, \"lon\": 8.1336},\n", " 'style': \"carto-darkmatter\",\n", " 'zoom': 6.9},\n", " showlegend=True,\n", " legend_title=\"Registered companies\",\n", " title_text='Where are most companies registered?', \n", " title_x=0.5,\n", " width=980,\n", " height=600\n", ")\n", "\n", "fig.show(\"notebook\")" ] }, { "cell_type": "markdown", "id": "e524662f", "metadata": {}, "source": [ "Thank you for following along and hopefully this notebook was helpful. If you want to [get in touch with us](https://zazuko.com/about/), please reach out to us [via email](info@zazuko.com)." ] } ], "metadata": { "interpreter": { "hash": "52612798f5eced5fbe3f58e9481faf8401bf1ada0bff399e6095e3d5ee493763" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" }, "title": "Central Business Name Index - Zefix" }, "nbformat": 4, "nbformat_minor": 5 }