{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Extra 3.1 - Historical Provenance - Application 2: CollabMap Data Quality\n", "Assessing the quality of crowdsourced data in CollabMap from their provenance.\n", "\n", "In this notebook, we explore the performance of classification using the provenance of a data entity instead of its dependencies (as shown [here](Application%202%20-%20CollabMap%20Data%20Quality.ipynb) and in the paper). In order to distinguish between the two, we call the former _historical_ provenance and the latter _forward_ provenance. Apart from using the historical provenance, all other steps are the same as [the original experiments](Application%202%20-%20CollabMap%20Data%20Quality.ipynb).\n", "\n", "* **Goal**: To determine if the provenance network analytics method can identify trustworthy data (i.e. buildings, routes, and route sets) contributed by crowd workers in [CollabMap](https://collabmap.org/).\n", "* **Classification labels**: $\\mathcal{L} = \\left\\{ \\textit{trusted}, \\textit{uncertain} \\right\\} $.\n", "* **Training data**:\n", " - Buildings: 5175\n", " - Routes: 4710\n", " - Route sets: 4997\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reading data [Changed]\n", "The CollabMap dataset based on historical provenance is provided in the [`collabmap/ancestor-graphs.csv`](collabmap/ancestor-graphs.csv) file, each row corresponds to a building, route, or route sets created in the application:\n", "* `id`: the identifier of the data entity (i.e. building/route/route set).\n", "* `trust_value`: the beta trust value calculated from the votes for the data entity.\n", "* The remaining columns provide the provenance network metrics calculated from the *historical provenance* graph of the entity." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
trust_valueentitiesagentsactivitiesnodesedgesdiameterassortativityaccacc_e...mfd_e_amfd_e_agmfd_a_emfd_a_amfd_a_agmfd_ag_emfd_ag_amfd_ag_agmfd_derpowerlaw_alpha
id
Route41053.00.8333332226640.5000000.5555560.666667...231230001-1.0
RouteSet9042.10.6000003227940.3637740.7500000.833333...231230001-1.0
Building19305.00.428571111322-1.0000000.0000000.000000...12001000-1-1.0
Building1136.00.428571111322-1.0000000.0000000.000000...12001000-1-1.0
Building24156.00.833333111322-1.0000000.0000000.000000...12001000-1-1.0
\n", "

5 rows × 23 columns

\n", "
" ], "text/plain": [ " trust_value entities agents activities nodes edges \\\n", "id \n", "Route41053.0 0.833333 2 2 2 6 6 \n", "RouteSet9042.1 0.600000 3 2 2 7 9 \n", "Building19305.0 0.428571 1 1 1 3 2 \n", "Building1136.0 0.428571 1 1 1 3 2 \n", "Building24156.0 0.833333 1 1 1 3 2 \n", "\n", " diameter assortativity acc acc_e ... \\\n", "id ... \n", "Route41053.0 4 0.500000 0.555556 0.666667 ... \n", "RouteSet9042.1 4 0.363774 0.750000 0.833333 ... \n", "Building19305.0 2 -1.000000 0.000000 0.000000 ... \n", "Building1136.0 2 -1.000000 0.000000 0.000000 ... \n", "Building24156.0 2 -1.000000 0.000000 0.000000 ... \n", "\n", " mfd_e_a mfd_e_ag mfd_a_e mfd_a_a mfd_a_ag mfd_ag_e \\\n", "id \n", "Route41053.0 2 3 1 2 3 0 \n", "RouteSet9042.1 2 3 1 2 3 0 \n", "Building19305.0 1 2 0 0 1 0 \n", "Building1136.0 1 2 0 0 1 0 \n", "Building24156.0 1 2 0 0 1 0 \n", "\n", " mfd_ag_a mfd_ag_ag mfd_der powerlaw_alpha \n", "id \n", "Route41053.0 0 0 1 -1.0 \n", "RouteSet9042.1 0 0 1 -1.0 \n", "Building19305.0 0 0 -1 -1.0 \n", "Building1136.0 0 0 -1 -1.0 \n", "Building24156.0 0 0 -1 -1.0 \n", "\n", "[5 rows x 23 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv(\"collabmap/ancestor-graphs.csv\", index_col='id')\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
trust_valueentitiesagentsactivitiesnodesedgesdiameterassortativityaccacc_e...mfd_e_amfd_e_agmfd_a_emfd_a_amfd_a_agmfd_ag_emfd_ag_amfd_ag_agmfd_derpowerlaw_alpha
count14882.00000014882.00000014882.00000014882.00000014882.00000014882.00000014882.00000014882.00000014882.00000014882.000000...14882.00000014882.00000014882.00000014882.00000014882.00000014882.014882.014882.014882.00000014882.000000
mean0.7667062.7860501.8080902.1094616.7036029.2533263.175581-0.1460270.4059630.461089...2.0079962.7459351.0007391.6464862.2045420.00.00.00.670945-0.991271
std0.1153012.3009391.2209641.3876074.82525511.5946161.3414000.6318860.3051100.345399...1.0171580.9976921.0147421.3813581.4034190.00.00.01.4267880.230387
min0.1538461.0000001.0000001.0000003.0000002.0000002.000000-1.0000000.0000000.000000...1.0000002.0000000.0000000.0000001.0000000.00.00.0-1.000000-1.000000
25%0.7500001.0000001.0000001.0000003.0000002.0000002.000000-1.0000000.0000000.000000...1.0000002.0000000.0000000.0000001.0000000.00.00.0-1.000000-1.000000
50%0.8000002.0000001.0000002.0000006.0000006.0000002.0000000.1853690.5555560.661111...2.0000002.0000001.0000002.0000001.0000000.00.00.01.000000-1.000000
75%0.8333333.0000002.0000002.0000007.0000009.0000004.0000000.3637740.6222220.725000...2.0000003.0000001.0000002.0000003.0000000.00.00.01.000000-1.000000
max0.96551728.00000019.00000024.00000071.000000254.0000008.0000000.5153070.7500000.833333...12.00000011.00000010.00000011.00000010.0000000.00.00.011.0000007.124572
\n", "

8 rows × 23 columns

\n", "
" ], "text/plain": [ " trust_value entities agents activities nodes \\\n", "count 14882.000000 14882.000000 14882.000000 14882.000000 14882.000000 \n", "mean 0.766706 2.786050 1.808090 2.109461 6.703602 \n", "std 0.115301 2.300939 1.220964 1.387607 4.825255 \n", "min 0.153846 1.000000 1.000000 1.000000 3.000000 \n", "25% 0.750000 1.000000 1.000000 1.000000 3.000000 \n", "50% 0.800000 2.000000 1.000000 2.000000 6.000000 \n", "75% 0.833333 3.000000 2.000000 2.000000 7.000000 \n", "max 0.965517 28.000000 19.000000 24.000000 71.000000 \n", "\n", " edges diameter assortativity acc acc_e \\\n", "count 14882.000000 14882.000000 14882.000000 14882.000000 14882.000000 \n", "mean 9.253326 3.175581 -0.146027 0.405963 0.461089 \n", "std 11.594616 1.341400 0.631886 0.305110 0.345399 \n", "min 2.000000 2.000000 -1.000000 0.000000 0.000000 \n", "25% 2.000000 2.000000 -1.000000 0.000000 0.000000 \n", "50% 6.000000 2.000000 0.185369 0.555556 0.661111 \n", "75% 9.000000 4.000000 0.363774 0.622222 0.725000 \n", "max 254.000000 8.000000 0.515307 0.750000 0.833333 \n", "\n", " ... mfd_e_a mfd_e_ag mfd_a_e mfd_a_a \\\n", "count ... 14882.000000 14882.000000 14882.000000 14882.000000 \n", "mean ... 2.007996 2.745935 1.000739 1.646486 \n", "std ... 1.017158 0.997692 1.014742 1.381358 \n", "min ... 1.000000 2.000000 0.000000 0.000000 \n", "25% ... 1.000000 2.000000 0.000000 0.000000 \n", "50% ... 2.000000 2.000000 1.000000 2.000000 \n", "75% ... 2.000000 3.000000 1.000000 2.000000 \n", "max ... 12.000000 11.000000 10.000000 11.000000 \n", "\n", " mfd_a_ag mfd_ag_e mfd_ag_a mfd_ag_ag mfd_der \\\n", "count 14882.000000 14882.0 14882.0 14882.0 14882.000000 \n", "mean 2.204542 0.0 0.0 0.0 0.670945 \n", "std 1.403419 0.0 0.0 0.0 1.426788 \n", "min 1.000000 0.0 0.0 0.0 -1.000000 \n", "25% 1.000000 0.0 0.0 0.0 -1.000000 \n", "50% 1.000000 0.0 0.0 0.0 1.000000 \n", "75% 3.000000 0.0 0.0 0.0 1.000000 \n", "max 10.000000 0.0 0.0 0.0 11.000000 \n", "\n", " powerlaw_alpha \n", "count 14882.000000 \n", "mean -0.991271 \n", "std 0.230387 \n", "min -1.000000 \n", "25% -1.000000 \n", "50% -1.000000 \n", "75% -1.000000 \n", "max 7.124572 \n", "\n", "[8 rows x 23 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Labelling data\n", "Based on its trust value, we categorise the data entity into two sets: _trusted_ and _uncertain_. Here, the threshold for the trust value, whose range is [0, 1], is chosen to be 0.75." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
trust_valueentitiesagentsactivitiesnodesedgesdiameterassortativityaccacc_e...mfd_e_agmfd_a_emfd_a_amfd_a_agmfd_ag_emfd_ag_amfd_ag_agmfd_derpowerlaw_alphalabel
id
Route41053.00.8333332226640.5000000.5555560.666667...31230001-1.0Trusted
RouteSet9042.10.6000003227940.3637740.7500000.833333...31230001-1.0Uncertain
Building19305.00.428571111322-1.0000000.0000000.000000...2001000-1-1.0Uncertain
Building1136.00.428571111322-1.0000000.0000000.000000...2001000-1-1.0Uncertain
Building24156.00.833333111322-1.0000000.0000000.000000...2001000-1-1.0Trusted
\n", "

5 rows × 24 columns

\n", "
" ], "text/plain": [ " trust_value entities agents activities nodes edges \\\n", "id \n", "Route41053.0 0.833333 2 2 2 6 6 \n", "RouteSet9042.1 0.600000 3 2 2 7 9 \n", "Building19305.0 0.428571 1 1 1 3 2 \n", "Building1136.0 0.428571 1 1 1 3 2 \n", "Building24156.0 0.833333 1 1 1 3 2 \n", "\n", " diameter assortativity acc acc_e ... \\\n", "id ... \n", "Route41053.0 4 0.500000 0.555556 0.666667 ... \n", "RouteSet9042.1 4 0.363774 0.750000 0.833333 ... \n", "Building19305.0 2 -1.000000 0.000000 0.000000 ... \n", "Building1136.0 2 -1.000000 0.000000 0.000000 ... \n", "Building24156.0 2 -1.000000 0.000000 0.000000 ... \n", "\n", " mfd_e_ag mfd_a_e mfd_a_a mfd_a_ag mfd_ag_e mfd_ag_a \\\n", "id \n", "Route41053.0 3 1 2 3 0 0 \n", "RouteSet9042.1 3 1 2 3 0 0 \n", "Building19305.0 2 0 0 1 0 0 \n", "Building1136.0 2 0 0 1 0 0 \n", "Building24156.0 2 0 0 1 0 0 \n", "\n", " mfd_ag_ag mfd_der powerlaw_alpha label \n", "id \n", "Route41053.0 0 1 -1.0 Trusted \n", "RouteSet9042.1 0 1 -1.0 Uncertain \n", "Building19305.0 0 -1 -1.0 Uncertain \n", "Building1136.0 0 -1 -1.0 Uncertain \n", "Building24156.0 0 -1 -1.0 Trusted \n", "\n", "[5 rows x 24 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "trust_threshold = 0.75\n", "df['label'] = df.apply(lambda row: 'Trusted' if row.trust_value >= trust_threshold else 'Uncertain', axis=1)\n", "df.head() # The new label column is the last column below" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Having used the trust valuue to label all the data entities, we remove the `trust_value` column from the data frame." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(14882, 23)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# We will not use trust value from now on\n", "df.drop('trust_value', axis=1, inplace=True)\n", "df.shape # the dataframe now have 23 columns (22 metrics + label)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Filtering data\n", "We split the dataset into three: buildings, routes, and route sets." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((5175, 23), (4997, 23), (4710, 23))" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_buildings = df.filter(like=\"Building\", axis=0)\n", "df_routes = df.filter(regex=\"^Route\\d\", axis=0)\n", "df_routesets = df.filter(like=\"RouteSet\", axis=0)\n", "df_buildings.shape, df_routes.shape, df_routesets.shape # The number of data points in each dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Balancing Data\n", "This section explore the balance of each of the three datasets and balance them using the [SMOTE Oversampling Method](https://www.jair.org/media/953/live-953-2037-jair.pdf)." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from analytics import balance_smote" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Buildings" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Trusted 4491\n", "Uncertain 684\n", "Name: label, dtype: int64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_buildings.label.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Balancing the building dataset:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Original data shapes: (5175, 22) (5175,)\n", "Balanced data shapes: (8982, 22) (8982,)\n" ] } ], "source": [ "df_buildings = balance_smote(df_buildings)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Routes" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "Trusted 3908\n", "Uncertain 1089\n", "Name: label, dtype: int64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_routes.label.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Balancing the route dataset:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Original data shapes: (4997, 22) (4997,)\n", "Balanced data shapes: (7816, 22) (7816,)\n" ] } ], "source": [ "df_routes = balance_smote(df_routes)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Route Sets" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Trusted 3019\n", "Uncertain 1691\n", "Name: label, dtype: int64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_routesets.label.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Balancing the route set dataset:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Original data shapes: (4710, 22) (4710,)\n", "Balanced data shapes: (6038, 22) (6038,)\n" ] } ], "source": [ "df_routesets = balance_smote(df_routesets)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cross Validation\n", "\n", "We now run the cross validation tests on the three balanaced datasets (`df_buildings`, `df_routes`, and `df_routesets`) using all the features (`combined`), only the generic network metrics (`generic`), and only the provenance-specific network metrics (`provenance`). Please refer to [Cross Validation Code.ipynb](Cross%20Validation%20Code.ipynb) for the detailed description of the cross validation code." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from analytics import test_classification" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Building Classification\n", "\n", "We test the classification of buildings, collect individual accuracy scores `results` and the importance of every feature in each test in `importances` (both are Pandas Dataframes). These two tables will also be used to collect data from testing the classification of routes and route sets later." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:1907: RuntimeWarning: invalid value encountered in multiply\n", " lower_bound = self.a * scale + loc\n", "/Users/tdh/.virtualenvs/datasets-provanalytics-dmkd/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:1908: RuntimeWarning: invalid value encountered in multiply\n", " upper_bound = self.b * scale + loc\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Accuracy: 50.00% ±nan <-- combined\n", "Accuracy: 50.00% ±nan <-- generic\n", "Accuracy: 50.00% ±nan <-- provenance\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AccuracyMetricsData Type
29950.5provenanceBuilding
29960.5provenanceBuilding
29970.5provenanceBuilding
29980.5provenanceBuilding
29990.5provenanceBuilding
\n", "
" ], "text/plain": [ " Accuracy Metrics Data Type\n", "2995 0.5 provenance Building\n", "2996 0.5 provenance Building\n", "2997 0.5 provenance Building\n", "2998 0.5 provenance Building\n", "2999 0.5 provenance Building" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Cross validation test on building classification\n", "res, imps = test_classification(df_buildings)\n", "\n", "# adding the Data Type column\n", "res['Data Type'] = 'Building'\n", "imps['Data Type'] = 'Building'\n", "\n", "# storing the results and importance of features\n", "results = res\n", "importances = imps\n", "\n", "# showing a few newest rows\n", "results.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Route Classification" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy: 61.20% ±0.0983 <-- combined\n", "Accuracy: 61.19% ±0.1001 <-- generic\n", "Accuracy: 61.23% ±0.0972 <-- provenance\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AccuracyMetricsData Type
59950.593350provenanceRoute
59960.606138provenanceRoute
59970.631714provenanceRoute
59980.607692provenanceRoute
59990.612821provenanceRoute
\n", "
" ], "text/plain": [ " Accuracy Metrics Data Type\n", "5995 0.593350 provenance Route\n", "5996 0.606138 provenance Route\n", "5997 0.631714 provenance Route\n", "5998 0.607692 provenance Route\n", "5999 0.612821 provenance Route" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Cross validation test on route classification\n", "res, imps = test_classification(df_routes)\n", "\n", "# adding the Data Type column\n", "res['Data Type'] = 'Route'\n", "imps['Data Type'] = 'Route'\n", "\n", "# storing the results and importance of features\n", "results = results.append(res, ignore_index=True)\n", "importances = importances.append(imps, ignore_index=True)\n", "\n", "# showing a few newest rows\n", "results.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Route Set Classification" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy: 63.38% ±0.0974 <-- combined\n", "Accuracy: 63.31% ±0.0972 <-- generic\n", "Accuracy: 63.42% ±0.0970 <-- provenance\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AccuracyMetricsData Type
89950.678808provenanceRoute Set
89960.637417provenanceRoute Set
89970.645695provenanceRoute Set
89980.639073provenanceRoute Set
89990.661130provenanceRoute Set
\n", "
" ], "text/plain": [ " Accuracy Metrics Data Type\n", "8995 0.678808 provenance Route Set\n", "8996 0.637417 provenance Route Set\n", "8997 0.645695 provenance Route Set\n", "8998 0.639073 provenance Route Set\n", "8999 0.661130 provenance Route Set" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Cross validation test on route classification\n", "res, imps = test_classification(df_routesets)\n", "\n", "# adding the Data Type column\n", "res['Data Type'] = 'Route Set'\n", "imps['Data Type'] = 'Route Set'\n", "\n", "# storing the results and importance of features\n", "results = results.append(res, ignore_index=True)\n", "importances = importances.append(imps, ignore_index=True)\n", "\n", "# showing a few newest rows\n", "results.tail()" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "### Charting the accuracy scores" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%matplotlib inline\n", "import seaborn as sns\n", "sns.set_style(\"whitegrid\")\n", "sns.set_context(\"paper\", font_scale=1.4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Converting the accuracy score from [0, 1] to percentage, i.e [0, 100]:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AccuracyMetricsData Type
050.0combinedBuilding
150.0combinedBuilding
250.0combinedBuilding
350.0combinedBuilding
450.0combinedBuilding
\n", "
" ], "text/plain": [ " Accuracy Metrics Data Type\n", "0 50.0 combined Building\n", "1 50.0 combined Building\n", "2 50.0 combined Building\n", "3 50.0 combined Building\n", "4 50.0 combined Building" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results.Accuracy = results.Accuracy * 100\n", "results.head()" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from matplotlib.font_manager import FontProperties\n", "fontP = FontProperties()\n", "fontP.set_size(12)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYgAAAETCAYAAAAs4pGmAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3XlYVGX/BvCbHSZU0EhBeYEfipnEIpsLiICKsiiplBtR\n4oYolgoqLrjvaGpqYlbuW5ICmpLlDm5oYrhkgr4gCkKo7Ov5/eHlvJEHnUEYUO7PdXXlnDnnOd+Z\nZ5h7zvYcJUEQBBAREf2Lcn0XQEREDRMDgoiIRDEgiIhIFAOCiIhEMSCIiEgUA4KIiEQpNCASEhLg\n4+MDa2trfPLJJ7h69SoA4MmTJwgKCoKNjQ169OiBffv2KbIsIiISobCASE9PR2BgIIYOHYqLFy8i\nMDAQo0ePxqNHjzBr1ixIJBLEx8djzZo1WLFiBX7//XdFlUZERCIUFhCnTp2CmZkZPv74Y6iqqqJH\njx6wsLDAkSNHcOzYMQQHB0NDQwMWFhbw8vLCgQMHFFUaERGJUFhAVFZWQlNTs+rKlZVx9uxZqKqq\nwtDQUDrdxMQEKSkpiiqNiIhEqCpqRY6OjlixYgWOHDkCNzc3JCQkICEhAVZWVi8Eh6amJoqLi2Vu\nOzExsbbLJSJ669nY2Lz0eYUFhLGxMb766iusXLkS4eHhcHR0RJ8+fZCdnY2SkpIq8xYXF0MikcjV\nfseOHWuzXCKit1pycvIr51FYQOTn50NfXx/R0dHSaR9//DGGDh2Kc+fOISMjAwYGBgCA1NRUtG3b\nVq72/70VQkREr0dhxyAeP36MwYMHIzk5GaWlpdixYwcePHiA3r17w83NDRERESgqKkJSUhJiY2Ph\n7e2tqNKIiEiEwrYg2rRpgzlz5mDChAl4/PgxOnbsiO+++w4SiQTz589HeHg4nJ2dIZFIEBISAktL\nS0WVRkREIpTehvtBJCYmvvJgCxER/Y8s35scaoOIiEQxIIiISBQDgoiIRDEgiIhIFAOCiIhEMSCI\niEiUwq6DUJRuc4YoZD1n5+xSyHpkMW3aNOjq6mLq1KkvPBcdHY09e/Zgx44dtbrO4OBgtGvXDhMm\nTKjVdl9m3bGdCllPUM+hClnP2+jv4jyFrKe5ZhOFrKex4xbEW65fv361Hg7UsIwcORJ79uyp7zLo\nLcSAUJALFy5g4MCBsLa2hqenJ86cOYOCggLMnTsX3bp1Q7du3TBjxgzk5T37BbZ27VqEhYVhzJgx\nsLa2ho+PD65evYqRI0fC2toavr6+ePDggbT9+/fvY/jw4bC2tsaIESOQkZEBAIiKisKAAQOkbU6Z\nMkXapoeHB86cOSNt4+LFixg4cCBsbW3h6+uLpKQk6XPXr1/HoEGDYGVlhTFjxuDJkyeKeNtIBt9+\n+y0++eST+i6D3kIMCAXIycnB2LFjMXToUFy6dAmTJ0/GhAkT8MUXXyAlJQUxMTE4fPgwsrOzMXv2\nbOly0dHRGDVqFC5cuIAmTZrA398f48aNQ0JCAjQ1NbF161bpvKdPn8YXX3yB8+fPo3Xr1pg0aZJo\nLUeOHMFnn32G8+fPw9nZGfPnzwcAZGRkYMyYMQgMDMS5c+cwYsQIjBo1Co8fP0ZpaSkCAwPh7u6O\nixcvwtfXFxcuXKjbN+0NEhcXB3d3dzg4OCAsLAyDBw9GVFQUHj9+jJCQEHTp0gWurq6IjIzE84EL\npk2bhgULFmDo0KGwtrbGgAEDqoyuGRcXBy8vL9ja2sLf3x+pqakAnt2Z0cbGBtOmTYOtrS0OHjwI\nPz8/bN++HQDw4MEDjB07Fp06dYKTkxO+//57xb8hDUx6ejqsra2xbt062NnZwdHREVu2bAEAuLq6\nYtasWXBwcEB4eDjKy8vx1VdfoXv37nBwcEBwcDAyMzNRWVkJZ2dnnDhxQtruuXPn4OjoiIqKihr3\ndWVlJb766iv06dMH1tbWcHZ2xu7du6V129raIjIyEt26dUOXLl2waNEi6fpf1tfVfX7kxYBQgBMn\nTuA///kPBg4cCBUVFbi6umLjxo2Ij49HSEgImjdvjmbNmmHq1Kn4+eefpffCsLa2hq2tLdTU1GBj\nYwMrKyt06tQJmpqasLW1lW4lAIC3tzdsbW2hrq6OKVOm4MqVK1W2MJ6zsrJCly5doK6uDm9vb9y7\ndw8AEBsbCwcHB/Ts2ROqqqro27cvzMzMcPToUSQmJqKkpAQBAQFQU1NDz5490blzZ8W8eQ1camoq\nQkJCEBYWhjNnzuA///kPrly5AgAIDQ2FkpISfv31V2zduhXR0dGIioqSLnvw4EHMnj0bCQkJMDIy\nwsqVKwEASUlJCAsLw9y5c5GQkAAXFxeMGTMGZWVlAJ6NjNy6dWvEx8ejd+/eVeqZOHEi9PT0cPbs\nWWzfvh3ffvttla3ExqqwsBC3bt3CyZMn8c033+Drr7/GqVOnADz7cXTy5EmEhIRgzZo1+PXXX7Fz\n506cOHECTZs2xcSJE6GkpARvb28cOnRI2mZMTAy8vb2hoqJS476Ojo5GXFwctm3bhsuXL2Py5MlY\ntGgRCgoKAAB5eXlIT0/H8ePHsWHDBuzcuVP6+aqur1/1+ZEHA0IBcnJy0KpVqyrTjI2NUV5eLh3i\nHABat24NQRCQmZkJANDR0ZE+p6KigqZNm0ofKysr45/DaP2znWbNmkEikeDRo0cv1NK8eXPpv1VV\nVaVtZGRk4PTp07C1tZX+d+3aNTx48ADZ2dnQ09ODsvL/Pi6tW7eW+314Gx06dAjdunWDs7Mz1NTU\nMGbMGLz33nvIzs7GqVOnMH36dEgkErRp0wYBAQHYt2+fdFlXV1e8//770NTUhIeHB+7evQsA+PHH\nH+Hj4wMbGxuoqanhs88+Q3l5Oc6fPy9d1tvbG+rq6tDS0pJOS0tLw9WrVxEaGgotLS0YGRlhy5Yt\n+OCDDxT2fjRkM2bMgEQigbm5OXx8fKRf9u7u7tDU1IS2tjYOHjyI8ePHo02bNtDS0kJYWBiSkpKQ\nkpICHx8f/PrrrygpKUFpaSni4uLQv39/PHr0qMZ93bNnT2zZsgXvvvsuMjMzoaGhgZKSkiq7cEeN\nGgV1dXVYWVnh//7v/3Dv3r2X9rUsnx9ZvXVnMTVE7733nvRL/7n9+/dDSUkJ9+/fl35pp6enQ1lZ\nWfpYSUlJ5nVkZ2dL/52bm4vCwkIYGBjgr7/+kml5PT09eHh4YNmyZdJpaWlp0NXVRXJyMjIzM1Fe\nXg5V1WcfmczMTLRs2VLm+t5WWVlZ0NfXlz5WUlKCvr4+lJSUIAgCevXqJX2usrKySuhXF9YPHjzA\n+fPnq9yXvaysDA8ePICxsTEA4N13332hlpycHEgkEjRp8r8zfOS9r8rbSkNDo8rntVWrVtLbGv/z\nvczJyanyY0sikUBXVxeZmZno2rUrjI2NceLECaioqEBfXx/vv/8+kpKSatzXZWVlWLBgARISEqCv\nr48OHTpIl69u2crKypf29cs+P/JiQCiAs7MzFi1ahIMHD8LLywsnT57E999/j48++ggrVqzAqlWr\noKKigmXLlsHZ2blKp8sqOjoa3t7eaN++PZYuXQpnZ2fRL5HqeHp6wtfXFwkJCejcuTMuX76MkSNH\nYv369bC1tUXTpk2xdu1aBAUFISEhAWfPnoWFhYXcdb5t9PX1qxzMf74FWFpaClVVVcTHx0NdXR0A\n8OTJE+mug5fR09NDQEAAJk6cKJ129+5dtGzZEjk5OQDEfzy0bNkShYWFyMvLk36GYmNj0bRpU3Tv\n3v21Xueb7vmv8mbNmgF4tsXcqlUrpKamVnkvDQwMcP/+fXz44YcAgIKCAuTm5qJFixYAgP79++PI\nkSNQVlZG//79ATzrr5r29cqVKyEIAk6fPg0NDQ1kZGTgp59+euVyL+vrl31+5MVdTAqgq6uLjRs3\nYseOHbC3t8fq1auxbt06zJgxA8bGxujXrx969uwJXV3dKr/g5eHq6orZs2fD0dERhYWFWLJkiVzL\nP78l7PLly2FjY4OpU6di+vTp6NKlC9TU1LBx40ZcuHAB9vb2iIyMRI8ePWpU59vGy8sL8fHxOH36\nNMrLy7FlyxY8fPgQ+vr6sLGxwfLly1FcXIzHjx8jODgYq1atemWbPj4+2LdvH5KTkyEIAn755Rd4\neXm98hegvr4+bG1tERERgZKSEty9exdLliyRbvU1dhERESgtLUVSUhIOHjwIHx+fF+bx8fHBunXr\ncP/+fRQVFWHx4sVo27YtzMzMADzbtfe8v728vADgtfo6Pz8f6urqUFFRQW5uLpYuXQoAKC8vf+ly\nL+vrmn5+RAlvgUuXLtV3CdSIHTp0SHBxcRHs7OyEmTNnCj169BCio6OFR48eCV9++aXQpUsXwd7e\nXpgyZYqQl5cnCIIgTJ06VViyZIm0jd9++01wcXGRPo6JiRE8PDwEKysrwdPTU/jll18EQRCEtLQ0\nwczMTMjPz5fOO3z4cGHbtm2CIAjCw4cPhXHjxgn29vaCs7OzsHPnTkW8BQ3a8/dsyZIlQpcuXQQX\nFxchKipKEARBcHFxEX777TfpvKWlpcLKlSuF7t27C506dRICAwOFBw8eVGlvzJgxwueff15lWk37\nOiUlRfD19RWsrKyE7t27CxEREULPnj2FY8eOifb1Rx99JOzfv18QhJf3dXWfn3+S5XuTNwwieg0Z\nGRkoLCyssq+/a9euWLZsGRwdHeuxMnouPT0dbm5uuHz5Mt555536LqfB4A2DiOpYVlYWPv30U6Sl\npaGyshK7du1CaWkprKys6rs0otfGnZNEr8HKygqjR4+Gn58fnjx5AlNTU3zzzTfQ1tau79KIXht3\nMRERNULcxURERDXGgCAiIlEMCCIiEsWAICIiUQwIIiISxYAgIiJRDAgiIhLFgCAiIlEMCCIiEsWA\nICIiUQwIIiISxYAgIiJRDAgiIhLFgCAiIlEKDYjLly9jwIAB6NSpE9zd3RETEwPg2Q2+g4KCYGNj\ngx49emDfvn2KLIuIiEQo7IZBFRUVCAoKQnh4OPr06YNLly7B398f1tbWWLZsGSQSCeLj43Hr1i2M\nGjUK7dq14125iIjqkcK2IJ4+fYq///4bFRUVEAQBSkpKUFNTg4qKCo4dO4bg4GBoaGjAwsICXl5e\nOHDggKJKIyIiEQrbgtDV1cXQoUMxadIkhISEoLKyEgsXLkRubi5UVVVhaGgondfExARxcXFytV9c\nXFzbJRMRNWoKC4jKykpoampi9erVcHV1RXx8PCZPnowNGzZAU1Ozyryamppyf+EnJyfXZrlERI2e\nwgIiLi4OSUlJmDp1KgCgR48e6NGjB9auXYuSkpIq8xYXF0MikcjVfseOHWutViKit50sP6oVFhAP\nHjxAaWlp1ZWrqqJjx45ITExERkYGDAwMAACpqalo27atXO3/eyuEiIhej8IOUnft2hU3btzA/v37\nIQgCLly4gF9++QWenp5wc3NDREQEioqKkJSUhNjYWHh7eyuqNCIiEqEkCIKgqJX99ttvWL16NdLS\n0mBgYICJEyeiV69eePz4McLDw5GQkACJRILx48dj0KBBMrebmJgIGxubOqyciOjtIsv3pkIDoq4w\nIIiI5CPL9yaH2iAiIlEMCCIiEsWAICIiUQwIIiISxYAgIiJRDAgiIhLFgCAiIlEMCCIiEsWAICIi\nUQwIIiISxYAgIiJRDAgiIhLFgCAiIlEMCCIiEsWAICIiUQwIIiISxYAgIiJRDAgiIhKlKs/Mqamp\nyMnJgbKyMvT09GBoaFhXdRERUT17ZUAkJiZi+/btOHv2LJ4+fSqdrqSkhGbNmsHJyQlDhgxBp06d\n6rRQIiJSrGoD4t69e5g9ezYePnwINzc3rF69GqamptDR0UFlZSVyc3Nx8+ZNXLp0CZMnT0abNm0w\nb948mJiYKLJ+IiKqI0qCIAhiTwwePBhBQUFwcnJ6ZSOCIOC3335DZGQk9uzZU+tFvkpiYiJsbGwU\nvl4iojeVLN+b1QbEm4QBQUQkH1m+N2t0FlN+fj7y8/NrVBQREb0Z5AqIGzduwNvbG7a2trCzs4On\npyeSkpLqqjYiIqpHcgXEjBkzEBwcjKtXr+LixYvw9fVFaGhoXdVGRET1qNqACA4Oxl9//VVlWkFB\nAQwNDaGhoQFtbW0YGBhwVxMR0Vuq2tNc+/fvj9DQUBgbG2PChAkwMTFBaGgo/P39oaamhsrKSpSX\nl2Pu3LmKrJeIiBSk2oBwc3ODm5sbjhw5gokTJ+L999/H+PHjcerUKdy5cwfKysowMjKClpaWIusl\nIiIFeeWV1H369IG7uztiYmIQGBgIS0tLjB8/HgYGBoqoj4iI6slLA+LkyZP466+/0Lp1a3h5ecHL\nyws//fQTRowYAXt7e4wbNw6tWrVSVK1ERKRA1R6kXrBgAebOnYvr16/j66+/RmBgIJSVlTFw4EDE\nxsaiQ4cO+PTTTzF//nxF1ktERApS7RbEwYMHsXv3bpiamqKoqAi2trbIzc2Frq4uVFVVMWTIEAwc\nOLBehtYgIqK6V21AGBoaYseOHejZsyeuXbuGJk2aoGnTplXmUVdXh5+fn0wrio6ORnh4eJVpRUVF\n8PX1xZQpUxAWFoZz586hSZMmCAoKgq+vbw1eDhER1ZZqA2LVqlWIiIjAggULYGBggI0bN0JFRaXG\nK+rXrx/69esnfRwfH4/Q0FAEBQVh1qxZkEgkiI+Px61btzBq1Ci0a9cOVlZWNV4fERG9nmoDwsjI\nCGvWrKmTlRYUFGDatGmYM2cOmjRpgmPHjuHo0aPQ0NCAhYUFvLy8cODAAQYEEVE9qjYgIiMj8dln\nn0FdXV2mhoqKirBlyxaMHTv2lfN+++23MDMzQ8+ePXH9+nWoqqpWuTudiYkJ4uLiZFrvc8XFxXLN\nT0REL/fS01w9PT3h5eWFnj17omPHjqLz3LhxA9HR0Thy5AgGDx78yhUWFBRg+/bt2LRpEwCgsLAQ\nmpqaVebR1NSU+ws/OTlZrvmJiOjlqg2I0aNHw8PDA5s3b8bw4cOhoaGBtm3bQldXV3pHudu3b6O8\nvBwfffQRtm3bhjZt2rxyhceOHYOBgYF095GWlhZKSkqqzFNcXAyJRCLXC6kuwIiI6EWy/Kh+6RZE\nmzZtEB4ejilTpuDChQtITk5GTk4OlJWVYW5ujrFjx6Jz584y74YCgOPHj6Nv377Sx0ZGRigrK0NG\nRob06uzU1FS0bdtW5jYBvLAVQkREr+eVQ20AwDvvvAMXFxe4uLi89gqvXr1aZVeUtrY23NzcpGdM\n3b59G7GxsYiMjHztdRERUc3V6I5yNVVRUYEHDx5AT0+vyvT58+ejvLwczs7OCA4ORkhICCwtLRVZ\nGhER/YtMWxC1RUVFBTdv3nxhuo6ODlavXq3IUoiI6BUUugVBRERvDpkC4saNG3VdBxERNTAyBYSv\nry88PT2xYcMGpKWl1XVNRETUAMgUEGfPnoW/vz/OnTuHPn36YPDgwdixYwf+/vvvuq6PiIjqiZIg\nCII8C2RlZSEuLg7Hjx/H5cuXYWtri/79+6N3795yXQ9RmxITE2FjY1Mv6yYiehPJ8r0p90Hq4uJi\nFBQUID8/H2VlZaisrERkZCRcXFxw8uTJGhdLREQNi0ynuWZkZODnn3/GoUOHcOPGDemIq+vXr0eL\nFi0AACtXrsT06dMRHx9fpwUTEZFiyBQQrq6uMDIygre3N1atWgUjI6MX5rGzsxO9xoGIiN5MMgXE\n3r17YWFhUWVafn4+tLW1pY+dnJzg5ORUu9UREVG9kekYRJs2bTB27NgqNxDq06cPgoKC8OTJkzor\njoiI6o9MATFnzhzk5+fD09NTOm3z5s14+vQpFi5cWGfFERFR/ZFpF1N8fDz27NkDU1NT6bT27dtj\n5syZ+PTTT+usOCIiqj8ybUFoaGiIXhRXUFBQ6wUREVHDIFNAeHh4YObMmTh9+jRyc3ORm5uL+Ph4\nhIeHo0+fPnVdIxER1QOZdjGFhITg6dOnCAwMREVFBQBAWVkZgwYNwrRp0+q0QCIiqh9yDbWRn5+P\n1NRUqKmpwdDQEO+8805d1iYzDrVBRCQfWb43Zb5hUGZmJlJSUqRbEI8ePUJpaSmSk5MRHBz8epUS\nEVGDI1NA7NixA4sWLUJFRQWUlJTwfKNDSUkJlpaWDAgioreQTAepN2/ejMDAQFy7dg0tWrTAiRMn\nEBsbi/fffx+9evWq6xqJiKgeyBQQWVlZ6N+/P9TU1NChQwf8/vvvaNu2LaZPn459+/bVdY1ERFQP\nZAoIHR0d5OXlAQBMTExw69YtAEDr1q3x8OHDuquOiIjqjUwB4eLigtmzZ+PmzZvo3LkzDh48iMuX\nL2Pbtm3Q19ev6xqJiKgeyHSQetq0aVi8eDFu3rwJHx8fHD16FEOHDoW2tjYiIiLqukYiokYhICAA\nd+/eBQAYGxtj8+bN9VqPTAFx+vRphISEoFmzZgCApUuXYvr06dDW1oaqqsxnyhJRI9ZtzpAaLadx\nJgtKAiAoASWO78m9/FBH7xqt99im/Xia/RhN39VBz1ED5V5+SA3WWyFUVvn338V5ci3fXLOJ3Ot8\nGZm+3WfPno1du3ZJAwJ4dlyCiKiu1SQU3lQr16959UwKJFNAmJub49SpU1VGcyUiepvVZKvhbSNT\nQKirq2Pp0qVYt24d2rRpA01NzSrP7969u06KIyKi+iPzFoS5uXld10JERA2ITAExfvz4uq6DiIga\nGJkCYvr06S99fvHixbVSDBERNRwyBURJSUmVx+Xl5UhPT8edO3cwePDgOimMiIjql0wBsXLlStHp\n69evR0ZGRq0WREREDcNrXeXWr18/9OvXDwsWLKiteoiqVdMLrdQTc6BcWIFKiQpKbVrIvfzZObtq\ntN7nV8XW9IpYeS+Sem7SuGCk3fsvDI3+U6Pz6mv7Yit6c71WQPz8888N5q5yRHVl3bGdNVru7/zH\n0v/XpI2aXIlLVJtkCghHR8cXphUWFqKoqOiVB7D/6eHDhwgPD8fFixehra2NkSNH4tNPP8WTJ08Q\nFhaGc+fOoUmTJggKCoKvr6/sr4LoJWqy1VAb6utCq4Z2NS69uWQKiMmTJ1d5rKSkBDU1NZibm8PI\nyEimFQmCgHHjxsHBwQFff/017t69i2HDhsHc3Bw//PADJBIJ4uPjcevWLYwaNQrt2rWDlZWV/K+I\niIhqhUwB8dFHH+Hhw4fIz89H27ZtAQA//fQT1NXVZV7R1atXkZWVhSlTpkBFRQXt2rXD7t27oaGh\ngWPHjuHo0aPQ0NCAhYUFvLy8cODAAQYEEVE9kikgzpw5g/Hjx2PEiBHS+0//+OOPWLBgATZu3Ahb\nW9tXtpGcnIx27dph+fLliImJgba2NsaOHYv27dtDVVUVhoaG0nlNTEwQFxcn1wspLi6Wa34iEse/\npTdXbfedTAGxYsUKjBs3DqNHj5ZO27FjBzZu3IjFixdj//79r2zjyZMnOH/+PDp37ozjx4/jjz/+\nwMiRIxEZGfnC2E6amppyv9Dk5GS55icicfxbenPVdt/JFBCpqano27fvC9M9PDywfv16mVakrq6O\nZs2aYcyYMQCATp06wd3dHWvWrHnhQrzi4mJIJBKZ2n2uY8eOcs1Pb6CY+i6gcaizvyX2X52Tp+9k\nCROZAsLIyAgnTpyAn59flenx8fFo1aqVTMWYmJigoqICFRUVUFFRAQBUVFTggw8+wKVLl5CRkQED\nAwMAzwLp+bEOWf17K4SIaoZ/S2+u2u47mQJi3LhxmDx5Mi5fvowPP/wQAHD9+nUcPXpU5nGYunXr\nBk1NTXz99dcICgpCUlISfvnlF3z//fe4f/8+IiIisGDBAty+fRuxsbGIjIys+asiIqLXJlNA9OnT\nBzo6Oti1axeioqKgpqYGY2NjbNu2TeYzjTQ1NbFt2zbMmzcPXbt2hba2NmbOnAkrKyvMnz8f4eHh\ncHZ2hkQiQUhICCwtLV/rhRER0euR+Upqa2trmJiYoGXLlgCAhIQEmJmZybUyIyMj0SEHdHR0sHr1\narnaIiKiuqUsy0zXrl2Di4sLfvjhB+m02bNno2/fvvjzzz/rqjYiIqpHMgXEwoUL4eHhgUmTJkmn\nxcXFoWfPnpg/f36dFUdERPVHpoC4efMm/P39oaamJp2mpKQEf39//PHHH3VWHBER1R+ZAqJly5a4\ncuXKC9OTk5Oho6NT60UREVH9k+kgtb+/P8LDw3H79m2Ym5sDeHaa644dO3i/aiKit5RMATF06FBo\naGhg165d2L59O9TU1GBiYoKZM2dCQ0OjrmskIqJ6IPNprgMHDsTAgc/Gt09OTkZUVBQWL16Mp0+f\nig7DQUREbzaZAyI3NxfR0dGIiorCn3/+CTU1NfTu3RvDhg2ry/qIiKievDQgKisrcfLkSURFReHE\niRMoKyuDubk5lJSUsH37dlhYWCiqTiIiUrBqA2LZsmWIjo7G48ePYWVlhcmTJ6N3794wMDBAx44d\n5R5tlYiI3izVBsR3330HIyMjhIaGwtXVFdra2oqsi4iI6lm110Fs3LgRFhYWCA8PR+fOnREQEIC9\ne/ciJydHkfUREVE9qTYgnJ2dsXz5csTHx2Px4sVQVVXFvHnz0L17d1RWVuL48eMoKipSZK1ERKRA\nr7ySWktLC97e3ti4cSNOnTqFsLAwWFpaIiIiAo6OjggPD1dEnUREpGAyn+YKAM2bN8ewYcMwbNgw\npKWlISYmBocOHaqr2oiIqB7JNBaTGENDQ4wbN44BQUT0lqpxQBAR0duNAUFERKIYEEREJIoBQURE\nohgQREQkigFBRESiGBBERCSKAUFERKIYEEREJIoBQUREohgQREQkigFBRESiGBBERCSKAUFERKIY\nEEREJIoBQUREohgQREQkSqEBsXnzZpibm8Pa2lr636VLl/DkyRMEBQXBxsYGPXr0wL59+xRZFhER\niZDrntSv6/r16/jyyy8REBBQZXpwcDAkEgni4+Nx69YtjBo1Cu3atYOVlZUiyyMion9Q6BbEjRs3\n0KFDhyrTCgoKcOzYMQQHB0NDQwMWFhbw8vLCgQMHFFkaERH9i8K2IIqKipCamoqtW7ciJCQETZs2\nRUBAAD7xNvhLAAARtUlEQVT44AOoqqrC0NBQOq+JiQni4uLkar+4uLi2SyZqlPi39Oaq7b5TWEBk\nZ2fDxsYGQ4YMwZo1a5CUlISxY8fi888/h6amZpV5NTU15X6hycnJMs0XHLNCrnaf0ziTBSUBEJSA\nEsf35F5+qKO33MscWPY9KisqAQDKKsrwCf1cruWH1GCdADDExxflZWVQVVPDrgPyHw9KTf6zRuul\nhkHWvyVqeGq77xQWEIaGhti+fbv0sa2tLfr3749Lly6hpKSkyrzFxcWQSCRytd+xY0fZZoyRq1mp\nmoTC65I3EGpLTULhn2TuC3nVsO9IPuy/N5c8fSdLmCgsIJKTk3H27FmMHj1aOq2kpAT6+vooKytD\nRkYGDAwMAACpqalo27atXO3/eyuE6g/74s3G/ntz1XbfKewgtUQiwddff40jR46gsrISCQkJOHTo\nEIYNGwY3NzdERESgqKgISUlJiI2Nhbd3zXaPEBFR7VDYFoSJiQm++uorrFq1CtOmTUPLli2xePFi\ndOzYEfPnz0d4eDicnZ0hkUgQEhICS0tLRZVGREQiFHodhKurK1xdXV+YrqOjg9WrVyuyFCIiegUO\ntUFERKIYEEREJIoBQUREohgQREQkigFBRESiGBBERCSKAUFERKIYEEREJIoBQUREohgQREQkigFB\nRESiGBBERCSKAUFERKIYEEREJIoBQUREohgQREQkigFBRESiGBBERCSKAUFERKIYEEREJIoBQURE\nohgQREQkigFBRESiGBBERCSKAUFERKIYEEREJIoBQUREohgQREQkigFBRESiGBBERCSKAUFERKIY\nEEREJIoBQUREohgQREQkSuEBkZ2djS5duuD48eMAgPT0dPj7+8Pa2hru7u7S6UREVL8UHhAzZszA\n48ePpY8nTpwICwsLXLhwAWFhYZg8eTIyMjIUXRYREf2LQgNi165d0NLSgr6+PgDgzp07+PPPPxEU\nFAQ1NTU4OzvD3t4ehw4dUmRZREQkQlVRK0pNTcX333+PvXv3YsCAAQCAlJQUtG7dGpqamtL5TExM\nkJKSInf7xcXFtVYrvR72xZuN/ffmqu2+U0hAlJeXIzQ0FDNmzICOjo50emFhIbS0tKrMq6mpWaMX\nmZycLNN8a7ynyN32myo1+c/6LqFWNaa+A9h/b7q3of8UEhDr169Hhw4d4OzsXGW6lpbWC2FQXFwM\niUQiV/s2NjavXSMREVWlkIA4fPgwHj16hMOHDwMA8vPzMWnSJIwdOxb3799HaWkp1NXVATzbFeXg\n4KCIsoiI6CWUBEEQFL1SV1dXzJo1Cy4uLhgwYAA6d+6ML774AgkJCfjiiy9w+PBh6YFsIiKqHwo7\nSF2dtWvXYvbs2ejSpQveffddrFy5kuFARNQA1MsWBBERNXwcaoOIiEQxIIiISBQDgqgepaWl1XcJ\n9A/11R/l5eV4+PBhvaz7ZRgQtaR9+/awtLSEtbU1rKys0KNHD3zzzTcyLZuRkQFra2sUFhbi/Pnz\nLz3N18HBAefPnwcAeHp64tSpU7VSP4n7Z78+79vevXtj3759r9329evXMWTIkFqosvFoqP2Rn5+P\nOXPmwNHREVZWVnB1dcXy5ctRWloq0/KTJk3CsWPHarTuulTvZzG9Tfbt2wczMzMAwN27dzFkyBCY\nmpqiV69eL13OwMAAV65ckXt9HLNKMf7ZrxUVFTh06BCmTp2KTp06wdTUtMbt5uXloaysrLbKbDQa\nYn/Mnz8feXl5OHjwIFq0aIF79+5h0qRJKC4uxqxZs165fG5ubo3WW9e4BVFHjI2NYWdnh+vXrwMA\noqKipGNQAUBBQQHat2+P9PR0pKeno3379igoKHihnZiYGLi5uaFTp05Yvnx5ledcXV2lw6O3b98e\nW7duhYuLC+zt7TFlyhTpr5fMzEwEBASgU6dOGDhwIJYuXQo/P7+6eulvNRUVFfTr1w/NmjXD7du3\nAQD37t3DmDFjYGdnBzc3N2zatAnPTw708/PD9u3bpctv374dfn5+yMnJwahRo/D48WNYW1sjNzcX\nxcXFWLBgAZycnODo6IilS5fK/Au0sWoo/XHt2jW4urqiRYsWAAAjIyOEhYWhadOm0nkuXryIgQMH\nwtbWFr6+vkhKSgIALFy4EJcuXcKSJUuwZMmSOnmfaooBUUdu3LiBq1evonv37jVu4+bNm5g5cyYW\nLVqEc+fOQUlJqcpQ6f+WkJCAmJgY7NmzB2fOnEFcXByAZ5uvrVq1QkJCAubOnYuoqKga19TYlZaW\nYuvWrSgpKYGVlRVKS0vx+eefw9TUFGfPnkVkZCT27NmD3bt3v7SdFi1aYNOmTdDR0cGVK1egq6uL\npUuXIiUlBdHR0YiOjsYff/wh827Kxqqh9IeHhwcWL16M+fPn49ixY8jJyYGNjQ0mTpwI4Nlu5DFj\nxiAwMBDnzp3DiBEjpIE0Y8YM2NraYtq0aZg2bVqtv0evgwFRiwYPHgxbW1tYWlrCx8cH7dq1Q/v2\n7Wvc3tGjR+Hk5AQHBweoq6sjODj4peNU+fv7Q1tbGyYmJrC2tsbdu3eRkZGBS5cuITQ0FBoaGjA3\nN8fHH39c45oao+f9amFhARsbG5w7dw4//PADWrVqhcTEROTl5WHSpElQV1eHqakpRo4ciZ9++kmu\ndQiCgKioKEyZMgW6urpo3rw5JkyYgL1799bRq3pzNcT+GD9+PBYvXoyMjAxMmzYNXbt2xZAhQ3Dj\nxg0AQGxsLBwcHNCzZ0+oqqqib9++MDMzw9GjR1/7/ahLPAZRi3bv3i3dN/ro0SOEhYVh0qRJNf4V\nmJ2djZYtW0ofq6urQ09Pr9r5mzdvLv23mpoaBEFAVlYWJBIJmjVrJn3OwMAAv//+e41qaoye92ta\nWhrGjx8PXV1dWFpaAgBycnLQsmVLqKr+70/JwMBA7jNS/v77bxQXF8PPzw9KSkoAnn1JlZWVoaSk\nBBoaGrX3gt5wDbU/evfujd69e6OyshK3bt3Cpk2bEBAQgOPHjyMjIwOnT5+Gra2tdP7y8vIGP9Ao\nA6KO6OnpYejQofjiiy8AAMrKylUOgL1sV9Fz7733XpVhzMvLy5GTkyNXHfr6+igsLMSTJ0+kIdEQ\nT6d7ExgaGmL9+vXw8fFBmzZtEBgYCH19fWRmZqK8vFz6pZSeno53330XgOz9rqOjAzU1NRw4cACG\nhoYAng2Hn52dzXCoRkPpj8zMTPTq1QsxMTEwMjKCsrIyOnTogPnz56NTp07IysqCnp4ePDw8sGzZ\nMulyaWlp0NXVrdX3pLZxF1Mdefr0Kfbv3w9ra2sAz26EdPfuXdy5cwclJSWIjIyU/jKpjoeHB+Lj\n43H8+HGUlZVh3bp1yM/Pl6uOli1bomvXrli+fDlKSkrw559/4scff6zx62rsWrdujenTp2PdunW4\nefMmLCwspGOIlZaW4s6dO9i8eTO8vb0BPDtZ4fTp0ygpKUFaWhqio6Olbamrq6O0tBSlpaVQUVGB\nt7c3VqxYgadPn6KwsBCzZ89ucPukG5qG0B8tW7aElZUVZs+ejTt37gB4tgWybt06tG/fHq1bt4an\npyeOHz+OhIQECIKAxMRE9OvXD9euXZOuW96/bUVgQNQiX19f6fnZvXr1goqKivQXg6WlJYYPHw5/\nf3+4ubnB2Ni4ym4fMaampli5ciWWLFkCe3t7ZGVlwcjISO66Fi5ciLS0NHTu3BlhYWHo3Lkz1NTU\navQaCRgwYADs7e0RFhYGZWVlfPPNN7h9+za6deuGzz77DIMGDYK/vz8AYPTo0SgvL0fXrl0RHBwM\nHx8faTvt27dH27Zt4eDggHv37mHGjBnQ1dWFp6cnnJ2dkZ+fj1WrVtXXy3xjNIT+WLduHczMzDBq\n1ChYWVmhb9++yM7OxqZNm6CsrAxjY2N89dVXWL58OWxsbDB16lRMnz4dXbp0AQB4eXlh48aNMp0S\nq0gcrK8RSEhIgJ2dnXSTe/ny5Xj48CEiIiLquTIiasi4BdEIzJ07F3v37oUgCLh79y5iYmLg5ORU\n32URUQPHLYhGIDk5GfPmzcPt27ehra2NTz75BOPGjXvlMRAiatwYEEREJIq7mIiISBQDgoiIRDEg\niIhIFK+kpreaq6sr7t+/L32spaUFU1NTBAQEwMPDQ+Z20tLScPv2bbi6uspdg5+fHy5cuFDt84sX\nL64y0i9RQ8GAoLfelClT4OPjA0EQkJeXh7i4OEyZMgVlZWXo37+/TG2EhYXB0tKyRgGxdu1a6fAO\nhw8fxjfffFPlCt4mTZrI3SaRIjAg6K2nra0tHeTwvffeQ2BgIAoLC7F8+XL07dsX6urqdbp+HR0d\n6b+bNGkCZWXllw66SNRQ8BgENUpDhgzBo0ePkJiYCADIysrCl19+CQcHB5ibm8Pd3R0///wzAGDa\ntGm4cOECNm3aJL3R0tWrV+Hn5wcrKytYWFhgyJAhuHXrVo1qKS0thZ2d3QtDUg8YMADff/+99Da0\n+/btQ9euXWFnZ4cFCxZUGXTuypUr+OSTT2BhYQF3d3ds2bIFPIOdXhcDgholAwMDSCQS/PXXXwCA\n0NBQ5OXlYdu2bYiJiYGdnR1mzZqF4uJizJgxA9bW1hg+fDjWrl2L/Px86Zg7MTEx2LlzJyorK7F0\n6dIa1aKurg53d3ccPnxYOu3u3bu4ceOG9DhJfn4+tm7dig0bNmD16tU4evSodKiU7OxsjBw5Ujqi\naGhoKDZt2oSdO3e+5rtEjR0DghqtJk2aSEfQdHV1xZw5c2BmZgYTExOMGjUKeXl5ePjwIZo0aQI1\nNTVoaWlBR0cHRUVFGD16NL788ksYGhrC3NwcgwYNkt7ysia8vb2RkJAgvTdxbGws7O3tpfcDKS8v\nx7x582BpaYmuXbti4sSJ+PHHH1FRUYEdO3bAxsYGI0eOhJGREdzc3DBhwgT88MMPr/0eUePGYxDU\naBUUFEBbWxvAs11OR44cwebNm5Gamiq9l3hFRcULy+np6WHQoEHYtm0bbt68idTUVCQnJ1e5/7C8\n7O3t0aJFC/zyyy/4+OOPcfjwYYwYMUL6vJqamvSmOADw4YcfIi8vD1lZWbhz5w7Onj0rHVr+ed1l\nZWUoLS2t82Ms9PZiQFCjlJ6ejvz8fLRr1w6VlZUICAhAdnY2PDw80K1bN+jp6VV7a9bMzEwMHDgQ\nZmZmcHJyQr9+/ZCSkoL169fXuB4lJSV4eXnh559/hqWlJdLS0uDu7i59XllZGcrK/9vgr6yslE4v\nLy9H3759MWHChBfa/eed1YjkxU8PNUr79u2Dnp4ebG1tcf36dZw/fx4nTpyAvr4+AODkyZPVLnvo\n0CFoamriu+++k047ffr0ax8U9vb2xpYtWxATE4MePXpUOf21pKQEd+7cgampKQAgKSkJLVq0gJ6e\nHkxNTXH27Nkq9wo5dOgQEhISsGDBgteqiRo3HoOgt15+fj4ePXqErKws/PXXX1izZg02b96MqVOn\nQlVVFXp6elBRUcHhw4dx//59nDx5EnPnzgXw7AwjAHjnnXfw3//+V3rP40ePHuHUqVNIT0/Hrl27\nsH37dum8NfX+++/DxMQEW7duhZeX1wvPz5w5Ezdv3sTp06exdu1aDB8+HMrKyhg2bBhSUlKwaNEi\npKSk4OTJk5g3b16Dv50lvQEEoreYi4uLYGZmJv3PwcFB8PPzE44fP15lvr179wrOzs6ChYWF4OXl\nJfz444+Ck5OTsHfvXkEQBOG3334T7OzshH79+gkVFRXCvHnzBHt7e8HGxkYYOnSo8NNPPwlmZmbC\nvXv3XlrP/v37ha5du1b7/IYNGwQbGxuhuLhYOu3cuXOCmZmZ8N133wn29vZCly5dhDVr1ggVFRXS\neS5evCh8/PHHgrm5ueDk5CSsXLlSKC8vr8E7RvQ/HO6bqAFZsGABioqKsHDhQum08+fP49NPP0VS\nUhI0NDTqsTpqbHgMgqgBSEpKws2bN7F//35s2bKlvsshAsCAIGoQEhISsGHDBowYMQIWFhb1XQ4R\nAN5RjoiIqsGzmIiISBQDgoiIRDEgiIhIFAOCiIhEMSCIiEgUA4KIiET9P2Mp/3Lw6D8zAAAAAElF\nTkSuQmCC\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "pal = sns.light_palette(\"seagreen\", n_colors=3, reverse=True)\n", "plot = sns.barplot(x=\"Data Type\", y=\"Accuracy\", hue='Metrics', palette=pal, errwidth=1, capsize=0.02, data=results)\n", "plot.set_ylim(40, 90)\n", "plot.legend(loc='upper center', bbox_to_anchor=(0.5, 1.0), ncol=3)\n", "plot.set_ylabel('Accuracy (%)')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Results**: We cannot use the provenance of buildings in our approach for assessing their quality because they are all the same in terms of the topology of their historical provenance graphs. There are small correlations between the provenance of routes/route sets and their quality. However, as shown above, the decision tree classifier's accuracy in predicting their quality is very low, 61% and 63%, compared with 97% and 96% while [using the dependency graphs](Application%202%20-%20CollabMap%20Data%20Quality.ipynb) (or the _forward provenance_). Note that the baseline accuracy for random selection, in this application, is 50%." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.2" } }, "nbformat": 4, "nbformat_minor": 2 }