{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Extra 2.1 - Unbalanced Data - Application 1: CollabMap Data Quality\n", "\n", "Assessing the quality of crowdsourced data in CollabMap from their provenance\n", "\n", "In this notebook, we compared the classification accuracy on **unbalanced** (original) CollabMap datasets vs that on a **balanced** CollabMap datasets.\n", "\n", "* **Goal**: To determine if the provenance network analytics method can identify trustworthy data (i.e. buildings, routes, and route sets) contributed by crowd workers in [CollabMap](https://collabmap.org/).\n", "* **Classification labels**: $\\mathcal{L} = \\left\\{ \\textit{trusted}, \\textit{uncertain} \\right\\} $.\n", "* **Training data**:\n", " - Buildings: 5175\n", " - Routes: 4710\n", " - Route sets: 4997\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reading data\n", "The CollabMap dataset is provided in the [`collabmap/depgraphs.csv`](collabmap/depgraphs.csv) file, each row corresponds to a building, route, or route sets created in the application:\n", "* `id`: the identifier of the data entity (i.e. building/route/route set).\n", "* `trust_value`: the beta trust value calculated from the votes for the data entity.\n", "* The remaining columns provide the provenance network metrics calculated from the dependency provenance graph of the entity." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | trust_value | \n", "entities | \n", "agents | \n", "activities | \n", "nodes | \n", "edges | \n", "diameter | \n", "assortativity | \n", "acc | \n", "acc_e | \n", "... | \n", "mfd_e_a | \n", "mfd_e_ag | \n", "mfd_a_e | \n", "mfd_a_a | \n", "mfd_a_ag | \n", "mfd_ag_e | \n", "mfd_ag_a | \n", "mfd_ag_ag | \n", "mfd_der | \n", "powerlaw_alpha | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
| Route41053.0 | \n", "0.833333 | \n", "9 | \n", "0 | \n", "6 | \n", "15 | \n", "26 | \n", "3 | \n", "-0.272207 | \n", "0.891091 | \n", "0.809409 | \n", "... | \n", "1 | \n", "0 | \n", "2 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "2 | \n", "-1.00000 | \n", "
| RouteSet9042.1 | \n", "0.600000 | \n", "6 | \n", "0 | \n", "3 | \n", "9 | \n", "15 | \n", "2 | \n", "-0.412974 | \n", "0.879630 | \n", "0.847222 | \n", "... | \n", "1 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "-1.00000 | \n", "
| Building19305.0 | \n", "0.428571 | \n", "6 | \n", "0 | \n", "4 | \n", "10 | \n", "13 | \n", "2 | \n", "-0.527046 | \n", "0.901235 | \n", "0.822222 | \n", "... | \n", "1 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "3.19876 | \n", "
| Building1136.0 | \n", "0.428571 | \n", "6 | \n", "0 | \n", "4 | \n", "10 | \n", "13 | \n", "2 | \n", "-0.527046 | \n", "0.901235 | \n", "0.822222 | \n", "... | \n", "1 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "3.19876 | \n", "
| Building24156.0 | \n", "0.833333 | \n", "9 | \n", "0 | \n", "5 | \n", "14 | \n", "24 | \n", "3 | \n", "-0.363937 | \n", "0.838034 | \n", "0.757639 | \n", "... | \n", "2 | \n", "0 | \n", "2 | \n", "2 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "2 | \n", "-1.00000 | \n", "
5 rows × 23 columns
\n", "| \n", " | trust_value | \n", "entities | \n", "agents | \n", "activities | \n", "nodes | \n", "edges | \n", "diameter | \n", "assortativity | \n", "acc | \n", "acc_e | \n", "... | \n", "mfd_e_a | \n", "mfd_e_ag | \n", "mfd_a_e | \n", "mfd_a_a | \n", "mfd_a_ag | \n", "mfd_ag_e | \n", "mfd_ag_a | \n", "mfd_ag_ag | \n", "mfd_der | \n", "powerlaw_alpha | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | \n", "14882.000000 | \n", "14882.000000 | \n", "14882.0 | \n", "14882.000000 | \n", "14882.000000 | \n", "14882.000000 | \n", "14882.000000 | \n", "14882.000000 | \n", "14882.000000 | \n", "14882.000000 | \n", "... | \n", "14882.000000 | \n", "14882.0 | \n", "14882.000000 | \n", "14882.000000 | \n", "14882.0 | \n", "14882.0 | \n", "14882.0 | \n", "14882.0 | \n", "14882.000000 | \n", "14882.000000 | \n", "
| mean | \n", "0.766706 | \n", "13.384693 | \n", "0.0 | \n", "6.793375 | \n", "20.178067 | \n", "39.118868 | \n", "2.771267 | \n", "-0.363791 | \n", "0.806123 | \n", "0.762426 | \n", "... | \n", "1.545424 | \n", "0.0 | \n", "1.742575 | \n", "0.987166 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.802782 | \n", "-0.226061 | \n", "
| std | \n", "0.115301 | \n", "17.165677 | \n", "0.0 | \n", "7.247706 | \n", "24.147888 | \n", "59.648535 | \n", "0.917298 | \n", "0.238658 | \n", "0.203627 | \n", "0.200090 | \n", "... | \n", "1.044079 | \n", "0.0 | \n", "1.012615 | \n", "1.391763 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.938974 | \n", "1.590865 | \n", "
| min | \n", "0.153846 | \n", "2.000000 | \n", "0.0 | \n", "0.000000 | \n", "2.000000 | \n", "1.000000 | \n", "1.000000 | \n", "-1.000000 | \n", "0.000000 | \n", "0.000000 | \n", "... | \n", "0.000000 | \n", "0.0 | \n", "0.000000 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.000000 | \n", "-1.000000 | \n", "
| 25% | \n", "0.750000 | \n", "5.000000 | \n", "0.0 | \n", "2.000000 | \n", "7.000000 | \n", "10.000000 | \n", "2.000000 | \n", "-0.500000 | \n", "0.820309 | \n", "0.757639 | \n", "... | \n", "1.000000 | \n", "0.0 | \n", "1.000000 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.000000 | \n", "-1.000000 | \n", "
| 50% | \n", "0.800000 | \n", "9.000000 | \n", "0.0 | \n", "5.000000 | \n", "14.000000 | \n", "24.000000 | \n", "3.000000 | \n", "-0.330835 | \n", "0.849790 | \n", "0.809409 | \n", "... | \n", "1.000000 | \n", "0.0 | \n", "2.000000 | \n", "0.000000 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "2.000000 | \n", "-1.000000 | \n", "
| 75% | \n", "0.833333 | \n", "14.000000 | \n", "0.0 | \n", "9.000000 | \n", "22.000000 | \n", "40.000000 | \n", "3.000000 | \n", "-0.251256 | \n", "0.880083 | \n", "0.854159 | \n", "... | \n", "2.000000 | \n", "0.0 | \n", "2.000000 | \n", "2.000000 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "2.000000 | \n", "-1.000000 | \n", "
| max | \n", "0.965517 | \n", "178.000000 | \n", "0.0 | \n", "70.000000 | \n", "248.000000 | \n", "706.000000 | \n", "13.000000 | \n", "0.494008 | \n", "1.000000 | \n", "1.000000 | \n", "... | \n", "13.000000 | \n", "0.0 | \n", "12.000000 | \n", "13.000000 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "12.000000 | \n", "4.674298 | \n", "
8 rows × 23 columns
\n", "| \n", " | trust_value | \n", "entities | \n", "agents | \n", "activities | \n", "nodes | \n", "edges | \n", "diameter | \n", "assortativity | \n", "acc | \n", "acc_e | \n", "... | \n", "mfd_e_ag | \n", "mfd_a_e | \n", "mfd_a_a | \n", "mfd_a_ag | \n", "mfd_ag_e | \n", "mfd_ag_a | \n", "mfd_ag_ag | \n", "mfd_der | \n", "powerlaw_alpha | \n", "label | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
| Route41053.0 | \n", "0.833333 | \n", "9 | \n", "0 | \n", "6 | \n", "15 | \n", "26 | \n", "3 | \n", "-0.272207 | \n", "0.891091 | \n", "0.809409 | \n", "... | \n", "0 | \n", "2 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "2 | \n", "-1.00000 | \n", "Trusted | \n", "
| RouteSet9042.1 | \n", "0.600000 | \n", "6 | \n", "0 | \n", "3 | \n", "9 | \n", "15 | \n", "2 | \n", "-0.412974 | \n", "0.879630 | \n", "0.847222 | \n", "... | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "-1.00000 | \n", "Uncertain | \n", "
| Building19305.0 | \n", "0.428571 | \n", "6 | \n", "0 | \n", "4 | \n", "10 | \n", "13 | \n", "2 | \n", "-0.527046 | \n", "0.901235 | \n", "0.822222 | \n", "... | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "3.19876 | \n", "Uncertain | \n", "
| Building1136.0 | \n", "0.428571 | \n", "6 | \n", "0 | \n", "4 | \n", "10 | \n", "13 | \n", "2 | \n", "-0.527046 | \n", "0.901235 | \n", "0.822222 | \n", "... | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "3.19876 | \n", "Uncertain | \n", "
| Building24156.0 | \n", "0.833333 | \n", "9 | \n", "0 | \n", "5 | \n", "14 | \n", "24 | \n", "3 | \n", "-0.363937 | \n", "0.838034 | \n", "0.757639 | \n", "... | \n", "0 | \n", "2 | \n", "2 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "2 | \n", "-1.00000 | \n", "Trusted | \n", "
5 rows × 24 columns
\n", "