{ "cells": [ { "cell_type": "markdown", "id": "da5ac69f-5063-40aa-805d-09c65c88d9b5", "metadata": {}, "source": [ "# Geospatial analysis of clinical trial sites in the USA: \n", "## Exploring Demographics and Vulnerabilities\n", "\n", "#### Author - Priyadarshini Satish\n", "#### Date – 03/10/2023\n", "#### Course - GGIS 407 - Cyber GIS and Geospatial Data Science\n", "#### Professor - Dr. Anand Padmanabhan\n" ] }, { "cell_type": "markdown", "id": "dc1f9522-bac4-4706-b655-9ab1118f0bc9", "metadata": {}, "source": [ "### Introduction\n", "In the medical and pharmaceutical industry, selection of sites for clinical trials is a critical task that involves analyzing proprietary information about demographics, geography, economy, and market dynamics. A comprehensive list of clinical trial sites from published by the US National Library of Medicine is available online. This data set includes information about clinical trials registered on the website since 2000 across the world. While it is not exhaustive, it contains a list of all studies that require registration with the FDAA that involve studies on behaviors, effect of a drug, or procedure on human volunteers. \n", "\n", "In the present study, this data set is filtered for sites in the United States. The city each site is in is plotted on the map. This geospatial data is visualized as a heat map and compared with county level socio-economic indicators from the US census bureau. The data retrieved from the census includes, median age, health insurance coverage, median household income, poverty rate. The expected result is insights on where pharmaceutical companies tend to locate sites for such studies. Further, we can explore if certain vulnerabilities such as lower income are exploited when locating such sites to attract more participants.\n" ] }, { "cell_type": "markdown", "id": "70e02c6a-d29d-43eb-bf70-b3532c713933", "metadata": {}, "source": [ "## Data Visualization" ] }, { "cell_type": "code", "execution_count": null, "id": "b1d2bf46-1312-4945-81ba-cb66a8573dcd", "metadata": {}, "outputs": [], "source": [ "#import necessary libraries\n", "import pandas as pd\n", "import geopandas as gpd\n", "import math\n", "import folium\n", "import zipfile\n", "from folium import Choropleth, Circle, Marker\n", "from folium.plugins import HeatMap, MarkerCluster" ] }, { "cell_type": "code", "execution_count": null, "id": "8a9e003c-5061-43aa-9416-920ff154c907", "metadata": {}, "outputs": [], "source": [ "with zipfile.ZipFile('Geodata.zip','r') as zip_ref:\n", " zip_ref.extractall()" ] }, { "cell_type": "code", "execution_count": null, "id": "d453a781-2843-4303-b047-53f7cf881f81", "metadata": {}, "outputs": [], "source": [ "censusdata = pd.read_excel(\"censusdata.xlsx\")\n", "#excel data from previous section was converted to a geojson using arcgis Pro. That file is loaded into the notebook here. \n", "geodata = gpd.read_file('CT_geo.geojson')" ] }, { "cell_type": "code", "execution_count": null, "id": "4e998042-df0b-4402-86e2-3be85ba1d304", "metadata": {}, "outputs": [], "source": [ "#Creating groups of unique cities to get a count of facilities by city \n", "bubbledata = geodata.groupby(['city','state','country']).agg(\n", " count_of_facilities = ('facility_name','count'),\n", " lat = ('lat','first'),\n", " lon = ('lng','first'),\n", " geometry = ('geometry','first')).reset_index()" ] }, { "cell_type": "markdown", "id": "f53ba7e3-df8f-4452-89a8-b287c405e30b", "metadata": {}, "source": [ "Creating a table grouping by cities to get a count of how many facilities are in each city " ] }, { "cell_type": "code", "execution_count": null, "id": "eddff921-07ac-404d-9016-097566356817", "metadata": {}, "outputs": [], "source": [ "# Create a map indicating the location of all the cities and the count of facilities in each. \n", "m_4 = folium.Map(location=[48,-102], tiles='cartodbpositron', zoom_start=3)\n", "\n", "# Add points to the map\n", "mc = MarkerCluster()\n", "for idx, row in bubbledata.iterrows():\n", " if not math.isnan(row['lon']) and not math.isnan(row['lat']):\n", " mc.add_child(Marker([row['lat'], row['lon']]))\n", "m_4.add_child(mc)\n", "# add marker one by one on the map\n", "for i in range(0,len(bubbledata)):\n", " folium.Marker(\n", " location=[bubbledata.iloc[i]['lat'], bubbledata.iloc[i]['lon']],\n", " popup=bubbledata.iloc[i][['city','count_of_facilities']],\n", " # icon=folium.DivIcon(html=f\"\"\"
{bubbledata.iloc[i]['city']}
\"\"\")\n", " ).add_to(mc)\n", "# Display the map\n", "m_4" ] }, { "cell_type": "markdown", "id": "7be2e15a-b8e7-4f75-9ccd-c31229dcaef1", "metadata": {}, "source": [ "The map above gives an overview of where the sites are located. Zoom through and click on markers to see city names, as well as the number of facilites in each of them" ] }, { "cell_type": "code", "execution_count": null, "id": "8aa91a0d-5a6b-40e2-81f5-0f9de179559c", "metadata": {}, "outputs": [], "source": [ "m_4.save(\"city_count.html\") #save to a file" ] }, { "cell_type": "code", "execution_count": null, "id": "29d23601-de0b-412d-84f6-6120bb3d475d", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Create a base map\n", "m_5 = folium.Map(location=[48,-102], tiles='cartodbpositron', zoom_start=4)\n", "\n", "#Add a heatmap to the base map\n", "HeatMap(data=geodata[['lat', 'lng']], radius=10).add_to(m_5)\n", "\n", "# Display the map\n", "m_5" ] }, { "cell_type": "markdown", "id": "7092d522-dd90-4ecd-a1d7-aa3c752f2eb0", "metadata": {}, "source": [ "The map above is a heat map showing greatest concentration of site clusters. Green colour shows greater density of sites. The data used contains over 150,000 clinical trial data and compounds the information from multiple sites in a single city to show greater density of sites in that region." ] }, { "cell_type": "code", "execution_count": null, "id": "f4c9358d-2464-464a-a023-8d40d56e59e7", "metadata": { "tags": [] }, "outputs": [], "source": [ "#Import Libraries\n", "import geopandas as gpd\n", "import pandas as pd\n", "import numpy as np\n", "import folium\n", "from folium.features import GeoJsonTooltip\n", "from pygris import counties \n", "from pygris.utils import shift_geometry\n", "from matplotlib import pyplot as plt" ] }, { "cell_type": "code", "execution_count": null, "id": "9f3c4dfd-79f8-4cb4-a55a-e7b025e3c1a4", "metadata": { "tags": [] }, "outputs": [], "source": [ "us_counties = counties(cb = True, resolution = \"20m\", cache = True, year = 2019)\n", "us_counties['GEOID'] = us_counties['GEOID'].astype(int)\n", "merged_data = us_counties.merge(censusdata, on = \"GEOID\")" ] }, { "cell_type": "code", "execution_count": null, "id": "4078fee8-d205-4723-95c7-c59c7fba5bfc", "metadata": { "tags": [] }, "outputs": [], "source": [ "#Create two FeatureGroup layers\n", "us_map = folium.Map(location=[40, -102], zoom_start=4,tiles='cartodbpositron',overlay=False)\n", "fg1 = folium.FeatureGroup(name='Percent below Poverty',overlay=False).add_to(us_map)\n", "fg2 = folium.FeatureGroup(name='Percent covered by Health Insurance',overlay=False).add_to(us_map)\n", "fg3 = folium.FeatureGroup(name='Median Household Income',overlay=False).add_to(us_map)\n", "fg4 = folium.FeatureGroup(name='Median Age',overlay=False).add_to(us_map)" ] }, { "cell_type": "code", "execution_count": null, "id": "d4126620-c060-45d3-aeea-b7938d754a53", "metadata": { "tags": [] }, "outputs": [], "source": [ "#Add the first choropleth map layer to fg1\n", "custom_scale1 = (merged_data['poverty_perc'].quantile((0,0.2,0.4,0.6,0.8,1))).tolist()\n", "Poverty=folium.Choropleth(\n", " geo_data=us_counties,\n", " data=merged_data,\n", " columns=['GEOID', 'poverty_perc'], \n", " key_on='feature.properties.GEOID', \n", " threshold_scale=custom_scale1, #use the custom scale we created for legend\n", " fill_color='YlOrRd',\n", " nan_fill_color=\"White\", #Use white color if there is no data available for the county\n", " fill_opacity=0.7,\n", " line_opacity=0.2,\n", " legend_name='Percent of Population Below Poverty Level',\n", " highlight=True,\n", " overlay=False,\n", " line_color='black').geojson.add_to(fg1)\n", "\n", "#Add customized tooltips to the map\n", "folium.features.GeoJson(\n", " data=merged_data,\n", " name='Percent of Population Below Poverty Level',\n", " smooth_factor=2,\n", " style_function=lambda x: {'color':'black','fillColor':'transparent','weight':0.5},\n", " tooltip=folium.features.GeoJsonTooltip(\n", " fields=['NAME',\n", " 'poverty_perc',],\n", " aliases=[\"County Name:\",\n", " \"Percent below poverty\"],\n", " localize=True,\n", " sticky=False,\n", " labels=True,\n", " style=\"\"\"\n", " background-color: #F0EFEF;\n", " border: 2px solid black;\n", " border-radius: 3px;\n", " box-shadow: 3px;\n", " \"\"\",\n", " max_width=800,),\n", " highlight_function=lambda x: {'weight':3,'fillColor':'grey'},\n", " ).add_to(Poverty) " ] }, { "cell_type": "code", "execution_count": null, "id": "dd464d7e-9f9b-4c53-ab36-e46f901cce29", "metadata": { "tags": [] }, "outputs": [], "source": [ "#Add the second choropleth map layer to fg2\n", "custom_scale2 = (merged_data['covered_perc'].quantile((0,0.2,0.4,0.6,0.8,1))).tolist()\n", "Insurance=folium.Choropleth(\n", " geo_data=us_counties,\n", " data=merged_data,\n", " columns=['GEOID', 'covered_perc'], #Here we tell folium to get the county fips and plot the 'pct_positive_7days' metric for each county\n", " key_on='feature.properties.GEOID', #Here we grab the geometries/county boundaries from the geojson file using the key 'coty_code' which is the same as fips_code\n", " threshold_scale=custom_scale2, #use the custom scale we created for legend\n", " fill_color='YlOrRd',\n", " nan_fill_color=\"White\", #Use white color if there is no data available for the county\n", " fill_opacity=0.7,\n", " line_opacity=0.2,\n", " legend_name='Percent of Population covered by Health Insurance',\n", " highlight=True,\n", " overlay=False,\n", " line_color='black').geojson.add_to(fg2)\n", "\n", "#Add customized tooltips to the map\n", "folium.features.GeoJson(\n", " data=merged_data,\n", " name='Percent of Population covered by Health Insurance',\n", " smooth_factor=2,\n", " style_function=lambda x: {'color':'black','fillColor':'transparent','weight':0.5},\n", " tooltip=folium.features.GeoJsonTooltip(\n", " fields=['NAME',\n", " 'covered_perc'],\n", " aliases=[\"County Name:\",\n", " \"Percent Covered by Insurance:\"],\n", " localize=True,\n", " sticky=False,\n", " labels=True,\n", " style=\"\"\"\n", " background-color: #F0EFEF;\n", " border: 2px solid black;\n", " border-radius: 3px;\n", " box-shadow: 3px;\n", " \"\"\",\n", " max_width=800,),\n", " highlight_function=lambda x: {'weight':3,'fillColor':'grey'},\n", " ).add_to(Insurance) " ] }, { "cell_type": "code", "execution_count": null, "id": "649ce204-b9ee-4f30-b44c-7ce62941a5e2", "metadata": { "tags": [] }, "outputs": [], "source": [ "#Add the third choropleth map layer to fg3\n", "custom_scale3 = (merged_data['median_age'].quantile((0,0.2,0.4,0.6,0.8,1))).tolist()\n", "Age=folium.Choropleth(\n", " geo_data=us_counties,\n", " data=merged_data,\n", " columns=['GEOID', 'median_age'], #Here we tell folium to get the county fips and plot the 'pct_positive_7days' metric for each county\n", " key_on='feature.properties.GEOID', #Here we grab the geometries/county boundaries from the geojson file using the key 'coty_code' which is the same as fips_code\n", " threshold_scale=custom_scale3, #use the custom scale we created for legend\n", " fill_color='YlOrRd',\n", " nan_fill_color=\"White\", #Use white color if there is no data available for the county\n", " fill_opacity=0.7,\n", " line_opacity=0.2,\n", " legend_name='Median Age of Population',\n", " highlight=True,\n", " overlay=False,\n", " line_color='black').geojson.add_to(fg3)\n", "\n", "#Add customized tooltips to the map\n", "folium.features.GeoJson(\n", " data=merged_data,\n", " name='Median Age of Population',\n", " smooth_factor=2,\n", " style_function=lambda x: {'color':'black','fillColor':'transparent','weight':0.5},\n", " tooltip=folium.features.GeoJsonTooltip(\n", " fields=['NAME',\n", " 'median_age'],\n", " aliases=[\"County Name:\",\n", " \"Median Age:\"],\n", " localize=True,\n", " sticky=False,\n", " labels=True,\n", " style=\"\"\"\n", " background-color: #F0EFEF;\n", " border: 2px solid black;\n", " border-radius: 3px;\n", " box-shadow: 3px;\n", " \"\"\",\n", " max_width=800,),\n", " highlight_function=lambda x: {'weight':3,'fillColor':'grey'},\n", " ).add_to(Age) " ] }, { "cell_type": "code", "execution_count": null, "id": "17fc693f-6a07-4861-b362-9907714f5273", "metadata": { "tags": [] }, "outputs": [], "source": [ "#Add the fourth choropleth map layer to fg4\n", "custom_scale4 = (merged_data['median_HHI'].quantile((0,0.2,0.4,0.6,0.8,1))).tolist()\n", "Income=folium.Choropleth(\n", " geo_data=us_counties,\n", " data=merged_data,\n", " columns=['GEOID', 'median_HHI'], #Here we tell folium to get the county fips and plot the 'pct_positive_7days' metric for each county\n", " key_on='feature.properties.GEOID', #Here we grab the geometries/county boundaries from the geojson file using the key 'coty_code' which is the same as fips_code\n", " threshold_scale=custom_scale4, #use the custom scale we created for legend\n", " fill_color='YlOrRd',\n", " nan_fill_color=\"White\", #Use white color if there is no data available for the county\n", " fill_opacity=0.7,\n", " line_opacity=0.2,\n", " legend_name='Median HHI of Population',\n", " highlight=True,\n", " overlay=False,\n", " line_color='black').geojson.add_to(fg4)\n", "\n", "#Add customized tooltips to the map\n", "folium.features.GeoJson(\n", " data=merged_data,\n", " name='Median HHI of Population',\n", " smooth_factor=2,\n", " style_function=lambda x: {'color':'black','fillColor':'transparent','weight':0.5},\n", " tooltip=folium.features.GeoJsonTooltip(\n", " fields=['NAME',\n", " 'median_HHI'],\n", " aliases=[\"County Name:\",\n", " \"Median HHI:\"],\n", " localize=True,\n", " sticky=False,\n", " labels=True,\n", " style=\"\"\"\n", " background-color: #F0EFEF;\n", " border: 2px solid black;\n", " border-radius: 3px;\n", " box-shadow: 3px;\n", " \"\"\",\n", " max_width=800,),\n", " highlight_function=lambda x: {'weight':3,'fillColor':'grey'},\n", " ).add_to(Income) " ] }, { "cell_type": "code", "execution_count": null, "id": "60dc1758-0864-4f18-838a-9f3a6bb27fb4", "metadata": { "tags": [] }, "outputs": [], "source": [ "#Add layer control to the map\n", "folium.TileLayer('cartodbdark_matter',overlay=True,name=\"View in Dark Mode\").add_to(us_map)\n", "folium.TileLayer('cartodbpositron',overlay=True,name=\"Viw in Light Mode\").add_to(us_map)\n", "folium.LayerControl(collapsed=False).add_to(us_map)\n", "us_map\n", "us_map.save(\"index.html\") #save to a file" ] }, { "cell_type": "code", "execution_count": null, "id": "ddb02a98-f7b8-4305-82c3-1dbe79c2b525", "metadata": {}, "outputs": [], "source": [ "us_map" ] }, { "cell_type": "markdown", "id": "608bbdd1-58d5-4c83-b1cb-203426ee5810", "metadata": {}, "source": [ "The choropleth map above shows the socio-economic indicators of every county in the USA. The darker regions are indicative of greater concentrations of poverty, higher household income, higher median age or high percentage of health insurance coverage according to the map selected. You can interact with the map and see the indicators for each county as you click through" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" } }, "nbformat": 4, "nbformat_minor": 5 }