## GGIS 527
### Lab 7 Analyzing Geo-Text Data with Natural Language Processing (NLP) Tools
#### Developed by Zhaonan Wang in Fall 2023
In this lab, you will go through the data wrangling process with both types of geo-text data, with explicit geo-tags or implicit location mentions within text.
- [**Explicit geo-text dataset**](#explicit): 2145 business located in Illinois side of St. Louis, derived from [Yelp Academic Dataset](https://www.yelp.com/dataset). Each data record also comes with a user review, in plain text. Refer [here for data format](https://www.yelp.com/dataset/documentation/main) of utilized business and review data. Your task is to perform sentiment analysis on each review and map the polarity score onto a map.
- [**Implicit geo-text dataset**](#implicit): news reports usually mention various locations, like countries, states, and even local toponyms (place names). In this notebook, you will play with a toy corpus containing three chunks of online news about some dam failure events. Your task to extract location mentions buried in the unstructured text.

<a id='explicit'></a>
### Explicit Geo-Text Data Analysis

In [1]:
import pandas as pd

# read prepared yelp data
yelp_data = pd.read_csv('./data/yelp_STL_IL.csv')

# check data
print(yelp_data.shape)
yelp_data.head()

# we are majorly interested in column 'text' with geotag ['latitude', longitude]

(2145, 25)


Unnamed: 0.1,Unnamed: 0,Unnamed: 0_x,business_id,name,address,city,state,postal_code,latitude,longitude,...,hours,Unnamed: 0_y,review_id,user_id,stars_y,useful,funny,cool,text,date
0,0,38,LcAozWCMLGjwRbokaJAKMg,Edwardsville Children's Museum,722 Holyoake Rd,Edwardsville,IL,62025,38.804395,-89.949733,...,"{'Monday': '10:0-15:0', 'Tuesday': '9:30-14:0'...",313,LfsU2lVUr1-pC802v0o32A,mRgAqvxz9jHYpm8ccIjZUQ,5.0,0,0,0,Place rocks excellent children's activities an...,2016-07-04 20:56:17
1,13,41,ljxNT9p0y7YMPx0fcNBGig,Tony's Restaurant & 3rd Street Cafe,312 Piasa St,Alton,IL,62002,38.896563,-90.186203,...,"{'Monday': '0:0-0:0', 'Tuesday': '16:0-21:30',...",20,uiqzlDEsUN_y1awEw_HHDA,qmQPWMV_YYmwV2DyvmIDYQ,5.0,0,0,0,"We had been driving around for some time, on a...",2018-07-17 01:07:49
2,118,48,bCBPXIVfVzBZBEpFu29dcg,All In Shipping,5343 Belleville Crossing St,Belleville,IL,62226,38.517586,-90.021929,...,,1378,oZqb2LRrJFaEjTz9ETzpPA,BHrWZS0J0FuJuLqeNk6J7w,5.0,0,0,0,I love this little local business. They have e...,2017-01-20 14:13:47
3,123,86,sE6jSnvMts_MAn-b4OkMAw,K-9 Groom Room,820 Industrial Dr,Troy,IL,62294,38.716244,-89.88583,...,"{'Monday': '8:0-16:0', 'Tuesday': '8:0-16:0', ...",194,UjBwlySBW4iPpFWGOw5Xkw,SE85OT0FKxeL28izk-5POg,4.0,3,0,0,This is another great local business. Our two...,2011-03-25 17:36:39
4,128,102,EuRGgOwJ0g1vTj2R04j37Q,Crafty Crab,51 Ludwig Dr,Fairview Heights,IL,62208,38.601298,-89.989683,...,"{'Monday': '12:0-22:0', 'Tuesday': '12:0-22:0'...",3261,DrWMCBMRweRydBEk-OLKYg,h3o-SqWjDeMI2fCJI63-jg,1.0,0,0,0,Waiter was absolutely terrible ordered our foo...,2021-11-06 02:07:15


#### Introduction to Spacy and Sentiment Analysis
We will use [Spacy](https://spacy.io/), which is free, open-sourced, and easy-to-use python library for foundamental NLP tasks, such as pre-processing, information extraction, and natural language understanding. Specifically, we will leverage a pre-trained pipeline, namely [spacytextblob](https://spacy.io/universe/project/spacy-textblob), for sentiment analysis. Depending on whether the user like the commented business or not, the model will return a sentiment polarity score on a scale from -1 to 1. Here negative denotes dislike and positive denotes like, to some extent.

In [2]:
# install required libraries
# spacy
!pip install -U pip setuptools wheel
!pip install -U spacy
!python -m spacy download en_core_web_sm

# spacytextblob
!pip install spacytextblob
!python -m textblob.download_corpora

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
Defaulting to user installation because normal site-packages is not writeable
[nltk_data] Downloading package brown to /home/jovyan/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading

In [3]:
# import required libraries
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob

# load pipelines
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('spacytextblob')

<spacytextblob.spacytextblob.SpacyTextBlob at 0x7f73caf88d60>

In [4]:
# define a function to be applied on each row of pandas dataframe
def sentiment_score(text):
    doc = nlp(text)
    return doc._.blob.polarity

In [5]:
%%time

# apply sentiment analysis to each row
yelp_data['sentiment'] = yelp_data['text'].apply(sentiment_score)
# It will take ~1 min to run through

CPU times: user 1min 2s, sys: 0 ns, total: 1min 2s
Wall time: 1min 2s


In [6]:
# check the derived column
print(yelp_data['sentiment'].min(), yelp_data['sentiment'].max())
yelp_data.head()

-1.0 1.0


Unnamed: 0.1,Unnamed: 0,Unnamed: 0_x,business_id,name,address,city,state,postal_code,latitude,longitude,...,Unnamed: 0_y,review_id,user_id,stars_y,useful,funny,cool,text,date,sentiment
0,0,38,LcAozWCMLGjwRbokaJAKMg,Edwardsville Children's Museum,722 Holyoake Rd,Edwardsville,IL,62025,38.804395,-89.949733,...,313,LfsU2lVUr1-pC802v0o32A,mRgAqvxz9jHYpm8ccIjZUQ,5.0,0,0,0,Place rocks excellent children's activities an...,2016-07-04 20:56:17,0.436623
1,13,41,ljxNT9p0y7YMPx0fcNBGig,Tony's Restaurant & 3rd Street Cafe,312 Piasa St,Alton,IL,62002,38.896563,-90.186203,...,20,uiqzlDEsUN_y1awEw_HHDA,qmQPWMV_YYmwV2DyvmIDYQ,5.0,0,0,0,"We had been driving around for some time, on a...",2018-07-17 01:07:49,0.20025
2,118,48,bCBPXIVfVzBZBEpFu29dcg,All In Shipping,5343 Belleville Crossing St,Belleville,IL,62226,38.517586,-90.021929,...,1378,oZqb2LRrJFaEjTz9ETzpPA,BHrWZS0J0FuJuLqeNk6J7w,5.0,0,0,0,I love this little local business. They have e...,2017-01-20 14:13:47,0.266146
3,123,86,sE6jSnvMts_MAn-b4OkMAw,K-9 Groom Room,820 Industrial Dr,Troy,IL,62294,38.716244,-89.88583,...,194,UjBwlySBW4iPpFWGOw5Xkw,SE85OT0FKxeL28izk-5POg,4.0,3,0,0,This is another great local business. Our two...,2011-03-25 17:36:39,0.48125
4,128,102,EuRGgOwJ0g1vTj2R04j37Q,Crafty Crab,51 Ludwig Dr,Fairview Heights,IL,62208,38.601298,-89.989683,...,3261,DrWMCBMRweRydBEk-OLKYg,h3o-SqWjDeMI2fCJI63-jg,1.0,0,0,0,Waiter was absolutely terrible ordered our foo...,2021-11-06 02:07:15,-0.124603


#### Visualization of Explicit Geo-Text Data
We will use [folium](https://python-visualization.github.io/folium/latest/), a python plug-in to build an interactive map in leaflet.js. 

In [7]:
# install folium
!pip install folium
# alternative conda install
# conda install -c conda-forge folium

Defaulting to user installation because normal site-packages is not writeable


In [8]:
# import libraries
import folium
import branca.colormap as cm
from branca.element import Figure

In [9]:
# firstly, filter a selected neighborhood from the dataset
select_neighbor = yelp_data[yelp_data['city']=='Edwardsville']

print(select_neighbor.shape)
# there are 274 businesses after filtering

(274, 26)


In [10]:
# build a color map to visualize sentiment polarity
rainbow = cm.StepColormap(['purple', 'lightblue', 'lightgreen', 'yellow', 'orange', 'red'], vmin=-1, vmax=1)
rainbow

In [11]:
# Create a map instance with a frame
fig = Figure(width=800, height=500)
m = folium.Map(location=[38.8039, -89.9583], zoom_start=11)
fig.add_child(m)

# iterate each business to add a marker onto the basemap
for index, row in select_neighbor.iterrows():
    iframe = folium.IFrame(row['text'])
    folium.Marker([row['latitude'], row['longitude']],
                  popup=folium.Popup(iframe, min_width=300, max_width=300),
                  icon=folium.Icon(color='lightgray', icon_color=rainbow(row['sentiment']))).add_to(m)

m
# Any observation about the spatial distribution pattern?

You can play with it by replacing the visualized attribute with other column, e.g., stars, or filter to a different neighborhood. You are also welcome to explore other regions for course project or out of personal interest. Please feel free to reach out to me (znwang@illinois.edu) about data or your cool project.

<a id='implicit'></a>
### Implicit Geo-Text Data Analysis
According to [Twitter](https://developer.twitter.com/en/docs/tutorials/advanced-filtering-for-geo-data#:~:text=As%20mentioned%20in%20the%20review,contain%20some%20profile%20location%20information.), while only 1-2% of Tweets are geotagged, 30-40% of Tweets contain some location information. Similarly, [this GIScience'21 paper](https://arxiv.org/pdf/2009.12914.pdf) confirms that over 10% of Tweets contain some location references in the contents. Thus, it is important to perform text mining to extract these implicit geographic information from unstructed text data.

Recently, researchers have been utilizing advanced NLP techniques to perform this task, which can be considered as a sub-task of Named Entity Recognition (NER). Instead of any named entities, like person names, time expression, we majorly focus on geospatial named entities, such as geopolitical entities, local organizations. We will use [Spacy](https://spacy.io/) again, as a general NER tool to recognize geo-entities from text data.

In [12]:
# import required libraries
import json
import spacy
from spacy import displacy    # visualizer
from collections import defaultdict
from tqdm import tqdm

In [13]:
# read text corpus and save into a list
data_list = []
with open('./data/news_samples.txt', encoding='utf-8') as f:
    readin = f.readlines()
    for line in tqdm(readin):
        data_list.append(line.strip())

print(f'Length of data_list: {len(data_list)}')
for text in data_list:
    print(text)

100%|██████████| 3/3 [00:00<00:00, 5236.33it/s]

Length of data_list: 3
Dozens, if not more than a hundred, Midland-area residents gathered to seek refuge within the walls of Midland High School Tuesday night after the Edenville Dam failed to hold back a deluge of water. Midland officials warned residents living near the Tittabawassee River to evacuate. They are concerned the Sanford Dam, located a few miles northwest of the city and downstream of the Edenville Dam, will also fail. Some drove to the school at 1301 Eastlawn Drive to seek shelter. Others were brought in by bus.
Soaking rains from the remnants of Hurricane Ida prompted the evacuations of thousands of people Wednesday after water reached dangerous levels at a dam near Johnstown, PA. The storm moved east in the evening, with the National Weather Service confirming at least one tornado and social media posts showing homes blown to rubble and roofs torn from buildings in a southern New Jersey county just outside Philadelphia. Pennsylvania was blanketed with rain after high 




In [14]:
# load spacy pipeline
nlp = spacy.load('en_core_web_sm')

# iterate through the news
for i, text in enumerate(tqdm(data_list)):
    doc = nlp(text)
    
    entity_dict = defaultdict(int)
    for entity in doc.ents:
        if entity.label_ in ['LOC', 'GPE']:    # LOCation, GeoPolitical Entity (i.e. countries, cities, states)
            entity_dict[entity.label_ + '_' + entity.text] += 1
    
    # visualize NER results
    displacy.render(doc, style='ent', options={"ents": ['LOC', 'GPE']}, jupyter=True)
    
    # save recognized entities into json
    with open(f'./data/NER_{i}.txt', 'w') as fout:
        fout.write(json.dumps(entity_dict) + '\n')

  0%|          | 0/3 [00:00<?, ?it/s]

 67%|██████▋   | 2/3 [00:00<00:00, 10.84it/s]

100%|██████████| 3/3 [00:00<00:00,  9.67it/s]


#### Optional: Visualization of Implicit Geo-Text Data (Geocoding Service Required)

In [None]:
# import libraries
import requests
import folium
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# load target news' NER results
target_news = 1
with open(f'./data/NER_{i}.txt') as f:
    ner = f.read()
ner_list = ner.split('\n')
ner_num = ner_list[0]
ner_js = json.loads(ner_num)
ner_js

In [None]:
ner_class = {}
for key in ner_js.keys():
    class_ = key[:3]
    if class_ not in ner_class.keys():
        ner_class[class_] = {}

In [None]:
# need Google Maps API key
my_Google_Maps_API_key = 'your_Google_Maps_API_key'
for key in ner_js.keys():
    class_, place_name = key.split('_')
    if place_name not in ner_class[class_].keys():
        response = requests.get(f'https://maps.googleapis.com/maps/api/geocode/json?address={place_name}&key={my_Google_Maps_API_key}')
        if response.json()['results']:
            ner_class[class_][place_name] = response.json()['results'][0]['geometry']['location']

In [None]:
# Create a map instance with a frame
fig = Figure(width=800, height=500)
m = folium.Map(location=[38, -97], tiles="cartodbpositron", zoom_start=6)
fig.add_child(m)

# LOC
for key in ner_class['LOC']:
    lat, lon = ner_class['LOC'][key]['lat'], ner_class['LOC'][key]['lng']
    folium.Marker([lat, lon], popup=key, icon=folium.Icon(color='red'),).add_to(m)
# GPE
for key in ner_class['GPE']:
    lat, lon = ner_class['GPE'][key]['lat'], ner_class['GPE'][key]['lng']
    folium.Marker([lat, lon], popup=key, icon=folium.Icon(color='blue'),).add_to(m)

m