{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Generate Time Series News Reporting for Earthquake Events via GDELT" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This repo will show you how to generate time series news reporting for earthquake events via GDELT. The GDELT Project is a real-time network diagram and database of global human society for open research. GDELT monitors the world's news media from nearly every corner of every country in print, broadcast, and web formats, in over 100 languages, every moment of every day." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Eathquake event data is collected from The International Disaster Database (EM-DAT). To reduce the data size, this notebook only focus on sample data containing 3 earthquake events." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Data structure: \n", "# . \n", "# ├── bigquery.json // your own Google BigQuery API key \n", "# ├── data \n", "# │ ├── countryinfo2.csv // country information \n", "# │ ├── earthquake_sample.csv // sample data containing 3 earthquake events \n", "# │ ├── earthquake_gdelt.csv // sample gdelt bigquery data containing 3 earthquake events \n", "# │ └── sourcesbycountry2018.csv // news sources by country \n", "# ├── EQNews_TS_generation.ipynb // Jupyter notebook \n", "# └── README.md " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import os\n", "import pandas as pd\n", "import numpy as np\n", "from datetime import timedelta" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 0: Generate FIPS for Earthquake Events" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Event_wiki_page</th>\n", " <th>UTC</th>\n", " <th>Country</th>\n", " <th>Location_iso3</th>\n", " <th>Latitude</th>\n", " <th>Longitude</th>\n", " <th>Magnitude</th>\n", " <th>Depth (km)</th>\n", " <th>MMI</th>\n", " <th>Death_wiki</th>\n", " <th>Location_fips</th>\n", " <th>Event_id</th>\n", " <th>UTC_round</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>2024 Noto earthquake</td>\n", " <td>2024-01-01 07:10:09</td>\n", " <td>Japan</td>\n", " <td>JPN</td>\n", " <td>37.495</td>\n", " <td>137.265</td>\n", " <td>7.5</td>\n", " <td>10.0</td>\n", " <td>XI (Extreme)</td>\n", " <td>339</td>\n", " <td>JA</td>\n", " <td>240101_Japan</td>\n", " <td>2024-01-01 07:15:00</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2024 Hualien earthquake</td>\n", " <td>2024-04-02 23:58:11</td>\n", " <td>Taiwan</td>\n", " <td>TWN</td>\n", " <td>23.819</td>\n", " <td>121.562</td>\n", " <td>7.4</td>\n", " <td>40.0</td>\n", " <td>VIII (Severe)</td>\n", " <td>18</td>\n", " <td>TW</td>\n", " <td>240402_Taiwan</td>\n", " <td>2024-04-03 00:00:00</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2023 Al Haouz earthquake</td>\n", " <td>2023-09-08 22:11:01</td>\n", " <td>Morocco</td>\n", " <td>MAR</td>\n", " <td>31.058</td>\n", " <td>-8.385</td>\n", " <td>6.8</td>\n", " <td>19.0</td>\n", " <td>IX (Violent)</td>\n", " <td>2960</td>\n", " <td>MO</td>\n", " <td>230908_Morocco</td>\n", " <td>2023-09-08 22:15:00</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Event_wiki_page UTC Country Location_iso3 \\\n", "0 2024 Noto earthquake 2024-01-01 07:10:09 Japan JPN \n", "1 2024 Hualien earthquake 2024-04-02 23:58:11 Taiwan TWN \n", "2 2023 Al Haouz earthquake 2023-09-08 22:11:01 Morocco MAR \n", "\n", " Latitude Longitude Magnitude Depth (km) MMI Death_wiki \\\n", "0 37.495 137.265 7.5 10.0 XI (Extreme) 339 \n", "1 23.819 121.562 7.4 40.0 VIII (Severe) 18 \n", "2 31.058 -8.385 6.8 19.0 IX (Violent) 2960 \n", "\n", " Location_fips Event_id UTC_round \n", "0 JA 240101_Japan 2024-01-01 07:15:00 \n", "1 TW 240402_Taiwan 2024-04-03 00:00:00 \n", "2 MO 230908_Morocco 2023-09-08 22:15:00 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# read earthquake event sample data\n", "df_event = pd.read_csv('data/earthquake_sample.csv')\n", "df_event" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# read country information\n", "country_info = pd.read_csv('data/countryinfo2.csv')\n", "# construct a iso3 to fips mapping\n", "iso3_to_fips = pd.Series(country_info.fips.values, index=country_info.iso3).to_dict()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['JA', 'TW', 'MO']\n" ] } ], "source": [ "# Earthquake affected counrties, join by '|'\n", "print(df_event.Location_fips.str.split(',').explode().unique().tolist())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1: BigQuery GDELT data" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: google-cloud-bigquery[bqstorage,pandas] in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (3.25.0)\n", "Collecting google-cloud-bigquery[bqstorage,pandas]\n", " Downloading google_cloud_bigquery-3.30.0-py2.py3-none-any.whl.metadata (7.9 kB)\n", "Requirement already satisfied: google-api-core<3.0.0dev,>=2.11.1 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from google-api-core[grpc]<3.0.0dev,>=2.11.1->google-cloud-bigquery[bqstorage,pandas]) (2.19.1)\n", "Requirement already satisfied: google-auth<3.0.0dev,>=2.14.1 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (2.27.0)\n", "Requirement already satisfied: google-cloud-core<3.0.0dev,>=2.4.1 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (2.4.1)\n", "Requirement already satisfied: google-resumable-media<3.0dev,>=2.0.0 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (2.7.2)\n", "Requirement already satisfied: packaging>=20.0.0 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (23.2)\n", "Requirement already satisfied: python-dateutil<3.0dev,>=2.7.3 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (2.8.2)\n", "Requirement already satisfied: requests<3.0.0dev,>=2.21.0 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (2.31.0)\n", "Requirement already satisfied: google-cloud-bigquery-storage<3.0.0dev,>=2.6.0 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (2.22.0)\n", "Requirement already satisfied: grpcio<2.0dev,>=1.47.0 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (1.59.3)\n", "Requirement already satisfied: pyarrow>=3.0.0 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (14.0.1)\n", "Requirement already satisfied: pandas>=1.1.0 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (2.1.3)\n", "Requirement already satisfied: db-dtypes<2.0.0dev,>=0.3.0 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from google-cloud-bigquery[bqstorage,pandas]) (1.1.1)\n", "Collecting pandas-gbq>=0.26.1 (from google-cloud-bigquery[bqstorage,pandas])\n", " Downloading pandas_gbq-0.28.0-py2.py3-none-any.whl.metadata (3.3 kB)\n", "Requirement already satisfied: numpy>=1.16.6 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from db-dtypes<2.0.0dev,>=0.3.0->google-cloud-bigquery[bqstorage,pandas]) (1.25.2)\n", "Requirement already satisfied: googleapis-common-protos<2.0.dev0,>=1.56.2 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from google-api-core<3.0.0dev,>=2.11.1->google-api-core[grpc]<3.0.0dev,>=2.11.1->google-cloud-bigquery[bqstorage,pandas]) (1.61.0)\n", "Requirement already satisfied: protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<6.0.0.dev0,>=3.19.5 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from google-api-core<3.0.0dev,>=2.11.1->google-api-core[grpc]<3.0.0dev,>=2.11.1->google-cloud-bigquery[bqstorage,pandas]) (4.25.1)\n", "Requirement already satisfied: proto-plus<2.0.0dev,>=1.22.3 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from google-api-core<3.0.0dev,>=2.11.1->google-api-core[grpc]<3.0.0dev,>=2.11.1->google-cloud-bigquery[bqstorage,pandas]) (1.22.3)\n", "Requirement already satisfied: grpcio-status<2.0.dev0,>=1.33.2 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from google-api-core[grpc]<3.0.0dev,>=2.11.1->google-cloud-bigquery[bqstorage,pandas]) (1.59.3)\n", "Requirement already satisfied: cachetools<6.0,>=2.0.0 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from google-auth<3.0.0dev,>=2.14.1->google-cloud-bigquery[bqstorage,pandas]) (5.3.2)\n", "Requirement already satisfied: pyasn1-modules>=0.2.1 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from google-auth<3.0.0dev,>=2.14.1->google-cloud-bigquery[bqstorage,pandas]) (0.3.0)\n", "Requirement already satisfied: rsa<5,>=3.1.4 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from google-auth<3.0.0dev,>=2.14.1->google-cloud-bigquery[bqstorage,pandas]) (4.9)\n", "Requirement already satisfied: google-crc32c<2.0dev,>=1.0 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from google-resumable-media<3.0dev,>=2.0.0->google-cloud-bigquery[bqstorage,pandas]) (1.5.0)\n", "Requirement already satisfied: pytz>=2020.1 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from pandas>=1.1.0->google-cloud-bigquery[bqstorage,pandas]) (2023.3.post1)\n", "Requirement already satisfied: tzdata>=2022.1 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from pandas>=1.1.0->google-cloud-bigquery[bqstorage,pandas]) (2023.3)\n", "Requirement already satisfied: setuptools in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from pandas-gbq>=0.26.1->google-cloud-bigquery[bqstorage,pandas]) (68.0.0)\n", "Collecting pydata-google-auth>=1.5.0 (from pandas-gbq>=0.26.1->google-cloud-bigquery[bqstorage,pandas])\n", " Downloading pydata_google_auth-1.9.1-py2.py3-none-any.whl.metadata (2.8 kB)\n", "Collecting google-auth-oauthlib>=0.7.0 (from pandas-gbq>=0.26.1->google-cloud-bigquery[bqstorage,pandas])\n", " Downloading google_auth_oauthlib-1.2.1-py2.py3-none-any.whl.metadata (2.7 kB)\n", "Requirement already satisfied: six>=1.5 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from python-dateutil<3.0dev,>=2.7.3->google-cloud-bigquery[bqstorage,pandas]) (1.16.0)\n", "Requirement already satisfied: charset-normalizer<4,>=2 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from requests<3.0.0dev,>=2.21.0->google-cloud-bigquery[bqstorage,pandas]) (3.3.2)\n", "Requirement already satisfied: idna<4,>=2.5 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from requests<3.0.0dev,>=2.21.0->google-cloud-bigquery[bqstorage,pandas]) (3.4)\n", "Requirement already satisfied: urllib3<3,>=1.21.1 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from requests<3.0.0dev,>=2.21.0->google-cloud-bigquery[bqstorage,pandas]) (2.0.7)\n", "Requirement already satisfied: certifi>=2017.4.17 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from requests<3.0.0dev,>=2.21.0->google-cloud-bigquery[bqstorage,pandas]) (2023.7.22)\n", "Collecting requests-oauthlib>=0.7.0 (from google-auth-oauthlib>=0.7.0->pandas-gbq>=0.26.1->google-cloud-bigquery[bqstorage,pandas])\n", " Downloading requests_oauthlib-2.0.0-py2.py3-none-any.whl.metadata (11 kB)\n", "Requirement already satisfied: pyasn1<0.6.0,>=0.4.6 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from pyasn1-modules>=0.2.1->google-auth<3.0.0dev,>=2.14.1->google-cloud-bigquery[bqstorage,pandas]) (0.5.0)\n", "Collecting oauthlib>=3.0.0 (from requests-oauthlib>=0.7.0->google-auth-oauthlib>=0.7.0->pandas-gbq>=0.26.1->google-cloud-bigquery[bqstorage,pandas])\n", " Downloading oauthlib-3.2.2-py3-none-any.whl.metadata (7.5 kB)\n", "Downloading pandas_gbq-0.28.0-py2.py3-none-any.whl (37 kB)\n", "Downloading google_cloud_bigquery-3.30.0-py2.py3-none-any.whl (247 kB)\n", "Downloading google_auth_oauthlib-1.2.1-py2.py3-none-any.whl (24 kB)\n", "Downloading pydata_google_auth-1.9.1-py2.py3-none-any.whl (15 kB)\n", "Downloading requests_oauthlib-2.0.0-py2.py3-none-any.whl (24 kB)\n", "Downloading oauthlib-3.2.2-py3-none-any.whl (151 kB)\n", "Installing collected packages: oauthlib, requests-oauthlib, google-auth-oauthlib, pydata-google-auth, google-cloud-bigquery, pandas-gbq\n", " Attempting uninstall: google-cloud-bigquery\n", " Found existing installation: google-cloud-bigquery 3.25.0\n", " Uninstalling google-cloud-bigquery-3.25.0:\n", " Successfully uninstalled google-cloud-bigquery-3.25.0\n", "Successfully installed google-auth-oauthlib-1.2.1 google-cloud-bigquery-3.30.0 oauthlib-3.2.2 pandas-gbq-0.28.0 pydata-google-auth-1.9.1 requests-oauthlib-2.0.0\n", "Requirement already satisfied: google-cloud-storage in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (2.18.2)\n", "Collecting google-cloud-storage\n", " Downloading google_cloud_storage-3.1.0-py2.py3-none-any.whl.metadata (12 kB)\n", "Requirement already satisfied: google-auth<3.0dev,>=2.26.1 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from google-cloud-storage) (2.27.0)\n", "Requirement already satisfied: google-api-core<3.0.0dev,>=2.15.0 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from google-cloud-storage) (2.19.1)\n", "Collecting google-cloud-core<3.0dev,>=2.4.2 (from google-cloud-storage)\n", " Downloading google_cloud_core-2.4.3-py2.py3-none-any.whl.metadata (2.7 kB)\n", "Requirement already satisfied: google-resumable-media>=2.7.2 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from google-cloud-storage) (2.7.2)\n", "Requirement already satisfied: requests<3.0.0dev,>=2.18.0 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from google-cloud-storage) (2.31.0)\n", "Requirement already satisfied: google-crc32c<2.0dev,>=1.0 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from google-cloud-storage) (1.5.0)\n", "Requirement already satisfied: googleapis-common-protos<2.0.dev0,>=1.56.2 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from google-api-core<3.0.0dev,>=2.15.0->google-cloud-storage) (1.61.0)\n", "Requirement already satisfied: protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<6.0.0.dev0,>=3.19.5 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from google-api-core<3.0.0dev,>=2.15.0->google-cloud-storage) (4.25.1)\n", "Requirement already satisfied: proto-plus<2.0.0dev,>=1.22.3 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from google-api-core<3.0.0dev,>=2.15.0->google-cloud-storage) (1.22.3)\n", "Requirement already satisfied: cachetools<6.0,>=2.0.0 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from google-auth<3.0dev,>=2.26.1->google-cloud-storage) (5.3.2)\n", "Requirement already satisfied: pyasn1-modules>=0.2.1 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from google-auth<3.0dev,>=2.26.1->google-cloud-storage) (0.3.0)\n", "Requirement already satisfied: rsa<5,>=3.1.4 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from google-auth<3.0dev,>=2.26.1->google-cloud-storage) (4.9)\n", "Requirement already satisfied: charset-normalizer<4,>=2 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from requests<3.0.0dev,>=2.18.0->google-cloud-storage) (3.3.2)\n", "Requirement already satisfied: idna<4,>=2.5 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from requests<3.0.0dev,>=2.18.0->google-cloud-storage) (3.4)\n", "Requirement already satisfied: urllib3<3,>=1.21.1 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from requests<3.0.0dev,>=2.18.0->google-cloud-storage) (2.0.7)\n", "Requirement already satisfied: certifi>=2017.4.17 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from requests<3.0.0dev,>=2.18.0->google-cloud-storage) (2023.7.22)\n", "Requirement already satisfied: pyasn1<0.6.0,>=0.4.6 in /home/weih9/miniconda3/envs/metagdelt/lib/python3.10/site-packages (from pyasn1-modules>=0.2.1->google-auth<3.0dev,>=2.26.1->google-cloud-storage) (0.5.0)\n", "Downloading google_cloud_storage-3.1.0-py2.py3-none-any.whl (174 kB)\n", "Downloading google_cloud_core-2.4.3-py2.py3-none-any.whl (29 kB)\n", "Installing collected packages: google-cloud-core, google-cloud-storage\n", " Attempting uninstall: google-cloud-core\n", " Found existing installation: google-cloud-core 2.4.1\n", " Uninstalling google-cloud-core-2.4.1:\n", " Successfully uninstalled google-cloud-core-2.4.1\n", " Attempting uninstall: google-cloud-storage\n", " Found existing installation: google-cloud-storage 2.18.2\n", " Uninstalling google-cloud-storage-2.18.2:\n", " Successfully uninstalled google-cloud-storage-2.18.2\n", "Successfully installed google-cloud-core-2.4.3 google-cloud-storage-3.1.0\n" ] } ], "source": [ "!pip install --upgrade 'google-cloud-bigquery[bqstorage,pandas]'\n", "!pip install --upgrade google-cloud-storage" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "from google.cloud import bigquery\n", "from google.cloud import storage\n", "\n", "# Setting Google application credentials\n", "os.environ[\"GOOGLE_APPLICATION_CREDENTIALS\"] = \"bigquery.json\"" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# Initialize BigQuery client\n", "bq_client = bigquery.Client()\n", "\n", "# Test: Run a simple query\n", "query = \"\"\"\n", "SELECT\n", " gkg.GKGRECORDID,\n", " gkg.DATE,\n", " gkg.SourceCommonName,\n", " ARRAY_TO_STRING(\n", " ARRAY(\n", " SELECT\n", " DISTINCT UPPER(TRIM(SPLIT(location, '#')[OFFSET(2)])) -- Extract and uppercase the FIPS code\n", " FROM\n", " UNNEST(SPLIT(gkg.V2Locations, ';')) AS location\n", " WHERE\n", " UPPER(TRIM(SPLIT(location, '#')[OFFSET(2)])) IN ['JA', 'TW', 'MO']\n", " ), ','\n", " ) AS V2Locations_FIPS\n", "FROM\n", " `gdelt-bq.gdeltv2.gkg_partitioned` gkg\n", "WHERE\n", " _PARTITIONTIME BETWEEN TIMESTAMP(\"2023-09-07\") AND TIMESTAMP(\"2024-08-17\")\n", " AND gkg.DATE >= 20230907000000\n", " AND REGEXP_CONTAINS(gkg.V2Themes, r'(?i)EARTHQUAKE.*EARTHQUAKE')\n", " AND ARRAY_LENGTH(\n", " ARRAY(\n", " SELECT\n", " DISTINCT UPPER(TRIM(SPLIT(location, '#')[OFFSET(2)])) -- Ensure FIPS codes match case-insensitively\n", " FROM\n", " UNNEST(SPLIT(gkg.V2Locations, ';')) AS location\n", " WHERE\n", " UPPER(TRIM(SPLIT(location, '#')[OFFSET(2)])) IN ['JA', 'TW', 'MO']\n", " )\n", " ) > 0;\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# Run the query once\n", "# Uncomment to run\n", "# query_job = bq_client.query(query)\n", "# results = query_job.to_dataframe()\n", "# results.to_csv('data/earthquake_gdelt.csv', index=False, sep=',')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2: Match Sources Country by sourcesbycountry2018" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this step, we show how to match sources country by sourcesbycountry2018. The sourcesbycountry2018 table is a list of all known news sources in the GDELT Global Knowledge Graph, along with their country of focus. The table is generated based on assumption that the country of focus of a news source is the country in which it is headquartered." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "df_2018 = pd.read_csv('data/sourcesbycountry2018.csv', sep='\\t')\n", "results = pd.read_csv('data/earthquake_gdelt.csv')" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "News records dropped with invalid or null FIPS: 0 (% of total: 100.00%)\n" ] } ], "source": [ "# add column names sources_FIPS to results join by df_2018 Domain on SourceCommonName\n", "results['sources_FIPS'] = results['SourceCommonName'].map(df_2018.set_index('Domain')['FIPS'])\n", "\n", "# filter out rows with sources_FIPS is null\n", "results = results[results['sources_FIPS'].notnull()]\n", "results_count = len(results)\n", "\n", "# drop SourceCommonName\n", "results.drop(columns=['SourceCommonName'], inplace=True)\n", "\n", "# print step 2 statistics\n", "print(f'News records dropped with invalid or null FIPS: {results_count - len(results)} (% of total: {len(results) / results_count * 100:.2f}%)')" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "results.to_csv('data/earthquake_gdelt_addsourcefips.csv', index=False, sep=',')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3: Generate Time Series based on GDELT data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this step, we show how to generate time series news reporting for earthquake events via GDELT. And we also show how quickly the earthquake events can be reported. The temporal resolution of the time series is 15 mins, and time series can be generate at resolution of an integer times 15 mins, such as 30 mins, 1 hour, 1 day, etc." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "df_gdelts = pd.read_csv('data/earthquake_gdelt_addsourcefips.csv')\n", "df_event = pd.read_csv('data/earthquake_sample.csv')" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "df_event['UTC'] = pd.to_datetime(df_event['UTC'])\n", "df_event['UTC_round'] = pd.to_datetime(df_event['UTC_round'])\n", "df_gdelts['DATE'] = pd.to_datetime(df_gdelts['DATE'], format='%Y%m%d%H%M%S')" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>GKGRECORDID</th>\n", " <th>DATE</th>\n", " <th>V2Locations_FIPS</th>\n", " <th>sources_FIPS</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>20240609034500-T479</td>\n", " <td>2024-06-09 03:45:00</td>\n", " <td>TU,TW</td>\n", " <td>CH</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>20231010010000-1055</td>\n", " <td>2023-10-10 01:00:00</td>\n", " <td>SY,TU,JA</td>\n", " <td>PK</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>20231010030000-148</td>\n", " <td>2023-10-10 03:00:00</td>\n", " <td>MO,SY,TU</td>\n", " <td>US</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>20240204233000-T1042</td>\n", " <td>2024-02-04 23:30:00</td>\n", " <td>TW,TU</td>\n", " <td>CH</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>20240204184500-T1719</td>\n", " <td>2024-02-04 18:45:00</td>\n", " <td>JA,SY,TU</td>\n", " <td>EG</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " GKGRECORDID DATE V2Locations_FIPS sources_FIPS\n", "0 20240609034500-T479 2024-06-09 03:45:00 TU,TW CH\n", "1 20231010010000-1055 2023-10-10 01:00:00 SY,TU,JA PK\n", "2 20231010030000-148 2023-10-10 03:00:00 MO,SY,TU US\n", "3 20240204233000-T1042 2024-02-04 23:30:00 TW,TU CH\n", "4 20240204184500-T1719 2024-02-04 18:45:00 JA,SY,TU EG" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_gdelts.head()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# check if V2Locations_FIPS is NaN\n", "df_gdelts['V2Locations_FIPS'].isnull().sum()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Event_wiki_page</th>\n", " <th>UTC</th>\n", " <th>Country</th>\n", " <th>Location_iso3</th>\n", " <th>Latitude</th>\n", " <th>Longitude</th>\n", " <th>Magnitude</th>\n", " <th>Depth (km)</th>\n", " <th>MMI</th>\n", " <th>Death_wiki</th>\n", " <th>Location_fips</th>\n", " <th>Event_id</th>\n", " <th>UTC_round</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>2024 Noto earthquake</td>\n", " <td>2024-01-01 07:10:09</td>\n", " <td>Japan</td>\n", " <td>JPN</td>\n", " <td>37.495</td>\n", " <td>137.265</td>\n", " <td>7.5</td>\n", " <td>10.0</td>\n", " <td>XI (Extreme)</td>\n", " <td>339</td>\n", " <td>JA</td>\n", " <td>240101_Japan</td>\n", " <td>2024-01-01 07:15:00</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2024 Hualien earthquake</td>\n", " <td>2024-04-02 23:58:11</td>\n", " <td>Taiwan</td>\n", " <td>TWN</td>\n", " <td>23.819</td>\n", " <td>121.562</td>\n", " <td>7.4</td>\n", " <td>40.0</td>\n", " <td>VIII (Severe)</td>\n", " <td>18</td>\n", " <td>TW</td>\n", " <td>240402_Taiwan</td>\n", " <td>2024-04-03 00:00:00</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2023 Al Haouz earthquake</td>\n", " <td>2023-09-08 22:11:01</td>\n", " <td>Morocco</td>\n", " <td>MAR</td>\n", " <td>31.058</td>\n", " <td>-8.385</td>\n", " <td>6.8</td>\n", " <td>19.0</td>\n", " <td>IX (Violent)</td>\n", " <td>2960</td>\n", " <td>MO</td>\n", " <td>230908_Morocco</td>\n", " <td>2023-09-08 22:15:00</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Event_wiki_page UTC Country Location_iso3 \\\n", "0 2024 Noto earthquake 2024-01-01 07:10:09 Japan JPN \n", "1 2024 Hualien earthquake 2024-04-02 23:58:11 Taiwan TWN \n", "2 2023 Al Haouz earthquake 2023-09-08 22:11:01 Morocco MAR \n", "\n", " Latitude Longitude Magnitude Depth (km) MMI Death_wiki \\\n", "0 37.495 137.265 7.5 10.0 XI (Extreme) 339 \n", "1 23.819 121.562 7.4 40.0 VIII (Severe) 18 \n", "2 31.058 -8.385 6.8 19.0 IX (Violent) 2960 \n", "\n", " Location_fips Event_id UTC_round \n", "0 JA 240101_Japan 2024-01-01 07:15:00 \n", "1 TW 240402_Taiwan 2024-04-03 00:00:00 \n", "2 MO 230908_Morocco 2023-09-08 22:15:00 " ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_event.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def generate_ts_df(df_event, df_gdelts, start_shift='1D', end_shift='21D', ts_interval='1D', quick_report=['1h', '3h']):\n", " # Initialize an empty list to store results\n", " results = []\n", "\n", " # Create timedelta objects for shifts and intervals\n", " ts_interval_timedelta = pd.to_timedelta(ts_interval)\n", " start_shift_timedelta = pd.to_timedelta(start_shift)\n", " end_shift_timedelta = pd.to_timedelta(end_shift)\n", "\n", " # Iterate over each event in df_event\n", " for _, event in df_event.iterrows():\n", " event_id = event['Event_id']\n", " event_fips = set(event['Location_fips'].split(','))\n", "\n", " # Filter df_gdelts for the full time series generation (including first report time)\n", " ts_gdelts = df_gdelts[\n", " (df_gdelts['DATE'] >= event['UTC_round'] - start_shift_timedelta) & \n", " (df_gdelts['DATE'] <= event['UTC_round'] + end_shift_timedelta)\n", " ]\n", "\n", " # Further filter based on relevant FIPS codes and ensure FIPS codes are valid\n", " ts_gdelts = ts_gdelts[\n", " ts_gdelts['V2Locations_FIPS'].apply(lambda x: bool(event_fips.intersection(set(x.split(',')))))\n", " ].copy()\n", "\n", " # Skip to the next event if there are no reports in the current window\n", " if ts_gdelts.empty:\n", " continue\n", "\n", " # Calculate the first report time for each reporting country\n", " # the first report time is after event['UTC_round'] \n", " first_report_times = ts_gdelts[ts_gdelts['DATE'] > event['UTC_round']].groupby('sources_FIPS')['DATE'].min().reset_index()\n", " first_report_times.columns = ['report_country_FIP', 'first_report_time']\n", "\n", " # Initialize a dictionary to store quick report counts for different intervals\n", " quick_report_counts_dict = {}\n", "\n", " # Process each quick_report interval\n", " for quick_interval in quick_report:\n", " quick_report_timedelta = pd.to_timedelta(quick_interval)\n", "\n", " # Filter df_gdelts based on the quick report date range\n", " quick_report_gdelts = ts_gdelts[\n", " (ts_gdelts['DATE'] >= event['UTC_round']) & \n", " (ts_gdelts['DATE'] < event['UTC_round'] + quick_report_timedelta)\n", " ]\n", "\n", " # Calculate the quick report count for this event and interval\n", " quick_report_counts = quick_report_gdelts.groupby('sources_FIPS')['DATE'].count().reset_index()\n", " quick_report_counts.columns = ['report_country_FIP', f'report_{quick_interval}']\n", "\n", " # Store the quick report counts for the current interval\n", " quick_report_counts_dict[quick_interval] = quick_report_counts\n", "\n", " # Group by the reporting country for time series\n", " grouped_gdelts = ts_gdelts.groupby('sources_FIPS')\n", "\n", " for report_country, group in grouped_gdelts:\n", " # Find the first report time for this report_country\n", " first_report_time = first_report_times[first_report_times['report_country_FIP'] == report_country]\n", " if not first_report_time.empty:\n", " first_report_time = first_report_time['first_report_time'].values[0]\n", " else:\n", " continue # Skip this country if there's no valid first report time\n", "\n", " ts_start = event['UTC_round'] - start_shift_timedelta\n", " ts_end = event['UTC_round'] + end_shift_timedelta\n", " \n", " # Create a time series range\n", " time_range = pd.date_range(start=ts_start, end=ts_end, freq=ts_interval_timedelta)\n", "\n", " # Efficiently calculate reports within each time interval\n", " # [start, end)\n", " ts_array = group['DATE'].groupby(pd.cut(group['DATE'], time_range, right=False, include_lowest=True), observed=False).size().tolist()\n", "\n", " # Initialize quick report counts for this report_country\n", " quick_report_results = {}\n", " for quick_interval in quick_report:\n", " quick_report_count = quick_report_counts_dict[quick_interval]\n", " quick_report_count_for_country = quick_report_count[quick_report_count['report_country_FIP'] == report_country]\n", " \n", " if not quick_report_count_for_country.empty:\n", " quick_report_results[f'report_{quick_interval}'] = quick_report_count_for_country[f'report_{quick_interval}'].values[0]\n", " else:\n", " quick_report_results[f'report_{quick_interval}'] = 0\n", "\n", " # Add results for this country and event to the final result list\n", " result = {\n", " 'Event_id': event_id,\n", " 'Location_fips': event['Location_fips'],\n", " 'UTC': event['UTC'],\n", " 'UTC_round': event['UTC_round'],\n", " 'report_country_FIP': report_country,\n", " 'first_report_time': first_report_time,\n", " 'TS_start': ts_start,\n", " 'TS_end': ts_end,\n", " 'TS_interval': ts_interval,\n", " 'TS_array': ts_array\n", " }\n", "\n", " # Add quick report results to the result dictionary\n", " result.update(quick_report_results)\n", "\n", " results.append(result)\n", "\n", " print(f\"Event {event_id} processed.\")\n", " \n", " # Convert results list to DataFrame\n", " final_df = pd.DataFrame(results)\n", " return final_df" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Event 240101_Japan processed.\n", "Event 240402_Taiwan processed.\n", "Event 230908_Morocco processed.\n" ] } ], "source": [ "result_df = generate_ts_df(df_event, df_gdelts)\n", "result_df.to_csv('data/earthquake_ts_1D_21D_1D.csv', index=False, sep=',')" ] } ], "metadata": { "kernelspec": { "display_name": "metagdelt", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.14" } }, "nbformat": 4, "nbformat_minor": 2 }