{ "cells": [ { "cell_type": "markdown", "id": "24e562c3", "metadata": {}, "source": [ "# Notebook for processing London Fire Brigade Data\n", "\n", "Author: [Huanfa Chen](https://github.com/huanfachen)\n", "\n" ] }, { "cell_type": "code", "execution_count": 34, "id": "ca908c7c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Last modified: 2026-01-08\n" ] } ], "source": [ "from datetime import date\n", "import pandas as pd\n", "# suppress warnings\n", "import warnings\n", "warnings.filterwarnings('ignore')\n", "print(\"Last modified:\", date.today())" ] }, { "cell_type": "code", "execution_count": 14, "id": "56af03f7", "metadata": {}, "outputs": [], "source": [ "# read Excel file data from this link: https://data.london.gov.uk/download/em8xy/f5066d66-c7a3-415f-9629-026fbda61822/LFB%20Incident%20data%20from%202018%20-%202023.xlsx\n", "# data from 2024:\n", "# https://data.london.gov.uk/download/em8xy/58m/LFB%20Incident%20data%20from%202024%20onwards.xlsx\n", "import pandas as pd\n", "df = pd.read_excel('https://data.london.gov.uk/download/em8xy/f5066d66-c7a3-415f-9629-026fbda61822/LFB%20Incident%20data%20from%202018%20-%202023.xlsx')" ] }, { "cell_type": "code", "execution_count": 15, "id": "b6024ebc", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['IncidentNumber', 'DateOfCall', 'CalYear', 'TimeOfCall', 'HourOfCall', 'IncidentGroup', 'StopCodeDescription', 'SpecialServiceType', 'PropertyCategory', 'PropertyType', 'AddressQualifier', 'Postcode_full', 'Postcode_district', 'UPRN', 'USRN', 'IncGeo_BoroughCode', 'IncGeo_BoroughName', 'ProperCase', 'IncGeo_WardCode', 'IncGeo_WardName', 'IncGeo_WardNameNew', 'Easting_m', 'Northing_m', 'Easting_rounded', 'Northing_rounded', 'Latitude', 'Longitude', 'FRS', 'IncidentStationGround', 'FirstPumpArriving_AttendanceTime', 'FirstPumpArriving_DeployedFromStation', 'SecondPumpArriving_AttendanceTime', 'SecondPumpArriving_DeployedFromStation', 'NumStationsWithPumpsAttending', 'NumPumpsAttending', 'PumpCount', 'PumpMinutesRounded', 'Notional Cost (£)', 'NumCalls']\n" ] } ], "source": [ "print(df.columns.tolist())" ] }, { "cell_type": "code", "execution_count": 16, "id": "fde5f559", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IncidentNumberDateOfCallCalYearTimeOfCallHourOfCallIncidentGroupStopCodeDescriptionSpecialServiceTypePropertyCategoryPropertyType...FirstPumpArriving_AttendanceTimeFirstPumpArriving_DeployedFromStationSecondPumpArriving_AttendanceTimeSecondPumpArriving_DeployedFromStationNumStationsWithPumpsAttendingNumPumpsAttendingPumpCountPumpMinutesRoundedNotional Cost (£)NumCalls
0000008-010120182018-01-01201800:04:250False AlarmAFANaNNon ResidentialMosque...348.0FinchleyNaNNaN1.01.01603281.0
1000009-010120182018-01-01201800:04:300False AlarmAFANaNNon ResidentialPub/wine bar/bar...144.0BeckenhamNaNNaN1.01.01603281.0
2000010-010120182018-01-01201800:04:340FireSecondary FireNaNOutdoor StructureCommon external bin storage area...232.0SouthgateNaNNaN1.01.01603281.0
3000011-010120182018-01-01201800:04:580Special ServiceSpecial ServiceRTCRoad VehicleMultiple Vehicles...22.0EnfieldNaNNaN1.01.01603281.0
4000014-010120182018-01-01201800:07:470FirePrimary FireNaNRoad VehicleCar...241.0StratfordNaNNaN1.01.01603286.0
\n", "

5 rows × 39 columns

\n", "
" ], "text/plain": [ " IncidentNumber DateOfCall CalYear TimeOfCall HourOfCall \\\n", "0 000008-01012018 2018-01-01 2018 00:04:25 0 \n", "1 000009-01012018 2018-01-01 2018 00:04:30 0 \n", "2 000010-01012018 2018-01-01 2018 00:04:34 0 \n", "3 000011-01012018 2018-01-01 2018 00:04:58 0 \n", "4 000014-01012018 2018-01-01 2018 00:07:47 0 \n", "\n", " IncidentGroup StopCodeDescription SpecialServiceType PropertyCategory \\\n", "0 False Alarm AFA NaN Non Residential \n", "1 False Alarm AFA NaN Non Residential \n", "2 Fire Secondary Fire NaN Outdoor Structure \n", "3 Special Service Special Service RTC Road Vehicle \n", "4 Fire Primary Fire NaN Road Vehicle \n", "\n", " PropertyType ... FirstPumpArriving_AttendanceTime \\\n", "0 Mosque ... 348.0 \n", "1 Pub/wine bar/bar ... 144.0 \n", "2 Common external bin storage area ... 232.0 \n", "3 Multiple Vehicles ... 22.0 \n", "4 Car ... 241.0 \n", "\n", " FirstPumpArriving_DeployedFromStation SecondPumpArriving_AttendanceTime \\\n", "0 Finchley NaN \n", "1 Beckenham NaN \n", "2 Southgate NaN \n", "3 Enfield NaN \n", "4 Stratford NaN \n", "\n", " SecondPumpArriving_DeployedFromStation NumStationsWithPumpsAttending \\\n", "0 NaN 1.0 \n", "1 NaN 1.0 \n", "2 NaN 1.0 \n", "3 NaN 1.0 \n", "4 NaN 1.0 \n", "\n", " NumPumpsAttending PumpCount PumpMinutesRounded Notional Cost (£) NumCalls \n", "0 1.0 1 60 328 1.0 \n", "1 1.0 1 60 328 1.0 \n", "2 1.0 1 60 328 1.0 \n", "3 1.0 1 60 328 1.0 \n", "4 1.0 1 60 328 6.0 \n", "\n", "[5 rows x 39 columns]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "5b0031be", "metadata": {}, "outputs": [], "source": [ "# extract data of year 2023\n", "df_lfb = df[df['CalYear'] == 2023]" ] }, { "cell_type": "code", "execution_count": 18, "id": "230a2c98", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(126114, 39)" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_lfb.shape" ] }, { "cell_type": "code", "execution_count": 19, "id": "dd1f5cd4", "metadata": {}, "outputs": [], "source": [ "# save df_lfb as LFB_2023_data.csv\n", "df_lfb.to_csv('LFB_2023_data.csv', index=False)" ] }, { "cell_type": "code", "execution_count": 4, "id": "73616a77", "metadata": {}, "outputs": [], "source": [ "# read CSV file data\n", "df_lfb = pd.read_csv('LFB_2023_data.csv')" ] }, { "cell_type": "code", "execution_count": 5, "id": "855892c8", "metadata": {}, "outputs": [], "source": [ "# aggregate data to daily level\n", "df_lfb_daily = df_lfb.groupby('DateOfCall').size().reset_index(name='IncidentCount')" ] }, { "cell_type": "code", "execution_count": 6, "id": "580cddec", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateOfCallIncidentCount
02023-01-01402
12023-01-02293
22023-01-03267
32023-01-04285
42023-01-05305
\n", "
" ], "text/plain": [ " DateOfCall IncidentCount\n", "0 2023-01-01 402\n", "1 2023-01-02 293\n", "2 2023-01-03 267\n", "3 2023-01-04 285\n", "4 2023-01-05 305" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_lfb_daily.head()" ] }, { "cell_type": "markdown", "id": "f0a4f988", "metadata": {}, "source": [ "## Linking with daily weather data" ] }, { "cell_type": "code", "execution_count": 7, "id": "0bef82eb", "metadata": {}, "outputs": [], "source": [ "# read in weather data\n", "df_weather = pd.read_csv('./London_weather_data_1979_to_2023.csv')" ] }, { "cell_type": "code", "execution_count": 8, "id": "5059a539", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DATETXQ_TXTNQ_TNTGQ_TGSSQ_SSSD...RRQ_RRQQQ_QQPPQ_PPHUQ_HUCCQ_CC
01979010123.00-75.00-41.0070.009.0...4.0052.0010190.0083.002.00
11979010216.00-75.00-26.0017.008.0...0.0027.0010253.0076.006.00
21979010313.00-72.00-28.000.004.0...0.0013.0010205.0086.005.00
319790104-3.00-65.00-26.000.002.0...0.0013.0010084.0068.008.00
41979010556.00-14.00-8.0020.001.0...0.0029.0010225.0073.006.00
\n", "

5 rows × 21 columns

\n", "
" ], "text/plain": [ " DATE TX Q_TX TN Q_TN TG Q_TG SS Q_SS SD ... RR \\\n", "0 19790101 23.0 0 -75.0 0 -41.0 0 70.0 0 9.0 ... 4.0 \n", "1 19790102 16.0 0 -75.0 0 -26.0 0 17.0 0 8.0 ... 0.0 \n", "2 19790103 13.0 0 -72.0 0 -28.0 0 0.0 0 4.0 ... 0.0 \n", "3 19790104 -3.0 0 -65.0 0 -26.0 0 0.0 0 2.0 ... 0.0 \n", "4 19790105 56.0 0 -14.0 0 -8.0 0 20.0 0 1.0 ... 0.0 \n", "\n", " Q_RR QQ Q_QQ PP Q_PP HU Q_HU CC Q_CC \n", "0 0 52.0 0 10190.0 0 83.0 0 2.0 0 \n", "1 0 27.0 0 10253.0 0 76.0 0 6.0 0 \n", "2 0 13.0 0 10205.0 0 86.0 0 5.0 0 \n", "3 0 13.0 0 10084.0 0 68.0 0 8.0 0 \n", "4 0 29.0 0 10225.0 0 73.0 0 6.0 0 \n", "\n", "[5 rows x 21 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_weather.head()" ] }, { "cell_type": "code", "execution_count": 9, "id": "645258e9", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DATE int64\n", "TX float64\n", "Q_TX int64\n", "TN float64\n", "Q_TN int64\n", "TG float64\n", "Q_TG int64\n", "SS float64\n", "Q_SS int64\n", "SD float64\n", "Q_SD int64\n", "RR float64\n", "Q_RR int64\n", "QQ float64\n", "Q_QQ int64\n", "PP float64\n", "Q_PP int64\n", "HU float64\n", "Q_HU int64\n", "CC float64\n", "Q_CC int64\n", "dtype: object" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_weather.dtypes" ] }, { "cell_type": "code", "execution_count": 10, "id": "640d4080", "metadata": {}, "outputs": [], "source": [ "# extract weather data of year 2023\n", "df_2023 = df_weather[df_weather['DATE'].astype(str).str.startswith('2023')]" ] }, { "cell_type": "code", "execution_count": 11, "id": "aff107fe", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(365, 21)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_2023.shape" ] }, { "cell_type": "code", "execution_count": 12, "id": "9075b1a7", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(364, 2)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_lfb_daily.shape" ] }, { "cell_type": "code", "execution_count": 13, "id": "5fe288ea", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/tmp/ipykernel_10264/228454178.py:2: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " df_2023['DATE'] = pd.to_datetime(df_2023['DATE'], format='%Y%m%d').dt.date\n" ] }, { "data": { "text/plain": [ "(364, 23)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# merge fire incident data with weather data. Note that df_lfb_daily has a column of DateOfCall (2023-01-01, str type) while df_2023 has a column of DATE (20230101, int type)\n", "df_2023['DATE'] = pd.to_datetime(df_2023['DATE'], format='%Y%m%d').dt.date\n", "df_lfb_daily['DateOfCall'] = pd.to_datetime(df_lfb_daily['DateOfCall']).dt.date\n", "df_lfb_merged = pd.merge(df_lfb_daily, df_2023, left_on='DateOfCall', right_on='DATE', how='inner')\n", "df_lfb_merged.shape\n", "# 364 rows" ] }, { "cell_type": "code", "execution_count": 14, "id": "76a93fa8", "metadata": {}, "outputs": [], "source": [ "# add new columns to df_lfb_merged: 'IsWeekend' to indicate if the day is a weekend; 'weekday' to indicate the day of the week (string rather than integer)\n", "df_lfb_merged['IsWeekend'] = df_lfb_merged['DateOfCall'].apply(lambda x: 1 if pd.to_datetime(x).weekday() >= 5 else 0)\n", "df_lfb_merged['weekday'] = df_lfb_merged['DateOfCall'].apply(lambda x: pd.to_datetime(x).strftime('%A'))\n" ] }, { "cell_type": "code", "execution_count": 15, "id": "b6f1aa52", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateOfCallIncidentCountDATETXQ_TXTNQ_TNTGQ_TGSS...QQQ_QQPPQ_PPHUQ_HUCCQ_CCIsWeekendweekday
02023-01-014022023-01-0193.01100.01112.010.0...21.0010080.0081.007.001Sunday
12023-01-022932023-01-02134.0052.0072.0052.0...44.0010169.0086.003.000Monday
22023-01-032672023-01-03139.0021.0078.000.0...9.0010186.0093.007.000Tuesday
32023-01-042852023-01-04135.0092.00116.004.0...17.0010146.0082.006.000Wednesday
42023-01-053052023-01-05127.0089.00112.004.0...28.0010186.0082.008.000Thursday
\n", "

5 rows × 25 columns

\n", "
" ], "text/plain": [ " DateOfCall IncidentCount DATE TX Q_TX TN Q_TN TG \\\n", "0 2023-01-01 402 2023-01-01 93.0 1 100.0 1 112.0 \n", "1 2023-01-02 293 2023-01-02 134.0 0 52.0 0 72.0 \n", "2 2023-01-03 267 2023-01-03 139.0 0 21.0 0 78.0 \n", "3 2023-01-04 285 2023-01-04 135.0 0 92.0 0 116.0 \n", "4 2023-01-05 305 2023-01-05 127.0 0 89.0 0 112.0 \n", "\n", " Q_TG SS ... QQ Q_QQ PP Q_PP HU Q_HU CC Q_CC \\\n", "0 1 0.0 ... 21.0 0 10080.0 0 81.0 0 7.0 0 \n", "1 0 52.0 ... 44.0 0 10169.0 0 86.0 0 3.0 0 \n", "2 0 0.0 ... 9.0 0 10186.0 0 93.0 0 7.0 0 \n", "3 0 4.0 ... 17.0 0 10146.0 0 82.0 0 6.0 0 \n", "4 0 4.0 ... 28.0 0 10186.0 0 82.0 0 8.0 0 \n", "\n", " IsWeekend weekday \n", "0 1 Sunday \n", "1 0 Monday \n", "2 0 Tuesday \n", "3 0 Wednesday \n", "4 0 Thursday \n", "\n", "[5 rows x 25 columns]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_lfb_merged.head()" ] }, { "cell_type": "code", "execution_count": 16, "id": "7a2e3249", "metadata": {}, "outputs": [], "source": [ "# read in ukbankholidays-jul19.xls\n", "df_uk_holidays = pd.read_excel('./ukbankholidays-jul19.xls')\n", "# rename the first column to 'Date'\n", "df_uk_holidays.rename(columns={df_uk_holidays.columns[0]: 'Date'}, inplace=True)\n", "# ignore rows not in date format\n", "df_uk_holidays = df_uk_holidays[pd.to_datetime(df_uk_holidays['Date'], errors='coerce').notnull()]\n", "df_uk_holidays['Date'] = pd.to_datetime(df_uk_holidays['Date']).dt.date" ] }, { "cell_type": "code", "execution_count": 17, "id": "7e2e6de3", "metadata": {}, "outputs": [], "source": [ "# keep only holidays in 2023\n", "df_uk_holidays = df_uk_holidays[df_uk_holidays['Date'].astype(str).str.startswith('2023')]" ] }, { "cell_type": "code", "execution_count": 18, "id": "d0955ca0", "metadata": {}, "outputs": [], "source": [ "# merge holiday info into df_lfb_merged. A new column 'IsBankHoliday' is added to indicate if the day is a holiday.\n", "df_lfb_merged = pd.merge(df_lfb_merged, df_uk_holidays[['Date']], left_on='DateOfCall', right_on='Date', how='left', indicator=True)\n", "df_lfb_merged['IsBankHoliday'] = df_lfb_merged['_merge'].apply(lambda x: 1 if x == 'both' else 0)\n", "df_lfb_merged.drop(columns=['Date', '_merge'], inplace=True)" ] }, { "cell_type": "code", "execution_count": 19, "id": "9e9a3517", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "IsBankHoliday\n", "0 356\n", "1 8\n", "Name: count, dtype: int64" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_lfb_merged.IsBankHoliday.value_counts()" ] }, { "cell_type": "code", "execution_count": 20, "id": "b325be8e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['DateOfCall', 'IncidentCount', 'DATE', 'TX', 'Q_TX', 'TN', 'Q_TN', 'TG',\n", " 'Q_TG', 'SS', 'Q_SS', 'SD', 'Q_SD', 'RR', 'Q_RR', 'QQ', 'Q_QQ', 'PP',\n", " 'Q_PP', 'HU', 'Q_HU', 'CC', 'Q_CC', 'IsWeekend', 'weekday',\n", " 'IsBankHoliday'],\n", " dtype='object')" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_lfb_merged.columns" ] }, { "cell_type": "code", "execution_count": 21, "id": "48305a75", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DateOfCall object\n", "IncidentCount int64\n", "DATE object\n", "TX float64\n", "Q_TX int64\n", "TN float64\n", "Q_TN int64\n", "TG float64\n", "Q_TG int64\n", "SS float64\n", "Q_SS int64\n", "SD float64\n", "Q_SD int64\n", "RR float64\n", "Q_RR int64\n", "QQ float64\n", "Q_QQ int64\n", "PP float64\n", "Q_PP int64\n", "HU float64\n", "Q_HU int64\n", "CC float64\n", "Q_CC int64\n", "IsWeekend int64\n", "weekday object\n", "IsBankHoliday int64\n", "dtype: object" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_lfb_merged.dtypes" ] }, { "cell_type": "code", "execution_count": 28, "id": "7c064be0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of rows in df_lfb_merged_backup: 364\n", "Number of rows in df_lfb_merged after removing rows with Q_* = 9: 364\n" ] } ], "source": [ "# save a backup of df_lfb_merged as df_lfb_merged_backup\n", "# in df_lfb_merged, remove rows where any Q_* column has value of 9.\n", "# print the number of rows of df_lfb_merged_backup and df_lfb_merged\n", "df_lfb_merged_backup = df_lfb_merged.copy()\n", "q_columns = [col for col in df_lfb_merged.columns if col.startswith('Q_')]\n", "df_lfb_merged = df_lfb_merged[~df_lfb_merged[q_columns].isin([9]).any(axis=1)]\n", "print(f'Number of rows in df_lfb_merged_backup: {df_lfb_merged_backup.shape[0]}')\n", "print(f'Number of rows in df_lfb_merged after removing rows with Q_* = 9: {df_lfb_merged.shape[0]}')" ] }, { "cell_type": "code", "execution_count": 32, "id": "37274adc", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RMSE: 37.29923634910724\n", "R2 on training set: 0.9999933838844299\n", "R2 on testing set: 0.09958332777023315\n" ] } ], "source": [ "# using XGBoost to predict IncidentCount using weather, weekday, weekend, and bank holiday info\n", "import xgboost as xgb\n", "from sklearn.model_selection import train_test_split\n", "# from sklearn.preprocessing import OneHotEncoder\n", "from sklearn.metrics import mean_squared_error\n", "# prepare data for modeling\n", "feature_cols = ['TX', 'TN', 'TG', 'SS', 'SD','RR','QQ', 'PP','HU','CC', 'IsWeekend', 'IsBankHoliday', 'weekday']\n", "X = df_lfb_merged[feature_cols]\n", "y = df_lfb_merged['IncidentCount']\n", "# one-hot encode the 'weekday' column\n", "X = pd.get_dummies(X, columns=['weekday'], drop_first=True)\n", "# split data into training and testing sets\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n", "# train XGBoost model\n", "# model = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=100, learning_rate=0.1, max_depth=5, random_state=42)\n", "model = xgb.XGBRegressor(objective='reg:squarederror', random_state=42)\n", "model.fit(X_train, y_train)\n", "# make predictions\n", "y_pred = model.predict(X_test)\n", "# evaluate model performance\n", "mse = mean_squared_error(y_test, y_pred)\n", "rmse = mse ** 0.5\n", "print(f'RMSE: {rmse}')\n", "\n", "# R2 of model on training and testing sets\n", "r2_train = model.score(X_train, y_train)\n", "r2_test = model.score(X_test, y_test)\n", "print(f'R2 on training set: {r2_train}')\n", "print(f'R2 on testing set: {r2_test}')" ] }, { "cell_type": "code", "execution_count": null, "id": "0b92a2bc", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Random Forest R2 on training set: 0.9136535356526865\n", "Random Forest R2 on testing set: 0.16263798826889153\n" ] } ], "source": [ "# use random forest regressor from sklearn to predict IncidentCount. Don't split data again\n", "# first, import RandomForestRegressor\n", "from sklearn.ensemble import RandomForestRegressor\n", "# train Random Forest model\n", "rf_model = RandomForestRegressor(random_state=10)\n", "rf_model.fit(X_train, y_train)\n", "# make predictions\n", "y_pred_rf = rf_model.predict(X_test)\n", "# evaluate model performance: training and teseting R2\n", "r2_train_rf = rf_model.score(X_train, y_train)\n", "r2_test_rf = rf_model.score(X_test, y_test)\n", "print(f'Random Forest R2 on training set: {r2_train_rf}')\n", "print(f'Random Forest R2 on testing set: {r2_test_rf}')" ] }, { "cell_type": "code", "execution_count": 33, "id": "7c5b846b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Feature Importance\n", "2 TG 0.181695\n", "7 PP 0.157313\n", "0 TX 0.154714\n", "8 HU 0.103176\n", "6 QQ 0.093228\n", "5 RR 0.084433\n", "1 TN 0.062492\n", "9 CC 0.056843\n", "3 SS 0.055895\n", "14 weekday_Sunday 0.009379\n", "17 weekday_Wednesday 0.007540\n", "10 IsWeekend 0.007209\n", "13 weekday_Saturday 0.006130\n", "15 weekday_Thursday 0.005779\n", "16 weekday_Tuesday 0.005241\n", "11 IsBankHoliday 0.004779\n", "12 weekday_Monday 0.004155\n", "4 SD 0.000000\n" ] } ], "source": [ "# feature importance in rf_model\n", "importances = rf_model.feature_importances_\n", "feature_names = X.columns\n", "feature_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})\n", "feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)\n", "print(feature_importance_df)" ] }, { "cell_type": "code", "execution_count": 36, "id": "d990e991", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['DateOfCall', 'IncidentCount', 'DATE', 'TX', 'Q_TX', 'TN', 'Q_TN', 'TG',\n", " 'Q_TG', 'SS', 'Q_SS', 'SD', 'Q_SD', 'RR', 'Q_RR', 'QQ', 'Q_QQ', 'PP',\n", " 'Q_PP', 'HU', 'Q_HU', 'CC', 'Q_CC', 'IsWeekend', 'weekday',\n", " 'IsBankHoliday'],\n", " dtype='object')" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_lfb_merged.columns" ] }, { "cell_type": "code", "execution_count": 38, "id": "bf02fa24", "metadata": {}, "outputs": [], "source": [ "# save to a csv file LFB_2023_daily_data.csv\n", "df_lfb_merged.to_csv('LFB_2023_daily_data.csv', index=False)" ] }, { "cell_type": "markdown", "id": "cbb2595b", "metadata": {}, "source": [ "## Notes on location precision" ] }, { "cell_type": "markdown", "id": "c5919f60", "metadata": {}, "source": [ "Note that the easting/northing coordinates are not precise for residential buildings, as they point to the centroid of the postcode area for privacy reasons. The column of 'Postcode_district' contains the postcode district (e.g., 'E1 6') for each incident, which can be used for spatial aggregation.\n", "\n", "The dataset can be aggregated to the ward level using the 'WardCode' column, which contains the unique identifier for each ward in London. \n", "\n", "As ward is not a census geography, we need to use lookup table to map postcodes (``Postcode_district``) to MSOAs before linking with demographic data." ] }, { "cell_type": "code", "execution_count": 24, "id": "5bba140c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['LSOA21CD', 'LSOA21NM', 'LSOA21NMW', 'WD25CD', 'WD25NM', 'WD25NMW', 'LAD25CD', 'LAD25NM', 'LAD25NMW', 'ObjectId']\n" ] } ], "source": [ "# read in LSOA to ward lookup table\n", "df_lookup = pd.read_csv('LSOA_(2021)_to_Electoral_Ward_(2025)_to_LAD_(2025)_Best_Fit_Lookup_in_EW_v2.csv')\n", "print(df_lookup.columns.tolist())" ] } ], "metadata": { "kernelspec": { "display_name": "sds2021", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.6" } }, "nbformat": 4, "nbformat_minor": 5 }