{ "cells": [ { "cell_type": "raw", "metadata": {}, "source": [ "---\n", "metadata: true\n", "section: \"Modification to raw AIS data from U.S. Marine Cadastre\"\n", "goal: \"Know the modifications that have been made on the raw data before using them in this course.\"\n", "time: \"x min\"\n", "prerequisites: \"Basics about machine learning\"\n", "level: \"Beginner and advanced\"\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Modification to raw AIS data from U.S. Marine Cadastre" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Description of the modifications done to the raw data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The original AIS data were downloaded through the open access website of the U.S. Marine Cadastre (https://marinecadastre.gov/ais/).\n", "\n", "The __raw dataset__ looks like this:\n", "\n", "![text](A-1-1-ais1.JPG)\n", "\n", "The data is missing information to be able to work with separated trips: it contains only the AIS messages.\n", "\n", "The first step is to create the attribute ``TripID``. For that, we group the AIS messages according to their ``MMSI``, to make sure that they belong to the same ship. As one ship might travel several times in the area on several days, we split the trips that are recorded on a different day.\n", "\n", "Once we have the attribute ``TripID``, we can sort the values of each trip according to their timestamp, and collect the departure and arrival information (time, latitude and longitude). This information is saved in the attributes ``DepTime``, ``ArrTime``, ``DepLat``, ``DepLon``, ``ArrLat``, ``ArrLon``.\n", "\n", "Finally, we use the latitude and longitude information to retrieve the country and city of departure and arrival. This creates the attributes ``DepCountry``, ``DepCity``, ``ArrCountry``, ``ArrCity``.\n", "\n", "With all these modification, the __dynamic dataset__ is created and now looks like this:\n", "\n", "![text](A-1-1-ais2.JPG)\n", "\n", "From this dynamic dataset, we create the static dataset: we retrieve the static information of each trip and create a new row for each trip in the static dataset. We simply reuse the information for the attributes ``TripID``, ``MMSI``, ``VesselName``, ``IMO``, ``CallSign``, ``VesselType``, ``Length``, ``Width``, ``Draft``, ``Cargo``, ``DepTime``, ``ArrTime``, ``DepLat``, ``DepLon``, ``ArrLat``, ``ArrLon``, ``DepCountry``, ``DepCity``, ``ArrCountry``, ``ArrCity``.\n", "\n", "For the attributes ``MeanSOG`` and ``Duration``, we calculate them afterwards. ``MeanSOG`` is created by taking the mean of the ``SOG`` attribute for all the AIS messages of the trip. ``Duration`` is simply the difference between ``ArrTime`` and ``DepTime``.\n", "\n", "The __static dataset__ looks like this:\n", "\n", "![text](A-1-1-ais3.JPG)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Code for creation of dynamic dataset from raw data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```python\n", "# Add the filename to the variable file_in\n", "file_in = ''\n", "# The path of the new file\n", "file_out = ''\n", "\n", "\n", "import pandas as pd\n", "import reverse_geocoder as rg\n", "\n", "def reverseGeocode(lat, lon):\n", " '''\n", " This function returns the city and country names from latitude and longitude coordinates.\n", " '''\n", " coordinates = (lat, lon)\n", " result = rg.search(coordinates)\n", " \n", " # result[0] is an OrderedDict containing 'lat', 'lon', 'name', 'admin1', 'admin2', 'cc'\n", " return result[0]\n", "\n", "\n", "# Load raw data\n", "data = pd.read_csv(file_in, nrows = 100000)\n", "\n", "# Transform timestamp into usable datetime type\n", "data['BaseDateTime'] = pd.to_datetime(data['BaseDateTime'])\n", "\n", "# Create a list of different MMSI\n", "MMSI_list = data['MMSI'].unique()\n", "\n", "# Initiate TripID attribute to 0 for all rows\n", "data['TripID'] = 0\n", "\n", "'''\n", "The following loop iterates over all the rows of the dataset to add the TripID information:\n", "A new TripID is create for every different MMSI, and in one MMSI, a trip is split up if the shi\n", "travels on two different days.\n", "'''\n", "tripid = 0\n", "for MMSI in MMSI_list:\n", " date = pd.to_datetime('01.01.2000', format = '%d.%m.%Y') # fake date to have the first row enter the if\n", " for index, row in data.loc[data['MMSI'] == MMSI].iterrows(): # iterate over the messages of one MMSI\n", " if (row['BaseDateTime'].day != date.day\n", " or row['BaseDateTime'].month != date.month\n", " or row['BaseDateTime'].year != date.year): # different day: different trip\n", " date = row['BaseDateTime'] # keep the date to compare for later\n", " tripid = tripid + 1\n", " data.loc[index, 'TripID'] = tripid # add the TripID number to the row\n", " \n", "\n", "TripID_list = data['TripID'].unique()\n", "\n", "# Initiate the following attributes to 0\n", "data['DepTime'] = 0\n", "data['ArrTime'] = 0\n", "data['DepLat'] = 0\n", "data['DepLon'] = 0\n", "data['ArrLat'] = 0\n", "data['ArrLon'] = 0\n", "\n", "'''\n", "The following loop iterates over each trip (with different value of TripID) to add the departure and\n", "arrival information. The function sort_values() allows to access easily the first and last timestamps\n", "of the trip.\n", "'''\n", "for TripID in TripID_list: # iterate over each trip\n", " this_trip = data.loc[data['TripID'] == TripID].sort_values('BaseDateTime')\n", " \n", " departure_time = this_trip.iloc[0]['BaseDateTime']\n", " departure_index = this_trip.index[0]\n", " arrival_time = this_trip.iloc[-1]['BaseDateTime']\n", " arrival_index = this_trip.index[-1]\n", " \n", " data.loc[data['TripID'] == TripID, 'DepTime'] = departure_time\n", " data.loc[data['TripID'] == TripID, 'ArrTime'] = arrival_time\n", " data.loc[data['TripID'] == TripID, 'DepLat'] = data.loc[departure_index, 'LAT']\n", " data.loc[data['TripID'] == TripID, 'DepLon'] = data.loc[departure_index, 'LON']\n", " data.loc[data['TripID'] == TripID, 'ArrLat'] = data.loc[arrival_index, 'LAT']\n", " data.loc[data['TripID'] == TripID, 'ArrLon'] = data.loc[arrival_index, 'LON']\n", "\n", "# Initiate the following attributes\n", "data['DepCountry'] = '?'\n", "data['DepCity'] = '?'\n", "data['ArrCountry'] = '?'\n", "data['ArrCity'] = '?'\n", "\n", "'''\n", "The following loop iterates over each trip (with different value of TripID), gets the departure and\n", "arrival latitude and longitudes values and gets the corresponding city and country. This information\n", "is added in the attributes 'DepCountry', 'DepCity', 'ArrCountry', 'ArrCity'.\n", "'''\n", "for TripID in TripID_list: # iterate over each trip\n", " this_trip = data.loc[data['TripID'] == TripID] # initiate the values of the variables for this trip\n", " \n", " dep_lat = this_trip.iloc[0]['DepLat']\n", " dep_lon = this_trip.iloc[0]['DepLon']\n", " departure = reverseGeocode(dep_lat, dep_lon) \n", " data.loc[data['TripID'] == TripID, 'DepCity'] = departure['name']\n", " data.loc[data['TripID'] == TripID, 'DepCountry'] = departure['cc']\n", " \n", " arr_lat = this_trip.iloc[0]['ArrLat']\n", " arr_lon = this_trip.iloc[0]['ArrLon']\n", " arrival = reverseGeocode(arr_lat, arr_lon)\n", " data.loc[data['TripID'] == TripID, 'ArrCity'] = arrival['name']\n", " data.loc[data['TripID'] == TripID, 'ArrCountry'] = arrival['cc']\n", "\n", " \n", "# Save new dataset\n", "data.to_csv(file_out)\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Code for creation of static dataset from dynamic dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```python\n", "file_in = '' # the input file is the dynamic dataset\n", "file_out = ''\n", "\n", "\n", "import pandas as pd\n", "\n", "columns = ['TripID', 'MMSI', 'MeanSOG', 'VesselName', 'IMO', 'CallSign', 'VesselType',\n", " 'Length', 'Width', 'Draft', 'Cargo', 'DepTime', 'ArrTime', 'DepLat', 'DepLon',\n", " 'ArrLat', 'ArrLon', 'DepCountry', 'DepCity', 'ArrCountry', 'ArrCity', 'Duration']\n", "\n", "# Create new DataFrame with the wanted columns\n", "static_data = pd.DataFrame(columns = columns)\n", "\n", "# Remove MeanSOG and Duration from columns because we have to create these attributes in the following loop\n", "columns.remove('MeanSOG')\n", "columns.remove('Duration')\n", "\n", "# Change DepTime and ArrTime type to be able to calculate the Duration later\n", "data['DepTime'] = pd.to_datetime(data['DepTime'])\n", "data['ArrTime'] = pd.to_datetime(data['ArrTime'])\n", "\n", "i = 0\n", "for tripid in data['TripID'].unique(): # iterate over trips and create one row for each trip\n", " \n", " first_row = data.loc[data['TripID'] == tripid].iloc[0]\n", " \n", " for attribute in columns:\n", " # Fill the new dataset with the value of the attribute for the first row\n", " # (the static attributes don't change for one trip)\n", " static_data.loc[i, attribute] = first_row[attribute]\n", " \n", " # For MeanSOG: take the mean of all the rows of the same trip\n", " df_tripid = data.loc[data['TripID'] == tripid]\n", " static_data.loc[i, 'MeanSOG'] = df_tripid['SOG'].mean()\n", " \n", " i = i + 1\n", "\n", "static_data['Duration'] = static_data['ArrTime'] - static_data['DepTime']\n", "\n", "\n", "# Save new dataset\n", "static_data.to_csv(file_out)\n", "```" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }