{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Parsing Hail Data Files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The purpose of the following code is to parse the Hail data from the The Alberta Hail Project Meteorological and Barge-Humphries Radar Archive and output it to a `.CSV` file.\n", "\n", "The data consists of hail measurements and observations submitted by Alberta farmers between the period of 1957 and 1985. \n", "\n", "The data exists as a directory of `.DAT` files which were manually converted to `.TXT` files (by changing the extension). Then, using regular expressions (also known as regex), the data is split apart and recombined into a nested list format allowing for export to a `.CSV` file." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "-----" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This code imports the modules and defines the custom functions that are needed to parse the hail data. The comments coloured in light blue and red describe the individual code." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import re\n", "import csv\n", "from os import listdir\n", "from os.path import splitext\n", "from os.path import basename\n", "\n", "# This function works on the contents of the files\n", "\n", "def read_file(filename):\n", " \"Read the contents of FILENAME and return as a string.\"\n", " infile = open(filename)\n", " contents = infile.read()\n", " infile.close()\n", " return contents\n", "\n", "# These functions remove the the path and file extension from a filename\n", "\n", "\n", "def list_textfiles(directory):\n", " \"Return a list of filenames ending in '.txt'\"\n", " textfiles = []\n", " for filename in listdir(directory):\n", " if filename.endswith(\".txt\"):\n", " textfiles.append(directory + \"/\" + filename)\n", " return textfiles\n", "\n", "\n", "def remove_ext(filename):\n", " \"Removes the file extension, such as .txt\"\n", " name, extension = splitext(filename)\n", " return name\n", "\n", "\n", "def remove_dir(filepath):\n", " \"Removes the path from the file name\"\n", " name = basename(filepath)\n", " return name\n", "\n", "\n", "def get_filename(filepath):\n", " \"Removes the path and file extension from the file name\"\n", " filename = remove_ext(filepath)\n", " name = remove_dir(filename)\n", " return name" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below is a list of the files in the Hail directory that need to be parsed. This list is helpful because we will need to manually change the filename in the next section of code in order to parse each file separately. \n", "\n", "The function `listdir` is called to display the files contained in the specified directory, shown below in red text." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['57HAIL.txt', '58HAIL.txt', '59HAIL.txt', '60HAIL.txt', '61HAIL.txt', '62HAIL.txt', '63HAIL.txt', '64HAIL.txt', '65HAIL.txt', '66HAIL.txt', '67HAIL.txt', '68HAIL.txt', '69HAIL.txt', '74HAIL.txt', '75HAIL.txt', '76HAIL.txt', '77HAIL.txt', '78HAIL.txt', '79HAIL.txt', '80HAIL.txt', '81HAIL.txt', '82HAIL.txt', '83HAIL.txt', '84HAIL.txt', '85HAIL.txt']\n" ] } ], "source": [ "print(listdir('hail/'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is where the raw data is imported and assigned to a variable. To read a different file from the directory, the filename needs to be changed (based on the above list). It's important to note that the original data is never altered, only the data that is loaded and assigned to the `file` variable." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "file = 'Hail/76HAIL.txt'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This code calls on a custom function `get_filename` to strip the path and file extension from the filename, storing it in the variable `name`. This variable will be used later to name the `.CSV` file so it matches the name of the `.TXT` file from which the data came. \n", "\n", "Then, the data from the variable `file` is read using the custom function `read_file` and passed to a regex that matches the format of the file (seen in red text)." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [], "source": [ "name = get_filename(file)\n", "text = read_file(file)\n", "\n", "data = re.findall(r'^(\\s)(\\d{1})(\\d{2})(\\d{1})(\\d{2})(\\d{4})(\\d{2})(\\d{2})(\\d{2})(\\d{1})(\\d{4})(\\d{3})(\\d{4})(\\d{3})(\\d{1})(\\d{3})(\\d{1})(\\d{1})(\\d{1})(\\d{4})(\\d{1})(\\d{4})(\\d{1})(\\d{2})(\\d{1})(\\d{1})(\\d{1})(\\d{1})(\\d{2})(\\d{1})(\\d{1})(\\d{1})', text, re.MULTILINE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's a look at the original data format. Each string of numbers is a separate entry consisting of groups of numbers that correspond to a specific data point, ranging from dates, times, geographical information, hail size and duration, rain size and duration, and damage to crops. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Hail/76HAIL.txt\n", " 17652510002938244101001510001800254229999920000133990999219\n", " 17652500013532055111001511140111127299999920000199410900419\n", " 17652700100345234999901099991200076229999910025099412900999\n", " 17652910002049064164501216570109999229999920000099411900919\n", " 17652900101038154120000511550259999329999920000125411900999\n", " 17653101002046045163700916250559999299999910152099410900019\n", " 17653100013237075172003017100459999329999930020140431999399\n" ] } ], "source": [ "print(file)\n", "print(text[:426])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is the data formatted by the regex. The first digit indicates the origin of the report (via mail-in card, etc.), the second is the year, and the third and fourth are the month and day. The data was parsed according to the information provided by the accompanying codebook." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[(' ', '1', '76', '5', '25', '1000', '29', '38', '24', '4', '1010', '015', '1000', '180', '0', '254', '2', '2', '9', '9999', '2', '0000', '1', '33', '9', '9', '0', '9', '99', '2', '1', '9')]\n" ] } ], "source": [ "print(data[:1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now the data can be converted to `.CSV`. This code uses the `name` variable that was created earlier to name the file and writes each group of data to one line in the file. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "with open(name + '.csv', 'w') as f:\n", " w = csv.writer(f)\n", " w.writerows(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "-----" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This code reads all of the files and prints each line to one master `.CSV` file. The only extra step required is in the fourth chunk of code, where the master list needs to be flattened so that each string of data appears in its own row in the file." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [], "source": [ "filenames = []\n", "for files in list_textfiles('Hail/'):\n", " files = get_filename(files)\n", " filenames.append(files)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "docs = []\n", "for filename in list_textfiles('Hail/'):\n", " docs.append(read_file(filename))" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "data_2 = []\n", "for doc in docs:\n", " data_2.append(re.findall(r'^(\\s)(\\d{1})(\\d{2})(\\d{1})(\\d{2})(\\d{4})(\\d{2})(\\d{2})(\\d{2})(\\d{1})(\\d{4})(\\d{3})(\\d{4})(\\d{3})(\\d{1})(\\d{3})(\\d{1})(\\d{1})(\\d{1})(\\d{4})(\\d{1})(\\d{4})(\\d{1})(\\d{2})(\\d{1})(\\d{1})(\\d{1})(\\d{1})(\\d{2})(\\d{1})(\\d{1})(\\d{1})', doc, re.MULTILINE))" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "alldata = [line for sublist in data_2 for line in sublist]" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "with open('allHail.csv', 'w') as f:\n", " w = csv.writer(f)\n", " w.writerows(alldata)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.0" } }, "nbformat": 4, "nbformat_minor": 0 }