{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Parsing Text Data Patterns Line-by-Line\n", "\n", "When a text uses natural language, i.e. normal human speech or prose, then data is embedded in\n", "the particular syntax and lexicon of that language. Natural Language Processing employs sophisticated\n", "models, trained on millions of documents, to parse natural language for meaningful information.\n", "\n", "In the case of the Charlotte directory, the text resembles a table with rows and columns, more so than\n", "prose. So there are no sentence structures for NLP to use as hints. Instead it reverts to using the English\n", "lexicon alone to establish parts of speech and relationships. The result is not much data, in the form of named\n", "entity recognition.\n", "\n", "In fact, the lines in our text file have non-English rules that we can establish and use to extract data directly.\n", "For instance, each line starts with a description of the household \"race\" category. Then within households there\n", "are often members of another race and these are also labeled. After each mention of race there is usually a surname\n", "and then a given name, but not always. Our job is to describe and encode these rules in formal patterns." ] }, { "cell_type": "code", "execution_count": 134, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4 files\n", "173 lines of text\n" ] } ], "source": [ "import re\n", "import os\n", "cwd = os.getcwd()\n", "data_loc = cwd + \"/data\"\n", "output_loc = cwd + \"/output\"\n", "allfiles = []\n", "\n", "# find all the text files\n", "for root, dirs, files in os.walk(data_loc):\n", " for file in files:\n", " if file.endswith(\".txt\"):\n", " allfiles.append(os.path.join(root, file))\n", " \n", "print('%d files' % len(allfiles))\n", "\n", "alltext = []\n", "for file in allfiles:\n", " with open(file, \"r\") as a_file:\n", " for line in a_file: alltext.append(line)\n", " \n", "print('%d lines of text' % len(alltext))" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [], "source": [ "def parse(pattern):\n", " data = []\n", " for line in alltext:\n", " match = pattern.search(line)\n", " if match is None:\n", " data.append(None)\n", " print(line)\n", " else:\n", " data.append(match.groupdict())\n", " return data" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[{}, {}, {}]\n" ] } ], "source": [ "race_pattern = re.compile(\"^(Black|White).*$\") # Match any line starting with Black or White\n", "data = parse(race_pattern)\n", "print(data[0:3])" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[{'race': 'White', 'surname': 'Adams'}, {'race': 'White', 'surname': 'Adams'}, {'race': 'Black', 'surname': 'Adams'}]\n" ] } ], "source": [ "race_pattern = re.compile(\"^(?PBlack|White)\\t(?P\\w*).*$\") # Match surname after a tab character\n", "data = parse(race_pattern)\n", "print(data[0:3])" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "White\tAdams\t\t\tHenry\tL\tFannie\tS\troute agent\tSouthern Railway Co\tHouse\t327 n Tryon \n", "\n", "White\tAdams\t\t\tJames\t\tGertrude\t\tmanager\t\tHouse\t419 Elizabeth av \n", "\n", "Black\tAdams\t\t\tJane\t\t\t\tteacher\t\tHouse\t1021 s Church\n", "\n", "Black\tAdams\t\t\tJohn\t\t\t\tlaborer\t\tHouse\t1031 s Church\n", "\n", "White\tAdams\t\t\tJohn\tJ\t\t\tpresident\tAdams G & P Co and Char Pepsi-Cola Co\tHouse\t309 e 6th\n", "\n", "White\tAdams\t\t\tJohn\tW\tCora\t\tconductor\tS A L Railway\tHouse\t21st nr Caldwell \n", "\n", "Black\tAdams\t\t\tJoseph\t\tViolet\t\tcooper\t\tHouse\t1011 s Church\n", "\n", "Black\tAdams\t\t\tKate\t\t\t\tlaundress\t\tHouse\tGroveton \n", "\n", "White\tAdams\t\t\tLafayette\tN\t\t\tclerk\tSouthern Railway\tHouse\t327 n Tryon \n", "\n", "White\tAdams\t\t\tLawrence\tA\t\t\tsalesman\tB S Moore & Co\tRooms\t405 s Tryon \n", "\n", "White\tAdams\t\t\tLaurie\tA\tMargaret\tN\tmill head\t\tHouse\tElizabetMills \n", "\n", "Black\tAdams\t\t\tLeland\t\t\t\tporter\tJ P Stowe & Co\tHouse\t403 s Myers \n", "\n", "Black\tAdams\t\t\tLizzie\t\t\t\t\t\tHouse\t309 1/2 w Morehead \n", "\n", "White\tAdams\t\twid (Geo O)\tGrace\tM\t\t\t\t\tHouse\t601 n College \n", "\n", "White\tAdams\t\t\tLuther\tM\tMamie\t\t\t\tHouse\tGroveton \n", "\n", "Black\tAdams\t\t\tMajor\t\t\t\twaiter\tBuford Hotel \t\t\n", "\n", "Black\tAdams\t\t\tMattie\t\t\t\t\t\tHouse\t719 n Graham ext \n", "\n", "Black\tAdams\t\t\tReuben\t\tBelle\t\tlaborer\tY & B Co\tHouse\t419 w 2d \n", "\n", "Black\tAdams\t\t\tRosa\t\t\t\t\t\tHouse\t714 s Caldwell \n", "\n", "Black\tAdams\t\t\tRufus\t\t\t\tlaborer\t\tHouse\tGreenville \n", "\n", "Black\tAdams\t\t\tRufus\t\t\t\tdriver\tStand I & F Co\tHouse\tRoss Town \n", "\n", "Black\tAdams\t\t\tViolet\t\t\t\tservant\t\t\t508 w Trade \n", "\n", "White\tAdams\t\t\tWheeler\tF\tMamie\t\tmoulder\t\tHouse\t303 s Cedar \n", "\n", "White\tAdams\t\t\tWilliam\tE\t\t\t\tThe Chronicle\tRooms\t300 1/2 s Church\n", "\n", "White\tAdcock\t\t\tJohn\tF\t\t\tmill worker\t\tHouse\t916 Calvine av \n", "\n", "White\tAdcock\t\twid (Jas M)\tMillie\tM\t\t\t\t\tHouse\t916 Calvine av \n", "\n", "White\tAdelsheimer\t\t\tHenry\tS\tLizzie\t\tmill worker\t\tHouse\t1216 Louise av \n", "\n", "Black\tAdkins\t\t\tKing\t\t\t\tlaborer\t\t\t600 s Myers \n", "\n", "White\tAdkins Walter D (Leona E)\t\t\tWalter\tD\tLeona\tE\tlineman\t\tHouse\t(r) 305 e 13th\n", "\n", "Black\tAgers\t\t\tNancy\t\t\t\tcook\t\tHouse\t206 Wilson \n", "\n", "Black\tAgers\t\t\tSallie\t\t\t\tlaundress\t\tHouse\t420 Jackson \n", "\n", "White\tAhaus\t\t\tHerman\t\tFrances\tE\ttailor\t203 w 4th\tHouse\t204 s Church\n", "\n", "White\tAikel\t\t\tJoseph\t\t\t\tconfectioner\t317 e Trade\tRooms\t225 w Trade \n", "\n", "White\tAiken\t\t\tGeorge\tW M\tBarbara\t\tsuperintendent\tQueen City M & G Wks\tHouse\t1120 s Caldwell \n", "\n", "White\tAiken\t\t\tHenry\t\t\t\t\t\tRooms\t9 e 3d \n", "\n", "Black\tAiken\t\t\tWalter\t\tElla\t\tlaborer\t\tHouse\t600 e 2d \n", "\n", "Black\tAiken\t\t\tWilliam\t\tElla\t\tdriver\tW I Henderson Gro Co\tHouse\t528 e 8th\n", "\n", "White\tAkers\t\t\tJoseph\tJ\t\t\tchief clerk superintendent\tSouthern Railway\tHouse\t908 s Tryon \n", "\n", "White\tAlbea\t\t\tJohn\tC\t\t\tpressman\tNews Ptg Co\tHouse\t310 s McDowell \n", "\n", "White\tAlbea\t\twid (J F)\tEmma\tS\t\t\t\t\tHouse\t310 s McDowell \n", "\n", "White\tAlbrecht\t\twid (Mathias)\tGesha\t\t\t\t\t\tHouse\t402 Elizabeth av \n", "\n", "White\tAlbright\t\t\tFay\tL\t\t\tclerk\t\tHouse\t15 s Caldwell \n", "\n", "White\tAlbright\t\t\tHal\tC\t\t\tclerk\tPost Office\tHouse\t15 s Caldwell \n", "\n", "White\tAlbright\t\t\tJudson\tD\tBelle\tC\tofficer\tUS Internal Revenue\tHouse\t15 s Caldwell \n", "\n", "White\tAldred\t\t\tCurtis\tM\tVictoria\t\tmachinist\t\tHouse\tSeversville \n", "\n", "White\tAldred\t\t\tJames\tA\tMattie\t\tclerk\t\tHouse\tSeversville \n", "\n", "White\tAldridge\t\twid (Wm A)\tMillie\tS\t\t\t\t\tHouse\tHarrill st Belmont Park \n", "\n", "White\tAlexain\t\t\tNicholas\t\t\t\temployment\tMet Café\tRooms\t26 n Tryon \n", "\n", "Black\tAlexander\t\t\tAdelaide\t\t\t\t\t\tHouse\t910 (909) e 1st \n", "\n", "Black\tAlexander\t\t\tAdeline\t\t\t\t\t\tHouse\tLutheran st \n", "\n", "White\tAlexander\t\t\tAlbert\tW\tAlice\t\tforeman\tCharlotte Cordage Co\tHouse\t621 e 5t\n", "\n", "Black\tAlexander\t\t\tAlfred\t\tWillie\t\tdriver\t\tHouse\t1016 e 3d \n", "\n", "Black\tAlexander\t\t\tAllen\t\tLizzie\t\tlaborer\t\tHouse\t6 10th st al \n", "\n", "Black\tAlexander\t\t\tAlvey\t\t\t\tlaborer\t\tHouse\tLutheran st \n", "\n", "Black\tAlexander\t\t\tAndrew\t\tBessie\t\tgrocer\t15 1/2 e Boundary\tHouse\t514 e Stonewall \n", "\n", "Black\tAlexander\t\t\tAndrew\tW\t\t\t\t\tRooms\t415 n Myers \n", "\n", "White\tAlexander\t\t\tArnold\tE\tAlice\t\tlaundryman\t\tHouse\t1424 e 5th\n", "\n", "Black\tAlexander\t\t\tBernard\t\t\t\tporter\tYoung's Steam Baking Co \t\t\n", "\n", "Black\tAlexander\t\t\tBessie\t\t\t\t\t\tHouse\t1006 e 1st \n", "\n", "White\tAlexander\t\t\tBlanche\t\t\t\tmadam\t\tHouse\t17 Spring \n", "\n", "Black\tAlexander\t\t\tBurdette\t\t\t\tporter\tColonial Club \t\t\n", "\n", "White\tAlexander\t\t\tC\tP\t\t\t\t\tHouse\tProvidence rd \n", "\n", "Black\tAlexander\t\t\tCalvin\t\tEmeline\t\tlaborer\t\tHouse\t213 1/2 w 1st \n", "\n", "White\tAlexander\t\t\tCarl\tL\t\t\tclerk\t\tHouse\t200 e Oak \n", "\n", "Black\tAlexander\t\t\tCarrie\t\t\t\t\t\tHouse\tGaither's al \n", "\n", "Black\tAlexander\t\t\tCarson\t\t\t\tchauffeur\tJ E S Davidson\tHouse\t209 s Middle \n", "\n", "Black\tAlexander\t\t\tCharles\t\t\t\tlaborer\t\tHouse\t410 n A \n", "\n", "White\tAlexander\t\t\tCharles\tF\tMary\t\tclerk\tPost Office\tHouse\t308 e Worthington av \n", "\n", "White\tAlexander\t\t\tCharles\tL\tSue\t\tdentist\t913-918 Realty Bldg\tHouse\t900 s Tryon \n", "\n", "White\tAlexander\t\t\tCharles\tY\t\t\tteacher\tKing's Bus College\tRooms\t305 s Church\n", "\n", "White\tAlexander\t\t\tClarence\tW\t\t\temployment\tJ M Porter\tHouse\t1116 s Tryon \n", "\n", "White\tAlexander\t\t\tClyde\tC\t\t\tpainter\t\tHouse\t810 n Brevard \n", "\n", "Black\tAlexander\t\t\tDaniel\t\tEtta\t\tlaborer\t\tHouse\t527 e 8th\n", "\n", "Black\tAlexander\t\t\tDecatur\tC\tAlta\t\tporter\tColonial Club\tHouse\t515 e 1st \n", "\n", "Black\tAlexander\t\t\tEdward\t\tLouvenia\t\tlaborer\t\tHouse\t1012 e Stonewall \n", "\n", "Black\tAlexander\t\t\tEli\t\tJennie\t\tgrocer\tOil Town\tHouse\t900 s Tryon \n", "\n", "Black\tAlexander\t\t\tElizabeth\t\t\t\tlaundress\t\tHouse\t414 s College \n", "\n", "Black\tAlexander\t\t\tEllen\t\t\t\t\t\tHouse\t606 s McDowell \n", "\n", "White\tAlexander\t\twid (Wm S)\tEmma\tV\t\t\t\t\tHouse\t610 n Graham ext \n", "\n", "Black\tAlexander\t\t\tEphraim\t\tAlice\t\tdriver\tColes & Smith\tHouse\tStonewall bet A & B \n", "\n", "Black\tAaron\t\t\tAmelia\t\t\t\tdomestic\t\t\t506 s Tryon \n", "\n", "White\tAbbey\t\t\tSimeon\tA\tMary\tA\tsuperintendent construction\tGeneral Fire Ext Co\tHouse\t104 Central av \n", "\n", "White\tAbee\t\t\tJunius\tA\t\t\ttelegraph operator\tSouthern Railway\tBoards\t507 n Graham \n", "\n", "White\tAbel\t\t\tAbram\t\tFannie\t\ttraveling salesman\t\tHouse\t506 w 10t\n", "\n", "Black\tAbel\t\t\tBelle\t\t\t\t\t\tHouse\t14 Boundary al \n", "\n", "Black\tAbel\t\t\tGeorge\t\t\t\tlaborer\t\tHouse\t12 Boundary al \n", "\n", "White\tAbernathy\t\t\tC\tL\t\t\t\t\tHouse\tLawyer's rd \n", "\n", "White\tAbernathy\t\t\tClement\tD\tCora\t\temployment\tCharlotte Lea Belt Co\tHouse\t1117 s Tryon \n", "\n", "White\tAbernathy\t\t\tDavid\tM\tEnola\t\t\t\tHouse\t1310 e 4text \n", "\n", "Black\tAbernathy\t\t\tHannah\t\t\t\t\t\tHouse\t4 Bellinger \n", "\n", "White\tAbernathy\t\t\tJohn\tW\tNannie\t\tcarpenter\t\tHouse\tSunnyside \n", "\n", "Black\tAbernathy\t\t\tJoseph\t\t\t\tlaborer\t\tHouse\t422 w Hill \n", "\n", "Black\tAbernathy\t\t\tLewis\t\tAnnie\t\tcarpenter\t\tHouse\t422 w Hill \n", "\n", "White\tAbernethy\t\t\tGlenn\tE\t\t\tsalesman\tBurwell & Dunn Co\tHouse\t409 w 11th\n", "\n", "White\tAbernethy\t\t\tLee\tJ\tIda\t\tpainter\th 810 n Brevard \tHouse\tSeversville \n", "\n", "White\tAbernethy\t\t\tJames\tF\tAlice\t\tblacksmith\tTrade ext\tHouse\tSeversville \n", "\n", "White\tAbernethy\t\twid (Jas C)\tMargaret\tK\t\t\t\t\tHouse\t3 e 1st \n", "\n", "White\tAbernethy\t\t\tThomas\tJ\tLucy\t\temployment\tCity\tHouse\t414 Templeton av \n", "\n", "White\tAbernethy\t\t\tLeslie\tW\t\t\tmachinist\tW S Abernethy\tHouse\t409 w 11th\n", "\n", "White\tAbernethy\t\t\tWilliam\tS\tMamie\t\tautomobile representative\t29 w 4th\tHouse\t409 w 11th\n", "\n", "White\tAbraham\t\t\tWilliam\t\tLula\t\t\t\tHouse\t516 s College \n", "\n", "Black\tAdanis\t\t\tAnna\t\t\t\tteacher\t\tHouse\t1021 s Church\n", "\n", "Black\tAdams\t\t\tBelle\t\t\t\tmaid\tRealty Bldg\tHouse\t419 w 2d \n", "\n", "Black\tAdams\t\t\tBerry\tA\tEdith\t\tporter\tW S Cramer\tHouse\tSeversville \n", "\n", "Black\tAdams\t\t\tBessie\t\t\t\tcook\t\t\t247 e Trade \n", "\n", "White\tAdams\t\t\tCompton\t\tCora\t\tmill head\t\tHouse\tElizabeth Mills \n", "\n", "Black\tAdams\t\t\tEdward\t\tMittie\t\tdriver\tRhyne Bros\tHouse\t305 w Palmer \n", "\n", "White\tAdams\t\t\tGeorge\t\t\t\ttailor\tHenry Miller Jr\tBoards\t15 & Church\n", "\n", "Black\tAdams\t\t\tHenry\t\tDicie\t\tlaborer\t\tHouse\t308 Middle \n", "\n", "Black\tAaron\t\t\tAmelia\t\t\t\tdomestic\t\t\t506 s Tryon \n", "\n", "White\tAbbey\t\t\tSimeon\tA\tMary\tA\tsuperintendent construction\tGeneral Fire Ext Co\tHouse\t104 Central av \n", "\n", "White\tAbee\t\t\tJunius\tA\t\t\ttelegraph operator\tSouthern Railway\tBoards\t507 n Graham \n", "\n", "White\tAbel\t\t\tAbram\t\tFannie\t\ttraveling salesman\t\tHouse\t506 w 10t\n", "\n", "Black\tAbel\t\t\tBelle\t\t\t\t\t\tHouse\t14 Boundary al \n", "\n", "Black\tAbel\t\t\tGeorge\t\t\t\tlaborer\t\tHouse\t12 Boundary al \n", "\n", "White\tAbernathy\t\t\tC\tL\t\t\t\t\tHouse\tLawyer's rd \n", "\n", "White\tAbernathy\t\t\tClement\tD\tCora\t\temployment\tCharlotte Lea Belt Co\tHouse\t1117 s Tryon \n", "\n", "White\tAbernathy\t\t\tDavid\tM\tEnola\t\t\t\tHouse\t1310 e 4text \n", "\n", "Black\tAbernathy\t\t\tHannah\t\t\t\t\t\tHouse\t4 Bellinger \n", "\n", "White\tAbernathy\t\t\tJohn\tW\tNannie\t\tcarpenter\t\tHouse\tSunnyside \n", "\n", "Black\tAbernathy\t\t\tJoseph\t\t\t\tlaborer\t\tHouse\t422 w Hill \n", "\n", "Black\tAbernathy\t\t\tLewis\t\tAnnie\t\tcarpenter\t\tHouse\t422 w Hill \n", "\n", "White\tAbernethy\t\t\tGlenn\tE\t\t\tsalesman\tBurwell & Dunn Co\tHouse\t409 w 11th\n", "\n", "White\tAbernethy\t\t\tLee\tJ\tIda\t\tpainter\th 810 n Brevard \tHouse\tSeversville \n", "\n", "White\tAbernethy\t\t\tJames\tF\tAlice\t\tblacksmith\tTrade ext\tHouse\tSeversville \n", "\n", "White\tAbernethy\t\twid (Jas C)\tMargaret\tK\t\t\t\t\tHouse\t3 e 1st \n", "\n", "White\tAbernethy\t\t\tThomas\tJ\tLucy\t\temployment\tCity\tHouse\t414 Templeton av \n", "\n", "White\tAbernethy\t\t\tLeslie\tW\t\t\tmachinist\tW S Abernethy\tHouse\t409 w 11th\n", "\n", "White\tAbernethy\t\t\tWilliam\tS\tMamie\t\tautomobile representative\t29 w 4th\tHouse\t409 w 11th\n", "\n", "White\tAbraham\t\t\tWilliam\t\tLula\t\t\t\tHouse\t516 s College \n", "\n", "Black\tAdanis\t\t\tAnna\t\t\t\tteacher\t\tHouse\t1021 s Church\n", "\n", "Black\tAdams\t\t\tBelle\t\t\t\tmaid\tRealty Bldg\tHouse\t419 w 2d \n", "\n", "Black\tAdams\t\t\tBerry\tA\tEdith\t\tporter\tW S Cramer\tHouse\tSeversville \n", "\n", "Black\tAdams\t\t\tBessie\t\t\t\tcook\t\t\t247 e Trade \n", "\n", "White\tAdams\t\t\tCompton\t\tCora\t\tmill head\t\tHouse\tElizabeth Mills \n", "\n", "Black\tAdams\t\t\tEdward\t\tMittie\t\tdriver\tRhyne Bros\tHouse\t305 w Palmer \n", "\n", "White\tAdams\t\t\tGeorge\t\t\t\ttailor\tHenry Miller Jr\tBoards\t15 & Church\n", "\n", "Black\tAdams\t\t\tHenry\t\tDicie\t\tlaborer\t\tHouse\t308 Middle \n", "\n", "[{'race': 'White', 'surname': 'Adams', 'title': 'Rev'}, {'race': 'White', 'surname': 'Adams', 'title': 'Miss'}, {'race': 'White', 'surname': 'Adams', 'title': 'Miss'}]\n" ] } ], "source": [ "pattern = re.compile(\"^(?PBlack|White)\\t(?P\\w*)\\t+(?PMiss|Dr|Rev|Mrs).*$\") # Match specific titles\n", "data = parse(pattern)\n", "print(data[0:3])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have captured all of the lines that have titles, we need to deal with the fact that \n", "many lines do not include a title at all. So we need to make the title pattern group optional, using a leading question mark." ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "White\tAdkins Walter D (Leona E)\t\t\tWalter\tD\tLeona\tE\tlineman\t\tHouse\t(r) 305 e 13th\n", "\n", "[{'race': 'White', 'surname': 'Adams', 'title': None}, {'race': 'White', 'surname': 'Adams', 'title': None}, {'race': 'Black', 'surname': 'Adams', 'title': None}]\n" ] } ], "source": [ "pattern = re.compile(\"^(?P<race>Black|White)\\t(?P<surname>\\w*)\\t+(?P<title>Miss|Dr|Rev|Mrs)?.*$\") # Match specific titles\n", "data = parse(pattern)\n", "print(data[0:3])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the output above we learned that our new pattern with optional title matches every line except one.\n", "The reason is the '\\t' or tab character we are expecting between the surname and the title. In the one\n", "exceptional line, printed above by our parse() function, the surname and title are separated by a space.\n", "Instead of using the tab character, let's use a generic \"whitespace\" detector, which is '\\w'." ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[{'race': 'White', 'surname': 'Adam', 'title': None}, {'race': 'White', 'surname': 'Adam', 'title': None}, {'race': 'Black', 'surname': 'Adam', 'title': None}]\n" ] } ], "source": [ "pattern = re.compile(\"^(?P<race>Black|White)\\t(?P<surname>\\w*)\\w+(?P<title>Miss|Dr|Rev|Mrs)?.*$\") # Match specific titles\n", "data = parse(pattern)\n", "print(data[0:3])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now all the text lines are matching again and we have the titles, when they are present.\n", "\n", "The next group is the head of household's given name." ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[{'race': 'White', 'surname': 'Adams', 'title': None, 'hohgiven': 'Henry'}, {'race': 'White', 'surname': 'Adams', 'title': None, 'hohgiven': 'James'}, {'race': 'Black', 'surname': 'Adams', 'title': None, 'hohgiven': 'Jane'}]\n" ] } ], "source": [ "pattern = re.compile(\"^(?P<race>Black|White)\\t(?P<surname>\\w*)\\W+(?P<title>Miss|Dr|Rev|Mrs)?(?P<hohgiven>\\w+).*$\") # Match specific titles\n", "data = parse(pattern)\n", "print(data[0:3])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sometimes there is a middle initial for the head of the household.. It appears that\n", "the initial is only ever one whitespace character away from the head of household given name.\n", "We are also going to change our display output so that we can look more closely at the results.." ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "White\tAdams\t\t\tHenry\tL\tFannie\tS\troute agent\tSouthern Railway Co\tHouse\t327 n Tryon \n", "\n", "{'race': 'White', 'surname': 'Adams', 'title': None, 'hohgiven': 'Henry', 'hohmi': 'L'}\n", "White\tAdams\t\t\tJames\t\tGertrude\t\tmanager\t\tHouse\t419 Elizabeth av \n", "\n", "{'race': 'White', 'surname': 'Adams', 'title': None, 'hohgiven': 'James', 'hohmi': None}\n", "Black\tAdams\t\t\tJane\t\t\t\tteacher\t\tHouse\t1021 s Church\n", "\n", "{'race': 'Black', 'surname': 'Adams', 'title': None, 'hohgiven': 'Jane', 'hohmi': None}\n", "Black\tAdams\t\t\tJohn\t\t\t\tlaborer\t\tHouse\t1031 s Church\n", "\n", "{'race': 'Black', 'surname': 'Adams', 'title': None, 'hohgiven': 'John', 'hohmi': None}\n", "White\tAdams\t\t\tJohn\tJ\t\t\tpresident\tAdams G & P Co and Char Pepsi-Cola Co\tHouse\t309 e 6th\n", "\n", "{'race': 'White', 'surname': 'Adams', 'title': None, 'hohgiven': 'John', 'hohmi': 'J'}\n", "White\tAdams\t\t\tJohn\tW\tCora\t\tconductor\tS A L Railway\tHouse\t21st nr Caldwell \n", "\n", "{'race': 'White', 'surname': 'Adams', 'title': None, 'hohgiven': 'John', 'hohmi': 'W'}\n", "Black\tAdams\t\t\tJoseph\t\tViolet\t\tcooper\t\tHouse\t1011 s Church\n", "\n", "{'race': 'Black', 'surname': 'Adams', 'title': None, 'hohgiven': 'Joseph', 'hohmi': None}\n", "White\tAdams\tRev\t\tJoseph\tQ\tLeslie\t\t\t\tHouse\t1509 s Boulevard \n", "\n", "{'race': 'White', 'surname': 'Adams', 'title': None, 'hohgiven': 'Rev', 'hohmi': None}\n", "White\tAdams\t\tMiss\tJulia\tM\t\t\t\t\tHouse\t707 n Church\n", "\n", "{'race': 'White', 'surname': 'Adams', 'title': None, 'hohgiven': 'Miss', 'hohmi': 'J'}\n", "Black\tAdams\t\t\tKate\t\t\t\tlaundress\t\tHouse\tGroveton \n", "\n", "{'race': 'Black', 'surname': 'Adams', 'title': None, 'hohgiven': 'Kate', 'hohmi': None}\n", "White\tAdams\t\t\tLafayette\tN\t\t\tclerk\tSouthern Railway\tHouse\t327 n Tryon \n", "\n", "{'race': 'White', 'surname': 'Adams', 'title': None, 'hohgiven': 'Lafayette', 'hohmi': 'N'}\n", "White\tAdams\t\t\tLawrence\tA\t\t\tsalesman\tB S Moore & Co\tRooms\t405 s Tryon \n", "\n", "{'race': 'White', 'surname': 'Adams', 'title': None, 'hohgiven': 'Lawrence', 'hohmi': 'A'}\n", "White\tAdams\t\t\tLaurie\tA\tMargaret\tN\tmill head\t\tHouse\tElizabetMills \n", "\n", "{'race': 'White', 'surname': 'Adams', 'title': None, 'hohgiven': 'Laurie', 'hohmi': 'A'}\n", "Black\tAdams\t\t\tLeland\t\t\t\tporter\tJ P Stowe & Co\tHouse\t403 s Myers \n", "\n", "{'race': 'Black', 'surname': 'Adams', 'title': None, 'hohgiven': 'Leland', 'hohmi': None}\n", "Black\tAdams\t\t\tLizzie\t\t\t\t\t\tHouse\t309 1/2 w Morehead \n", "\n", "{'race': 'Black', 'surname': 'Adams', 'title': None, 'hohgiven': 'Lizzie', 'hohmi': None}\n", "White\tAdams\t\tMiss\tLula\t\t\t\tclerk\tBelk Bros\tRooms\tY W C A \n", "\n", "{'race': 'White', 'surname': 'Adams', 'title': None, 'hohgiven': 'Miss', 'hohmi': 'L'}\n", "White\tAdams\t\twid (Geo O)\tGrace\tM\t\t\t\t\tHouse\t601 n College \n", "\n", "{'race': 'White', 'surname': 'Adams', 'title': None, 'hohgiven': 'wid', 'hohmi': None}\n", "White\tAdams\t\t\tLuther\tM\tMamie\t\t\t\tHouse\tGroveton \n", "\n", "{'race': 'White', 'surname': 'Adams', 'title': None, 'hohgiven': 'Luther', 'hohmi': 'M'}\n", "Black\tAdams\t\t\tMajor\t\t\t\twaiter\tBuford Hotel \t\t\n", "\n", "{'race': 'Black', 'surname': 'Adams', 'title': None, 'hohgiven': 'Major', 'hohmi': None}\n", "Black\tAdams\t\t\tMattie\t\t\t\t\t\tHouse\t719 n Graham ext \n", "\n", "{'race': 'Black', 'surname': 'Adams', 'title': None, 'hohgiven': 'Mattie', 'hohmi': None}\n", "White\tAdams\t\tMiss\tPattie\tV\t\t\tstenographer\t\tBoards\t708 n Caldwell \n", "\n", "{'race': 'White', 'surname': 'Adams', 'title': None, 'hohgiven': 'Miss', 'hohmi': 'P'}\n", "Black\tAdams\t\t\tReuben\t\tBelle\t\tlaborer\tY & B Co\tHouse\t419 w 2d \n", "\n", "{'race': 'Black', 'surname': 'Adams', 'title': None, 'hohgiven': 'Reuben', 'hohmi': None}\n", "Black\tAdams\t\t\tRosa\t\t\t\t\t\tHouse\t714 s Caldwell \n", "\n", "{'race': 'Black', 'surname': 'Adams', 'title': None, 'hohgiven': 'Rosa', 'hohmi': None}\n", "Black\tAdams\t\t\tRufus\t\t\t\tlaborer\t\tHouse\tGreenville \n", "\n", "{'race': 'Black', 'surname': 'Adams', 'title': None, 'hohgiven': 'Rufus', 'hohmi': None}\n", "Black\tAdams\t\t\tRufus\t\t\t\tdriver\tStand I & F Co\tHouse\tRoss Town \n", "\n", "{'race': 'Black', 'surname': 'Adams', 'title': None, 'hohgiven': 'Rufus', 'hohmi': None}\n", "White\tAdams\t\tMiss\tSalie\tH\t\t\tassistant\tCarnegie Library\tHouse\t707 n Church\n", "\n", "{'race': 'White', 'surname': 'Adams', 'title': None, 'hohgiven': 'Miss', 'hohmi': 'S'}\n", "Black\tAdams\t\t\tViolet\t\t\t\tservant\t\t\t508 w Trade \n", "\n", "{'race': 'Black', 'surname': 'Adams', 'title': None, 'hohgiven': 'Violet', 'hohmi': None}\n", "White\tAdams\t\t\tWheeler\tF\tMamie\t\tmoulder\t\tHouse\t303 s Cedar \n", "\n", "{'race': 'White', 'surname': 'Adams', 'title': None, 'hohgiven': 'Wheeler', 'hohmi': 'F'}\n", "White\tAdams\t\t\tWilliam\tE\t\t\t\tThe Chronicle\tRooms\t300 1/2 s Church\n", "\n", "{'race': 'White', 'surname': 'Adams', 'title': None, 'hohgiven': 'William', 'hohmi': 'E'}\n", "White\tAdcock\t\t\tJohn\tF\t\t\tmill worker\t\tHouse\t916 Calvine av \n", "\n", "{'race': 'White', 'surname': 'Adcock', 'title': None, 'hohgiven': 'John', 'hohmi': 'F'}\n", "White\tAdcock\t\twid (Jas M)\tMillie\tM\t\t\t\t\tHouse\t916 Calvine av \n", "\n", "{'race': 'White', 'surname': 'Adcock', 'title': None, 'hohgiven': 'wid', 'hohmi': None}\n", "White\tAdelsheimer\t\t\tHenry\tS\tLizzie\t\tmill worker\t\tHouse\t1216 Louise av \n", "\n", "{'race': 'White', 'surname': 'Adelsheimer', 'title': None, 'hohgiven': 'Henry', 'hohmi': 'S'}\n", "Black\tAdkins\t\t\tKing\t\t\t\tlaborer\t\t\t600 s Myers \n", "\n", "{'race': 'Black', 'surname': 'Adkins', 'title': None, 'hohgiven': 'King', 'hohmi': None}\n", "White\tAdkins Walter D (Leona E)\t\t\tWalter\tD\tLeona\tE\tlineman\t\tHouse\t(r) 305 e 13th\n", "\n", "{'race': 'White', 'surname': 'Adkins', 'title': None, 'hohgiven': 'Walter', 'hohmi': 'D'}\n", "Black\tAgers\t\t\tNancy\t\t\t\tcook\t\tHouse\t206 Wilson \n", "\n", "{'race': 'Black', 'surname': 'Agers', 'title': None, 'hohgiven': 'Nancy', 'hohmi': None}\n", "Black\tAgers\t\t\tSallie\t\t\t\tlaundress\t\tHouse\t420 Jackson \n", "\n", "{'race': 'Black', 'surname': 'Agers', 'title': None, 'hohgiven': 'Sallie', 'hohmi': None}\n", "White\tAhaus\t\t\tHerman\t\tFrances\tE\ttailor\t203 w 4th\tHouse\t204 s Church\n", "\n", "{'race': 'White', 'surname': 'Ahaus', 'title': None, 'hohgiven': 'Herman', 'hohmi': None}\n", "White\tAikel\t\t\tJoseph\t\t\t\tconfectioner\t317 e Trade\tRooms\t225 w Trade \n", "\n", "{'race': 'White', 'surname': 'Aikel', 'title': None, 'hohgiven': 'Joseph', 'hohmi': None}\n", "White\tAiken\t\t\tGeorge\tW M\tBarbara\t\tsuperintendent\tQueen City M & G Wks\tHouse\t1120 s Caldwell \n", "\n", "{'race': 'White', 'surname': 'Aiken', 'title': None, 'hohgiven': 'George', 'hohmi': 'W'}\n", "White\tAiken\t\t\tHenry\t\t\t\t\t\tRooms\t9 e 3d \n", "\n", "{'race': 'White', 'surname': 'Aiken', 'title': None, 'hohgiven': 'Henry', 'hohmi': None}\n", "Black\tAiken\t\t\tWalter\t\tElla\t\tlaborer\t\tHouse\t600 e 2d \n", "\n", "{'race': 'Black', 'surname': 'Aiken', 'title': None, 'hohgiven': 'Walter', 'hohmi': None}\n", "Black\tAiken\t\t\tWilliam\t\tElla\t\tdriver\tW I Henderson Gro Co\tHouse\t528 e 8th\n", "\n", "{'race': 'Black', 'surname': 'Aiken', 'title': None, 'hohgiven': 'William', 'hohmi': None}\n", "White\tAkers\t\t\tJoseph\tJ\t\t\tchief clerk superintendent\tSouthern Railway\tHouse\t908 s Tryon \n", "\n", "{'race': 'White', 'surname': 'Akers', 'title': None, 'hohgiven': 'Joseph', 'hohmi': 'J'}\n", "White\tAlbea\t\t\tJohn\tC\t\t\tpressman\tNews Ptg Co\tHouse\t310 s McDowell \n", "\n", "{'race': 'White', 'surname': 'Albea', 'title': None, 'hohgiven': 'John', 'hohmi': 'C'}\n", "White\tAlbea\t\tMiss\tLottie\t\t\t\tclerk\tFloyd L Liles Co\tHouse\t10 w 11th\n", "\n", "{'race': 'White', 'surname': 'Albea', 'title': None, 'hohgiven': 'Miss', 'hohmi': 'L'}\n", "White\tAlbea\t\twid (J F)\tEmma\tS\t\t\t\t\tHouse\t310 s McDowell \n", "\n", "{'race': 'White', 'surname': 'Albea', 'title': None, 'hohgiven': 'wid', 'hohmi': None}\n", "White\tAlbrecht\t\twid (Mathias)\tGesha\t\t\t\t\t\tHouse\t402 Elizabeth av \n", "\n", "{'race': 'White', 'surname': 'Albrecht', 'title': None, 'hohgiven': 'wid', 'hohmi': None}\n", "White\tAlbright\t\t\tFay\tL\t\t\tclerk\t\tHouse\t15 s Caldwell \n", "\n", "{'race': 'White', 'surname': 'Albright', 'title': None, 'hohgiven': 'Fay', 'hohmi': 'L'}\n", "White\tAlbright\t\t\tHal\tC\t\t\tclerk\tPost Office\tHouse\t15 s Caldwell \n", "\n", "{'race': 'White', 'surname': 'Albright', 'title': None, 'hohgiven': 'Hal', 'hohmi': 'C'}\n", "White\tAlbright\t\t\tJudson\tD\tBelle\tC\tofficer\tUS Internal Revenue\tHouse\t15 s Caldwell \n", "\n", "{'race': 'White', 'surname': 'Albright', 'title': None, 'hohgiven': 'Judson', 'hohmi': 'D'}\n", "White\tAlderman\t\tMiss\tElizabeth\t\t\t\tnurse\tCharlotte Sanatorium\tRooms\t12 w 7th\n", "\n", "{'race': 'White', 'surname': 'Alderman', 'title': None, 'hohgiven': 'Miss', 'hohmi': 'E'}\n", "White\tAldred\t\t\tCurtis\tM\tVictoria\t\tmachinist\t\tHouse\tSeversville \n", "\n", "{'race': 'White', 'surname': 'Aldred', 'title': None, 'hohgiven': 'Curtis', 'hohmi': 'M'}\n", "White\tAldred\t\t\tJames\tA\tMattie\t\tclerk\t\tHouse\tSeversville \n", "\n", "{'race': 'White', 'surname': 'Aldred', 'title': None, 'hohgiven': 'James', 'hohmi': 'A'}\n", "White\tAldred\tRev\t\tWilliam\tL\tMarian\t\tevangelist\t\tHouse\tSeversville \n", "\n", "{'race': 'White', 'surname': 'Aldred', 'title': None, 'hohgiven': 'Rev', 'hohmi': None}\n", "White\tAldridge\t\twid (Wm A)\tMillie\tS\t\t\t\t\tHouse\tHarrill st Belmont Park \n", "\n", "{'race': 'White', 'surname': 'Aldridge', 'title': None, 'hohgiven': 'wid', 'hohmi': None}\n", "White\tAlexain\t\t\tNicholas\t\t\t\temployment\tMet Café\tRooms\t26 n Tryon \n", "\n", "{'race': 'White', 'surname': 'Alexain', 'title': None, 'hohgiven': 'Nicholas', 'hohmi': None}\n", "Black\tAlexander\t\t\tAdelaide\t\t\t\t\t\tHouse\t910 (909) e 1st \n", "\n", "{'race': 'Black', 'surname': 'Alexander', 'title': None, 'hohgiven': 'Adelaide', 'hohmi': None}\n", "Black\tAlexander\t\t\tAdeline\t\t\t\t\t\tHouse\tLutheran st \n", "\n", "{'race': 'Black', 'surname': 'Alexander', 'title': None, 'hohgiven': 'Adeline', 'hohmi': None}\n", "White\tAlexander\t\t\tAlbert\tW\tAlice\t\tforeman\tCharlotte Cordage Co\tHouse\t621 e 5t\n", "\n", "{'race': 'White', 'surname': 'Alexander', 'title': None, 'hohgiven': 'Albert', 'hohmi': 'W'}\n", "Black\tAlexander\t\t\tAlfred\t\tWillie\t\tdriver\t\tHouse\t1016 e 3d \n", "\n", "{'race': 'Black', 'surname': 'Alexander', 'title': None, 'hohgiven': 'Alfred', 'hohmi': None}\n", "Black\tAlexander\t\t\tAllen\t\tLizzie\t\tlaborer\t\tHouse\t6 10th st al \n", "\n", "{'race': 'Black', 'surname': 'Alexander', 'title': None, 'hohgiven': 'Allen', 'hohmi': None}\n", "Black\tAlexander\t\t\tAlvey\t\t\t\tlaborer\t\tHouse\tLutheran st \n", "\n", "{'race': 'Black', 'surname': 'Alexander', 'title': None, 'hohgiven': 'Alvey', 'hohmi': None}\n", "Black\tAlexander\t\t\tAndrew\t\tBessie\t\tgrocer\t15 1/2 e Boundary\tHouse\t514 e Stonewall \n", "\n", "{'race': 'Black', 'surname': 'Alexander', 'title': None, 'hohgiven': 'Andrew', 'hohmi': None}\n", "Black\tAlexander\t\t\tAndrew\tW\t\t\t\t\tRooms\t415 n Myers \n", "\n", "{'race': 'Black', 'surname': 'Alexander', 'title': None, 'hohgiven': 'Andrew', 'hohmi': 'W'}\n", "White\tAlexander\t\tMiss\tAnnie\t\t\t\tstenographer\tC M Stiefif\tHouse\t610 n Graham ext \n", "\n", "{'race': 'White', 'surname': 'Alexander', 'title': None, 'hohgiven': 'Miss', 'hohmi': 'A'}\n", "White\tAlexander\t\tMiss\tAnnie\tL\t\t\tphysician\t410 n Tryon\tHouse\t610 n Graham ext \n", "\n", "{'race': 'White', 'surname': 'Alexander', 'title': None, 'hohgiven': 'Miss', 'hohmi': 'A'}\n", "White\tAlexander\t\t\tArnold\tE\tAlice\t\tlaundryman\t\tHouse\t1424 e 5th\n", "\n", "{'race': 'White', 'surname': 'Alexander', 'title': None, 'hohgiven': 'Arnold', 'hohmi': 'E'}\n", "Black\tAlexander\t\t\tBernard\t\t\t\tporter\tYoung's Steam Baking Co \t\t\n", "\n", "{'race': 'Black', 'surname': 'Alexander', 'title': None, 'hohgiven': 'Bernard', 'hohmi': None}\n", "Black\tAlexander\t\t\tBessie\t\t\t\t\t\tHouse\t1006 e 1st \n", "\n", "{'race': 'Black', 'surname': 'Alexander', 'title': None, 'hohgiven': 'Bessie', 'hohmi': None}\n", "White\tAlexander\t\t\tBlanche\t\t\t\tmadam\t\tHouse\t17 Spring \n", "\n", "{'race': 'White', 'surname': 'Alexander', 'title': None, 'hohgiven': 'Blanche', 'hohmi': None}\n", "White\tAlexander\t\tMiss\tBlandina\t\t\t\tmillliner\tJ B Ivey Co\tHouse\t610 n Graham ext \n", "\n", "{'race': 'White', 'surname': 'Alexander', 'title': None, 'hohgiven': 'Miss', 'hohmi': 'B'}\n", "Black\tAlexander\t\t\tBurdette\t\t\t\tporter\tColonial Club \t\t\n", "\n", "{'race': 'Black', 'surname': 'Alexander', 'title': None, 'hohgiven': 'Burdette', 'hohmi': None}\n", "White\tAlexander\t\t\tC\tP\t\t\t\t\tHouse\tProvidence rd \n", "\n", "{'race': 'White', 'surname': 'Alexander', 'title': None, 'hohgiven': 'C', 'hohmi': 'P'}\n", "Black\tAlexander\t\t\tCalvin\t\tEmeline\t\tlaborer\t\tHouse\t213 1/2 w 1st \n", "\n", "{'race': 'Black', 'surname': 'Alexander', 'title': None, 'hohgiven': 'Calvin', 'hohmi': None}\n", "White\tAlexander\t\t\tCarl\tL\t\t\tclerk\t\tHouse\t200 e Oak \n", "\n", "{'race': 'White', 'surname': 'Alexander', 'title': None, 'hohgiven': 'Carl', 'hohmi': 'L'}\n", "Black\tAlexander\t\t\tCarrie\t\t\t\t\t\tHouse\tGaither's al \n", "\n", "{'race': 'Black', 'surname': 'Alexander', 'title': None, 'hohgiven': 'Carrie', 'hohmi': None}\n", "Black\tAlexander\t\t\tCarson\t\t\t\tchauffeur\tJ E S Davidson\tHouse\t209 s Middle \n", "\n", "{'race': 'Black', 'surname': 'Alexander', 'title': None, 'hohgiven': 'Carson', 'hohmi': None}\n", "Black\tAlexander\t\t\tCharles\t\t\t\tlaborer\t\tHouse\t410 n A \n", "\n", "{'race': 'Black', 'surname': 'Alexander', 'title': None, 'hohgiven': 'Charles', 'hohmi': None}\n", "White\tAlexander\t\t\tCharles\tF\tMary\t\tclerk\tPost Office\tHouse\t308 e Worthington av \n", "\n", "{'race': 'White', 'surname': 'Alexander', 'title': None, 'hohgiven': 'Charles', 'hohmi': 'F'}\n", "White\tAlexander\t\t\tCharles\tL\tSue\t\tdentist\t913-918 Realty Bldg\tHouse\t900 s Tryon \n", "\n", "{'race': 'White', 'surname': 'Alexander', 'title': None, 'hohgiven': 'Charles', 'hohmi': 'L'}\n", "White\tAlexander\t\t\tCharles\tY\t\t\tteacher\tKing's Bus College\tRooms\t305 s Church\n", "\n", "{'race': 'White', 'surname': 'Alexander', 'title': None, 'hohgiven': 'Charles', 'hohmi': 'Y'}\n", "White\tAlexander\t\t\tClarence\tW\t\t\temployment\tJ M Porter\tHouse\t1116 s Tryon \n", "\n", "{'race': 'White', 'surname': 'Alexander', 'title': None, 'hohgiven': 'Clarence', 'hohmi': 'W'}\n", "White\tAlexander\t\t\tClyde\tC\t\t\tpainter\t\tHouse\t810 n Brevard \n", "\n", "{'race': 'White', 'surname': 'Alexander', 'title': None, 'hohgiven': 'Clyde', 'hohmi': 'C'}\n", "Black\tAlexander\t\t\tDaniel\t\tEtta\t\tlaborer\t\tHouse\t527 e 8th\n", "\n", "{'race': 'Black', 'surname': 'Alexander', 'title': None, 'hohgiven': 'Daniel', 'hohmi': None}\n", "Black\tAlexander\t\t\tDecatur\tC\tAlta\t\tporter\tColonial Club\tHouse\t515 e 1st \n", "\n", "{'race': 'Black', 'surname': 'Alexander', 'title': None, 'hohgiven': 'Decatur', 'hohmi': 'C'}\n", "White\tAlexander\t\tMiss\tEdna\t\t\t\tnurse\tSt Peters Hospital\tRooms\t515 e 1st \n", "\n", "{'race': 'White', 'surname': 'Alexander', 'title': None, 'hohgiven': 'Miss', 'hohmi': 'E'}\n", "Black\tAlexander\t\t\tEdward\t\tLouvenia\t\tlaborer\t\tHouse\t1012 e Stonewall \n", "\n", "{'race': 'Black', 'surname': 'Alexander', 'title': None, 'hohgiven': 'Edward', 'hohmi': None}\n", "White\tAlexander\t\tMiss\tEleanor\t\t\t\t\t\tHouse\t900 s Tryon \n", "\n", "{'race': 'White', 'surname': 'Alexander', 'title': None, 'hohgiven': 'Miss', 'hohmi': 'E'}\n", "Black\tAlexander\t\t\tEli\t\tJennie\t\tgrocer\tOil Town\tHouse\t900 s Tryon \n", "\n", "{'race': 'Black', 'surname': 'Alexander', 'title': None, 'hohgiven': 'Eli', 'hohmi': None}\n", "Black\tAlexander\t\t\tElizabeth\t\t\t\tlaundress\t\tHouse\t414 s College \n", "\n", "{'race': 'Black', 'surname': 'Alexander', 'title': None, 'hohgiven': 'Elizabeth', 'hohmi': None}\n", "Black\tAlexander\t\t\tEllen\t\t\t\t\t\tHouse\t606 s McDowell \n", "\n", "{'race': 'Black', 'surname': 'Alexander', 'title': None, 'hohgiven': 'Ellen', 'hohmi': None}\n", "White\tAlexander\t\twid (Wm S)\tEmma\tV\t\t\t\t\tHouse\t610 n Graham ext \n", "\n", "{'race': 'White', 'surname': 'Alexander', 'title': None, 'hohgiven': 'wid', 'hohmi': None}\n", "Black\tAlexander\t\t\tEphraim\t\tAlice\t\tdriver\tColes & Smith\tHouse\tStonewall bet A & B \n", "\n", "{'race': 'Black', 'surname': 'Alexander', 'title': None, 'hohgiven': 'Ephraim', 'hohmi': None}\n", "Black\tAaron\t\t\tAmelia\t\t\t\tdomestic\t\t\t506 s Tryon \n", "\n", "{'race': 'Black', 'surname': 'Aaron', 'title': None, 'hohgiven': 'Amelia', 'hohmi': None}\n", "White\tAbbey\t\t\tSimeon\tA\tMary\tA\tsuperintendent construction\tGeneral Fire Ext Co\tHouse\t104 Central av \n", "\n", "{'race': 'White', 'surname': 'Abbey', 'title': None, 'hohgiven': 'Simeon', 'hohmi': 'A'}\n", "White\tAbbott\t\tMiss\tMargaret\t\t\t\t\t\tHouse\t1804 s Boulevard \n", "\n", "{'race': 'White', 'surname': 'Abbott', 'title': None, 'hohgiven': 'Miss', 'hohmi': 'M'}\n", "White\tAbee\t\t\tJunius\tA\t\t\ttelegraph operator\tSouthern Railway\tBoards\t507 n Graham \n", "\n", "{'race': 'White', 'surname': 'Abee', 'title': None, 'hohgiven': 'Junius', 'hohmi': 'A'}\n", "White\tAbel\t\t\tAbram\t\tFannie\t\ttraveling salesman\t\tHouse\t506 w 10t\n", "\n", "{'race': 'White', 'surname': 'Abel', 'title': None, 'hohgiven': 'Abram', 'hohmi': None}\n", "Black\tAbel\t\t\tBelle\t\t\t\t\t\tHouse\t14 Boundary al \n", "\n", "{'race': 'Black', 'surname': 'Abel', 'title': None, 'hohgiven': 'Belle', 'hohmi': None}\n", "Black\tAbel\t\t\tGeorge\t\t\t\tlaborer\t\tHouse\t12 Boundary al \n", "\n", "{'race': 'Black', 'surname': 'Abel', 'title': None, 'hohgiven': 'George', 'hohmi': None}\n", "White\tAbernathy\t\t\tC\tL\t\t\t\t\tHouse\tLawyer's rd \n", "\n", "{'race': 'White', 'surname': 'Abernathy', 'title': None, 'hohgiven': 'C', 'hohmi': 'L'}\n", "White\tAbernathy\t\t\tClement\tD\tCora\t\temployment\tCharlotte Lea Belt Co\tHouse\t1117 s Tryon \n", "\n", "{'race': 'White', 'surname': 'Abernathy', 'title': None, 'hohgiven': 'Clement', 'hohmi': 'D'}\n", "White\tAbernathy\t\tMrs\tCora\t\t\t\tboarding\t\tHouse\t1117 s Tryon\n", "\n", "{'race': 'White', 'surname': 'Abernathy', 'title': None, 'hohgiven': 'Mrs', 'hohmi': 'C'}\n", "White\tAbernathy\t\t\tDavid\tM\tEnola\t\t\t\tHouse\t1310 e 4text \n", "\n", "{'race': 'White', 'surname': 'Abernathy', 'title': None, 'hohgiven': 'David', 'hohmi': 'M'}\n", "White\tAbernathy\t\tMiss\tElizabeth\t\t\t\t\t\tHouse\t430 Mint \n", "\n", "{'race': 'White', 'surname': 'Abernathy', 'title': None, 'hohgiven': 'Miss', 'hohmi': 'E'}\n", "Black\tAbernathy\t\t\tHannah\t\t\t\t\t\tHouse\t4 Bellinger \n", "\n", "{'race': 'Black', 'surname': 'Abernathy', 'title': None, 'hohgiven': 'Hannah', 'hohmi': None}\n", "White\tAbernathy\t\t\tJohn\tW\tNannie\t\tcarpenter\t\tHouse\tSunnyside \n", "\n", "{'race': 'White', 'surname': 'Abernathy', 'title': None, 'hohgiven': 'John', 'hohmi': 'W'}\n", "Black\tAbernathy\t\t\tJoseph\t\t\t\tlaborer\t\tHouse\t422 w Hill \n", "\n", "{'race': 'Black', 'surname': 'Abernathy', 'title': None, 'hohgiven': 'Joseph', 'hohmi': None}\n", "Black\tAbernathy\t\t\tLewis\t\tAnnie\t\tcarpenter\t\tHouse\t422 w Hill \n", "\n", "{'race': 'Black', 'surname': 'Abernathy', 'title': None, 'hohgiven': 'Lewis', 'hohmi': None}\n", "White\tAbernethy\t\tMiss\tConnie\t\t\t\tstenographer\tBurwell & Dunn Co\tHouse\t414 Templeton av \n", "\n", "{'race': 'White', 'surname': 'Abernethy', 'title': None, 'hohgiven': 'Miss', 'hohmi': 'C'}\n", "White\tAbernethy\t\t\tGlenn\tE\t\t\tsalesman\tBurwell & Dunn Co\tHouse\t409 w 11th\n", "\n", "{'race': 'White', 'surname': 'Abernethy', 'title': None, 'hohgiven': 'Glenn', 'hohmi': 'E'}\n", "White\tAbernethy\t\tMiss\tGertrude\t\t\t\tfitter\tLittle-Long Co\tHouse\t306 w 7th\n", "\n", "{'race': 'White', 'surname': 'Abernethy', 'title': None, 'hohgiven': 'Miss', 'hohmi': 'G'}\n", "White\tAbernethy\t\t\tLee\tJ\tIda\t\tpainter\th 810 n Brevard \tHouse\tSeversville \n", "\n", "{'race': 'White', 'surname': 'Abernethy', 'title': None, 'hohgiven': 'Lee', 'hohmi': 'J'}\n", "White\tAbernethy\tDr\t\tJ\tS\t\t\t\t\tHouse\tBeatty's Ford rd \n", "\n", "{'race': 'White', 'surname': 'Abernethy', 'title': None, 'hohgiven': 'Dr', 'hohmi': None}\n", "White\tAbernethy\t\t\tJames\tF\tAlice\t\tblacksmith\tTrade ext\tHouse\tSeversville \n", "\n", "{'race': 'White', 'surname': 'Abernethy', 'title': None, 'hohgiven': 'James', 'hohmi': 'F'}\n", "White\tAbernethy\t\tMiss\tLillian\t\t\t\tsaleslady\tEfird's Dept Store\tHouse\tSeversville \n", "\n", "{'race': 'White', 'surname': 'Abernethy', 'title': None, 'hohgiven': 'Miss', 'hohmi': 'L'}\n", "White\tAbernethy\t\twid (Jas C)\tMargaret\tK\t\t\t\t\tHouse\t3 e 1st \n", "\n", "{'race': 'White', 'surname': 'Abernethy', 'title': None, 'hohgiven': 'wid', 'hohmi': None}\n", "White\tAbernethy\t\tMiss\tMildred\t\t\t\t\t\tRooms\t603 n Davidson \n", "\n", "{'race': 'White', 'surname': 'Abernethy', 'title': None, 'hohgiven': 'Miss', 'hohmi': 'M'}\n", "White\tAbernethy\t\tMiss\tNettie\tI\t\t\tbookkeeper\tPound & Moore Co\tHouse\t311 n College \n", "\n", "{'race': 'White', 'surname': 'Abernethy', 'title': None, 'hohgiven': 'Miss', 'hohmi': 'N'}\n", "White\tAbernethy\t\t\tThomas\tJ\tLucy\t\temployment\tCity\tHouse\t414 Templeton av \n", "\n", "{'race': 'White', 'surname': 'Abernethy', 'title': None, 'hohgiven': 'Thomas', 'hohmi': 'J'}\n", "White\tAbernethy\t\t\tLeslie\tW\t\t\tmachinist\tW S Abernethy\tHouse\t409 w 11th\n", "\n", "{'race': 'White', 'surname': 'Abernethy', 'title': None, 'hohgiven': 'Leslie', 'hohmi': 'W'}\n", "White\tAbernethy\t\t\tWilliam\tS\tMamie\t\tautomobile representative\t29 w 4th\tHouse\t409 w 11th\n", "\n", "{'race': 'White', 'surname': 'Abernethy', 'title': None, 'hohgiven': 'William', 'hohmi': 'S'}\n", "White\tAbraham\t\t\tWilliam\t\tLula\t\t\t\tHouse\t516 s College \n", "\n", "{'race': 'White', 'surname': 'Abraham', 'title': None, 'hohgiven': 'William', 'hohmi': None}\n", "Black\tAdanis\t\t\tAnna\t\t\t\tteacher\t\tHouse\t1021 s Church\n", "\n", "{'race': 'Black', 'surname': 'Adanis', 'title': None, 'hohgiven': 'Anna', 'hohmi': None}\n", "Black\tAdams\t\t\tBelle\t\t\t\tmaid\tRealty Bldg\tHouse\t419 w 2d \n", "\n", "{'race': 'Black', 'surname': 'Adams', 'title': None, 'hohgiven': 'Belle', 'hohmi': None}\n", "Black\tAdams\t\t\tBerry\tA\tEdith\t\tporter\tW S Cramer\tHouse\tSeversville \n", "\n", "{'race': 'Black', 'surname': 'Adams', 'title': None, 'hohgiven': 'Berry', 'hohmi': 'A'}\n", "Black\tAdams\t\t\tBessie\t\t\t\tcook\t\t\t247 e Trade \n", "\n", "{'race': 'Black', 'surname': 'Adams', 'title': None, 'hohgiven': 'Bessie', 'hohmi': None}\n", "White\tAdams\t\tMiss\tBeulah\t\t\t\t\t\tHouse\t419 Elizabeth av \n", "\n", "{'race': 'White', 'surname': 'Adams', 'title': None, 'hohgiven': 'Miss', 'hohmi': 'B'}\n", "White\tAdams\t\t\tCompton\t\tCora\t\tmill head\t\tHouse\tElizabeth Mills \n", "\n", "{'race': 'White', 'surname': 'Adams', 'title': None, 'hohgiven': 'Compton', 'hohmi': None}\n", "White\tAdams\t\tMiss\tDorothy\t\t\t\tstenographer\tOconee Mills Co\tRooms\tYWCA \n", "\n", "{'race': 'White', 'surname': 'Adams', 'title': None, 'hohgiven': 'Miss', 'hohmi': 'D'}\n", "Black\tAdams\t\t\tEdward\t\tMittie\t\tdriver\tRhyne Bros\tHouse\t305 w Palmer \n", "\n", "{'race': 'Black', 'surname': 'Adams', 'title': None, 'hohgiven': 'Edward', 'hohmi': None}\n", "White\tAdams\t\t\tGeorge\t\t\t\ttailor\tHenry Miller Jr\tBoards\t15 & Church\n", "\n", "{'race': 'White', 'surname': 'Adams', 'title': None, 'hohgiven': 'George', 'hohmi': None}\n", "Black\tAdams\t\t\tHenry\t\tDicie\t\tlaborer\t\tHouse\t308 Middle \n", "\n", "{'race': 'Black', 'surname': 'Adams', 'title': None, 'hohgiven': 'Henry', 'hohmi': None}\n", "Black\tAaron\t\t\tAmelia\t\t\t\tdomestic\t\t\t506 s Tryon \n", "\n", "{'race': 'Black', 'surname': 'Aaron', 'title': None, 'hohgiven': 'Amelia', 'hohmi': None}\n", "White\tAbbey\t\t\tSimeon\tA\tMary\tA\tsuperintendent construction\tGeneral Fire Ext Co\tHouse\t104 Central av \n", "\n", "{'race': 'White', 'surname': 'Abbey', 'title': None, 'hohgiven': 'Simeon', 'hohmi': 'A'}\n", "White\tAbbott\t\tMiss\tMargaret\t\t\t\t\t\tHouse\t1804 s Boulevard \n", "\n", "{'race': 'White', 'surname': 'Abbott', 'title': None, 'hohgiven': 'Miss', 'hohmi': 'M'}\n", "White\tAbee\t\t\tJunius\tA\t\t\ttelegraph operator\tSouthern Railway\tBoards\t507 n Graham \n", "\n", "{'race': 'White', 'surname': 'Abee', 'title': None, 'hohgiven': 'Junius', 'hohmi': 'A'}\n", "White\tAbel\t\t\tAbram\t\tFannie\t\ttraveling salesman\t\tHouse\t506 w 10t\n", "\n", "{'race': 'White', 'surname': 'Abel', 'title': None, 'hohgiven': 'Abram', 'hohmi': None}\n", "Black\tAbel\t\t\tBelle\t\t\t\t\t\tHouse\t14 Boundary al \n", "\n", "{'race': 'Black', 'surname': 'Abel', 'title': None, 'hohgiven': 'Belle', 'hohmi': None}\n", "Black\tAbel\t\t\tGeorge\t\t\t\tlaborer\t\tHouse\t12 Boundary al \n", "\n", "{'race': 'Black', 'surname': 'Abel', 'title': None, 'hohgiven': 'George', 'hohmi': None}\n", "White\tAbernathy\t\t\tC\tL\t\t\t\t\tHouse\tLawyer's rd \n", "\n", "{'race': 'White', 'surname': 'Abernathy', 'title': None, 'hohgiven': 'C', 'hohmi': 'L'}\n", "White\tAbernathy\t\t\tClement\tD\tCora\t\temployment\tCharlotte Lea Belt Co\tHouse\t1117 s Tryon \n", "\n", "{'race': 'White', 'surname': 'Abernathy', 'title': None, 'hohgiven': 'Clement', 'hohmi': 'D'}\n", "White\tAbernathy\t\tMrs\tCora\t\t\t\tboarding\t\tHouse\t1117 s Tryon\n", "\n", "{'race': 'White', 'surname': 'Abernathy', 'title': None, 'hohgiven': 'Mrs', 'hohmi': 'C'}\n", "White\tAbernathy\t\t\tDavid\tM\tEnola\t\t\t\tHouse\t1310 e 4text \n", "\n", "{'race': 'White', 'surname': 'Abernathy', 'title': None, 'hohgiven': 'David', 'hohmi': 'M'}\n", "White\tAbernathy\t\tMiss\tElizabeth\t\t\t\t\t\tHouse\t430 Mint \n", "\n", "{'race': 'White', 'surname': 'Abernathy', 'title': None, 'hohgiven': 'Miss', 'hohmi': 'E'}\n", "Black\tAbernathy\t\t\tHannah\t\t\t\t\t\tHouse\t4 Bellinger \n", "\n", "{'race': 'Black', 'surname': 'Abernathy', 'title': None, 'hohgiven': 'Hannah', 'hohmi': None}\n", "White\tAbernathy\t\t\tJohn\tW\tNannie\t\tcarpenter\t\tHouse\tSunnyside \n", "\n", "{'race': 'White', 'surname': 'Abernathy', 'title': None, 'hohgiven': 'John', 'hohmi': 'W'}\n", "Black\tAbernathy\t\t\tJoseph\t\t\t\tlaborer\t\tHouse\t422 w Hill \n", "\n", "{'race': 'Black', 'surname': 'Abernathy', 'title': None, 'hohgiven': 'Joseph', 'hohmi': None}\n", "Black\tAbernathy\t\t\tLewis\t\tAnnie\t\tcarpenter\t\tHouse\t422 w Hill \n", "\n", "{'race': 'Black', 'surname': 'Abernathy', 'title': None, 'hohgiven': 'Lewis', 'hohmi': None}\n", "White\tAbernethy\t\tMiss\tConnie\t\t\t\tstenographer\tBurwell & Dunn Co\tHouse\t414 Templeton av \n", "\n", "{'race': 'White', 'surname': 'Abernethy', 'title': None, 'hohgiven': 'Miss', 'hohmi': 'C'}\n", "White\tAbernethy\t\t\tGlenn\tE\t\t\tsalesman\tBurwell & Dunn Co\tHouse\t409 w 11th\n", "\n", "{'race': 'White', 'surname': 'Abernethy', 'title': None, 'hohgiven': 'Glenn', 'hohmi': 'E'}\n", "White\tAbernethy\t\tMiss\tGertrude\t\t\t\tfitter\tLittle-Long Co\tHouse\t306 w 7th\n", "\n", "{'race': 'White', 'surname': 'Abernethy', 'title': None, 'hohgiven': 'Miss', 'hohmi': 'G'}\n", "White\tAbernethy\t\t\tLee\tJ\tIda\t\tpainter\th 810 n Brevard \tHouse\tSeversville \n", "\n", "{'race': 'White', 'surname': 'Abernethy', 'title': None, 'hohgiven': 'Lee', 'hohmi': 'J'}\n", "White\tAbernethy\tDr\t\tJ\tS\t\t\t\t\tHouse\tBeatty's Ford rd \n", "\n", "{'race': 'White', 'surname': 'Abernethy', 'title': None, 'hohgiven': 'Dr', 'hohmi': None}\n", "White\tAbernethy\t\t\tJames\tF\tAlice\t\tblacksmith\tTrade ext\tHouse\tSeversville \n", "\n", "{'race': 'White', 'surname': 'Abernethy', 'title': None, 'hohgiven': 'James', 'hohmi': 'F'}\n", "White\tAbernethy\t\tMiss\tLillian\t\t\t\tsaleslady\tEfird's Dept Store\tHouse\tSeversville \n", "\n", "{'race': 'White', 'surname': 'Abernethy', 'title': None, 'hohgiven': 'Miss', 'hohmi': 'L'}\n", "White\tAbernethy\t\twid (Jas C)\tMargaret\tK\t\t\t\t\tHouse\t3 e 1st \n", "\n", "{'race': 'White', 'surname': 'Abernethy', 'title': None, 'hohgiven': 'wid', 'hohmi': None}\n", "White\tAbernethy\t\tMiss\tMildred\t\t\t\t\t\tRooms\t603 n Davidson \n", "\n", "{'race': 'White', 'surname': 'Abernethy', 'title': None, 'hohgiven': 'Miss', 'hohmi': 'M'}\n", "White\tAbernethy\t\tMiss\tNettie\tI\t\t\tbookkeeper\tPound & Moore Co\tHouse\t311 n College \n", "\n", "{'race': 'White', 'surname': 'Abernethy', 'title': None, 'hohgiven': 'Miss', 'hohmi': 'N'}\n", "White\tAbernethy\t\t\tThomas\tJ\tLucy\t\temployment\tCity\tHouse\t414 Templeton av \n", "\n", "{'race': 'White', 'surname': 'Abernethy', 'title': None, 'hohgiven': 'Thomas', 'hohmi': 'J'}\n", "White\tAbernethy\t\t\tLeslie\tW\t\t\tmachinist\tW S Abernethy\tHouse\t409 w 11th\n", "\n", "{'race': 'White', 'surname': 'Abernethy', 'title': None, 'hohgiven': 'Leslie', 'hohmi': 'W'}\n", "White\tAbernethy\t\t\tWilliam\tS\tMamie\t\tautomobile representative\t29 w 4th\tHouse\t409 w 11th\n", "\n", "{'race': 'White', 'surname': 'Abernethy', 'title': None, 'hohgiven': 'William', 'hohmi': 'S'}\n", "White\tAbraham\t\t\tWilliam\t\tLula\t\t\t\tHouse\t516 s College \n", "\n", "{'race': 'White', 'surname': 'Abraham', 'title': None, 'hohgiven': 'William', 'hohmi': None}\n", "Black\tAdanis\t\t\tAnna\t\t\t\tteacher\t\tHouse\t1021 s Church\n", "\n", "{'race': 'Black', 'surname': 'Adanis', 'title': None, 'hohgiven': 'Anna', 'hohmi': None}\n", "Black\tAdams\t\t\tBelle\t\t\t\tmaid\tRealty Bldg\tHouse\t419 w 2d \n", "\n", "{'race': 'Black', 'surname': 'Adams', 'title': None, 'hohgiven': 'Belle', 'hohmi': None}\n", "Black\tAdams\t\t\tBerry\tA\tEdith\t\tporter\tW S Cramer\tHouse\tSeversville \n", "\n", "{'race': 'Black', 'surname': 'Adams', 'title': None, 'hohgiven': 'Berry', 'hohmi': 'A'}\n", "Black\tAdams\t\t\tBessie\t\t\t\tcook\t\t\t247 e Trade \n", "\n", "{'race': 'Black', 'surname': 'Adams', 'title': None, 'hohgiven': 'Bessie', 'hohmi': None}\n", "White\tAdams\t\tMiss\tBeulah\t\t\t\t\t\tHouse\t419 Elizabeth av \n", "\n", "{'race': 'White', 'surname': 'Adams', 'title': None, 'hohgiven': 'Miss', 'hohmi': 'B'}\n", "White\tAdams\t\t\tCompton\t\tCora\t\tmill head\t\tHouse\tElizabeth Mills \n", "\n", "{'race': 'White', 'surname': 'Adams', 'title': None, 'hohgiven': 'Compton', 'hohmi': None}\n", "White\tAdams\t\tMiss\tDorothy\t\t\t\tstenographer\tOconee Mills Co\tRooms\tYWCA \n", "\n", "{'race': 'White', 'surname': 'Adams', 'title': None, 'hohgiven': 'Miss', 'hohmi': 'D'}\n", "Black\tAdams\t\t\tEdward\t\tMittie\t\tdriver\tRhyne Bros\tHouse\t305 w Palmer \n", "\n", "{'race': 'Black', 'surname': 'Adams', 'title': None, 'hohgiven': 'Edward', 'hohmi': None}\n", "White\tAdams\t\t\tGeorge\t\t\t\ttailor\tHenry Miller Jr\tBoards\t15 & Church\n", "\n", "{'race': 'White', 'surname': 'Adams', 'title': None, 'hohgiven': 'George', 'hohmi': None}\n", "Black\tAdams\t\t\tHenry\t\tDicie\t\tlaborer\t\tHouse\t308 Middle \n", "\n", "{'race': 'Black', 'surname': 'Adams', 'title': None, 'hohgiven': 'Henry', 'hohmi': None}\n" ] } ], "source": [ "pattern = re.compile(\"^(?P<race>Black|White)\\t(?P<surname>\\w*)\\W+(?P<title>Miss|Dr|Rev|Mrs)?(?P<hohgiven>\\w+)(\\W(?P<hohmi>\\w))?.*$\") # Match specific titles\n", "data = parse(pattern)\n", "for i in range(0, len(alltext)):\n", " print(alltext[i])\n", " print(data[i])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next is often the name of a second person in the household, but not always. Sometimes the next \n", "word is the start of the head of household occupation. These are distinct in that names start \n", "with an upper case character. So our second given name pattern will ask for an upper case first character." ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "White\tAdams\t\t\tHenry\tL\tFannie\tS\troute agent\tSouthern Railway Co\tHouse\t327 n Tryon \n", "\n", "{'race': 'White', 'surname': 'Adams', 'title': None, 'hohgiven': 'Henry', 'hohmi': 'L', 'given2': 'Fannie'}\n", "White\tAdams\t\t\tJames\t\tGertrude\t\tmanager\t\tHouse\t419 Elizabeth av \n", "\n", "{'race': 'White', 'surname': 'Adams', 'title': None, 'hohgiven': 'James', 'hohmi': None, 'given2': 'Gertrude'}\n", "Black\tAdams\t\t\tJane\t\t\t\tteacher\t\tHouse\t1021 s Church\n", "\n", "{'race': 'Black', 'surname': 'Adams', 'title': None, 'hohgiven': 'Jane', 'hohmi': None, 'given2': None}\n", "Black\tAdams\t\t\tJohn\t\t\t\tlaborer\t\tHouse\t1031 s Church\n", "\n", "{'race': 'Black', 'surname': 'Adams', 'title': None, 'hohgiven': 'John', 'hohmi': None, 'given2': None}\n", "White\tAdams\t\t\tJohn\tJ\t\t\tpresident\tAdams G & P Co and Char Pepsi-Cola Co\tHouse\t309 e 6th\n", "\n", "{'race': 'White', 'surname': 'Adams', 'title': None, 'hohgiven': 'John', 'hohmi': 'J', 'given2': None}\n" ] } ], "source": [ "pattern = re.compile(\"^(?P<race>Black|White)\\t(?P<surname>\\w*)\\W+(?P<title>Miss|Dr|Rev|Mrs)?(?P<hohgiven>\\w+)(\\W(?P<hohmi>\\w))?\\W+(?P<given2>[A-Z]{1}\\w*)?.*$\") # Match specific titles\n", "data = parse(pattern)\n", "for i in range(0, 5):\n", " print(alltext[i])\n", " print(data[i])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The second person may also optionally have a middle initial." ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "White\tAdams\t\t\tHenry\tL\tFannie\tS\troute agent\tSouthern Railway Co\tHouse\t327 n Tryon \n", "\n", "{'race': 'White', 'surname': 'Adams', 'title': None, 'hohgiven': 'Henry', 'hohmi': 'L', 'given2': 'Fannie', 'mi2': 'S'}\n", "White\tAdams\t\t\tJames\t\tGertrude\t\tmanager\t\tHouse\t419 Elizabeth av \n", "\n", "{'race': 'White', 'surname': 'Adams', 'title': None, 'hohgiven': 'James', 'hohmi': None, 'given2': 'Gertrude', 'mi2': None}\n", "Black\tAdams\t\t\tJane\t\t\t\tteacher\t\tHouse\t1021 s Church\n", "\n", "{'race': 'Black', 'surname': 'Adams', 'title': None, 'hohgiven': 'Jane', 'hohmi': None, 'given2': None, 'mi2': None}\n", "Black\tAdams\t\t\tJohn\t\t\t\tlaborer\t\tHouse\t1031 s Church\n", "\n", "{'race': 'Black', 'surname': 'Adams', 'title': None, 'hohgiven': 'John', 'hohmi': None, 'given2': None, 'mi2': None}\n", "White\tAdams\t\t\tJohn\tJ\t\t\tpresident\tAdams G & P Co and Char Pepsi-Cola Co\tHouse\t309 e 6th\n", "\n", "{'race': 'White', 'surname': 'Adams', 'title': None, 'hohgiven': 'John', 'hohmi': 'J', 'given2': None, 'mi2': None}\n" ] } ], "source": [ "pattern = re.compile(\"^(?P<race>Black|White)\\t(?P<surname>\\w*)\\W+(?P<title>Miss|Dr|Rev|Mrs)?(?P<hohgiven>\\w+)(\\W(?P<hohmi>\\w))?\\W+(?P<given2>[A-Z]{1}\\w*)?(\\W(?P<mi2>\\w))?.*$\") # Match specific titles\n", "data = parse(pattern)\n", "for i in range(0, 5):\n", " print(alltext[i])\n", " print(data[i])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we come to the occupation, which is a series of lower case words, separated by single spaces." ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "White\tAdams\t\t\tHenry\tL\tFannie\tS\troute agent\tSouthern Railway Co\tHouse\t327 n Tryon \n", "\n", "{'race': 'White', 'surname': 'Adams', 'title': None, 'hohgiven': 'Henry', 'hohmi': 'L', 'given2': 'Fannie', 'mi2': 'S', 'occupation': 'route agent'}\n", "White\tAdams\t\t\tJames\t\tGertrude\t\tmanager\t\tHouse\t419 Elizabeth av \n", "\n", "{'race': 'White', 'surname': 'Adams', 'title': None, 'hohgiven': 'James', 'hohmi': None, 'given2': 'Gertrude', 'mi2': None, 'occupation': 'manager'}\n", "Black\tAdams\t\t\tJane\t\t\t\tteacher\t\tHouse\t1021 s Church\n", "\n", "{'race': 'Black', 'surname': 'Adams', 'title': None, 'hohgiven': 'Jane', 'hohmi': None, 'given2': None, 'mi2': None, 'occupation': 'teacher'}\n", "Black\tAdams\t\t\tJohn\t\t\t\tlaborer\t\tHouse\t1031 s Church\n", "\n", "{'race': 'Black', 'surname': 'Adams', 'title': None, 'hohgiven': 'John', 'hohmi': None, 'given2': None, 'mi2': None, 'occupation': 'laborer'}\n", "White\tAdams\t\t\tJohn\tJ\t\t\tpresident\tAdams G & P Co and Char Pepsi-Cola Co\tHouse\t309 e 6th\n", "\n", "{'race': 'White', 'surname': 'Adams', 'title': None, 'hohgiven': 'John', 'hohmi': 'J', 'given2': None, 'mi2': None, 'occupation': 'president'}\n" ] } ], "source": [ "pattern = re.compile(\"^(?P<race>Black|White)\\t(?P<surname>\\w*)\\W+(?P<title>Miss|Dr|Rev|Mrs)?(?P<hohgiven>\\w+)(\\W(?P<hohmi>\\w))?\\W+(?P<given2>[A-Z]{1}\\w*)?(\\W(?P<mi2>\\w))?\\W+(?P<occupation>[a-z ]+)?.*$\") # Match specific titles\n", "data = parse(pattern)\n", "for i in range(0, 5):\n", " print(alltext[i])\n", " print(data[i])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next bit seems to be either \"House\" or \"Boards\" or the capitalized, often multi-word, name of a business.\n", "Let's add a required \"House\" or \"Boards\" later in the pattern as required for a match. Then the workplace pattern will\n", "be before that and optional." ] }, { "cell_type": "code", "execution_count": 96, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "White\tAdams\t\twid (Geo O)\tGrace\tM\t\t\t\t\tHouse\t601 n College \n", "\n", "Black\tAdams\t\t\tMajor\t\t\t\twaiter\tBuford Hotel \t\t\n", "\n", "White\tAdcock\t\twid (Jas M)\tMillie\tM\t\t\t\t\tHouse\t916 Calvine av \n", "\n", "White\tAdkins Walter D (Leona E)\t\t\tWalter\tD\tLeona\tE\tlineman\t\tHouse\t(r) 305 e 13th\n", "\n", "White\tAiken\t\t\tGeorge\tW M\tBarbara\t\tsuperintendent\tQueen City M & G Wks\tHouse\t1120 s Caldwell \n", "\n", "White\tAlbea\t\twid (J F)\tEmma\tS\t\t\t\t\tHouse\t310 s McDowell \n", "\n", "White\tAldridge\t\twid (Wm A)\tMillie\tS\t\t\t\t\tHouse\tHarrill st Belmont Park \n", "\n", "White\tAlexain\t\t\tNicholas\t\t\t\temployment\tMet Café\tRooms\t26 n Tryon \n", "\n", "Black\tAlexander\t\t\tAdelaide\t\t\t\t\t\tHouse\t910 (909) e 1st \n", "\n", "Black\tAlexander\t\t\tBernard\t\t\t\tporter\tYoung's Steam Baking Co \t\t\n", "\n", "Black\tAlexander\t\t\tBurdette\t\t\t\tporter\tColonial Club \t\t\n", "\n", "White\tAlexander\t\t\tCharles\tY\t\t\tteacher\tKing's Bus College\tRooms\t305 s Church\n", "\n", "White\tAlexander\t\twid (Wm S)\tEmma\tV\t\t\t\t\tHouse\t610 n Graham ext \n", "\n", "Black\tAlexander\t\t\tEphraim\t\tAlice\t\tdriver\tColes & Smith\tHouse\tStonewall bet A & B \n", "\n", "White\tAbernethy\t\t\tLee\tJ\tIda\t\tpainter\th 810 n Brevard \tHouse\tSeversville \n", "\n", "White\tAbernethy\t\tMiss\tLillian\t\t\t\tsaleslady\tEfird's Dept Store\tHouse\tSeversville \n", "\n", "White\tAbernethy\t\twid (Jas C)\tMargaret\tK\t\t\t\t\tHouse\t3 e 1st \n", "\n", "White\tAdams\t\t\tGeorge\t\t\t\ttailor\tHenry Miller Jr\tBoards\t15 & Church\n", "\n", "White\tAbernethy\t\t\tLee\tJ\tIda\t\tpainter\th 810 n Brevard \tHouse\tSeversville \n", "\n", "White\tAbernethy\t\tMiss\tLillian\t\t\t\tsaleslady\tEfird's Dept Store\tHouse\tSeversville \n", "\n", "White\tAbernethy\t\twid (Jas C)\tMargaret\tK\t\t\t\t\tHouse\t3 e 1st \n", "\n", "White\tAdams\t\t\tGeorge\t\t\t\ttailor\tHenry Miller Jr\tBoards\t15 & Church\n", "\n", "White\tAdams\t\t\tHenry\tL\tFannie\tS\troute agent\tSouthern Railway Co\tHouse\t327 n Tryon \n", "\n", "{'race': 'White', 'surname': 'Adams', 'title': None, 'hohgiven': 'Henry', 'hohmi': 'L', 'given2': 'Fannie', 'mi2': 'S', 'occupation': 'route agent', 'workplace': 'Southern Railway Co', 'la': 'House', 'address': '327 n Tryon '}\n", "White\tAdams\t\t\tJames\t\tGertrude\t\tmanager\t\tHouse\t419 Elizabeth av \n", "\n", "{'race': 'White', 'surname': 'Adams', 'title': None, 'hohgiven': 'James', 'hohmi': None, 'given2': 'Gertrude', 'mi2': None, 'occupation': 'manager', 'workplace': 'House', 'la': None, 'address': '419 Elizabeth av '}\n", "Black\tAdams\t\t\tJane\t\t\t\tteacher\t\tHouse\t1021 s Church\n", "\n", "{'race': 'Black', 'surname': 'Adams', 'title': None, 'hohgiven': 'Jane', 'hohmi': None, 'given2': None, 'mi2': None, 'occupation': 'teacher', 'workplace': 'House', 'la': None, 'address': '1021 s Church'}\n", "Black\tAdams\t\t\tJohn\t\t\t\tlaborer\t\tHouse\t1031 s Church\n", "\n", "{'race': 'Black', 'surname': 'Adams', 'title': None, 'hohgiven': 'John', 'hohmi': None, 'given2': None, 'mi2': None, 'occupation': 'laborer', 'workplace': 'House', 'la': None, 'address': '1031 s Church'}\n", "White\tAdams\t\t\tJohn\tJ\t\t\tpresident\tAdams G & P Co and Char Pepsi-Cola Co\tHouse\t309 e 6th\n", "\n", "{'race': 'White', 'surname': 'Adams', 'title': None, 'hohgiven': 'John', 'hohmi': 'J', 'given2': None, 'mi2': None, 'occupation': 'president', 'workplace': 'Adams G & P Co and Char Pepsi-Cola Co', 'la': 'House', 'address': '309 e 6th'}\n" ] } ], "source": [ "pattern = re.compile(\"^(?P<race>Black|White)\\t(?P<surname>\\w+)(\\W+(?P<title>Miss|Dr|Rev|Mrs))?(\\W+(?P<hohgiven>\\w+))(\\W+(?P<hohmi>[A-Z]{1}))?(\\W+(?P<given2>[A-Z]{1}\\w*))?(\\W+(?P<mi2>[A-Z]{1}))?(\\W+(?P<occupation>[a-z ]+))?(\\W+(?P<workplace>[A-Z0-9]{1}[A-Za-z0-9 /&-]+))?(\\W+(?P<la>House|Boards|Rooms))?(\\W+(?P<address>[A-Za-z0-9 /]+))?$\")\n", "data = parse(pattern)\n", "for i in range(0, 5):\n", " print(alltext[i])\n", " print(data[i])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now looking at the unmatched lines, we see a number of problems remaining, each a special case\n", "that we need to add to the overall pattern..\n", "\n", "* Widow pattern with deceased husband in parentheses\n", "```White\tAdams\t\twid (Geo O)```\n", "\n", "* Widower pattern with deceased wife in parentheses.\n", "```White\tAdkins Walter D (Leona E)```\n", "\n", "The parentheses above are unexpected in our current pattern." ] }, { "cell_type": "code", "execution_count": 138, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "White\tAdams\t\t\tHenry\tL\tFannie\tS\troute agent\tSouthern Railway Co\tHouse\t327 n Tryon \n", "\n", "{'race': 'White', 'surname': 'Adams', 'deadhusband': None, 'title': None, 'hohgiven': 'Henry', 'hohmi': 'L', 'given2': 'Fannie', 'mi2': 'S', 'occupation': 'route agent', 'workplace': 'Southern Railway Co', 'la': 'House', 'address': '327 n Tryon '}\n", "White\tAdams\t\t\tJames\t\tGertrude\t\tmanager\t\tHouse\t419 Elizabeth av \n", "\n", "{'race': 'White', 'surname': 'Adams', 'deadhusband': None, 'title': None, 'hohgiven': 'James', 'hohmi': None, 'given2': 'Gertrude', 'mi2': None, 'occupation': 'manager', 'workplace': None, 'la': 'House', 'address': '419 Elizabeth av '}\n", "Black\tAdams\t\t\tJane\t\t\t\tteacher\t\tHouse\t1021 s Church\n", "\n", "{'race': 'Black', 'surname': 'Adams', 'deadhusband': None, 'title': None, 'hohgiven': 'Jane', 'hohmi': None, 'given2': None, 'mi2': None, 'occupation': 'teacher', 'workplace': None, 'la': 'House', 'address': '1021 s Church'}\n", "Black\tAdams\t\t\tJohn\t\t\t\tlaborer\t\tHouse\t1031 s Church\n", "\n", "{'race': 'Black', 'surname': 'Adams', 'deadhusband': None, 'title': None, 'hohgiven': 'John', 'hohmi': None, 'given2': None, 'mi2': None, 'occupation': 'laborer', 'workplace': None, 'la': 'House', 'address': '1031 s Church'}\n", "White\tAdams\t\t\tJohn\tJ\t\t\tpresident\tAdams G & P Co and Char Pepsi-Cola Co\tHouse\t309 e 6th\n", "\n", "{'race': 'White', 'surname': 'Adams', 'deadhusband': None, 'title': None, 'hohgiven': 'John', 'hohmi': 'J', 'given2': None, 'mi2': None, 'occupation': 'president', 'workplace': 'Adams G & P Co and Char Pepsi-Cola Co', 'la': 'House', 'address': '309 e 6th'}\n" ] } ], "source": [ "pattern = re.compile(r'^(?P<race>Black|White)\\t(?P<surname>\\w+)(\\W+wid \\((?P<deadhusband>[\\w ])\\))?(\\W+(?P<title>Miss|Dr|Rev|Mrs))?(\\W+(?P<hohgiven>\\w+))(\\W+(?P<hohmi>[A-Z]{1})\\W)?(\\W*(?P<given2>(?!House|Boards|Rooms)[A-Z]{1}\\w*))?(\\W+(?P<mi2>[A-Z]{1})\\W)?(\\W*(?P<occupation>[a-z ]+))?(\\W+(?P<workplace>(?!House|Boards|Rooms)[A-Z0-9]{1}[A-Za-z0-9- /&\\']+))?(\\W+(?P<la>House|Boards|Rooms))?(\\W+(?P<address>[A-Za-z0-9 ]+))?.*$')\n", "data = parse(pattern)\n", "\n", "for i in range(0, 5):\n", " print(alltext[i])\n", " print(data[i])" ] }, { "cell_type": "code", "execution_count": 139, "metadata": {}, "outputs": [], "source": [ "import json\n", "with open(os.path.join(output_loc,'data.json'), 'w') as outfile:\n", " json.dump(data, outfile)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 4 }