# `regex` workflow

In [6]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
import pandas as pd
import re

### Jeremy Howard is the guest lecturer for Lesson 9! <br>

#### In the video, he gives a three-part lesson plan: 
    * regex workflow
    * svd
    * transfer learning. 
    
Jeremy mentions that he uses `regex` every day in his work, and that it is essential for machine learning practitioners to develop a working knowledge of `regex`. Since we've already done deep dives into `svd` and into `transfer learning`, we'll focus on the `regex` part of this video, `from 1:50 to 21:29`.

### A simple `regex` exercise
#### To illustrate the power of `regex` and familiarize us with the way he works, Jeremy poses the following problem: <br>Let's extract all the phone numbers from the Austin Public Health Locations database and create a list of the phone numbers in the standard format `(ddd) ddd dddd`. He shows how to use `vim` to accomplish this task.
Let's listen to Jeremy for the next 20 minutes or so:

In [53]:
from IPython.display import HTML

# Play youtube video
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/5gCQvuznKn0?start=110" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')

#### Some of the takeaways from the video, paraphrased:
1. A neccessary but not sufficient condition for success<br>
What is the greatest difference between people who succeed and people who do not? It's entirely about tenacity. If you are willing to focus on the task and keep trying you have a good chance of succeeding. 

2. Workflow<br>
Work in an interactive environment, such as `vim`, or `jupyter notebook`, so you can try things get immediate feedback, revise, and progress toward a solution. 

3. Debugging<br>
When your code fails, remember that the computer is doing exactly what you asked. A good general approach is to break the code up into smaller parts, then run it again, and find out which part doesn't work.

4. Humility<br>
It's never "I think the problem in the code is X". A better approach is to start with the working assumption "I am an idiot, and I don't understand why things aren't working". Be willing to start from scratch and check every little step.

#### OK, let's get to work on our task. We'll use `jupyter notebook` as our interactive environment.

## 1. Get the Austin Public Health Locations database
#### https://data.austintexas.gov/Health-and-Community-Services/Austin-Public-Health-Locations/6v78-dj3u/data

In [13]:
path = 'C:/Users/cross-entropy/.fastai/data/Austin_Public_Health_Locations'

#### Read the data into a pandas dataframe. 
From the `Phone Number` column, we see that the phone numbers are in the format `ddd-ddd-dddd`.

In [56]:
df = pd.read_csv(path+'/Austin_Public_Health_Locations.csv')
display(df)

Unnamed: 0,Facility Name,Street Address,Zip Code,Hours,Website,Phone Number,Other Phone,Building ID,Ownership Status,Owner,Occupying Division,Occupancy Type,Sq. Ft.,Year Built
0,Bastrop WIC Clinic,"443 Texas Highway 71\nBastrop, Texas 78602\n(3...",78602,"Monday 7:30am to 7pm, closed 12 noon to 1pm; T...",,512-972-4942,,BAS,Lease,The Marketplace at Bastrop,Community Services,Clinic,1400.0,
1,Vital Records,"7201 Levander Loop, Building C\nAustin, Texas\...",78702,"Monday to Friday, 8am to 4:30pm",https://austintexas.gov/birthcertificates,512-972-4784,,,,,,,,
2,Manor WIC Clinic,"600 West Carrie\nManor, Texas 78653\n(30.34016...",78653,"Thursday 9am to 2:30pm, closed 12 noon to 12:30pm",http://www.austintexas.gov/department/manor-wi...,512-972-4942,,MAN,,Travis County,Community Services,Clinic,1700.0,
3,St. Johns WIC Clinic,"7500 Blessing Avenue\nAustin, Texas 78752\n(30...",78752,"Monday and Tuesday 7:30 a.m. to 7 p.m., closed...",http://www.austintexas.gov/department/st-johns...,512-972-4942,,SJC,Lease,Austin Independent School Center,Community Services,"Clinic, Neighborhood Center",9559.0,2001.0
4,Betty Dunkerley Health Campus Building B,"7201 Levander Loop\nAustin, Texas 78702\n(30.2...",78702,Sunday - Saturday 7:00am to 7:00pm,,512-972-5010,,BDCB,Own,City of Austin,"Epidemiology and Public Health Preparedness, O...",Offices,2190.0,
5,Northwest WIC Clinic,"8701 Research Blvd, Suite A\nAustin, Texas 787...",78758,"Monday and Tuesday 7:30am to 7:00pm, closed 12...",http://www.austintexas.gov/department/northwes...,512-972-4942,,NWW,Lease,"Van Family Real Estate Partnership, Ltd",Community Services,Clinic,4200.0,1993.0
6,Rutherford Campus,"1520 Rutherford Lane, Bldg 1\nAustin, Texas 78...",78754,Monday and Wednesday 7:45am to 11:30am; Tuesda...,https://austintexas.gov/department/environment...,512-978-0300,,RLC,Own,City of Austin,Environmental Health Services,Offices,2500.0,
7,Montopolis Recreation Community Center,"1200 Montopolis Dr.\nAustin, Texas 78741\n(30....",78741,"Monday - Thursday: 11 AM - 9 PM, Friday 11 AM ...",,12-978-2300,,MRCC,Own,City of Austin,Community Services,Offices,,
8,Blackland Neighborhood Center,"2005 Salina St\nAustin, Texas 78722\n(30.28075...",78722,Monday to Thursday 8am to 6pm; Friday 8am to 1...,,512-972-5790,,BNC,Own,City of Austin,Community Services,"Neighborhood Center, Offices",347.0,1984.0
9,Dove Springs WIC Center,"6801 South IH-35, Suite I & J\nAustin, Texas 7...",78744,"Monday and Tuesday 7:30am to 7:00pm, closed 12...",http://www.austintexas.gov/department/dove-spr...,512-972-4942,,DOV,Lease,"LX-Northbluff Center, L.P.",Community Services,Clinic,2100.0,


#### Read the database into a raw text string. 
This will be our starting point.

In [54]:
with open(path+'/Austin_Public_Health_Locations.csv', 'r') as file:
    data = file.read().replace('\n', '')
print(data)

Facility Name,Street Address,Zip Code,Hours,Website,Phone Number,Other Phone,Building ID,Ownership Status,Owner,Occupying Division,Occupancy Type,Sq. Ft. ,Year Built Bastrop WIC Clinic,"443 Texas Highway 71Bastrop, Texas 78602(30.10646853400044, -97.33211573399967)",78602,"Monday 7:30am to 7pm, closed 12 noon to 1pm; Tuesday and Friday closed; Wednesday and Thursday 7:30 am to 4:30 pm, closed 12 noon to 12:30 pm (closed second Wednesday of each month); Second Saturday of each month 8am to 12 noon",,512-972-4942,,BAS,Lease,The Marketplace at Bastrop,Community Services ,Clinic,1400,N/AVital Records,"7201 Levander Loop, Building CAustin, Texas(30.252329, -97.690404)",78702,"Monday to Friday, 8am to 4:30pm",https://austintexas.gov/birthcertificates,512-972-4784,,,,,,,,Manor WIC Clinic,"600 West CarrieManor, Texas 78653(30.340164, -97.563744)",78653,"Thursday 9am to 2:30pm, closed 12 noon to 12:30pm",http://www.austintexas.gov/department/manor-wic-clinic,512-972-4942,,MAN,N/A,Travis County 

## 2. Extract the phone numbers

#### We first construct a regular expression to match the phone numbers and break them into tuples. This involved a bit of trial and error.

In [57]:
re_extract_phone_number = re.compile(r"(\d\d\d)-(\d+)-(\d+)")

In [59]:
phone_number_list = re_extract_phone_number.findall(data)
display(phone_number_list)

[('512', '972', '4942'),
 ('512', '972', '4784'),
 ('512', '972', '4942'),
 ('512', '972', '4942'),
 ('512', '972', '5010'),
 ('512', '972', '4942'),
 ('512', '978', '0300'),
 ('512', '972', '5790'),
 ('512', '972', '4942'),
 ('512', '972', '4100'),
 ('512', '972', '6840'),
 ('512', '972', '4942'),
 ('512', '972', '4942'),
 ('512', '972', '5400'),
 ('512', '972', '4942'),
 ('512', '972', '5000'),
 ('512', '962', '6650'),
 ('512', '972', '4942'),
 ('512', '972', '4942'),
 ('512', '972', '6740'),
 ('512', '972', '4942'),
 ('512', '972', '5000'),
 ('512', '719', '3010'),
 ('800', '514', '6667'),
 ('512', '978', '9740'),
 ('512', '972', '5139'),
 ('512', '972', '4942'),
 ('512', '972', '4942'),
 ('512', '972', '6650'),
 ('512', '972', '4942')]

## 3. Put the phone numbers in the desired format

#### Next we join together the tuples, separated by spaces:

In [52]:
[' '.join(tuple) for tuple in phone_number_list]

['512 972 4942',
 '512 972 4784',
 '512 972 4942',
 '512 972 4942',
 '512 972 5010',
 '512 972 4942',
 '512 978 0300',
 '512 972 5790',
 '512 972 4942',
 '512 972 4100',
 '512 972 6840',
 '512 972 4942',
 '512 972 4942',
 '512 972 5400',
 '512 972 4942',
 '512 972 5000',
 '512 962 6650',
 '512 972 4942',
 '512 972 4942',
 '512 972 6740',
 '512 972 4942',
 '512 972 5000',
 '512 719 3010',
 '800 514 6667',
 '512 978 9740',
 '512 972 5139',
 '512 972 4942',
 '512 972 4942',
 '512 972 6650',
 '512 972 4942']

#### Voila! Finis.