# Python Tidbits for NLP
## Anoop Sarkar

This is an extremely concise introduction to Python for programmers already proficient in at least one other programming language. 

A slower and more thorough tutorial is the [Python Tutorial](https://docs.python.org/2/tutorial/) by Guido van Rossum. Read it at least upto Chapter 10.

These code fragments are for Python version 3.x

## Be agnostic of the operating system

Python code, especially for file system interaction, can be written so that it runs on many different operating systems. This makes your code more portable and easier to maintain as well.

In [2]:
import os
import sys
if sys.platform == 'win32':
 ROOT = os.path.splitdrive(os.path.abspath('.'))[0]
elif sys.platform == 'linux2' or sys.platform == 'darwin':
 ROOT = os.sep
else:
 raise ValueError("unknown operating system")
dictfile = os.path.join(ROOT, 'usr','share','dict','words')
print(dictfile)

/usr/share/dict/words


## For loops

Use built in functions to create ranges.

In [3]:
for i in range(1,10,2):
 print(i)

1
3
5
7
9


## Opening and closing file handles

Always open a file using the `with` statement because it closes the file at the end of the statement (even if there is an exception during interaction with the file system). A for loop can be used to iterate through lines using the file handle.

In [4]:
with open(dictfile, 'r') as fhandle:
 for line in fhandle:
 line = line.strip()
 if len(line) > 23:
 print(line)

antidisestablishmentarianism
formaldehydesulphoxylate
pathologicopsychological
scientificophilosophical
tetraiodophenolphthalein
thyroparathyroidectomize


## List comprehensions

List comprehensions are very useful to replace a for-loop. Example below finds unique elements as a one line python program using the built-in set data structure.

In [5]:
x = ['a', 'b', 'c', 'd', 'a', 'b', 'c']
print([ i for i in set(x) ])

['a', 'd', 'b', 'c']


Also, you can use an 'if' statement in a list comprehension.

In [6]:
print([ i for i in set(x) if i != 'a'])

['d', 'b', 'c']


Using list comprehensions, the following Python code prints out the lowercased tokens of length greater than 15 from Sense and Sensibility (note that one of them occurs twice).

In [7]:
import nltk
longwords = [ word.lower() for word in nltk.corpus.gutenberg.words('austen-sense.txt') if len(word) > 15]
print(longwords)

['incomprehensible', 'incomprehensible', 'disinterestedness', 'companionableness', 'disqualifications']


## Enumerate

enumerate is very useful when you want a counter variable for each element in a list.

In [8]:
x = ['a', 'c', 'b', 'd']
for (index,element) in enumerate(x): print(index, element)

0 a
1 c
2 b
3 d


## Dictionary comprehensions

Dictionary comprehensions are just like list comprehensions except they let you build a dictionary instead of a list. Say we want to build a dictionary where the dictionary keys are lowercase ASCII characters and the values are the probabilities for each character. In the following we just assign a random probability to each lowercase character.

In [9]:
import string
import numpy
# set up a random probability distribution over lowercase ASCII characters
counts = [ numpy.random.random() for c in string.ascii_lowercase ]
total = sum(counts)
# the following is a dictionary comprehension
prob = { c: (counts[i] / total) for (i,c) in enumerate(string.ascii_lowercase) }
print(prob['e'])
print(prob['z'])

0.040804314032715325
0.04517300968758016


## argmax

Often we wish to compute the argmax using a probability distribution. The argmax function returns the element that has the highest probability. $$\hat{x} = \arg\max_x P(x)$$

In [10]:
def P(c):
 return prob[c]
# the character with the highest probability is given by argmax_c P(c)
argmax_char = max(string.ascii_lowercase, key=P)
print(argmax_char, P(argmax_char))

w 0.07334002043038697


## Formatted Strings

Formatted strings, where you want to insert a value into a string, where %s is a string value, %d is a decimal integer, %f is a floating point number.

In [11]:
print("%s = %d and %s = %f" % ("x", 10, "y", 0.0003))

x = 10 and y = 0.000300


In [12]:
print("The %(foo)s is %(bar)i." % {'foo': 'answer', 'bar':42})

The answer is 42.


In [13]:
print("The {foo} is {bar}".format(foo='answer', bar=42))

The answer is 42


## Tuples

The builtin function 'tuple' can be used to create n-grams from a list of words.

In [14]:
words = ['a', 'good', 'book', 'is', 'all', 'you', 'need', '.']
print("print unigrams aka 1-grams: ", end='')
print(words)

print("print bigrams aka 2-grams: ", end='')
print([ tuple(words[i:i+2]) for i in range(len(words)-1) ])

print("print trigrams aka 3-grams: ", end=''),
print([ tuple(words[i:i+3]) for i in range(len(words)-2) ])

print unigrams aka 1-grams: ['a', 'good', 'book', 'is', 'all', 'you', 'need', '.']
print bigrams aka 2-grams: [('a', 'good'), ('good', 'book'), ('book', 'is'), ('is', 'all'), ('all', 'you'), ('you', 'need'), ('need', '.')]
print trigrams aka 3-grams: [('a', 'good', 'book'), ('good', 'book', 'is'), ('book', 'is', 'all'), ('is', 'all', 'you'), ('all', 'you', 'need'), ('you', 'need', '.')]


## Sorting

The function itemgetter from the operator module in Python provides a concise way to sort on different tuple elements in a list of tuples. Note that itemgetter(1) is set to the 2nd component of the tuple, and used as a key to sort the tuples.

In [15]:
word_freq = [ ('the', 1223), ('a', 2413), ('Mr.', 450), ('Elton', 10) ]
print(word_freq)
from operator import itemgetter
word_freq.sort(key=itemgetter(1), reverse=True)
print(word_freq)

[('the', 1223), ('a', 2413), ('Mr.', 450), ('Elton', 10)]
[('a', 2413), ('the', 1223), ('Mr.', 450), ('Elton', 10)]


You can also use the built-in 'map' function to get the sorted values.

In [16]:
print(list(map(itemgetter(1), word_freq)))

[2413, 1223, 450, 10]


In [17]:
print(list(map(itemgetter(0), word_freq)))

['a', 'the', 'Mr.', 'Elton']


## Classes

A class works pretty much like what you would expect from other languages such as C++ or Java. Methods of a class are determined by indentation. Each method that is part of the class must take at least one argument. The first argument of each method in a class is a pointer to the object derived from the class definition. By convention this first argument is typically called `self` and it is analogous but not exactly the same as the C++ `this` pointer.

In [18]:
class C:
 def foo(self):
 return self.a
 def bar(self, a):
 self.a = a
x = C()
x.bar('a')
print(x.foo())

a


## Constructor and Destructor methods in a class

The magic method `__init__` is the constructor method for the class and `__del__` is the destructor method which is called by the garbage collector (Python is similar to Java -- it does not require explicit memory management).

In [19]:
from itertools import islice

class FileObject:
 '''Wrapper for file objects to make sure the file gets closed on deletion.'''

 def __init__(self, filename):
 self.file = open(filename, 'r')

 def __del__(self):
 self.file.close()
 del self.file

f = FileObject(dictfile) # dictfile is defined in an earlier cell
for line in islice(f.file, 5):
 print(line,end='')
del f # get rid of f -- this is typically not explicitly done in Python. trust the garbage collector to do it for you.

A
a
aa
aal
aalii


## Iterators

A class is an iterator if it has a `__iter__` and `next` method
defined as shown in this example.

In [20]:
# circular queue 
class cq:
 q = [] # needs to be initialized with a list
 def __init__(self,q): # the argument q is a list 
 self.q = q 
 def __iter__(self): 
 return self 
 def __next__(self): 
 r = self.q[0]
 self.q = self.q[1:] + [r] # rotate the list
 return r

x = cq([1,2,3])
print(x.__next__())
print(x.__next__())
print(x.__next__())
print(x.__next__())
print(x.__next__())

1
2
3
1
2


## Magic!

Methods like `__iter__` in the above code for `cq` is called a magic method. Here is a [guide to all Python magic methods](http://anoopsarkar.github.io/nlp-class/cached/magicmethods.pdf)

## Iteration tools

The function islice allows you to take a slice of an iterator.

In [21]:
from itertools import islice
x = cq([1,2,3])
for i in islice(x, 5):
 print(i)

y = cq([1,2,3,4,5])
for i in islice(y,3): print(i)
z = [i for i in islice(y,10)]
print(z)

1
2
3
1
2
1
2
3
[4, 5, 1, 2, 3, 4, 5, 1, 2, 3]


## Convenient Dictionaries

The class defaultdict allows convenient insertion into a dictionary. You do not need to check if a key exists first before updating the value when using defaultdict.

In [22]:
from collections import defaultdict
foo = defaultdict(int)
bar = defaultdict(list)
foo['a'] += 1
foo['a'] += 1
bar['b'].append(1)
bar['b'].append(2)
print(foo, bar)

defaultdict(, {'a': 2}) defaultdict(, {'b': [1, 2]})


## Generators

Use generators instead of lists. Generators behave like streams which you can iterate over while lists are statically allocated.

In [23]:
def sum_of_squares(n):
 v = 0
 for i in range(1,n+1):
 v += i*i
 yield v
for i in sum_of_squares(10): print(i)

1
5
14
30
55
91
140
204
285
385


## Generator expressions

In [24]:
a = [1,2,3,4] # this is a list
b = [2*x for x in a] # this is a list comprehension
c = (2*x for x in a) # this is a generator, not a list. it creates an iterator object
print(b)
print(c)

[2, 4, 6, 8]
 at 0x17b552730>


In [25]:
n = ((a,b) for a in range(0,2) for b in range(4,6))
for i in n:
 print(i)

(0, 4)
(0, 5)
(1, 4)
(1, 5)


## More on Generators

Read [Generator Tricks for Systems Programmers](http://anoopsarkar.github.io/nlp-class/cached/generators.pdf) by David Beazley.

## Use built-in functions

In [26]:
# from Part 2 of Peter Norvig's excellent essay on xkcd 1313 
# http://nbviewer.ipython.org/url/norvig.com/ipython/xkcd1313-part2.ipynb
import re
searcher = re.compile('^a.o').search
data = frozenset('''all particularly just less indeed over soon course still yet before 
 certainly how actually better to finally pretty then around very early nearly now 
 always either where right often hard back home best out even away enough probably 
 ever recently never however here quite alone both about ok ahead of usually already 
 suddenly down simply long directly little fast there only least quickly much forward 
 today more on exactly else up sometimes eventually almost thus tonight as in close 
 clearly again no perhaps that when also instead really most why ago off 
 especially maybe later well together rather so far once'''.split())
%timeit { s for s in data if searcher(s) }
%timeit set(filter(searcher, data))

9.43 µs ± 32 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
8.34 µs ± 9.61 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


So about 18% faster to use the built-in command `filter` instead of the set comprehension with an `if` statement.

## Unpacking tuples and dictionaries

In [27]:
def concat(x, y): 
 return x + y 

foo = ('A', 'B')
bar = {'y': 'B', 'x': 'A'}

print(concat(*foo))
print(concat(**bar))

AB
AB


## Exceptions

In [28]:
def doit(x,y):
 if x < 0:
 raise ValueError("x should be >= 0")
 return y

print(doit(0,10))

10


In [29]:
# This will raise an exception, if you uncomment the following line:
# print doit(-1,10)

## Advanced Features

### Easter Eggs

In [30]:
import this, codecs
print(this.s)
print(codecs.encode(this.s, "rot-13")) # -> uryyb

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
Gur Mra bs Clguba, ol Gvz Crgref

Ornhgvshy vf orggre guna htyl.
Rkcyvpvg vf orggre guna vzcyvpvg.
Fvzcyr vf orggre guna pbzcyrk.
Pbzcyrk vf or

Uncomment the following easter eggs to see what happens.

In [31]:
# from __future__ import braces

In [32]:
# import __phello__

In [33]:
import antigravity

### Scoping and Namespaces in Python

This section is strictly for programming language wonks. Scoping in Python can sometimes be tricky.

In [34]:
x = 'a'
class wat:
 x = 'b'
 def __init__(self):
 print("1:", x)
 print("2:", self.x)
f = wat()
print("3:", f.x)

1: a
2: b
3: b


In [35]:
x = 'a'
print("1:", list(x for x in (1,2)), x)
print("2:", [x for x in (1,2)], x)

1: [1, 2] a
2: [1, 2] a


For more visit http://programmingwats.tumblr.com/

### Function Decorators

Python has syntactic support for function composition. 

In [36]:
## function composition of foo with bar: foo(bar(args)) using a decorator

def foo(f):
 def decorator_func(*args, **keyword_args):
 f(*args, **keyword_args)
 print("Desecration is the smile on my face\n")
 return decorator_func

@foo
def bar(n):
 print(n)
bar("Never in the wrong time or wrong place")

## function composition directly by calling foo(bar_bar(args))

def bar_bar(n):
 print(n)

# notice how I give a function as an argument to foo which returns a function
# and I then provide an argument to that function returned by foo
foo(bar_bar)("My face my face, hey")

Never in the wrong time or wrong place
Desecration is the smile on my face

My face my face, hey
Desecration is the smile on my face



## End

In [37]:
from IPython.core.display import HTML


def css_styling():
 styles = open("../css/notebook.css", "r").read()
 return HTML(styles)
css_styling()