Using Python to Emulate Unix Pipes
Unix utility emulation¶
Introduction¶
This post is an example of emulation unix utilities in Python. I was prompted by another blog posting about the difficulties of Unix shell scripting. The blogger was trying to accumulate the time taken to execute a given script multiple times (difficulties arose because Unix Shell scripting doesn't do floating calculations easily). Ignoring the details, I wondered how I would do the core script in Python.
See: https://blog.plover.com/Unix/tools.html
awk '{print $11}' FILE_NAME_PATTERN | sort | uniq -c | sort -n | grep -v EXCLUDE_PATTERN
Basically, this says:
For all files with names that match a specified pattern
Read the file, extracting the 11-th field of each line
Sort the fields
Use
uniq
output the unique field values, and append the count of the unique field valuesSort into ascending order
Exclude fields that match a given pattern.
The script is being used to process website logs.
import glob
import re
import collections
Load up lab_black
to format our Python nicely.
%reload_ext lab_black
Find file names¶
Find all file names that match a pattern in target directory
SOURCE_DIR = '../data/'
fnames = glob.glob(SOURCE_DIR + 'test01 - Copy (*).txt')
Read Files¶
We create a Counter
instance, read the contents of each file, split each line into fields, and update the count of field 2. Note that we use the with
statement to avoid all the file closing cleanup.
We also cater for the case where the log file has NO field 2.
Note a gotcha: If you give a string to Counter, it thinks it is a list of chars
, and counts each character. You have to wrap strings in a list.
Finally, we get the list in descending order, reverse it, and make it into a list again.
# create a counter of field values from field 1
f1_count = collections.Counter()
# open each file name
for fname in fnames:
with open(fname) as f:
# read all lines in this file
lines = f.readlines()
# strip off leading and trailing whitespace, split on whitespace
# update count of second field
for line in lines:
try:
# get field (if present)
field = line.strip().split()[1]
# update count, note passing in a string gets it chopped into chars
# have to pass list with string as only item
f1_count.update([field])
except IndexError as e:
# no field 1, ignore this (maybe blank line?)
pass
# end try
# end for
# end with
# end for
# sort list by count, then reverse, then turn into list again
counts = list(reversed(f1_count.most_common()))
Excluding Don't Cares¶
Finally, we go through the list of (field
, count
) tuples, excluding those that match the specified pattern. I made this a little fancy, in that I catered for the case with no exclusion pattern.
In the spirit of Unix, the output is just the raw tuples.
# do the exclusion on RE pattern
exclude = '^c$'
final_counts = [
(v, n)
for v, n in counts
if (exclude == None) or (re.search(exclude, v) == None)
]
# raw display on counts
__ = [print(v, n) for v, n in final_counts]
vvvvvv 1 a 1 d 1 b 3 f 6
Fancy Report¶
I decided to add a reporting function that had the exclusion function build it. Not quite in the spirit of Unix, but nicer to look at (a manager of Arsenal FC once famously said "If you want entertainment, go to circus": Unix bros would probably say "If you want to look at something nice, go to an art gallery")
def report_counts(
counts: list, exclude: str = None
) -> None:
'''
report_counts: prints a formatted report show values and counts,
excluding values that match a RE pattern
Parameters
counts: list of form [(v1, n1), (v2, n2) ...], v_i strings, n_i counts
exclude: string holding RE pattern to supress a line if pattern matches in v_i string
default None
'''
title1 = 'Value'
title2 = 'Count'
underbar = '-'
col1 = 15
col2 = 5
print(f'{title1:^{col1}}|{title2:^{col2}}')
print(
f'{underbar:{underbar}^{col1}}|{underbar:{underbar}^{col2}}'
)
# print line of report if no exclude pattern given,
# or if exclude pattern (non-None) not seen
__ = [
print(f'{v:>{col1}}|{n:>{col2}}')
for v, n in counts
if (exclude == None)
or (re.search(exclude, v) == None)
]
# end
report_counts(counts, exclude='^c|b$')
Value |Count ---------------|----- vvvvvv| 1 a| 1 d| 1 f| 6
report_counts(final_counts)
Value |Count ---------------|----- vvvvvv| 1 a| 1 d| 1 b| 3 f| 6
More Pythonic?¶
The nested for
loops above are not very Pythonic-looking.
The code below collapses them into a set of nested comprehensions.
Sadly, so far as I can see there is no way the get the effect of a with
Context Manager in a list comprehension. Also sadly, I can't see any way to catch and ignore Exceptions in a list comprehension, which makes them very brittle in this case.
zz = collections.Counter(
[
x
for file_list in [
[
line.split()[1]
for line in open(fname).readlines()
]
for fname in fnames
]
for x in file_list
]
)
zz
Counter({'b': 3, 'c': 5, 'f': 6, 'd': 1, 'a': 1, 'vvvvvv': 1})
This shows the input to the Counter object (a list of field tokens)
[
x
for file_list in [
[
line.split()[1]
for line in open(fname).readlines()
]
for fname in fnames
]
for x in file_list
]
['b', 'b', 'c', 'f', 'f', 'f', 'f', 'f', 'f', 'b', 'd', 'c', 'c', 'c', 'c', 'a', 'vvvvvv']
This comprehension returns a list, each item of which is the list of field token in the corresponding file. The code above flattens this into a single list.
[
[line.split()[1] for line in open(fname).readlines()]
for fname in fnames
]
[['b'], ['b'], ['c', 'f', 'f', 'f', 'f', 'f', 'f'], ['b'], ['d', 'c', 'c', 'c', 'c', 'a', 'vvvvvv']]
Scaling Up¶
The toy data sets used about are all very well, but then I thought about "What if my files are Mega-or-Giga bytes big". So I recast my code to use generators (i.e. lazy evaluation, rather than greedy evaluation).
This approach might fall over if there are huge numbers of log files, more than the allowed number of open files, because as I said above, open file cleanup is not supported.
zz_gen = (
[line.split()[1] for line in open(fname).readlines()]
for fname in fnames
)
collections.Counter(
(x for file_list in zz_gen for x in file_list)
)
Counter({'b': 3, 'c': 5, 'f': 6, 'd': 1, 'a': 1, 'vvvvvv': 1})
zz_gen
<generator object <genexpr> at 0x0000012F35ABE1B0>