In the last post, I introduced the concordance and how it can be used to examine, understand, and identify problems content. Just as a reminder, a concordance is a list of words used in a body of work, with their immediate contexts [http://en.wikipedia.org/wiki/Concordance]. For example, I can use a concordance to understand the different usages of the word “reactor” in a Wikipedia article on the Fukushima Nuclear Disaster:
> python concordance.py '.{30}(?i)reactor.{30}' fukushima.txt
arch 2011 of the four damaged reactor buildings Date 11 March 2011
es six separate boiling water reactors originally designed by Gener
O). At the time of the quake, Reactor 4 had been de-fueled while 5
the earthquake, the remaining reactors 1-3 shut down automatically
olant water through a nuclear reactor for several days in order to
wn. As the pumps stopped, the reactors overheated due to the normal
first few days after nuclear reactor shutdown (smaller amounts of
, only prompt flooding of the reactors with seawater could have coo
ause it would ruin the costly reactors permanently. Flooding with s
the water boiled away in the reactors and the water levels in the
...
The Python code that generates this information is short and simple. There are two command line arguments: the expression used to search the text and the name of the text file. The Python program reads the input text, then uses the findall method from the re library to generate a list of matches. A simple for loop iterates over the results and prints them. Here is the core of the program:
import re
import sys
fp = open(sys.argv[2])
txt = fp.read()
fp.close()
for matchstr in re.findall(sys.argv[1],txt):
print matchstr
This script is short because most of the processing is happening in the re.findall function when evaluating the regular expression. Using this powerful capability, you can create a wide variety of input expressions to match the content in your text file. Here are just a few examples:
| Expression | Meaning |
|---|---|
| day | Match any string containing “day” |
| [Dd]ay | Match any string containing “day” or “Day” |
| (?i)day | Case insensitive match for any string containing “day” with any combination of upper- and lower-case characters |
| (?i)\bday\b | Case insensitive match only the word “day”, not Sunday, days, etc |
| .{10}day | Match any string containing “day” and any 10 characters before it |
| .{10}day.{10} | Match any string containing “day” and any 10 characters before and after |
| .(?:the |a )day | Match “the day” or “a day”, but not “someday” |
| on \w+ day | Match “on __ day” where “__” is any single word |
| on.{,30} reactor | Match “on” followed by reactor within 30 characters |
At first glance, regular expressions might seem bewildering and incomprehensible. But creating a simple template with examples can help you towards the productive use of regular expressions. Another issue is running a Python program on the command line, which is not a common use model on a Windows computer in 2013. In the next post I’ll explore how to make this powerful script more user-friendly and accessible.


