About dan

Just another Python hacker

Concordance 2

In the last post, I introduced the concordance and how it can be used to examine, understand, and identify problems content. Just as a reminder, a concordance is a list of words used in a body of work, with their immediate contexts [http://en.wikipedia.org/wiki/Concordance]. For example, I can use a concordance to understand the different usages of the word “reactor” in a Wikipedia article on the Fukushima Nuclear Disaster:

> python concordance.py '.{30}(?i)reactor.{30}' fukushima.txt
arch 2011 of the four damaged reactor buildings Date 11 March 2011
es six separate boiling water reactors originally designed by Gener
O). At the time of the quake, Reactor 4 had been de-fueled while 5
the earthquake, the remaining reactors 1-3 shut down automatically
olant water through a nuclear reactor for several days in order to
wn. As the pumps stopped, the reactors overheated due to the normal
 first few days after nuclear reactor shutdown (smaller amounts of
, only prompt flooding of the reactors with seawater could have coo
ause it would ruin the costly reactors permanently. Flooding with s
 the water boiled away in the reactors and the water levels in the
...

The Python code that generates this information is short and simple. There are two command line arguments: the expression used to search the text and the name of the text file. The Python program reads the input text, then uses the findall method from the re library to generate a list of matches. A simple for loop iterates over the results and prints them. Here is the core of the program:

import re
import sys
fp = open(sys.argv[2])
txt = fp.read()
fp.close()
for matchstr in re.findall(sys.argv[1],txt):
    print matchstr

This script is short because most of the processing is happening in the re.findall function when evaluating the regular expression. Using this powerful capability, you can create a wide variety of input expressions to match the content in your text file. Here are just a few examples:

Expression Meaning
day Match any string containing “day”
[Dd]ay Match any string containing “day” or “Day”
(?i)day Case insensitive match for any string containing “day” with any combination of upper- and lower-case characters
(?i)\bday\b Case insensitive match only the word “day”, not Sunday, days, etc
.{10}day Match any string containing “day” and any 10 characters before it
.{10}day.{10} Match any string containing “day” and any 10 characters before and after
.(?:the |a )day Match “the day” or “a day”, but not “someday”
on \w+ day Match “on __ day” where “__” is any single word
on.{,30} reactor Match “on” followed by reactor within 30 characters

At first glance, regular expressions might seem bewildering and incomprehensible. But creating a simple template with examples can help you towards the productive use of regular expressions. Another issue is running a Python program on the command line, which is not a common use model on a Windows computer in 2013. In the next post I’ll explore how to make this powerful script more user-friendly and accessible.

Building a Concordance with Python

As a Python scripter who is now a Technical Writer, I’ve found an amazing number of uses of Python in my job. Python is wonderfully suited for all sorts of text processing and manipulation. I heartily recommend that any Technical Writer become familiar with a scripting language like Python to simplify their document production flow, improve the quality of their documentation and increase productivity.

One way that scripted text processing can help your documentation is to identify inconsistencies and style problems in your writing. In my first assignment as a Technical Writer, I was handed a user guide and told to “clean it up.” Documents that are passed from writer to writer tend to gather cruft over time and inherit different writing styles from different writers. How do you get a quantitative understanding of the update task, other than reading the document and noting the problems one-by-one? To answer this question, I wrote a concordance generator.

A concordance is a list of words used in a body of work, with their immediate contexts.
[http://en.wikipedia.org/wiki/Concordance_(publishing)]. Using a concordance, you can view word usage in context and quickly spot inconsistencies. I’ve found that this technique is much, much more effective than using the word processor’s search feature to look through the document (although the latest version of Microsoft Word does contain improvements in this area.)

As an example, I’ll use the text from a random Wikipedia article. I chose the page describing the Fukushima nuclear disaster at http://en.wikipedia.org/wiki/Fukushima_Daiichi_nuclear_disaster and saved the text into a file called fukushima.txt. Using my concordance tool, I first examined how the word “reactor” was used in the text:

> python concordance.py '.{30}(?i)reactor.{30}' fukushima.txt
arch 2011 of the four damaged reactor buildings Date 11 March 2011
es six separate boiling water reactors originally designed by Gener
O). At the time of the quake, Reactor 4 had been de-fueled while 5
the earthquake, the remaining reactors 1-3 shut down automatically
olant water through a nuclear reactor for several days in order to
wn. As the pumps stopped, the reactors overheated due to the normal
 first few days after nuclear reactor shutdown (smaller amounts of
, only prompt flooding of the reactors with seawater could have coo
ause it would ruin the costly reactors permanently. Flooding with s
 the water boiled away in the reactors and the water levels in the
...

The concordance.py file contains the actual program script which we’ll look at in a later post. The arguments to the program are the match string and a list of files to process. The string is a regular expression that is applied to the input text. If you’ve never seen a regular expression, it can be a bit intimidating. However, if you stick to a few basic patterns, you’d be amazed at what you can accomplish. The ".{30}" element says “match any character (.) thirty times.” This provides context by matching any thirty characters before and any thirty characters after the substring of interest. The "(?i)" element tells the matcher to ignore case sensitivity. The program reads the fukushima.txt file and prints out every match for the expression you provide.

Here’s a more interesting example. I want to see how the terms “shutdown” and “shut down” are used in context. In a typical word processor, I would probably search for “shut” to make sure I catch both cases. Using concordance.py, I can modify my search expression to capture exactly what I want:

> python concordance.py '.{30}(?i)shut.?down.{30}' fukushima.txt
ed while 5 and 6 were in cold shutdown for planned maintenance.[8] I
e, the remaining reactors 1-3 shut down automatically and emergency g
from melting down after being shut down. As the pumps stopped, the re
ew days after nuclear reactor shutdown (smaller amounts of this heat
...

The "shut.?down" expression searches for “shut”, followed optionally by any character, followed by “down”. This term matches the strings “shut down”, “shut-down”, and “shutdown”. If you’re a skeptic, you’re probably thinking “Meh, I could just search for ‘shut’ and find all these cases anyway.”

Here’s a trickier example. The Fukushima plant involved a number of different buildings or “units” which contained reactors. Your first thought might be to search for “unit”:

> python concordance.py '.{30}(?i)unit.{30}' fukushima.txt
...
tance of nuclear power in the United States was eroded sharply f
 Switzerland, Taiwan, and the United States. Much of the help an
 The multiple nuclear reactor units involved in the Fukushima Da
d exposed fuel pools at three units.[79] On 21 December 2011, th
bine and reactor buildings of units 1 and 3 of contaminated water by
...

Unfortunately, “unit” also matches “United”. You could add a space after unit, but you would miss “units”. You really just wanted to see “unit” or “units” followed by a number. This expression should do the trick:

> python concordance.py '.{30}(?i)units? \d.{30}' fukushima.txt
...
ark I containment, as used in Units 1 to 5. Key: DW, dry well enclo
ectric Power Company (TEPCO). Unit 1 is a 439 MWe type (BWR3) reac
2 Kern County earthquake.[31] Units 2 and 3 are both 784 MWe type B
ed operating in July 1974 and Unit 3 in March 1976. The earthquake
...

The "s?" term optionally matches the letter s. As a result, the expression matches both “unit” and “units”. The "\d" element means “match any digit 0 through 9. The matcher can now find “unit 1″, “units 2 and 3″, “Units 4,5, and 6″, but does match “United States”. This is the behavior we want.

This is a lot to swallow for one post. In later posts, I’ll dive into what makes concordance.py tick, and how we can simplify the application to be useful for non-experts.

— Dan

Writing Quality Technical Information

In the UC Santa Cruz Extension program for Technical Writing and Communication, we looked at several different resources for suggested writing styles and guidelines (the Stanford Writing in the Sciences course by Coursera was too brief to cover many different sources.) The book I found most useful was Developing Quality Technical Information: A Handbook for Writers and Editors, by Gretchen Hargis, Michelle Carey, and others. The authors are members of the technical writing staff at IBM, and write about the methodologies they use to author technical documentation. Not everyone agrees with the IBM way for technical documentation, especially DITA, but the guidelines form a great starting point.

Hopefully you can take some time to read the book and pick up some tips on writing techniques and style. I especially like the generous use of examples to illustrate the different points made in the descriptions of best practices. The introduction provides a framework for the rest of the book and boils down to these key concepts:

Easy2

Easy to use, easy to understand, and easy to find, it sounds simple. Actually, much of the discussion of documentation best practices follow common sense rules. Who could argue for unclear, inaccurate, hard to search documentation? On the other hand, I did find that the examples helped me to understand where problems can creep into my documentation.

Overall, I highly recommend this book for technical writers and others who write technical documentation.

– Dan

Coursera Writing in the Sciences

I guess I’m a glutton for punishment. I recently completed the Writing in the Sciences course, a free Stanford course offered through Coursera.org. Coursera is an organization co-founded by Professors Andrew Ng and Daphne Koller at Stanford, with a mission to offer online, university-level courses for no charge. Coursera courses comprise a set of video lectures, computer-gradable quizzes and homework assignments in an 8 to 10 week format. At the end of the course, you receive a certificate of completion, but no Stanford credit. Coursera is a fantastic resource for anyone wishing to broaden their skills in any number of areas.

For a technical writer focusing on software and computer-related content, the focus on health and science writing in this course might not appear compelling. From the website for the course, “This course trains scientists to become more effective, efficient, and confident writers … Kristin Sainani (née Cobb) is a clinical assistant professor at Stanford University and also a health and science writer.” Well, I’m not a scientist, and I’m not involved with heath and science writing, why do I need this course?

As it turns out, the first four weeks of the class are broadly applicable to anyone that is authoring technical content for an audience. Professor Sainani’s lectures on editing were particularly superb; she is a brutal editor and encourages her students to really dive into the material, search for the essential meaning, and extract the cruft from your writing.

For technical writers of software or hardware products, the concerns Kristin mentions in science writing strongly overlap with your care-abouts. Overuse of acronyms, wordiness, lack of clarity, etc. are exactly the same writing issues we struggle with. The class finished in November 2012, keep watching the Coursera site for information on a future session.

— Dan

writingsciences

Certified

When I first started Technical Writing, my boss asked me to take a grammar class at University of California Santa Cruz (UCSC) Extension. I enjoyed the class, and decided to pursue the full certificate. Ten classes and two years later, I completed the entire program and received my certificate.

UCSC

You can read more about the certificate program at http://www.ucsc-extension.edu/programs/technical-writing. To receive a certificate, you must complete 7 required and 3 elective classes. I chose the following classes:

  • Information Architecture
  • Grammar and Style for Technical Communicators
  • Technical Communication: An Introduction to the Profession
  • Technical Writers’ Workshop
  • Writing Successful Instructions, Procedures and Policies
  • Developing Technical Information from Plan to Completion
  • Minimalist Design for Documentation
  • Graphic Design Fundamentals
  • Content Management
  • DITA Authoring, Introduction
  • Final Project

Was it worth it? Some classes were certainly more informative and interesting than others. There were some repetition of material, but the feedback from the instructors and other students really helped me to hone my writing. It was also helpful to gain exposure to newer topics in technical writing, such as DITA. Overall, I’m glad I invested the time, it was definitely worth the effort.

It helped that my company was paying the tuition of around $600 per class. Surprisingly, I’m one of the rare exceptions in our group at work to take advantage of this opportunity. If you’re considering the investment in a Technical Writing certificate, you might also want to check out a recent discussion on the Linkedin Software User Assistance forum.

http://www.linkedin.com/groups/Technical-Writing-Certificates-Are-they-1276817.S.145103688

Thanks —- Dan