Basic Excel Driving with Python

Now it’s getting interesting. Reading and writing spreadsheets with XLRD and XLWT is sufficient for many tasks, and you don’t even need a copy of Excel to do it. But to really open up your data and fully wring all the information possible from it, you’ll need Excel and its powerful set of functions, pivot tables and charting.

For starters, let’s do some simple operations using Python to invoke Excel, add a spreadsheet, insert some data, then save the results to a spreadsheet file. You can play along at home by following my lead and entering the program text exactly as I’ve described below. My exercises and screen shots are done with Excel 2007, but all the commands presented here also work fine for Excel 2003. A prerequisite for this exercise is Python, the Win32 module and a copy of Microsoft Excel.

Here is the complete script we’ll be entering using IDLE, the Python interactive development tool.

#
# driving.py
#
import win32com.client as win32
excel = win32.gencache.EnsureDispatch('Excel.Application')
excel.Visible = True
wb = excel.Workbooks.Add()
ws = wb.Worksheets('Sheet1')
ws.Name = 'Built with Python'
ws.Cells(1,1).Value = 'Hello Excel'
print ws.Cells(1,1).Value
for i in range(1,5):
    ws.Cells(2,i).Value = i  # Don't do this
ws.Range(ws.Cells(3,1),ws.Cells(3,4)).Value = [5,6,7,8]
ws.Range("A4:D4").Value = [i for i in range(9,13)]
ws.Cells(5,4).Formula = '=SUM(A2:D4)'
ws.Cells(5,4).Font.Size = 16
ws.Cells(5,4).Font.Bold = True

What follows is a step-by-step guide to entering this script and monitoring the result.

  1. Open the Python IDLE interface from the Start menu

IDLE Startup

IDLE is the Python IDE built with the tkinter GUI toolkit, as quoted from the Python IDLE documentation, and gives you an interactive interface to enter, run and save Python programs. IDLE isn’t strictly necessary for this exercise, you could use any shell command window, or a tool such as IPython or the MS Windows command line interface.

  1. Import the win32 module

importwin32

If the import command was successful, you’ll see the “>>>” prompt returned. If there was a problem, such as not having the win32 module installed correctly, you’ll see Import Error: Nomodule named win32com.client. In that case, install the appropriate win32 module from the web site.

  1. Start Excel

The command win32.gencache.EnsureDispatch('Excel.Application') attaches to an Excel process that is already running, or starts Excel if it’s not. If you see the “>>>” prompt, Excel has been started or linked successfully. At this point you won’t see Excel, but if you check your task manager you can confirm that the process is running.

  1. Make Excel Visible

Setting the Visible flag with excel.Visible = True makes the Excel window appear. At this point, Excel does not contain any workbooks or worksheets, we’ll add those in the next step.

  1. Add a workbook, select the sheet “Sheet1” and rename it

idlewbws

Excel needs a workbook to serve as a container for the worksheets. A new workbook containing 3 sheets is added with command wb = excel.Workbooks.Add(). The command ws =wb.Worksheets('Sheet1') assigns ws to the sheet named Sheet1, and the command ws.Name ='Built with Python' changes the name of Sheet1 to “Built with Python”. Your screen should now look something like this:

excelblank

  1. Add some text into the first cell

idlehello

Now the setup is complete and you can add data to the spreadsheet. There are several options for addressing cells and blocks of data in Excel, I’ll cover a few of them here. You can address individual cells with the Cells(row,column).Value pattern, where row and column are integer values representing the row and column location for the cell. Note that row and column counts begin from one, not zero. Use .Value to add text, numbers and date information to the cell and use .Formula for entering an Excel formula into the cell location.

After typing these commands, you’ll see the “Hello Excel” text in your Excel worksheet, and see the text printed in the IDLE window as well. Of course, Python can set values in the spreadsheet as well as query data from the spreadsheet.

excelhello

  1. Populate the second row with data by using a for loop

idle_for_loop

In many cases you’ll have lists of data to insert into or extract from the worksheet. Wrapping the Cells(row,column).Value pattern with a loop seems like a natural approach, but in reality this maximizes the communication overhead between Python and Excel and results in very inefficient and slow code. It’s much better to transfer lists than individual elements whenever possible as shown in the next section. After this command, your Excel spreadsheet will look like this:

For loop results in excel

  1. Populate the third and fourth rows of data

Range insertion

A better approach to populating or extracting blocks of data is to use the Range().Value  pattern. With this construct you can efficiently transfer a one- or two-dimensional blocks of data. In the first example, cells (3,1) through (3,4) are assigned to the list [5,6,7,8]. The next line uses the Excel-style cell address “A4:D4” to assign the results of the operation [ifor i in range(9,13)]. In some cases, it may be more intuitive to use the Excel-style naming. The Excel sheet now looks like this:

Excel data insertion

  1. Assign a formula to sum the numbers just added

Formula insertion

You can insert Excel formulas into cells using the .Formula pattern. The formula is the same as if you were to enter it in Excel: =SUM(A2:D4). In this example, the sum of 12 numbers in rows 2,3 and 4 is generated. Your Excel sheet should now look like the screenshot below.

Excel Formula

  1. Change the formatting of the formula cell **

Formatting data

As a final exercise, the format of the formula cell is changed to point size 16 with a bold typeface. You can change any of dozens of attributes for the various cells in the worksheet through Python. Your spreadsheet should now look like this.

Formatting result in Excel

Hopefully you did this exercise interactively, typing the commands and monitoring the result in Excel. You can also cut to the chase and run this script to generate the result. When the script exits, you’ll be left with an open Excel spreadsheet just as shown in the last screenshot above.

Prerequisites

Source Files and Scripts

Source for the program and data text file are available at http://github.com/pythonexcels/examples/tree/master

References

Core Python Programming

Wesley Chun’s book has a chapter on Programming Microsoft Office with Win32 COM

http://groups.google.com/group/python-excel

Though this group mainly covers questions on the excellent XLRD, XLWT and XLUTILS modules, there is also some discussion on interfacing to Excel using Win32 COM

Stack Overflow

Stack Overflow is a great resource for getting questions answered on a variety of programming topics, including Python

Thanks everyone — Dan

Another XLWT Example

After completing the last example, I wanted to try something a little more interesting with XLWT this time around. This article describes a short script that uses Python and XLWT to download some raw data from the web, parse it, and write a spreadsheet with a new column derived from the data. The data for this example comes from research done by David Harrison and Daniel L. Rubinfeld in “Hedonic Housing Prices and the Demand for Clean Air”, published in the Journal of Environmental Economics and Management, Volume 5, (1978), and contains information on location, pricing, tax and other information from the Boston housing market. I’ll be illustrating three things in this script:

  • Downloading data directly from the web
  • Parsing the data, removing extraneous information at the top of the file and writing the real data fields to a spreadsheet
  • Adding a hyperlink for each record that links to a Google Map, based on the latitude and longitude data given

Note that it’s possible to do these steps using a web browser and an interactive Excel session. You can easily download the file, import it into Excel, remove the leading text, and make a formula to produce a hyperlink. But the beauty of this script is that everything is done automatically, which can be very handy if the source data is constantly updated.

Here’s the script that performs these operations:

#
# xlwt_bostonhousing.py
#
import sys
from urllib2 import urlopen
from xlwt import Workbook, easyxf, Formula

def doxl():
    '''Read the boston_corrected.txt file based on
       Harrison, David, and Daniel L. Rubinfeld, "Hedonic Housing Prices
       and the Demand for Clean Air," Journal of Environmental Economics
       and Management, Volume 5, (1978), write to an excel spreadsheet .  
       '''
    #URL = 'http://stat.cmu.edu/datasets/boston_corrected.txt'
    URL = 'https://raw.github.com/pythonexcels/xlwt/master/boston_corrected.txt'
    try:
        fp = urlopen(URL)
    except:
        print 'Failed to download %s' % URL
        sys.exit(1)
    lines = fp.readlines()

    wb = Workbook()
    ws = wb.add_sheet('Housing Data')
    ulstyle = easyxf('font: underline single')
    r = 0
    for line in lines:
        tokens = line.strip().split('\t')
        if len(tokens) != 21:
            continue
        for c,t in enumerate(tokens):
            for dtype in (int,float):
                try:
                    t = dtype(t)
                except:
                    pass
                else:
                    break 
            ws.write(r,c+1,t)
        if r == 0:
            hdr = tokens
            ws.write(r,0,'MAPLINK')
        else:
            d = dict(zip(hdr,tokens))
            link = 'HYPERLINK("http://maps.google.com/maps?q=%s'+\
                   ',+%s+(Observation+%s)&hl=en&ie=UTF8&z=14&'+\
                   'iwloc=A";"MAP")'
            link = link % (d['LAT'],d['LON'],d['OBS.'])
            ws.write(r,0,Formula(link),ulstyle)

        r += 1
    wb.save('bostonhousing.xls')
    print 'Wrote bostonhousing.xls'

if __name__ == "__main__":
    doxl()

As in the previous post, you must have xlwt installed (refer to the site http://www.python-excel.org for information on downloading and installing). Looking at the important bits in the script above, the following lines

#URL = 'http://stat.cmu.edu/datasets/boston_corrected.txt'
URL = 'https://raw.github.com/pythonexcels/xlwt/master/boston_corrected.txt'
try:
    fp = urlopen(URL)
except:
    print 'Failed to download %s' % URL
    sys.exit(1)
lines = fp.readlines()

open the URL for the boston_corrected.txt file, then reads the URL and returns a list of strings (update 9/15/12: the original link is now broken, I’ve updated the script to pull the data from a copy on github). The next section:

wb = Workbook()
ws = wb.add_sheet('Housing Data')
ulstyle = easyxf('font: underline single')
r = 0

creates a new Workbook object, then adds a sheet named “Housing Data” to the workbook. The easyfx function provides a convenient way to add formatting to the spreadsheet; in this example, the single underline format is used to denote a hyperlink. In the next line, the variable r acts as a row counter.

The for loop below examines each row of data:

for line in lines:
    tokens = line.strip().split('\\t')
    if len(tokens) != 21:
        continue
    for c,t in enumerate(tokens):
        for dtype in (int,float):
            try:
                t = dtype(t)
            except:
                pass
            else:
                break
        ws.write(r,c+1,t)

Each line is “stripped” (leading and trailing white space characters are removed), then split by tab characters. A data line contains 21 fields of information, otherwise it is rejected. To properly format the data for the spreadsheet, the datatype is set using try-except-else within the for* loop. The loop only considers string, integer and float data, which is sufficient for this input data. More complex input files may contain date information which would require additional handling. The cell data with the correct type setting is written to the spreadsheet using the ws.write statement.

The next section builds the hyperlink to a Google Map using the latitude and longitude information within the input data.

    if r == 0:
        hdr = tokens
        ws.write(r,0,'MAPLINK')
    else:
        d = dict(zip(hdr,tokens))
        link = 'HYPERLINK("http://maps.google.com/maps?q=%s'+\
               ',+%s+(Observation+%s)&hl=en&ie=UTF8&z=14&'+\
               'iwloc=A";"MAP")'
        link = link % (d['LAT'],d['LON'],d['OBS.'])
        ws.write(r,0,Formula(link),ulstyle)

    r += 1
wb.save('bostonhousing.xls')
print 'Wrote bostonhousing.xls'

If this is the first row of data (if r == 0), it is assumed to be header data and is saved in the hdr variable. Otherwise, the statement d = dict(zip(hdr,tokens)) builds a dictionary, using the header information as keys. This allows each field to be referenced by its column heading. The hyperlink is built by specifying a URL containing http://maps.google.com/maps?, with the corresponding latitude (d['LAT']) and longitude (d['LON']) information from the current line of data. (Forgive the link formatting, my original one line of code is split across four lines to reduce the line width.) Finally, the hyperlink data is written to the spreadsheet with ws.write. The last two lines write the spreadsheet and print a message.

The original raw data looks like this:

rawdata

The bostonhousing.xls spreadsheet containing the original data and the new map hyperlink looks like this:

bostonhousing

Clicking on the MAP link brings up a Google Map showing the location according to the latitude and longitude.

googlemap

For some reason, the location of latitude 42.255000 and longitude -70.955000 isn’t in the town of Nahant, but in the middle of Rock Island Cove near Quincy. It’s left as an exercise for the reader to determine why this is so.

Prerequisites

Python (refer to http://www.python.org)

xlwt (refer to http://www.python-excel.org)

Source Files and Scripts

Source for the program and data text file are available athttp://github.com/pythonexcels/xlwt/tree/master

References

http://www.python-excel.org

This site contains pointers to the best information available about working with Excel files in the Python programming language.

http://groups.google.com/group/python-excel

Google group for questions on xlrd, xlwt, xlutils and general questions on interfacing to Excel with Python

Thanks everyone — Dan

Using XLWT to Write Spreadsheets Without Excel

Long before I even started thinking about writing a blog covering the integration possibilities between Excel and Python, the boys at Simplistix (http://www.simplistix.co.uk) were crafting some of the handiest Python modules around: xlrd, xlwt and xlutils. These fantastic modules enable Python to directly read and write an Excel .xls file, without the need for Microsoft Excel. Not only can you insert numbers and text, but xlwt supports a multitude of formatting options. There’s even a great set of documentation, and a Google group monitored by the maintainers. You can find out more about these modules athttp://www.python-excel.org

I’ve written a fair number of Python scripts that interface with Excel through the COM interface, but for many tasks you don’t need to invoke Excel to operate on spreadsheet data. Although some advanced operations like charting and pivot tables are not currently supported, these utilities provide an excellent way of interfacing with Excel spreadsheets for many common jobs.

In my example below, I format a raw text file containing information on reference account categories for a hospital (based on data from data.gov):

09700RESEARCH
09800PHYSICIANS  PRIVATE OFFICES
09900NONPAID WORKERS
MANAGEMENT FEES
REFERENCE LABS

The raw data is smashed together, with the account number and name running into each other without separation. Some of the data doesn’t even have an account number at all. The script accomplishes the following:

  • Separate the account number and name, and write them into two separate columns in the spreadsheet
  • Assign an account number of 99999 for unnumbered accounts, and set the cell color to red
  • Convert the upper cased account name to a properly capitalized name, remove any extra spaces
  • Format the column width of each spreadsheet column based on the width of the data

Based on the input above, the goal is to produce Excel output like this:

Excel output

Since there are no charts, pivot tables or other fancy formatting needed, xlwt is perfect for the job. Here is the script that performs the conversion:

import sys
import re
from xlwt import Workbook, easyxf

def doxl():
    '''Read raw account number and name strings, separate the data and
       write to an excel spreadsheet.  Properly capitalize the account
       names and mark cells with no account number as 99999 with red fill
       '''
    try:
        fp = open("hospdata.txt")
    except:
        print 'Failed to open hospdata.txt'
        sys.exit(1)
    lines = fp.readlines()

    nameandnum = re.compile(r'(\d+)\s*(.*)\s*')
    wb = Workbook()
    wsraw = wb.add_sheet('Raw Data')
    ws = wb.add_sheet('Account List')
    ws.write(0,0,'Account Number')
    ws.write(0,1,'Account Name')
    ws.col(0).width = len('Account Number') * 256
    ws.col(1).width = max([len(l) for l in lines]) * 256
    r = 1

    for line in lines:
        wsraw.write(r,0,line.strip())
        m = nameandnum.match(line)
        if m:
            ws.write(r,0,int(m.group(1)))
            ws.write(r,1,' '.join([w.capitalize() for w in m.group(2).split()]))
        else:
            ws.write(r,0,99999,easyxf('pattern: pattern solid, fore_colour red;'))
            ws.write(r,1,' '.join([w.capitalize() for w in line.split()]))
        r += 1
    wb.save('accounts.xls')
    print 'Wrote accounts.xls'

if __name__ == "__main__":
    doxl()

Before running this example, you must have xlwt installed (please refer to the sitehttp://www.python-excel.org for information on downloading and installing). Let’s look at the important bits in the script above.

The following lines:

Lines 12-17:

try:
   fp = open("hospdata.txt")
except:
   print 'Failed to open hospdata.txt'
   sys.exit(1)
lines = fp.readlines()

attempt to open the file hospdata.txt. If successful, the file contents are read into “lines” as a list of strings. Next, line 19

nameandnum = re.compile(r'(\d+)\s*(.*)\s*')

establishes a compiled regular expression used to parse each line and separate the account number and account name. It’s assumed that the account numbers can only contain digits, and the account names cannot begin with a digit. The compiled regular expression is used later to attempt a match on each line in the input file. Lines 20-26:

wb = Workbook()
wsraw = wb.add_sheet('Raw Data')
ws = wb.add_sheet('Account List')
ws.write(0,0,'Account Number')
ws.write(0,1,'Account Name')
ws.col(0).width = len('Account Number') * 256
ws.col(1).width = max([len(l) for l in lines]) * 256

create a new workbook with two worksheets, “Raw Data” and “Account List”, which are used to store the unprocessed and processed data. The headings “Account Number” and “Account Name” are written to the first row of the Account List worksheet. To set the column width, the length of the longest string in each column is multiplied by 256 to calculate the setting. Referring to the documentation for xlwt, you’ll see that the column width setting is based on the width of the 0 character for the first font in the list, not the default font, so you may want to fine tune the calculation to make the column width setting slightly wider than the actual data.

The next section iterates over each stored line from the text file:

for line in lines:
    wsraw.write(r,0,line.strip())
    m = nameandnum.match(line)
    if m:
        ws.write(r,0,int(m.group(1)))
        ws.write(r,1,' '.join([w.capitalize() for w in m.group(2).split()]))
    else:
        ws.write(r,0,99999,easyxf('pattern: pattern solid, fore_colour red;'))
        ws.write(r,1,' '.join([w.capitalize() for w in line.split()]))

The raw data is written to the Raw Data sheet, then the line is matched against the regular expression. If the match is successful, the number part is written to the first column, and the text is written to the second column. The following statement in line 34:

' '.join([w.capitalize() for w in m.group(2).split()])

splits account name information into individual words, capitalizes each word, then reassembles the string. If you wanted to insert the account name string as-is, you would simply replace the statement above with the following:

m.group(2)

If no match occurs, then the string may not contain an account number and it is assumed that the whole string contains the account name only. The dummy account number 99999 is inserted, and the cell background is changed to a red color. This is done to flag the text for quick visual scanning.

After all input lines are processed, the accounts.xls file is written and the following message is printed by the following lines:

wb.save('accounts.xls')
print 'Wrote accounts.xls'

The completed accounts.xls spreadsheet looks like this, with account number and name split into two columns and with a reformatted account name string:

excelfile

Prerequisites

Python (refer to http://www.python.org)

xlwt (refer to http://www.python-excel.org)

Source Files and Scripts

Source for the program and data text file are available at http://github.com/pythonexcels/xlwt/tree/master

References

http://www.python-excel.org contains pointers to the best information available aboutworking with Excel files in the Python programming language.

http://groups.google.com/group/python-excel is the Google group for questions on xlrd, xlwt, xlutils and general questions on interfacing to Excel with Python

Thanks everyone — Dan