XML has two different interfaces or APIs. You might ask, why do we even need an API to interface with such a simple format, never mind two?
If you accessed your XML directly, you could build a parser that would recognize tags (”<tagname>blah</tagname>”), and process them sequentially. How would you do this yourself? Well, you’d probably parse each bit of XML and pull out what the tag is, what the attributes of the tag are, and what the contents (data) inside the tag is, if any. And then, you’d do something with that information. Well, that’s pretty repetitive work, so an API was contructed to do for you. This is the SAX XML API. Python provides the xml.sax module for working with the SAX interface.
This sequential approach is appropriate for really big files that you dont want to read entirely into memory, or for files you might be receiving piece-by-piece, like you would over a network connection perhaps. But it doesnt give you a good ‘big picture’ view of the document, and it makes it hard to search and manipulate whole files.
So, what if you wanted to do it another way? Maybe you would want something that could parse the document by levels, starting with the parent level (the root element), returning children, searching for various elements or data, etc. But, this requires scanning the entire document, reading it into memory, and building a tree-like representation. That’s kind of a lot of work, and since it’s another typical approach to parsing XML files, it was also made into an API. This is what the DOM XML API does. Python provides the xml.dom module for working with the DOM interface.
We’ll return to the SAX interface here, because this is what you want for large files and data sets. As a quick preview, here’s roughly what we’re going to do:
from xml.sax import make_parser, ContentHandler
if __name__ == '__main__':
p = make_parser()
p.setContentHandler(myXMLHandler())
p.parse('/path/to/myfile.xml')
the magic, of course, is in the myXMLHandler class. This is a class you define, which can implement any of the methods defined in the parent class. the main ones you are likely to work with are the following:
class ContentHandler:
def startDocument(self, tag):
def endDocument(self, tag):
def startElement(self, tag, attributes):
def endElement(self, tag):
def characters(self, data):
These functions are basically callbacks. They are triggered at the end of each content type. So, as the parser goes through your document, when it sees the end of an opening tag (marked by the closing ‘>’), it triggers the startElement() function. When it sees the end of a closing tag, it calls the endElement() function. Etc.
An example Content Handler class might trigger on each start element, store the current tag name, and process the tag contents based on which tag you’re currently inside of. For example, say you are processing a large xml file with a number of ’s, and each fileobject had a number of children elements. Here’s a wireframe for what you might do:
class myHandler(ContentHandler):
def __init__(self):
ContentHandler.__init__(self)
self.currentTag = ""
self.currentData = ""
self.currentFile = ""
self.numFiles = 0
def startElement(self, tag, attributes):
self.currentTag = tag
if tag == 'fileobject':
self.numFiles += 1
def endElement(self, tag):
self.currentTag = False
self.currentData = False
if tag == 'fileobject':
self.currentFile = ""
def characters(self, data):
self.currentData = data
if self.currentTag == 'filename':
self.currentFile = data
if self.currentTag == 'contents':
doSomethingFancy(data) # you write this
def startDocument(self):
pass
def endDocument(self):
print 'the number of files in the document were: ' + str(self.numFiles)
You could also write multiple handlers, and register different ones with the parser at different times. For example, say you were parsing an html document and you wanted to use one handler for the headers and one for the body. from within the endElement() function, you could check to see if that element is the closing tag, and if so, call p.setContentHandler(bodyContentHandler).
There’s lots more you could (and are likely to) do, but hopefully this gives a sense of how you would built up a more sophisticated handler from here.
Tags: data, parsing, sax, xml

One Comment
SAX and DOM are relatively old technologies, StAX and VTD-XML are more advanced options