One of my favourite things about working with large data sets is the ‘getting to know you’ phase. This is the casual, exploratory part where you start to understand this new resource.
First I always look at it raw. What does it look like, what’s the data format (tab- or comma-delimited, json, xml, binary…), what are the field separators, how are strings handled, what of the many, many date formats availabel is used here, what strange formatting sticks out as being a potential problem, do any fields look wrong, etc. Then when i build my initial analysis scripts, I can purposefully explore those areas, and understand how it breaks.
Then i tend to do some basic counting analysis. How many records are there? Grep(1) comes in really handy here: for flat files, there’s a lot you can do with grep, sort, and uniq, to get an overall feel for your dataset. here’s a really simple example:
$ cat Sent\ Mail | grep 'From ' | wc 1880 13330 61463
in mbox format, the regex ‘^From ‘ (with a space after the word from) is THE delimiter between messages. when i later ran my script, it only recognized 1844 messages, which was pretty close, and gave me confidence in the numbers i was seeing.
i also find it helps to look at the variance in the values of a given field. are they all really similar? if it’s a numberic or date field, are there upper or lower bounds on the values?
$ cat Sent\ Mail | grep 'Content-Type:' | sort | uniq -c
will find all the lines where the ‘Content-Type’ email header is defined, then sort it, and then print out each of the unique (well, distinct) values, and a count (-c) of how many times that value occured. sweet.
now maybe it’s time to do some proper scripting. depending on what kind of analysis i want to do on the data set, i like to write some scripts that will test the values i am expecting to find in the various fields, and then print out information on those instances that violate my assumptions (and they will ALWAYS violate your assumptions). with my email data set, i told it to print out the payloads of all the messages in the mailbox. but many of them are multivalued MIME messages, and so the message ‘payload’ is actually a list of message parts, each with their own set of headers and payloads.
if msg.is_multipart(): multipart += 1 # for multipart messages, get_payload() will return a LIST of # the message parts if contains_plaintext(msg): plaintext +=1
great– but what i didnt count on in my initial haste was that these can be nested– so i was getting a discrepancy between the total number of messages, and those with plaintext components. but it was easy to see where my assumptions had gone wrong by telling it to print out the header fields of all messages that didnt fit my assumptions.
else: # then what DOES it contain?? print ' -------------Offender! Multipart message with non plaintext. Printing content types-------------' print 'From: ', msg['From'] print 'To: ', msg['To'] print 'Subject: ', msg['Subject'] for msg_part in msg.get_payload(): print msg_part.keys() print 'Content-Type = ', msg_part['Content-Type'] print '\n'
this eventually whittled down to the point that the only things appearing in the ‘else’ section of the prinout were html messages with, for whatever reason, no plaintext. wouldnt have known that otherwise.
anyway, i could go on– and i likely will over the next few months. but the point is that i think this kind of ‘getting to know you’ stage of working with a large data set is important to have a good working relationship with it, understanding what it will like and not like, what it will choke on, how to handle problems that eventually arise– and also, to finding interesting new angles on the data you never would have thought of otherwise.
oh yeah, and i did i mention that’s it’s hella fun?

One Comment
In visual art we do thumbnails, small quick sketches of the major shapes and values (lights & darks). It’s the same thing—the getting to know you phase, where you realize that you can crop the image, realign the shapes, adjust the values, all quickly without having invested huge amounts of time and materials. That’s where you find out where potential problems are and what part of the whole project means the most to you. Phases of creative work are the same in all kinds of work, and that’s a good thing.
I like your clear description of the process; I always learn something from you.