Are Brunette’s with Roots Mean?

By Jessy

We tend to look down on rule-based learning systems because we don’t expect them to perform very well in general situations. Just because all the people I’ve met in the past with brown hair were mean, doesn’t mean all future one will be mean. A rule based system would naively conclude this, whereas a statistical system would say it’s true with a certain probability.

But when we clean a data set for use with a machine learning algorithm, the cleaning process itself is often extremely rule based. In the hair example– we might have to decide what to do with people who’s natural hair colour is blonde but subsequently dyed their hair brown (hey, it’s possible!), or (gasp) who’s roots are now growing out. Why? Because data is always messy. In the end, we make a decision, a rule, that our algorithms inherit from.

For example, in email corpora, we have to identify the best way to tokenize the email bodies based on the algorithms we intend to apply, and in my particular case we would also like to identify and remove, where possible, email footers, forwarded content, and the like. While we’d like to think this is done with some greater intuition about the data in mind, for the most part it happens iteratively by examining the way the algorithms treat the data.

Perhaps this is why so few research papers applying machine learning algorithms are further developed. Perhaps the point is that this complexity is a sign the data is being over-processed.

Is there a sweet spot right in the middle?

It seems like the sweet spot would be if our rule-base approaches converged on some stable, consistent truths about the data sets being processed. But my feeling is that this wont happen, because the cleaning process is a function of our implementations, and those are always changing.

The reason rule based approaches dont work in most machine learning algorithms is because the diversity of cases a machine needs to be able to handle drawfs any finite set of rules we could dervie and teach it.

So if this is the case for the data sets we give to those algorithms, then how exactly do we expect our machine intelligence to scale?

The seems to suggest that we should keep data cleaning simple, with as few meaningful decisions as possible. Why? Because if you’re making those decisions, you’re classifier is inheriting implcit rules that it would otherwise learn statistically. Essentially– anything sufficiently complex is not a pre-processing question at all, but rather a machine learning question.

And those brunette’s with roots– we’ll just have to see what the classifier says about them.

One Comment

  1. Ben McGraw commented on February 26, 2009 | Permalink

    After you figure out the roots question, how about a study of the correlation between short natural redheads and loud/bossy traits?

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

Archives

  • January 2010
  • November 2009
  • July 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • Site Feeds

    Posts
    Comments

    Marginal Structure Posts RSS feed

    Site Tags