I have to admit I’ve been extremely disappointed with the academic papers I’ve been reading while doing research for my thesis. These are computer science papers that are, for the most part, showing new machine learning, forensics, or natural language processing techniques. I’m amazed and appalled at the amount of research that has been done that will sit forever on the dusty platters of hard drives and only, if ever, be read for academic purposes.
It’s not that these papers aren’t good quality, it’s that they sit in this weird kind of science purgatory where they’ve successfully demonstrated a significant result, but further development of that work would no longer qualify as innovative enough to warrant academic funding, and commercial applications would tend to take it out of the public domain. Good results in one paper does not imply there’s no need for further discussion, improvements, or implementations. But it’s difficult, not to mention unexciting, to publish a paper saying, “I re-did experiment X, and it still worked!“.
I think there’s actually a number of factors coming into play here. All the fields I mentioned above are relatively young fields, where there is still a huge amount of breadth to be covered. Breadth means pushing back boundaries, making discoveries, and getting your name out there. Consequently, there’s little incentive in these fields right now to tackle depth.
The result is that most of this science is, tragically, disposable. For example, there are hundreds of papers out there on “string similarity algorithms”, with perhaps 2 that I found which actually compare different measures (certainly this was a non-exhaustive search, but I think it is roughly representative). Even in these, there is no attempt to work these things into frameworks with metrics, which others could subsequently build upon. Further, in the vast majority of cases, neither data nor code are published– most of the time, you’re lucky if details about the data pre-processing are even included at all.
But these are the things that make science, well, science. We should remember that research is only one element of the scientific method. To truly qualify as science, we need to make sure our work is not just innovative, but is also verifiable, repeatable, and extensible. In many computer science papers, if you propose and demonstrate an innovative algorithm, then arguably, publishing the details of your data processing (in computer science, this essentially equated to your assumptions), and your code, are equally if not even more important than the actual reporting of your results. It’s the classic give a man a fish/teach a man to fish dichotomy.
The open source community with projects like Apache Mahout has been a real inspiration lately, as I begin to see more and more what a crucial and unfilled role they could play in taking the results of scientific research and making them operational. But I think we still need better frameworks for data set (interface) standardization, quality and existence of README-like documentation on processing applied to raw data sets, and overall adherence to the tenets of the scientific method.
So how can we channel the truly innovative research that is happening in all these fields into a more coherent, streamlined process for identifying, iterating, and refining new ideas and algorithms? How can we take them out of this science purgatory, out of the academic garbage bin of financially or linguistically inaccessible journals? In true computer geek fashion, I have several ideas about how the infrastructure of the Internet itself could play a role in addressing these shortcomings. I’ll get back to those in part II of this post.

2 Comments
Thanks, I enjoyed that. I’ve never been able to articulate that problem, but it’s come to mind, especially during paper reviews.
Jessy, really good thoughts on open science! You make a very good point that ” to truly qualify as science, we need to make sure our work is not just innovative, but is also verifiable, repeatable, and extensible.” I’d really like to find some time to chat with you about the open-science / research platform we’re trying to build at JSC. I met a few great resources (all from ARC of course) at the PMC, so I’ll follow-up with them too (legal, dashlink, etc).
One Trackback
[...] Marginal Structure » Blog Archive » Disposable Science [...]