Where to Draw the Line on ‘Data Science’?

I completely agree with Tim O’Reilly. Mike Loukides’ post on what is data science is a, “seminal, important post.” If it has managed to avoid your gaze over the past twenty-four hours I highly recommend it; if nothing else, it is a 2,000 word massage of the data geek ego and a nifty tool and who’s who reference to boot. As the latest in a recent series of blog post and magazine/newspaper articles on the rise of the data scientist Loukides draws broad strokes on this emerging discipline, covering everything from where the data comes from, to how to manage it, and who is doing great work (kudos for getting quotes for so many excellent members of the data community).

While I think it is important to write and discuss the importance of this field, I think it is equally important that we—the data science community—do not fall into a perpetual cycle of self-admiration and navel gazing. That is, when asking the question, “what is data science,” we should also be asking, “what is not data science?” Or, perhaps more appropriately, “What is good data science, and how do I become a good data scientist?” These questions have not been the focus of the discussion thus far, and it is time to start asking them.

Up to this point the discussion of what is data science has been rather inclusive. As Loukides notes:

In the last few years, there has been an explosion in the amount of data that’s available. Whether we’re talking about web server logs, tweet streams, online transaction records, “citizen science,” data from sensors, government data, or some other source, the problem isn’t finding data, it’s figuring out what to do with it.

After reading the Loukides piece in the context of what has already been said, I was struck by what appears to be a gradual blurring between what is science and what now being promoted as data science. As an example, consider the recent adjustment of the estimated amount of oil spilled into the Gulf coast. Using the live video feeds of the spill and satellite imagery, FSU oceanographer Ian R. MacDonald performed “rough calculations” to find that the actual amount of oil being spilled may have been four or five times what the government had estimated. Now, were the calculations performed by Dr. MacDonald data science, or just science? His data came from the streaming ethers of the Internet pointed out by Loukides and others; the external spring from which data science flows, but his primary tools were his own eyes and decades of experience.

Before you accuse me of pedantic folly, my purpose with the MacDonald example is to highlight the fact that good data science is exactly the same good science. The most meaningful analyses will be borne from a thorough understanding of the data’s context, and an acute sense of what the most important questions should be asked. The conversation up to this point, unfortunately, has been far too focused on the data resources themselves and the tools used to approach them. Good data science will never be measured by the terabytes in your Cassandra database, the number of EC2 nodes your jobs is using, or the volume of mappers you can send through a Hadoop instance. Having a lot of data does not license you to have a lot to say about it.

To that end, I have been disappointed in the lack of mention as to on how critical the social sciences are to good data science. Loukides quotes LinkedIn’s Chief Scientist DJ Patil in reference to who makes the best data scientist:

…the best data scientists tend to be “hard scientists,” particularly physicists, rather than computer science majors. Physicists have a strong mathematical background, computing skills, and come from a discipline in which survival depends on getting the most from the data.

While I have the upmost respect for physicist, like Patil, their discipline is unencumbered by such pesky matters as human free will and fallibility. I happen to know that DJ respects and understands the difference because I have had the great pleasure of discussing this issues with him, but imagine how much more difficult the so-called hard sciences would be if atoms got to decide their own charge? As data science is fundamentally about gleaning information from the data trail of humans those with perspective on causality in this context are invaluable. While large data stores may be interested in running a regression over some set of variables, a good data scientist would first wonder what the underlying process was that generated those observations, what is missing, and how that affects the interpretations of results.

My assessment of the current state of data science is best described as cautious optimism. The tools needed to capture the data deluge (as Chris Anderson puts it) have developed at a truly astonishing rate. And though I think those leading the data science charge are brilliant and preeminently capable of continuing its surge; I fear our intuitions about what the data mean have not kept pace, and it may be sooner than later that our analyses suffer for it.

Photo: Boston Globe


Automatically Generated Related posts:

  1. What Will ‘Data Science’ Teach Us?
  2. Thoughts on the New DoD Budget and Social Science Research
  3. A Brief Case for Social Science at DARPA
  4. Data Analysis Challenges for the Defense and Intelligence Communities
  5. Open Letter to Regina Dugan, Bring Social Science to DARPA

6 comments to Where to Draw the Line on ‘Data Science’?

  • As an ecologist working in the Long-Term Ecological Research Network, a firehose of data on the natural world only just being brought together under one roof. I consider myself a self-taught data scientist (e.g. a dash of computer science here, a stack of stats textbooks and R coding there), but my grounding is as a Natural Scientist. It’s been fascinating to watch more and more data mining efforts get published by those interested in tools and the foregone conclusions rather than applying careful knowledge of natural history to a dataset of interest. I think you raise an excellent point in the need for Disciplinary application in data science. The tools developed by data science are still just tools. They are not answers. Proper unbiased interpretation requires knowledge of the tricks, missteps, context, obvious fallacies, and wide range of correlated relationships in one’s system of interest. Heck, I’ll admit myself to losing site of the system for the awesomeness of data. Until less data-science enamored collaborators or even reviewers raise a hairy eyebrow. It’s a fascinating conundrum. I guess my greatest hope is that the next generation of scientists (not just physicists or statisticians) is growing up in a data rich environment. To them, the need to confront and properly interrogate the mass of data out there will be far less intimidating. Second nature, even. I even see it in the current generation, and find it heartening.

    [Reply]

    Drew Conway Reply:

    Agreed, too often we have tool driven research, which is a-theoretical and by design biased in some way

    [Reply]

  • Bruno

    While I have the upmost respect for physicist, like Patil, their discipline is unencumbered by such pesky matters as human free will and fallibility. Imagine how much more difficult the so-called hard sciences would be if atoms got to decide their own charge?

    There’s actually been a lot of activity in this direction over the last 5 years or so within the Physics community. There’s an area within Statistical Physics called “Human Dynamics” that tried to quantitatively tackle the difficult questions posed by human behavior. Take a look at http://arxiv.org/find/all/1/ti:+EXACT+Human_Dynamics/0/1/0/all/0/1 and other papers by the same authors to get an idea of some of the things than have been done recently.

    [Reply]

    Drew Conway Reply:

    Thank you for the link, I will check out some of the papers. I am familair with some of this work, and unfortunately I find that in many cases the research is simply physicists attempting to reinvent and rebrand old social science findings as their own.

    [Reply]

    Bruno Reply:

    Some level of duplication of previous results is inevitable whenever you have crossovers from one discipline to another. This isn’t necessarily malicious (as in one field purposefully trying to rebrand results from another are as their own), but simply a direct consequence of lack of communication between the two disciplines.

    We (the physics community) should definitely do a better job of keeping up with the body of knowledge accumulated in areas we are interested in working on, but it is not always easy to navigate a new body of literature with its own conventions, language and “well known results”.

    Although physics are notorious for dismissing other disciplines (as the old joke goes, “It’s not Bio/Socio/Econo/etc-physics, it’s Biology/Sociology/Economics/etc done right”), social scientists are also not free of “blame” in this count with sometimes not much attention being paid to whatever quantitative insights come from other disciplines.

    Better communication and collaboration between the two areas would benefit everyone and result in much quick progress towards the common goal of understanding, modeling and quantifying human behavior. After all, one of August Compte’s first names for “Sociology” was “Social-Physics” ;-) .

    [Reply]

    Drew Conway Reply:

    Agreed, and I think both sides are negligent, though clearly my bias is in one direction.

    Perhaps the best evidence of the “borrowing” of social science into the hard sciences was this small paper in PLoS ONE last year: http://www.plosone.org/article/info:doi/10.1371/journal.pone.0004803

    …complete with nifty network map!

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Technorati Profile