Wikileaks Afghanistan Data

By now, you have most certainly have read about the publication of a massive (72,000+) number of classified documents related to coalition operation in Afghanistan by the whistleblowers group Wikileaks. The data are available in several formats at the Wikileaks dedicated site.

Before proceeding, I want to point out that given the nature by which this information was obtained and subsequently disseminated I am unclear as to the legal protections provided to those in possession of the data (i.e., retaining copies on their hard drives), or performing analysis (i.e., citing data in research). As such, I am not recommending or condoning anyone download the data until these questions are explicitly addressed.

I, however, have downloaded the data and begun examining it at a high-level. I believe such an examination is critical for two reasons: first, this is the first time in history that the public has been given such a granular view of the day-to-day operation of contemporary warfare. With the proper analytical tools, this data may reveal insights to the predicates of conflict in ways that previous aggregate-level data could not. Second, because the data may have gone through some degree of filtering/selection by Wikileaks, an intricate analysis of the data may provide insight into the nature of that selection and the process by which this selection occurred.

After the jump is an initial overall descriptive visualization of the data as it was provided by Wikileaks, with some brief interpretations. Over the next several days and weeks, I hope to examine the data in more detail and periodically present the results.

Continue reading Wikileaks Afghanistan Data

Local R User Group Panel from useR! 2010 (Video)

As I mentioned last week, I will be hosting videos of several of the keynote speakers from this year’s useR! 2010 conference at the video Rchive. As it happens, the first video I was able to upload was the panel discussion we held on starting local R user groups. I have uploaded the video, which is also embedded below (after the jump).

I was joined on the panel by an illustrious assembly of R community members, which included:

Continue reading Local R User Group Panel from useR! 2010 (Video)

userR! 2010 Videos to be Hosted at Rchive

Today, I am packing up the car and heading south to my old home, Washington, DC, for the useR! 2010 conference, which is being held at the National Institute of Standards and Technology. Incidentally, where I was an intern in the Information Technology Lab during college.

If you are not able to make the trip to Gaithersburg, MD; fear not, through the hard work of Szilard Pafka (organizer of the LA R user’s group) and Katherine Mullen, coordinator of useR!, I will be hosting many of the conference’s keynotes, lectures and several of the panel discussions at the Video Rchive. It may be several days before all of the videos are uploaded, but be sure to check back at the Rchive next week for any updates.

If you are attending useR!, do try to make it to our panel discussion on starting a local R users group in your area (Thursday, 3:25pm in the Red Room). The panel includes several prominent charactersmembers of the R community, and should be a very entertaining and informative discussion.

Hope to see you there!

Anatomy of a Life-Milestone Announcement on Facebook

As I have mentioned, I recently returned for a lovely trip to Europe. While on vacation my brilliant, beautiful, funny, and all around perfect girlfriend accepted my invitation to be my wife.

Pause for shared overwhelming feeling of joy…

While I am still basking in the glow of being the luckiest man on Earth, as a true data geek I could not let this opportunity to analyze a novel data set escape me.

One of the most fascinating aspects of social media is how it has changed the way life-milestones, like getting engaged, are announced. Facebook’s ‘Relationship Status’ feature allows users to inform all of their friends at once about these large life changes. Such announcements are often met with a sudden deluge of comments and wall postings, so I thought: wouldn’t it be interesting to collect this data and analyze the frequency of decay of these postings?

Though I am not on Facebook, my fiancée is, and with a little help from Facebook’s API and R’s ggplot2 library I was able to collect and analyze this data. Below I present (with permission) the data on from my fiancée’s wall for the first 48 hours after she changed her Relationship Status from ‘In a Relationship’ to ‘Engaged’.

Interesting. A huge spike in the first hour, a drop and flattening over the next two hours, and finally another large drop with sporadic spikes. Women dominate the initial posts, while the gender difference vanish as posting frequency decreased more late-comers make posts.

Of course, all of this is secondary to that fact that—Kristen—I love you and I cannot wait to spend the rest of my life with you!

What Will ‘Data Science’ Teach Us?

waterfall.pngIf the level of online discourse is a good indicator of whether a topic has penetrated the collective nerd consciousness, then the notion of a burgeoning “data science” discipline has taken hold. A few weeks ago I discussed where to draw the line on this idea, but recently I again begann thinking about the idea and term more critically. Yesterday, I had a wonderful discussion with a brilliant member of the data community here in New York, which focused on the delicate balance between keeping a human-friendly face on mass quantities of data—something the data scientists are meant to do—and having this new discipline make formidable contributions to our general understand of human behavior.

That is, up to this point, many of the great evangelists of data science have focused on telling stories with data. Science, however, is not about story telling, but about discovery. Perhaps I am particularly cautious of the suffix “science” because of the awkward self-consciousness the word has imbued in my own discipline. At its roots, political science was a discipline that sought to construct narratives; equal parts history, philosophy and personal experience. The name “political science,” therefore, brought the ire of the “hard science” community, as they felt (perhaps with reason) that the word had been appended to the title erroneously, as there were no identifiably scientific aspects to the endeavor. While my discipline has come a long way in its application of the scientific method, and today can much more accurately be referred to as a science, there continues to be a delicate balance between discovery and story telling. What, then, can the data science community learn from this experience?

Broadly, all disciplines are measured by their contributions to our understanding of the universe. Data science—by design—is the product of measured human activity, and therefore should seek to provide new insight into human behavior. Unfortunately, the current focus of many of the community’s members has been a self-congratulatory appraisal of the tools that have been developed to allow for this large-scale measurement and recording. To be a successful discipline, however, the focus must move away from tools and toward questions.

Continue reading What Will ‘Data Science’ Teach Us?

Sunbelt XXX, and Other Loose Ends

I have been back in the United States for about a week, but only now have found some time to get back to blogging. As I stated before my departure, the primary reason for my trip to Europe was to participate in the 30th meeting of the International Network of Social Network Analysts.

First, Aric Hagberg and I gave a workshop on using NetworkX to hack social networks. Given that it was the first time we had ever given this workshop, I was pleased with how well it went and the positive reception we received from the audience. It was encouraging to see so many researchers from academia, private corporations and the government interested in learning the mechanics of generating network data and analyzing it. That said, Sunbelt did reinforce my previous observation that academic researchers have a lot of catching up to do in terms of tools. There were several talks that indicated an unfortunate lack of technical expertise, which could easily be overcome with a minimal level of effort. Thankfully, conferences like Sunbelt allow for a people with many different talents to mix together and exchange ideas—and this norm was on display in Riva del Garda. Continue reading Sunbelt XXX, and Other Loose Ends

Materials from NetworkX Workshop

On Tuesday Aric Hagberg and I presented a half-day workshop on NetworkX titled “Hacking Social Networks with the Python Programming Language,” at Sunbelt XXX. I tweeted this on Tuesday, but for those that were unable to make the plane, train and bus trip required to reach Riva del Garda, Italy to attend Sunbelt, Aric and I have posted all of the workshop materials (slides, LaTeX, code, etc.) to Github.

Please feel free to download, play, and reuse liberally. Also, if you have any questions, please drop me a line.

Extended leave of absence

No sooner do I post my (controversial?) list of reasons why grad students blog, then I must take an extended leave of absence. As others have rightly pointed out, course and academic research comes first. In a few days I will be heading to Europe for, among other things, Sunbelt XXX to hobnob with the world’s foremost network theorists, and present Aric Hagberg and my workshop on NetworkX. If you are going to be at Sunbelt, please drop me a line.

Blogging we return sometime mid-July, so farewell until then…

Gratuitous and Rather Useless World Cup Post

I am not a soccer fan, I prefer the American version of football. That said, I am admittedly actively following the 2010 World Cup. While watching the opening match between South Africa and Mexico, I thought it would be fun to ask the question, “Do free countries produce better football teams?”

So, I quickly combined the FIFA World Rankings and Press Freedom Index for all of the countries participating in the 2010 World Cup, and came up with this:

Note, the Press Freedom Index goes from 0.00 (most free) to 115.00 (least free), and the FIFA point totals increase as the teams overall quality improves, so we might expect a negative relationship. Also, I removed North Korea from the data set because it was such an extreme outlier on the press freedom dimension. So, we find basically a null result. There is a slight negative relationship, but it is essentially random.

The peak and valley of the smoothed fit curve is a bit interesting. For the worst teams, as freedom goes down the quality of the teams go up, but around a freedom score of 20, that relations inverses and as the quality of the teams increases so does the level of freedom—until we reach the best team, Brazil and Spain.

Learning About Network Theory

Over the past several weeks I have had to pleasure of co-authoring a lengthy introduction to network theory with Bradford Cross, co-founder and head of research for FlightCaster (one of my top five favorite iPhone apps). After many ebbs and flows of writing, it is finally up over at Brad’s excellent Measuring Measures blog. Here’s Brad’s motivation for the post:

I received a lot of great feedback to my first and second posts on learning about machine learning. Part of that feedback was that people wanted to see similar posts for other topics. The most asked about topic was Network Theory, no doubt due to a massive recent increase in interest in social networks and social network analysis (SNA).

In this post, Drew Conway (a PhD Candidate at New York University, studying networks) and I will walk you through a guide that we hope may be of use to others trying to find their way through network theory

So, go check out Learning About Network Theory and let Brad and I know what you think. Also, be sure to up-vote it on Hacker News and /r/statistics.

Technorati Profile