The ever vigilant observers of government publications at Secrecy News highlight a recently released analysis and thought piece on the data analysis challenges facing the defense and intelligence communities. The excellent report, entitled (cleverly) “Data Analysis Challenges,” was produced by the JASON defense advisory panel, and is a thorough treatment of the voluminous landscape of data pouring into these communities, how other disciplines (e.g., astrophysics) have dealt with this scaling, the current state-of-the-art in data management tools and techniques—but most interesting—presents a series of “grand data challenges” as a means to consider how to deal with data problems.
…we put forth a set of suggested challenge topics that would spur further development in automated analysis of large data. It should be emphasized that our proposals below are by no means exhaustive. Instead, they are simply meant to provide example applications of a methodology that could lead to identification of such grand challenge problems and thus to a rationale for significant investment in research in the area of machine assisted analysis of large data. None of the grand challenge problems described below focus on hardware, networking or storage…Rather, important advances in data fusion, registration, and ultimately in machine learning are called for.
Grand challenges are a great way to test the intellectual boundaries of a subject, and they form an important foundation for discovery—especially within the DoD/IC communities. All of the problems put forth by JASON here are fascinating, but of most interest to the social science and network analysis communities will be the “Conversational Analysis Grand Challenge.” As the authors states, the purpose of this challenge is to develop method for inferring the membership of individuals to groups based on their communication. The authors suggest integrating machine learning to accomplish this task, however, more traditional network analysis techniques may be more useful.
The context, frequency and structure of communicative networks will likely tell the story of individual affiliations and strength of tie. Clique and cluster analysis, as well as block modeling are all techniques that are currently used to examine how individuals form sub-groups within a network. The challenge is to create a methodology that builds on these methods, which are equal parts art and science, and develop reliable algorithms that accurately perform these data reductions on a massive scale. An additional complication would be to integrate natural language toolkits to attempt to derive context from conversations as part of the data coding, however, these technologies are far from full developed, and may lead to more error than discovery.
What are your thoughts on using network analysis to determine individual group membership from electronic communication? What methods would you use?
Photo: Data processing
Automatically Generated Related posts:




This problem would seem to be substantially complicated by secrecy in two ways. First, how does strategic non-communication in dark networks bias estimates that traditional clique or cluster analytic methods would produce? The strategic behavior of actors of interest to the defense and intelligence communities would seem to differ substantially from the behaviors of actors more typically studied by network scholars. As a result, the challenge requires not just scaling existing methods but developing new techniques, or perhaps introducing techniques to academics that have been previously been developed within in the intelligence community. I’m not saying this is impossible, but it frames the challenge somehow differently from the way it is portrayed above.
The second aspect of secrecy that increases this challenge is the secrecy from the intelligence community. In a very different context, the Netflix challenge has shown that developing effective algorithms for for efficient processing in large data sets requires some disclosure of data to people who can bring innovative approaches to bear on the problem. Similarly, the self-interest of academics in publishing in peer-reviewed journals and therefore producing reproducible results means that limitations on data disclosure will do little to attract a wide array of the best and brightest from within academia. As a result, meeting the challenge will be a lot easier if the intelligence community can produce public use data without compromising on-going operations or sources or methods. Two approaches to resolving this conundrum stand out to me. First, there may be older data that could be made public, the disclosure of which would not represent a security concern but is still sufficiently relevant to the large scale data analysis challenge. Second, it may be possible to produce simulated “communication” data with similar structural properties to the type of real data for which the DoD/IC needs better analytic techniques. However, without public use data of some type — real or simulated — progress on any such challenge seems likely to be substantially slowed.
[Reply]
Kevin, you make two excellent points, thank you for sharing them.
First, I think you are absolutely right that those of most interest will be actively attempting to block and deceive collection of their communication; however, I do not think that the actual communication activity itself is entirely different from non-adversarial behavior. Speaking in codes, back-channeling emails, etc. are all methods of hiding disclosure, but the ties themselves remain relatively consistent. As such, the real challenge is creating methods for deciphering these codes so that the traditional methods can be effectively used.
On your second point, getting the IC to share data with academia should be a grand challenge in and of itself; and you are right to point that out. For the purposes of these problems, though, it is probably more useful to assume that the data is accessible and think about what to do with it from there. I think the best approach to take is large scale anonymization of data, and the Enron corpus is an excellent example of how to do this. I agree that simulation is another good approach, though hard to overcome the biases inserted by the developers. DARPA’s National Cyber Range, I think, attempts to get at some of this.
[Reply]
[...] Posted in ubiwar by Tim Stevens on 8 July 2009 Data Analysis Challenges for the Defense and Intelligence Communities – Drew Conway, Zero Intelligence [...]
Hi Drew,
Just a comment on your response: The Enron corpus was not anonymized. Email addresses are present. In fact, I’m a bit skeptical about anonymization in data arising from network communication (ie. a sequence of relational events). Suppose you hope to release a sequence of triples (time, ID of sender, ID of receiver), it’s not going to tough to make things anonymous. For example, suppose you wanted to anonymize a year’s worth of email among all students at NYU. If a person at NYU downloads the data set and wants to find themselves in the network, they can just look at the sequence of timestamps in their outbox and, in turn, deduce all their neighbors. And I don’t see how any naive fixes (e.g. subsetting events, adding noise to the timestamps) will help without (possibly) changing the nature of the data in a significant way. If you have any pointers to papers you have found that solve this issue, please let me know!
PS. It was good meeting you at the NY Meetup last night.
[Reply]
Chris,
The Enron corpus that I had worked with in the past was anonymized, though that is merely a matter of semantics. More to the point, I do not think the value in anonymization is in preventing those within the data from deciphering their location; but rather, preventing third party analysts from identifying individuals.
The key concern for releasing national security data is revealing collecting tactics, which was clearly not a concern for the Enron corpus.
[Reply]