A couple of days ago someone posted a link to a data set of all TIME Magazine covers, from March, 1923 to March, 2012. Of course, I downloaded it and began thumbing through the images. As is often the case when presented with a new data set I was left wondering, “What can I ask of the data?”
After thinking it over, and with the help of Trey Causey, I came up with, “Have the faces of those on the cover become more diverse over time?” To address this questions I chose to answer something more specific: Has the color values of skin tones in faces on the covers changed over time?
I developed a data visualization tool, I’m calling the Shades of TIME, to explore the answer to that question.
The process for generating the Shades of TIME required the following steps:
Using OpenCV to detect and extract the faces appearing in the magazine covers
I have two primary observations from exploring the data. First, it does appear that the variance in skin tones have changed over time, and in fact the tones are getting darker. Most of the first quarter of the data are hard to interpret because TIME was still largely using black and white images, and when they did use color it was often artist’s renderings of portraits. The interpretation of skin tone in drawings is difficult. Around the mid-1970′s, however, there appears to be an explosion of skin tone diversity. Of course, there can be many reasons for this, not the least of which may be improvement in photo and magazine printing technologies.
Second, and much more certainly, is TIME has steadily increased the number of faces that appear on their covers over time. As you scroll through the visualization you will quickly notice the number of faces per cover increase from one, to a few, to many in the 1990′s through 2010′s. Whether this is the result of a desire to show a more diverse set of faces, or increase their marketing appeal on newsstands, or both; is completely unknown.
But, as with most data projects of this nature the resulting tool generates more observations than questions. Perhaps the most important is how brittle the out-of-the-box face detection algorithms were. As you click through the tone cells you will notice that many of them do not correspond to a face at all. As such, it may be difficult to interpret any of this as relevant to the motivational question. That said, in aggregate there are many more faces than there are false-positives, so the exercise still seems useful.
With the release of the eBook version of Machine Learning for Hackers this week, many people have been asking for the code. With good reason—as it turns out—because O’Reilly still (at the time of this writing) has not updated the book page to include a link to the code.
For those interested, my co-author John Myles White is hosting the code at his Github, which can be accessed at:
Please feel free to clone, fork, and hack the repository as much as you like. As we mention in the README, some of the code will not appear exactly as it does in the text. This happens for two reasons; first, because some minor formatting changes had to be made to fit the code into the book; and second, some of the code has been updated or edited to remove typos and minor errors.
We hope you find the code a useful supplement to the text!
If you missed the scuttlebutt on Twitter yesterday, I announced that John Myles White and my book, Machine Learning for Hackers was sent to the printers! This means that hard copies will be available very soon, and presumedly an eBook copy will be available even sooner.
We are thrilled by the community’s interest and enthusiasm for the book, and want to thank everyone who told us that they have already pre-ordered copies. Many people have been asking for a table of contents, which O’Reilly has not yet posted. To give people a preview I have posted the TOC below. Hopefully this will pique your interest even more!
The mobilization against SOPA/PIPA also included many grassroots efforts to contact Congress and demand the bill be stopped. Given the attention the bill was getting, I was curious if there was any surge in discussion of the bill by members of Congress on Twitter.
Last night Mike Dewar presented a wonderful talk to the New York Open Statistical Programming Meetup titled, “First steps in data visualisation using d3.js.” Mike took the audience through an excellent review of d3.js fundamentals, as well as showed off some of the features of working with Chrome Web Developer Tools. This is one of the best talks we have ever had, and if you have had any interest in exploring d3.js, but were intimidated by the design concepts or syntax, this is exactly the talk for you.
Also, Mike’s slides were all designed using d3.js and are available for download on his Github account: https://github.com/mikedewar/d3talk.
Many months ago I blogged about the research that John Myles White and I are conducting on using Twitter data to estimate an individual’s political ideology. As I mentioned then, we are using the Twitter activity of members of the U.S. Congress to build a training data set for our model. A large part of the effort for this project has gone into designing a system to systematically collect the Twitter data on the members of the U.S. Congress.
Today I am pleased to announce that we have worked out most of the bugs, and now have a reliable data set upon which to build. Better still, we are ready to share. Unlike our old system, the data now lives on a live CouchDB database, and can be queried for specific research tasks. We have combined all of the data available from Twitter’s search API with the information on each member from Sunlight Foundation’s Congressional API.
To show the power of the database, I decided to use my newly acquired d3.js (sick of it yet?) skills to put together a tool that allows you to compare the monthly Twitter activity of all members of the U.S. Congress on Twitter for 2011.
Simply choose a politician from the drop-down menu (alphabetical by surname) and the graph will update with their activity data. If you want to reset the graph, just click the “Clear selections” button.
Feel free to add as many members as you like, but the dimensions of the visualization max out around 9. I have been playing around with for awhile, it’s fun! Oh, and if you choose a member and nothing happens it is most likely because that person didn’t tweet anything in 2011. I could have built-in error-catching or some warning. Also, to clear things you need to re-load the page. I’ll leave real UX to the professional web designers.
Back to the data. Unfortunately, the database is sitting on a server that cannot process many requests (read, web-scale) at a time. In fact, this blog post may bring it down! As such, if you are interested in getting access to the database please contact me directly. But be forewarned, working with this system and CouchDB requires a mature understanding of several tools and languages; including but not restricted to; curl, map/reduce, Javascript, and JSON. And that’s before you have even done any analysis.
Many people have asked me about working with Congressional Twitter data, so I hope this data can be useful. Please feel free to reach out if you have any questions.
This leads naturally to many questions; perhaps most importantly that of this post’s title: Who are the most central members of the China’s leadership as we enter 2012?
Recently, I had the opportunity to work with Recorded Future, a startup out of Boston that specializes in longitudinal entity extraction from the massive amount of open-source data generated daily. For example, they have used their data to predict future patent issues for Apple based on issues raised by their competitors. This analysis includes many entities: Apple, HTC, Samsung, etc.; as well as the patents and law suits.
For our analysis we focused on the China’s leadership, as defined by the CIA World Factbook, and extracted all of named entities in their data for 2011 (over 4 billion events) for which any of the 33 official Chinese leaders appear. The result is a dataset with over 150,000 entities; including people, organizations, and places. To answer our questions, however, I used the co-occurrence of these entities in sentence fragments to build a large network of these entities.
Here I define an edge between two entities as the co-occurrence of two entities in a sentence fragment, which is provided by Recorded Future. Then, by extracting only the entities that are defined as people in the data, I generated a graph with 5,435 nodes and 34,413 edges. Big, but not unreasonable for analysis. Next, I computed some basic network statistics on that graph. As I have mentioned many times before, these measures are often most interesting if compared together. To highlight key actors, I generated a scatter plot of two metrics: Eigenvector centrality and betweenness centrality.
First, from looking at the date of my last substantive post I owe everyone an apology. I have essentially let Zero Intelligence Agents wither on the vie, and that is terrible. Not so much because I think people are desperate to read it, but because I am desperate to get feedback from people on my projects and ideas.
One such project I have been working on recently is looking at the newly released data on Federal Reserve borrowing of 407 banks and companies during the 2007-2009 financial crisis. I have been looking for data sets to tell stories with because one of the tools I am eager to learn in 2012 is Michael Bostock’s d3.js, a Javascript library for data-driven design (d3, get it?). It is an incredibly powerful tool, albeit very verbose and cumbersome for a total Javascript newbie such as myself
I decided to teach myself some d3 through this Federal Reserver data, and came up with this visualization in the labs section of drewconway.com. The image below is just a snapshot of the visualization, please click through to see the full interactive chart.
One way increasing your productivity is to see how other people get their work done. The blog The Setup provides this by asking, “What do people use to get stuff done?” If you would like to compare setups with an exceedingly eclectic group of people, than this is a very interesting resource.
If, for some reason, you are curious what I use to get my work done, you now have the opportunity to compare your setup with mine.
This also includes my work with Jake Porway on Data Without Borders. Data Without Borders (DWB) is a new initiative to match non- profits and NGOs in need of data analysis with pro bono data scientists who can help them collect, organize, analyze, and visualize that data. The group was founded when it became clear that there was a huge amount of energy and excitement in the global big data community, but that much of that energy was being diverted to less socially conscientious applications, like deal finders and advertisement placement optimization. DWB seeks to capitalize on that energy by arranging short and long term partnerships between socially conscious data scientists and NGOs.
Last week O’Reilly held the second installment of its Strata Conference, which brings together people from across the big data community. I had the opportunity to sit down with Mac Slocum last week to talk about Data Without Borders.
After the jump I have also embedded a the keynote Jake Porway and I delivered at the Strata Conference on Data Without Borders. One of the things we were most excited about announcing at Strata is our upcoming Datadive in San Francisco. If you are win the Bay Area, please join us to help us do good with data!
Drew Conway is a PhD student in political science at New York University. Drew studies terrorism and armed conflict; using tools from mathematics and computer science to gain a deeper understanding of these phenomena.
Popular Posts