The Shades of TIME project

A couple of days ago someone posted a link to a data set of all TIME Magazine covers, from March, 1923 to March, 2012. Of course, I downloaded it and began thumbing through the images. As is often the case when presented with a new data set I was left wondering, “What can I ask of the data?”

After thinking it over, and with the help of Trey Causey, I came up with, “Have the faces of those on the cover become more diverse over time?” To address this questions I chose to answer something more specific: Has the color values of skin tones in faces on the covers changed over time?

I developed a data visualization tool, I’m calling the Shades of TIME, to explore the answer to that question.

The process for generating the Shades of TIME required the following steps:

  1. Using OpenCV to detect and extract the faces appearing in the magazine covers
  2. Using the Python Image Library to implement the Peer, at al. (2003) skin tone classifier to find the dominant skin tone in each face
  3. Designing a data visualization and exploration tool using d3.js

The code and data are all available at my Github. Instructions for how to use the tool to explore the data are available at the tool page itself. It is worth checking out just as a fun way to explore the TIME Magazine covers.

I have two primary observations from exploring the data. First, it does appear that the variance in skin tones have changed over time, and in fact the tones are getting darker. Most of the first quarter of the data are hard to interpret because TIME was still largely using black and white images, and when they did use color it was often artist’s renderings of portraits. The interpretation of skin tone in drawings is difficult. Around the mid-1970′s, however, there appears to be an explosion of skin tone diversity. Of course, there can be many reasons for this, not the least of which may be improvement in photo and magazine printing technologies.

Second, and much more certainly, is TIME has steadily increased the number of faces that appear on their covers over time. As you scroll through the visualization you will quickly notice the number of faces per cover increase from one, to a few, to many in the 1990′s through 2010′s. Whether this is the result of a desire to show a more diverse set of faces, or increase their marketing appeal on newsstands, or both; is completely unknown.

But, as with most data projects of this nature the resulting tool generates more observations than questions. Perhaps the most important is how brittle the out-of-the-box face detection algorithms were. As you click through the tone cells you will notice that many of them do not correspond to a face at all. As such, it may be difficult to interpret any of this as relevant to the motivational question. That said, in aggregate there are many more faces than there are false-positives, so the exercise still seems useful.

Code for Machine Learning for Hackers

With the release of the eBook version of Machine Learning for Hackers this week, many people have been asking for the code. With good reason—as it turns out—because O’Reilly still (at the time of this writing) has not updated the book page to include a link to the code.

For those interested, my co-author John Myles White is hosting the code at his Github, which can be accessed at:

https://github.com/johnmyleswhite/ML_for_Hackers

Please feel free to clone, fork, and hack the repository as much as you like. As we mention in the README, some of the code will not appear exactly as it does in the text. This happens for two reasons; first, because some minor formatting changes had to be made to fit the code into the book; and second, some of the code has been updated or edited to remove typos and minor errors.

We hope you find the code a useful supplement to the text!

Machine Learning for Hackers table of contents

If you missed the scuttlebutt on Twitter yesterday, I announced that John Myles White and my book, Machine Learning for Hackers was sent to the printers! This means that hard copies will be available very soon, and presumedly an eBook copy will be available even sooner.

We are thrilled by the community’s interest and enthusiasm for the book, and want to thank everyone who told us that they have already pre-ordered copies. Many people have been asking for a table of contents, which O’Reilly has not yet posted. To give people a preview I have posted the TOC below. Hopefully this will pique your interest even more!

TOC ML4Hackers

Chart of Congressional activity on Twitter related to SOPA/PIPA

As many of you know, this week thousands of people mobilized to protest two laws being considered in Congress: the Stop Online Piracy Act (SOPA) and it’s Senate version the PROTECT IP Act (PIPA). Several Internet mainstays, such as Wikipedia, Reddit andy O’Reilly blacked out their sites to protest the bill. For some information on why this legislation is so dangerous check out this excellent video by The Guardian.

The mobilization against SOPA/PIPA also included many grassroots efforts to contact Congress and demand the bill be stopped. Given the attention the bill was getting, I was curious if there was any surge in discussion of the bill by members of Congress on Twitter.

So, I created a visualization that is a cumulative timeline of tweets by members of the U.S. Congress for “SOPA” or “PIPA.” To see if there was any surge, check out the visualization for yourself.


First steps in data visualisation using d3.js, by Mike Dewar

Last night Mike Dewar presented a wonderful talk to the New York Open Statistical Programming Meetup titled, “First steps in data visualisation using d3.js.” Mike took the audience through an excellent review of d3.js fundamentals, as well as showed off some of the features of working with Chrome Web Developer Tools. This is one of the best talks we have ever had, and if you have had any interest in exploring d3.js, but were intimidated by the design concepts or syntax, this is exactly the talk for you.

Also, Mike’s slides were all designed using d3.js and are available for download on his Github account: https://github.com/mikedewar/d3talk.

Monthly Twitter activity for all members of the U.S. Congress

Many months ago I blogged about the research that John Myles White and I are conducting on using Twitter data to estimate an individual’s political ideology. As I mentioned then, we are using the Twitter activity of members of the U.S. Congress to build a training data set for our model. A large part of the effort for this project has gone into designing a system to systematically collect the Twitter data on the members of the U.S. Congress.

Today I am pleased to announce that we have worked out most of the bugs, and now have a reliable data set upon which to build. Better still, we are ready to share. Unlike our old system, the data now lives on a live CouchDB database, and can be queried for specific research tasks. We have combined all of the data available from Twitter’s search API with the information on each member from Sunlight Foundation’s Congressional API.

To show the power of the database, I decided to use my newly acquired d3.js (sick of it yet?) skills to put together a tool that allows you to compare the monthly Twitter activity of all members of the U.S. Congress on Twitter for 2011.

Simply choose a politician from the drop-down menu (alphabetical by surname) and the graph will update with their activity data. If you want to reset the graph, just click the “Clear selections” button.

Feel free to add as many members as you like, but the dimensions of the visualization max out around 9. I have been playing around with for awhile, it’s fun! Oh, and if you choose a member and nothing happens it is most likely because that person didn’t tweet anything in 2011. I could have built-in error-catching or some warning. Also, to clear things you need to re-load the page. I’ll leave real UX to the professional web designers.

Back to the data. Unfortunately, the database is sitting on a server that cannot process many requests (read, web-scale) at a time. In fact, this blog post may bring it down! As such, if you are interested in getting access to the database please contact me directly. But be forewarned, working with this system and CouchDB requires a mature understanding of several tools and languages; including but not restricted to; curl, map/reduce, Javascript, and JSON. And that’s before you have even done any analysis.

Many people have asked me about working with Congressional Twitter data, so I hope this data can be useful. Please feel free to reach out if you have any questions.

Who are the most central members of the China’s leadership as we enter 2012?

As the United States gears up for what appears to be a long and grueling 2012 presidential campaign, China will also undergo its decennial turnover in presidential power in 2012. Unlike the United States, however, this shift will not involve any campaigning or voting—at least not with the people of China. Instead, this shift is one that is formalized within he Chinese Communist party; but that doesn’t mean that there won’t be interesting shifts and reallocations of power.

This leads naturally to many questions; perhaps most importantly that of this post’s title: Who are the most central members of the China’s leadership as we enter 2012?

Recently, I had the opportunity to work with Recorded Future, a startup out of Boston that specializes in longitudinal entity extraction from the massive amount of open-source data generated daily. For example, they have used their data to predict future patent issues for Apple based on issues raised by their competitors. This analysis includes many entities: Apple, HTC, Samsung, etc.; as well as the patents and law suits.

For our analysis we focused on the China’s leadership, as defined by the CIA World Factbook, and extracted all of named entities in their data for 2011 (over 4 billion events) for which any of the 33 official Chinese leaders appear. The result is a dataset with over 150,000 entities; including people, organizations, and places. To answer our questions, however, I used the co-occurrence of these entities in sentence fragments to build a large network of these entities.

Here I define an edge between two entities as the co-occurrence of two entities in a sentence fragment, which is provided by Recorded Future. Then, by extracting only the entities that are defined as people in the data, I generated a graph with 5,435 nodes and 34,413 edges. Big, but not unreasonable for analysis. Next, I computed some basic network statistics on that graph. As I have mentioned many times before, these measures are often most interesting if compared together. To highlight key actors, I generated a scatter plot of two metrics: Eigenvector centrality and betweenness centrality.

Continue reading Who are the most central members of the China’s leadership as we enter 2012?

Federal Reserve borrowing during the 2007-2009 financial crisis

First, from looking at the date of my last substantive post I owe everyone an apology. I have essentially let Zero Intelligence Agents wither on the vie, and that is terrible. Not so much because I think people are desperate to read it, but because I am desperate to get feedback from people on my projects and ideas.

One such project I have been working on recently is looking at the newly released data on Federal Reserve borrowing of 407 banks and companies during the 2007-2009 financial crisis. I have been looking for data sets to tell stories with because one of the tools I am eager to learn in 2012 is Michael Bostock’s d3.js, a Javascript library for data-driven design (d3, get it?). It is an incredibly powerful tool, albeit very verbose and cumbersome for a total Javascript newbie such as myself

I decided to teach myself some d3 through this Federal Reserver data, and came up with this visualization in the labs section of drewconway.com. The image below is just a snapshot of the visualization, please click through to see the full interactive chart.

Continue reading Federal Reserve borrowing during the 2007-2009 financial crisis

My setup

One way increasing your productivity is to see how other people get their work done. The blog The Setup provides this by asking, “What do people use to get stuff done?” If you would like to compare setups with an exceedingly eclectic group of people, than this is a very interesting resource.

If, for some reason, you are curious what I use to get my work done, you now have the opportunity to compare your setup with mine.

Interview on Data Without Borders

A few people have noticed that the blog has been very quiet lately. This is very true, and I am sorry for the lack of new material. Unfortunately, the only excuse I have is that I have been very busy with many other exciting projects. These include both my upcoming book on machine learning, and a separate research project on political speech on Twitter with John Myles White.

This also includes my work with Jake Porway on Data Without Borders. Data Without Borders (DWB) is a new initiative to match non- profits and NGOs in need of data analysis with pro bono data scientists who can help them collect, organize, analyze, and visualize that data. The group was founded when it became clear that there was a huge amount of energy and excitement in the global big data community, but that much of that energy was being diverted to less socially conscientious applications, like deal finders and advertisement placement optimization. DWB seeks to capitalize on that energy by arranging short and long term partnerships between socially conscious data scientists and NGOs.

Last week O’Reilly held the second installment of its Strata Conference, which brings together people from across the big data community. I had the opportunity to sit down with Mac Slocum last week to talk about Data Without Borders.

After the jump I have also embedded a the keynote Jake Porway and I delivered at the Strata Conference on Data Without Borders. One of the things we were most excited about announcing at Strata is our upcoming Datadive in San Francisco. If you are win the Bay Area, please join us to help us do good with data!

Continue reading Interview on Data Without Borders

Technorati Profile Jeffrey Pikus Worst landlord in New York City Blue Star Properties Worst Management Company New York City liar cheat criminal worst landlord Jeffrey Pikus New York City Jeffrey Pikus Blue Star Properties terrible awful cheat liar scum Jeff Pikus liar cheat criminal worst landlord Blue Star Properties New York City Jeffrey Pikus never rent from worst landlord New York City