Chart of Congressional activity on Twitter related to SOPA/PIPA

As many of you know, this week thousands of people mobilized to protest two laws being considered in Congress: the Stop Online Piracy Act (SOPA) and it’s Senate version the PROTECT IP Act (PIPA). Several Internet mainstays, such as Wikipedia, Reddit andy O’Reilly blacked out their sites to protest the bill. For some information on why this legislation is so dangerous check out this excellent video by The Guardian.

The mobilization against SOPA/PIPA also included many grassroots efforts to contact Congress and demand the bill be stopped. Given the attention the bill was getting, I was curious if there was any surge in discussion of the bill by members of Congress on Twitter.

So, I created a visualization that is a cumulative timeline of tweets by members of the U.S. Congress for “SOPA” or “PIPA.” To see if there was any surge, check out the visualization for yourself.


First steps in data visualisation using d3.js, by Mike Dewar

Last night Mike Dewar presented a wonderful talk to the New York Open Statistical Programming Meetup titled, “First steps in data visualisation using d3.js.” Mike took the audience through an excellent review of d3.js fundamentals, as well as showed off some of the features of working with Chrome Web Developer Tools. This is one of the best talks we have ever had, and if you have had any interest in exploring d3.js, but were intimidated by the design concepts or syntax, this is exactly the talk for you.

Also, Mike’s slides were all designed using d3.js and are available for download on his Github account: https://github.com/mikedewar/d3talk.

Monthly Twitter activity for all members of the U.S. Congress

Many months ago I blogged about the research that John Myles White and I are conducting on using Twitter data to estimate an individual’s political ideology. As I mentioned then, we are using the Twitter activity of members of the U.S. Congress to build a training data set for our model. A large part of the effort for this project has gone into designing a system to systematically collect the Twitter data on the members of the U.S. Congress.

Today I am pleased to announce that we have worked out most of the bugs, and now have a reliable data set upon which to build. Better still, we are ready to share. Unlike our old system, the data now lives on a live CouchDB database, and can be queried for specific research tasks. We have combined all of the data available from Twitter’s search API with the information on each member from Sunlight Foundation’s Congressional API.

To show the power of the database, I decided to use my newly acquired d3.js (sick of it yet?) skills to put together a tool that allows you to compare the monthly Twitter activity of all members of the U.S. Congress on Twitter for 2011.

Simply choose a politician from the drop-down menu (alphabetical by surname) and the graph will update with their activity data. If you want to reset the graph, just click the “Clear selections” button.

Feel free to add as many members as you like, but the dimensions of the visualization max out around 9. I have been playing around with for awhile, it’s fun! Oh, and if you choose a member and nothing happens it is most likely because that person didn’t tweet anything in 2011. I could have built-in error-catching or some warning. Also, to clear things you need to re-load the page. I’ll leave real UX to the professional web designers.

Back to the data. Unfortunately, the database is sitting on a server that cannot process many requests (read, web-scale) at a time. In fact, this blog post may bring it down! As such, if you are interested in getting access to the database please contact me directly. But be forewarned, working with this system and CouchDB requires a mature understanding of several tools and languages; including but not restricted to; curl, map/reduce, Javascript, and JSON. And that’s before you have even done any analysis.

Many people have asked me about working with Congressional Twitter data, so I hope this data can be useful. Please feel free to reach out if you have any questions.

Who are the most central members of the China’s leadership as we enter 2012?

As the United States gears up for what appears to be a long and grueling 2012 presidential campaign, China will also undergo its decennial turnover in presidential power in 2012. Unlike the United States, however, this shift will not involve any campaigning or voting—at least not with the people of China. Instead, this shift is one that is formalized within he Chinese Communist party; but that doesn’t mean that there won’t be interesting shifts and reallocations of power.

This leads naturally to many questions; perhaps most importantly that of this post’s title: Who are the most central members of the China’s leadership as we enter 2012?

Recently, I had the opportunity to work with Recorded Future, a startup out of Boston that specializes in longitudinal entity extraction from the massive amount of open-source data generated daily. For example, they have used their data to predict future patent issues for Apple based on issues raised by their competitors. This analysis includes many entities: Apple, HTC, Samsung, etc.; as well as the patents and law suits.

For our analysis we focused on the China’s leadership, as defined by the CIA World Factbook, and extracted all of named entities in their data for 2011 (over 4 billion events) for which any of the 33 official Chinese leaders appear. The result is a dataset with over 150,000 entities; including people, organizations, and places. To answer our questions, however, I used the co-occurrence of these entities in sentence fragments to build a large network of these entities.

Here I define an edge between two entities as the co-occurrence of two entities in a sentence fragment, which is provided by Recorded Future. Then, by extracting only the entities that are defined as people in the data, I generated a graph with 5,435 nodes and 34,413 edges. Big, but not unreasonable for analysis. Next, I computed some basic network statistics on that graph. As I have mentioned many times before, these measures are often most interesting if compared together. To highlight key actors, I generated a scatter plot of two metrics: Eigenvector centrality and betweenness centrality.

Continue reading Who are the most central members of the China’s leadership as we enter 2012?

Federal Reserve borrowing during the 2007-2009 financial crisis

First, from looking at the date of my last substantive post I owe everyone an apology. I have essentially let Zero Intelligence Agents wither on the vie, and that is terrible. Not so much because I think people are desperate to read it, but because I am desperate to get feedback from people on my projects and ideas.

One such project I have been working on recently is looking at the newly released data on Federal Reserve borrowing of 407 banks and companies during the 2007-2009 financial crisis. I have been looking for data sets to tell stories with because one of the tools I am eager to learn in 2012 is Michael Bostock’s d3.js, a Javascript library for data-driven design (d3, get it?). It is an incredibly powerful tool, albeit very verbose and cumbersome for a total Javascript newbie such as myself

I decided to teach myself some d3 through this Federal Reserver data, and came up with this visualization in the labs section of drewconway.com. The image below is just a snapshot of the visualization, please click through to see the full interactive chart.

Continue reading Federal Reserve borrowing during the 2007-2009 financial crisis

My setup

One way increasing your productivity is to see how other people get their work done. The blog The Setup provides this by asking, “What do people use to get stuff done?” If you would like to compare setups with an exceedingly eclectic group of people, than this is a very interesting resource.

If, for some reason, you are curious what I use to get my work done, you now have the opportunity to compare your setup with mine.

Interview on Data Without Borders

A few people have noticed that the blog has been very quiet lately. This is very true, and I am sorry for the lack of new material. Unfortunately, the only excuse I have is that I have been very busy with many other exciting projects. These include both my upcoming book on machine learning, and a separate research project on political speech on Twitter with John Myles White.

This also includes my work with Jake Porway on Data Without Borders. Data Without Borders (DWB) is a new initiative to match non- profits and NGOs in need of data analysis with pro bono data scientists who can help them collect, organize, analyze, and visualize that data. The group was founded when it became clear that there was a huge amount of energy and excitement in the global big data community, but that much of that energy was being diverted to less socially conscientious applications, like deal finders and advertisement placement optimization. DWB seeks to capitalize on that energy by arranging short and long term partnerships between socially conscious data scientists and NGOs.

Last week O’Reilly held the second installment of its Strata Conference, which brings together people from across the big data community. I had the opportunity to sit down with Mac Slocum last week to talk about Data Without Borders.

After the jump I have also embedded a the keynote Jake Porway and I delivered at the Strata Conference on Data Without Borders. One of the things we were most excited about announcing at Strata is our upcoming Datadive in San Francisco. If you are win the Bay Area, please join us to help us do good with data!

Continue reading Interview on Data Without Borders

Create an animated clock in R with ggplot2 (and ffmpeg)

Because it’s Friday—and I needed to create this for a separate visualization—here is how to create an animated clock in R using ggplot2.

In just about 20 lines of code! And here is the clock…

Continue reading Create an animated clock in R with ggplot2 (and ffmpeg)

Data use policies and social media: an appeal

After my post a few weeks ago lamenting Twitter’s data use policies, many people reached out to me supporting my position and asking what they could do to help. One person was Mark Huberty, a fellow political scientist at UC Berkeley. Mark mentioned that there were many other social scientists who had similar experiences and were worried about its ramifications for research.

We decided to the best way to proceed was to make an appeal to all researchers—not only social scientists—to gather examples of work, and stories, of how many disciplines are using this data to uncover new aspects of human behavior. This morning, Mark wrote just such an appeal to the POLMETH mailing list, and in an effort to make this appeal to a larger audience I have reproduced it below:

Continue reading Data use policies and social media: an appeal

Members’ of Congress activity on Twitter

A few months ago I started a project with John Myles White examining the tweets of members of the U.S. Congress. For one reason or the other, the project slowly moved to the back-burner. Recently, we have started to reengage on the project, and I was looking through some of the work that we had done and found this interesting visualization I had made on the activity of various Congressional Twitter accounts over time and thought I would share it.

Continue reading Members’ of Congress activity on Twitter

Technorati Profile Jeffrey Pikus Worst landlord in New York City Blue Star Properties Worst Management Company New York City liar cheat criminal worst landlord Jeffrey Pikus New York City Jeffrey Pikus Blue Star Properties terrible awful cheat liar scum Jeff Pikus liar cheat criminal worst landlord Blue Star Properties New York City Jeffrey Pikus never rent from worst landlord New York City