Tonight the NYC R Meetup will be discussing data visualization in R using ggplot2. As part of tonight’s meeting I will be providing a very brief show and tell, which includes mostly code examples and external resources. This exercise has had me thinking quite a bit about data visualization. In addition, a few days ago the Security Crank (great new blog) pinged me on the apparent uselessness of network analysis visualizations in the defense and intelligence communities. As I say in my comment at SC, I agree; however, only in that the method is abused by those that view it as only a means to generate “pretty pictures.” All of this has touched off a very important point about data analysis; possibly the most important, which is how best to convey an analysis visually.
Consumers of data analytics are very rarely analysts themselves, so those in the business of generating plots, figures, chats, graphs, etc. most not only be expert in the analytical process, but also in choosing the best format and medium for relaying that knowledge to an audience. Admittedly, I am not Edward Tufte, Ben Fry, or David McCandless, but I have been around long enough to know what does and does not work, and as such here (in no particular order) are my five rules for data visualization.
- The viz must be able to stand alone
This I learned early, after being dressed down multiple times while giving briefings to senior intelligence officers. Since then it has been reinforced while sitting in on failed job talks and conference presentations. The important thing to keep n mind is that when an audience sees a visualization it should be providing answers, not generating more questions.This, to me, is the most difficult aspect of creating high quality data visualizations. As the creators we are often intimately familiar with the data, and thus take its subtleties for granted. Some people recommend asking yourself “would my Grandmother understand this,” but why insult Grandma’s intelligence? Here’s the bottom line: you have to decide the most efficient means of plotting the data (we’ll get to this), then you have a chart title, legend, possibly some axis labels, and if you are bold a short (140 characters is a good limit) footnote to get your point across. The best visualizations only require a subset of these to be effective, but once you have added the appropriate data accoutrements the chart better be self-explanatory. Very simple and imperfect example: restaurant tipping trends between men and women.


Why is the chart on the right better? First, it has more explanatory value. By splitting the data into two parts we are able to see the x-axis shift for men, i.e., in general they are tipping on higher bills. Also, we are able to use color in a more valuable way; rather than using it to distinguish between sex we can use it to highlight outliers and note general trends. Next, by reducing the amount of data in each plot the information is conveyed more efficiently. Finally, it achieves our ultimate goal, which is always to provide more answers than questions.
- Have a diverse tool set
Learning the quirks and syntax of various data visualizations tools is time consuming and often frustrating, but if you want to create impressive charts you have to do it. I am very sorry to report, but Microsoft Excel + PowerPoint do not generate the best data visualizations. In fact, they often generate visualizations in the 10-20th percentile of quality. The question; therefore is: how do you find the best tools for your task?Most of us will not have the resources to use professional data visualizations suites, but even so these tools are often limited by the scope and vision of their creators. Explore the open-source and general purpose data visualization options out there, learn the three best that fit your needs, and always be open to learning the new stuff—it will pay off.
- People are terrible at distinguishing small differences
This could also be described as the “pie chart trap,” but clearly goes beyond that particular chart design. In fact, network visualizations are notorious for blurring subtle differences. For example, visualizations of massive amounts of social network data can be beautiful, but in nearly all cases they are much more art than science. If we are interested in telling a story with our data, and our data is large and complex, then we need to be creative about how to parse that complexity in order to enhance the clarity of our story. Example using networks: the structure of venture capital co-investments


The visualizations above examine the same data, and even use a similar technique to visualize it, but clearly the example on the right is conveying a more informative story. Admittedly, this visualization, which I generated, in many ways violates my first rule; however, it is still telling a story (e.g., there is a strong underlying structure among four notable communities of VC firms). The visualization on the left, taken from an initial attempt at analyzing this data, tells almost no story; save that the network is highly complex and there exist some disconnected firms.
- Color selection matters
This would seem to be a self-evident point, but it may be the most often violated rule of quality visualization. It seems the primary reason for this problem is laziness, as the default color schemes in many visualization packages were not designed to convey information (again, see the left panel of the figure above). I recently violated this rule while putting together the slides for tonight’s R meetup. Using a single line of R code I generated this chart:data(whiskey,package="flexmix") library(ggplot2) ggplot(subset(whiskey_brands,Brand!="Other brands") ,aes(x=Type, fill=Brand))+geom_bar(position="fill")

In my defense, I was first excited that there was a built-in Scotch whiskey dataset in R, but I also wanted to show what could be done with a single line of code. Clearly, however, the color scheme I used is taking away from the story. The default color scheme in ggplot2 wants to use a gradient, which may be useful in some cases, but not here. To improve the above example I should override this default and construct a more informative color scheme; such as setting a base color for each Scotch type (e.g., blue for blends and green for single malts).
- Reduce, reuse, recycle
When developing statistical models we are often striving to specify the most “parsimonious” model, that is, the model that has the highest explanatory value-to-required variables ratio. We do this to reduce waste in our models, enhance our degrees of freedom, and provide a model that is most relevant to the data. The exact same rules apply to visualizations. Not all observations are created equally; therefore, they may not all belong in a visualization. Those who are analyzing large datasets take data reduction (or “munging”) as given, but in any visualization if something is not adding any value take it out. Developing new and meaningful methods for reducing data is a serious challenge, but one that should be considered before any attempt at visualization is doneOn the other hand, if a reduction and/or visualization method has be successful in the past then it will likely b e successful in the future, so do not be afraid to reuse and recycle. Many of the most successful data visualizers have distinguished themselves by creating a method for visualization and sticking with it (think Gapminder). Not only will it possibly make you famous, but putting in the effort to create a useful method for combining, reducing and visualizing data will mean your efforts are more streamlined in the long term.
So that’s it. Nothing too profound there, but I wanted to post this in order to start a conversation. In that vein, what did I miss and where do you disagree? As always, I welcome your comments.
Automatically Generated Related posts:




The from the New York Times just posted an item which explains, according to them, why it took so long to figure out the Afghanistan Policy:
Had you posted this about 4 months ago, we’d have been 3.6 months into the new Obama Policy!
[Reply]
Drew Conway Reply:
December 4th, 2009 at 7:36 am
Goodness gracious that is ugly!
[Reply]
Units! Will someone think of the units! Are those people tipping in dollars, pounds sterling, or bhats? Tens, hundreds, or thousands?
Enquiring minds want to know!
[Reply]
Drew Conway Reply:
December 4th, 2009 at 9:13 am
That is an excellent point, and thanks for raising it. I actually do not know, but my suspicion is they are in USD.
[Reply]
[...] December 4, 2009 · Leave a Comment I thought this might be helpful for any and all who may be presenting at conferences in the future: The Five Rules of Data Visualization. [...]
Excellent post, Drew – and thanks for pointing out Security Crank to me. A new blog to follow.
But, shame on you – Sctoch “Whiskey”? If it’s Scotch, it’s “Whisky.”
[Reply]
Hi Drew, thanks for the post. You write really helpful stuff!
Here’s a question for you. I agree with point #4. Do you (or anyone else) know of a tool that helps choose distinct colors? For example, I have a stacked graph that contains 10 colored areas but as much as I try, I can’t seem to choose good colors and the default matplotlib colors repeat after about 8 or so.
Seems like there should be a pallet website that can answer the question of: What would be n colors that are the most distinctive visually? For any given n.
[Reply]
Drew Conway Reply:
December 11th, 2009 at 8:42 pm
The best resource I have seen was suggested by @dwf and is at http://colorbrewer2.org/ He actually noted that the default color scheme for ggplot2 was very bad for the colorblind, and suggested I check out this site, which has been extremely useful.
[Reply]
I found this link a while back (Hackernews?) http://www.research.ibm.com/people/l/lloydt/color/color.HTM. It has some examples that describe how color can be used to accentuate (or muddle) your visualizations. @Dugolo, it seems like it’s not just about what colors you pick, but how want you want the colors to do. Adobe’s Kuler can be a good place for inspiration, although your taste could conflict with your purpose. http://kuler.adobe.com
[Reply]
Don’t forget about red/green colorblindness. If you must use colors, do NOT include red and green in the same plot. And be consistent on what the colors represent. I use this to control colors in R:
my.palette <- colorRampPalette(c("white", "yellow", "orange", "red"), space="rgb")
Then, within whatever plotting mechanism I am using, I use:
col.regions=my.palette(20) #The 20 just refers to how shading will occur, I think.
This color scheme also works well if you have to print it in black and white. The shading makes sense.
[Reply]