Thanks to the R Bloggers aggregator I came across Yihui Xie’s post on a piece currently making the rounds about statistical analysis platforms. In The Next Big Thing, AnnMaria De Mars makes the argument that R—as a statistical computing platform—is not well suited for what she views as the next big things in data analytics: dealing with very large data sets, and creative visualization. She goes so far as to say that in this respect, R is an epic fail (emphasis below mine):
Contrary to what some people seem to think, R is definitely not the next big thing, either. I am always surprised when people ask me why I think that, because to my mind it is obvious…I know that R is free and I am actually a Unix fan and think Open Source software is a great idea. However, for me personally and for most users, both individual and organizational, the much greater cost of software is the time it takes to install it, maintain it, learn it and document it. On that, R is an epic fail. It does NOT fit with the way the vast majority of people in the world use computers. The vast majority of people are NOT programmers. They are used to looking at things and clicking on things.
I have edited out a bit of R code that De Mars uses to illustrate her points; first because the code itself has nothing to do with either big data or creative visualization; and second, it contains errors and does not run. This, however, is rather beside the point. De Mars’s main point is that commercial statistical platforms, such as SAS, STATA and SPSS, are better suited to handle large data and visualization. Where to begin…
First, yes, R is a difficult language to learn. Even for those that have an extensive programming background its syntactical peculiarities and functional foundation can be a difficult hurdles to climb for many. That said, R is a Turing complete language, which means once you learn the language, data analytics are bounded only by your imagination (and NP-completeness). To De Mars’s point then, it would seem a tautological fallacy that endeavors into the “next big thing” would be best nurtured in the fully limited environment of point-and-click commercial analysis platforms. Byron Ellis summarized this sentiment quite nicely on Twitter:
[R] is for making new things. Point and click is for redoing old things. I often need to make new things to analyze my data.
Well put. R allows users to build their own methods for analysis and feast on an ever expanding catalog of libraries for any number of analytical needs, commercial products provide users with the set of functionality they deem fit. Next, with respect to the specific big things De Mars is concerned with—big data and visualization—R appears to be the hands down winner.
Brendan O’Connor provided an excellent, though somewhat dated now, side-by-side comparison of several open-source and commercial data analytic platforms. One his main complaints about all of these platforms is that they cannot handle data sets that do not fit on a single hard drive, i.e., really big. Since his writing, however, R now supports computing over a cluster with MPI or SNOW, and streaming to various map/reduce frameworks such as Hadoop. In addition, the sqldf library enhances R to manipulate large relational databases.
Regarding R specifically, O’Connor notes that one of its main weaknesses is, “visualizations are great for exploratory analysis, but you want something else for very high-quality graphs.” Now, however, there are several libraries that empower users to create extremely high-quality, and publication ready visualization. To the latter, both lattice and ggplot2 provide unique visualization power, and are fully extensible to the needs of their users—again, unlike the commercial platforms. In addition, R can be extended to work with Processing to generate extremely high-end interactive visualization. There are also several libraries that allow R to generate web-ready visualization with protovis. Some commercial platforms are able perform high-performance computing, but to my knowledge, none have the flexibility and quality of visualization as R.
I am wholesale perplexed by De Mars’s argument. While most software users are more comfortable with GUI platforms, it seems entirely unlikely that the next big data analysis “thing” would come from a world catering to the lowest-common denominator. While clearly I am biased, that bias comes from experience on all of these platforms and dealing with problems of big data and visualization. For those looking for the next big thing, I highly recommend following the adoption of R and its relatives, and spending very little concerned with the commercial platforms.
Finally, Matt Blackwell of the Social Science Statistics Blog has also weighed in, and I recommend his post.
UPDATE: Others have weighed in as well:
Tal Galili – “The next big thing”, R, and Statistics in the cloud
Joe Dunn – R Is (Not) the Next Big Thing
Photo: Social Media Law Student
Automatically Generated Related posts:




Colleagues,
I am a professor at the Monterey Institute in Monterey,CA and this year we did something rather radical. After more than 20 years, we shifted from SPSS to R for our (required) Data Analysis class for our Policy Studies Master’s students…this IS a “next” thing (I’m not sure how big it is…) this means that an entire generation of policy professional that will be working for national governments, international organizations, large NGO, and private corporations have developed their analytical skills entirely on an open-source platform…and they are not (and will not be) programmers…by now there’s a long enough list of more or less user-friendly interfaces that allow almost anyone use it “extensively” without programming…
…just my two cents (may be just one and a half…)
~fernando
[Reply]
Drew Conway Reply:
April 15th, 2010 at 12:59 pm
That fantastic! What were your reasons for making the switch? Open-source? Extensibility? Power? Love for letters appearing late in the alphabet?
[Reply]
Fernando DePaolis Reply:
April 15th, 2010 at 6:33 pm
first and foremost, open-source…but beyond that it was a challenge for the faculty involved…as JFK said, “we are doing this and other things not because they are easy, but because they are hard…” We believe it’s mature enough for prime-time, with very reasonable GUIs, so we gave it a try, and so far, we’re not regretting it…
[Reply]
Very articulate post, Drew. I think that there is clearly a need for point and click tools. Tableau is a good example of a well executed data exploration GUI. I used Tableau for more than a year and it quickly became obvious that it’s a great platform for visual slice and dice (classic BI), it’s not a good platform for modeling as it supports no programming language. In addition, when I would do a bunch of slice and dice and discover some interesting relationship which I wanted to build into my reporting system I would have to recreate a lot of data manipulation because Tableau does not support exporting the SQL used to extract the underlying data for its graphs.
In no way am I down on Tableau. I think it’s a fantastic tool. And I invoke its name here simply as an example of a great GUI based data tool. But the limitations of a “GUI only” tool become obvious very quickly. But this only becomes obvious to someone who is a technician working with data daily. In that regard, I think Dr. De Mars revealed a considerable amount more about her experience than she did about software.
I’d like a fridge that would fix me a sandwich and pour me a beer every night. And Dr. De Mars would like the next big thing to be something that made hard work easy. Me too, Dr. De Mars. Me too.
[Reply]
As long as you can 100% use the GUI, she may have a point, but if you have to actually type any SAS statements, you end up using a syntax from 1970′s and punch cards (including things like DATA statements and all kinds of syntactical inconsistencies). Hardly “the future”. And if you need to do any programming in SAS, you resort to a different macro programming language.
That is the Achilles heel of open source in general: not-so-great GUIs. One reason why Linux has taken so long to gain the non-techie-user traction that has been “around the corner” for years.
[Reply]
Tal Galili Reply:
April 16th, 2010 at 11:11 am
One platform that I think did a great job on it is WordPress.
They just hired someone to help with usability testing for redesigning it’s back-end. And she did a great job.
[Reply]
[...] others), they are: The inevitable R backlash by Matt Blackwell. R is an Epic Fail? by Yihui Xie The Next Big Thing: SAS and SPSS!…wait, what? By Drew Conway (Extra hat tip for also linking to other R bloggers who wrote about this) R Is (Not) [...]
It’s interesting she cites ease of use as a strength of SPSS and SAS. While I have limited experience with SAS, Stata, and other packages, I’ve had the unfortunate opportunity to be cajoled into both teaching SPSS to undergraduates, and using it to do data analysis for a state criminal justice agency, both are entirely painful.
I think Dr. De Mars associates point and click with ease of use; however, for anyone who has ever used SPSS, the point and click interface is anything but easy. SPSS has a menu structure that isn’t particularly easy to navigate, indecipherable options to change, and counter-intuitive defaults for many actions. In an attempt to remedy this, SPSS does provide a syntax language for people with programming experience to be more flexible. I have used many programming languages (C,C++,Lisp,Haskell,Python,Matlab/Octave,R,Java,Ruby,Visual Basic,Fortran), and none is as poorly designed and hard to use as SPSS Syntax. The most glaring difficulty is the complete lack of support for indexing of any kind. There was nothing worse than realizing I was spending more time teaching my students about how to handle the quirks of SPSS, then how to actually use the program for the kind of analysis they already knew. This was compounded when the whole program was transfered over to the java framework. Nothing like taking a product which is not particularly fast to begin with and putting another layer of complexity between it and the CPU. This is most apparent when one tries to sort a moderately large data set (a million cases), checking the time, and watching the status bar seems to indicate that the program is probably doing improved bubble sort(though I have no way of telling). With SPSS, I tell it to sort, and go grab a cup of coffee and chat up my office mates. Sorting the same data set with R, I barely have time to check my email.
Also, has she seen the graphs SPSS produces? I refuse to call them visualizations, because frankly they are terrible. Perhaps I am just too unfamiliar with the capabilities of SPSS, but I generally take the data produced by SPSS, and put it in another program to create a graph, even OpenOffice Calc does a better job.
I like R, but I am a computer scientist, so maybe I’m biased. I do know however, that nobody who has every had to use SPSS for extended periods of time actually likes it.
[Reply]
gg … stands for Grammar of Graphics … for what I know work started by Leland Wilkinson when he was with SPSS … Statistics renders GPL (an implementation of it)
I have met many guys interested in R in Academia that finally fall into SAS or SPSS due to lack of power to handle big data sets
Check R-evolution for a possible way of tackling things…
With IBM besides SPSS now who knows what the future may look like…but expect some push on the development side
[Reply]
sam Reply:
April 18th, 2010 at 1:38 am
Can SAS or SPSS hook in with Hadoop? NO! There are many ways of handling big data in R. If you try and load in a several GB csv, dont blame R; blame the person who thought that was the smart way of doing it
[Reply]
I disagree that someone necessarily has to be a proficient programmer to benefit from using R.
Consider that the general statistical procedures, that most users should be familiar with, can be carried out with very simple commands.
For example : t.test(),prop.test(),anova(),lm(),plot(), summary().
Those six commands cover a huge chunk of any undergraduate stats course.
[Reply]
She is just an amateur point and click statistician. She provides weak and misleading evidence as her examples. Has she actually seen the graphics SAS outputs? Or the fact that the SAS language makes no sense. It is really sad that she had to open her mouth and provide nothing.
[Reply]
Saying Tuning complete, NP etc is just a cliche. R is good for small-to-mediate datasets, explorative programming. I use R for my data mining research, and I like it. I worked in big research company before, there are 6 matlab licenses, a lot of researchers use R when matlab is not available.
But I don’t think the current open source packages of R could handle large datsets effectively. Stable large-scale computing always needs money and a lot of human effort. I have some experience working in a cluster. A lot of system factors need to consider, the network bandwidth, how to reduce the data passing between different machines. These are system problems, sometimes extremely hard for statisticians. Companies don’t use R for big things, not because they are rich, but because maintaining R costs more money and time than using a piece of commercial software.
This situation will be better and better as the numbers of R users and experts are increasing.
[Reply]
Drew Conway Reply:
April 19th, 2010 at 3:47 pm
I thought saying Turing complete and NP-complete were mor pedantic than cliche, but then again, I suppose this comment is too.
I think the assertion that maintaining large databases in open-source platforms is more expensive is flatly false. This is supported by the fact that places generating huge amounts of data, e.g., Twitter, LinkedIn, Facebook, etc., are all on open-sources DB platforms and use R for analytics.
For more depth check out Kevin Weil’s talk on data at Twitter from Chirp http://www.slideshare.net/kevinweil/big-data-at-twitter-chirp-2010
[Reply]
Jason Reply:
May 28th, 2010 at 12:48 am
Drew:
Google and facebook never claim to use R for large scale work. In fact at a R meetup last year at SF, representatives from both goolge and facebook explicitly stated that they DO NOT use R for production work but only for exploratory analysis due to R’s problem with big dataset. If you want to hear what they actually said, check out the meeting video at the following link:
http://dataspora.com/blog/predictive-analytics-using-r/
With regard to R’s problem with larger-than-memory dataset problem, check out Ross Ihaka’s( who develop R side by side with Robert Gentleman) paper in which he and Lang proposed new generation of stat language over R to solve R’s inefficiency:
http://www.stat.auckland.ac.nz/~ihaka/downloads/Compstat-2008.pdf
[Reply]
Drew Conway Reply:
May 28th, 2010 at 8:34 am
Yes, I know they aren’t using it for production. I had a chance to meet those same Google and Facebook folks when I came out to give a talk at the SF R Meetup last Fall!
Your point is well taken: yes, R is not a good tool for very large data sets. I suppose I will just follow JD’s lead below from now on.
jd long Reply:
April 19th, 2010 at 6:34 pm
I’ve changed my tune completely. If everyone figures out how to scale R for large data then I lose a competitive advantage. So from now on I think I am going to simply respond “you are correct. R is a toy for small datasets. Please pay no attention to it.” Then I can count my cash with very little interruption as I underbid firms with huge analytic costs.
[Reply]
I would argue that “the next big thing” is a really powerful and useful wrapper for R that does not just include a good GUI, but also makes it easy to find help, describes the syntax in relatively simple terms, and can make basic high-quality graphics easily. While Jospeh Dunn argues that R is in the “Linux, circa 1998″ stage, I would argue that Linux is still not “the next big thing.” Instead, that honor (judging since 1998) went to Apple. Why? They made a very attractive, intuitive operating system wrapped around the Linux system that allowed flexibility for those who desired it and ease-of-use for those who don’t need it.
R is very difficult to learn, or possibly I am very slow. But, I have taught myself basic PHP and Python, learned how to program using Stata and still have a difficult time trying to learn R, so I don’t think that I have the slowest uptake in the world. On the other hand, I hope to learn it before I start teaching so that I can save students $400+ on the cost of software; but, I think that it is important to be aware that not everyone has the same programming chops that those who use R (or Linux for that matter) on a regular basis have and that it is important to be sensitive to that.
[Reply]
Mike – you might find the open source stats/reporting package SOFA Statistics useful. It might not meet all your students’ needs but it could be a useful tool to add to their kit. I note you have learned Python. SOFA Statistics is Python all the way down which is a big benefit should you ever want to sidestep the GUI and create some scripts. Disclosure – I am the lead developer of SOFA Statistics.
[Reply]
Thanks Drew. As usual, an interesting post.
I wonder if there’s much to be said by the fact Eviews 7, allows users to programme in R or MATLAB? (Of course, you’d need to have R or MATLAB installed–I’m not sure if it will work with Octave). My guess is that it adds a huge degree of functionality to Eviews while maintaining the easy point-and-click features most students or other low-end users actually use it for. For example, the copy of Eviews I use at work (6) can’t create a null-space matrix. Damn.
It’d be interesting to hear if there are any other proprietary systems that are incorporating R? I am pretty excited about having this new release put on my work computer; I work in a government department, and they’ve never been very keen to install any software that doesn’t cost thousands of dollars.
Glad to see our taxes spent well…
[Reply]
Part of the controversy is that people mean different things when they say ‘large data’. I’ve even seen people describe a data set with a few tens of thousands of rows and a few hundred columns as ‘huge’. Another part is that the suitability of R depends on the type of algorithm you need.
If you have only a few tens of gigabytes of data, as in genomic studies, there is no difficulty in working in R, and we do. If you have petabytes you are likely to be out of luck. Also, if you have petabytes of data, computation time is no longer that cheap relative to programmer time, so you should be building specialized tools.
If you are doing mostly matrix-based statistical computations, R does well. For a lot of graph or string algorithms you need different data structures (eg suffix trees for genetic sequence lookup). These will need to be written in something else, though they could then be given an R interface. Similarly, R is structurally unsuited to interactive graphics (the basic graphics design predates Space Invaders), though there are some tolerable workarounds.
So, there are plenty of real situations where R isn’t a good tool. These are usually also situations where SAS or SPSS or Stata would be less suitable.
The real advantage of SPSS and Stata is that they are easier to learn. This is important: if statistical computations are hard to do, they will tend to be done by people who know more about statistical computation and less about the data and the scientific questions, which is a Bad Thing.
This is not to deny that R sucks in many important ways. It just sucks less than most of the current alternatives, for a fairly wide range of problems. I would say R is the Current Big Thing for statistical analysis, and I hope something different is the Next Big Thing.
[Reply]
Surely the whole point of something being the next big thing relies on it being universally adopted. In this regard R is not easy to use and therefore will always be the domain of the techie, while SPSS is much more straight forward and therefore lends itself nicely to being adopted throughout the organisation. SAS can be easy to use through the tools, but the SAS community is filled with techies who still build everything in code and poo poo the end users tool kit. Analytics is still in the back water and always will be until this situation is resolved.
Having been in IT for over 30 years, I have seen this before on many occasions. Does anyone remember the best database in the world at the time, Fox Pro? Sank by its own community who wanted to maintain their positions as guardians of the black box, and of course their contracts.
[Reply]