Python is the greatest thing to happen to computer science since the Turing Machine! Well, no, but it has inspired me into a personal renaissance for software writing. Its flexibility, widespread community support, and leveraging of legacy C and Fortran code also make it an outstanding language for social science researchers.
If you are a new researcher looking to get started, or experienced and willing to walk away from your [:,:] lifestyle in Matlab—and licensing and training fees—then equip yourself with these 10 packages and get to it!
- NumPy
- SciPy
- Matplotlib
- NetworkX
- PyMC
- SimPy
- SymPy
- html5lib
- Pycluster
- cjson
- Pyevolve
- MySQL for Python
- RPy2
NumPy, short for Numeric Python, is the cornerstone of Python’s mathematics and statistics operations. All scientific computing in Python starts and ends with NumPy!
Download NumPy
SciPy, short for Scientific Python, is the little brother of NumPy, as it relies on NumPy data types for its operations. To distinguish itself, SciPy adds several of its own sophisticated data types, and integration and optimization techniques. Many of the packages proceeding this rely on some combination of NumPy and SciPy.
Download SciPy
The third tine on Python’s scientific trident, Matplotlib (pylab) is the standard for 2D plotting. Highly extensible, and will display your results just the way you like ‘em.
Download Matplotlib
This package is what motivated me to learn Python. This is the best tool for analyzing network data–period. For novice social network analysts/graph theorist, the learning curve will be steep, but taking the time to learn NX will preclude you from having to waste your time with other inferior tools. Oh, and for those of you with accreditation concerns, its subversion is maintained by Los Alamos National Laboratory.
Download NetworkX
This one is for all of you Bayesian/MCMC modelers out there. PyMC implements the Metropolis-Hastings algorithm as a Python class, providing flexibility when building your model. PyMC is also highly extensible, and well supported by the community.
Download PyMC
Short for “Simulation in Python”, SimPy is an object-oriented, process-based discrete-event simulation language, making it a wholesale agent-based modeling environment written entirely in Python. While not as robust as REPAST or NetLogo, SimPy provides an excellent tool set for designing experiments, and because it is pure Python, the data can be fed to other analytical packages.
Download SimPy
Not to be confused with the previous entry, SymPy is an full-featured Python library for symbolic mathematics. Oliver suggested I add Sage to the list, which is an excellent tool, but SymPy contains nearly all of the same functionality (algebraic evaluation, differentiation, expansion, complex numbers, etc.), but is contained in a pure Python distribution. This package is great for researchers who want symbolic mathematics support, but have no access to mega-expensive computer algebra systems, like Mathematica.
Download SymPy
UPDATE: How to use Python and SymPy to solve optimization problems.
After the fall of BeautifulSoup, I was desperate for a web data parser that equaled soup’s flexibility and easy of use. Enter html5lib. If you need to download and organize large amounts of data from the Internet in a quick and easy way, then html5lib is the only package you will need. This module also supports the BeautifulSoup tree type, as well as many others, making it incredibly useful across a wide range of tasks. To take advantage of its power, you will need a little background in HTML (or XML, if that happens to be what you are parsing), but there are many tutorials available online to get you up to speed quickly.
Download html5lib
There are many clustering algorithms available for Python, but many of these packages are designed to cluster one-dimensional data. Data collected by social scientist, however, is often of a higher dimension–enter Pycluster. This package contains efficient implementations of hierarchical and k-means clustering, with several options for measuring distance. Still waiting for a clever binding to Matplotlib to draw the dendrogram, but in the meantime, you can use their Java program TreeView to display result.
Download Pycluster
This module implements a very fast JSON encoder/decoder for Python. JSON (JavaScript Object Notation) is useful for many things, but most notably for social scientist is how many social networking sites use JSON to encode public data about their users and their users’ relationships. JSON is also what is returned by Google’s SocialGraph API, so cjson allows researchers to feed this social network data directly into Python data types.
Download cjson
A complete pure python genetic algorithm framework. I am wearing my computer science background on my sleeve with this one, but for people serious about designing pure Python agent-based models, Pyevolve provides the tools to create intricate experimental environments.
Download Pyevolve
A pure Python binding for MySQL, allowing the user to integrate MySQL execution into any Python script. Very straightforward and simple to use, and since many social science data sets are stored on MySQL databases, a necessity.
Download MySQL for Python
Updated 4/6/2009>: I have been negligent, as it pointed out in the comments, RPy has functionally been replaced by RPy2.
There are very few statistical calculations that the combination of NumPy and SciPy cannot handle, but there are NO statistical operations R cannot do. RPy2 is a simple Python interface for R, able to execute any R function from within a Python script.
Download RPy2
I should also note that most (maybe all by now) of these packages come standard with the Enthought distribution of Python. If you are interested in using Python as a platform for scientific research, I highly recommend installing this distribution, which is free for academics.
Automatically Generated Related posts:




awesome!
[Reply]
Thanks for the pyEvolve link – I’ve been looking for something like this for quite awhile. Agent based simulations are very interesting indeed. Thanks
[Reply]
I am glad the list was helpful. I with there was a PyGeneticProgramming counterpart to PyEvolve so models could be built that blended GP’s sophisticated agent evolution; but alas, no one has done the heavy lifting for me yet.
If there is enough interest, I will post another update in the near future with a few additional packages I have run across since the last update.
[Reply]
Thanks for the packages, I used to think numPy is only for high level scientific calculations. And also thanks for networkX, I initially worked on graphviz but this baby fits right into my programming environment.
[Reply]
Although visualizations are one of NetworkX’s weaknesses, it does have a very nice graphviz binding if you want to display your analysis.
[Reply]
I would also add sage which integrates most of those tools.
[Reply]
Oliver, I totally agree. In my next update I am going to add a great Python package for symbolic mathematics that incorporates much of the functionality of Sage.
[Reply]
I’d add ipython: http://ipython.scipy.org/moin/
It’s an amazing replacement for the vanilla shell, with auto-completion, pdb integration to drop you on the stack of a previous exception, pretty printing, session history, and many more.
Really, can’t work without it
[Reply]
Thanks for the addition Bruno. I agree, iPython is a good tool, but in terms of IDE I have a hard time getting away from IDLE. I have tried others, and people harp on how many better options there are, but I can’t seem to shake it.
If I had to switch though, and since I am a MacPython user, it would be to TextMate.
[Reply]
Hello Conway, I’m author of PyEvolve, thank you for indicating the framework. Good luck on your social studies; any help you needs, just mail.
[Reply]
Perone, thanks, and I just updated my PyEvolve to the new version–looks great!
[Reply]
One small update to your list. Rpy has practically been replaced by Rpy2, which uses slightly different syntax but feels more robust. It is the actively developed strain of Rpy. You can find the links to Rpy2 from the Rpy page.
[Reply]
Very helpful list, thanks for posting!
A question: I like both R and Python a lot but would love to “standardize” on one for my modeling/number-crunching needs. The hassle of remembering two sets of details is getting to me. Any words of wisdom on making this choice? Thanks.
[Reply]
RR,
I am very biased here, since Python pretty much brought me back from the brink of giving up software coding. That said, Python is an extremely versatile and easy to learn language, so I would recommend focusing and building you skills there.
If you are find that there is something you can’t do in Python, and there is an easy solution in R, just run it through RPy.
[Reply]
Thanks for the quick response, Drew.
[Reply]
Dear Drew,
Thanks for this compilation of Python tools. Fantastic! I have a question: do you have any experience building/executing agent-based models in Python? I’ve been always afraid that the easiness of programming in Python will not be compensate the execution speed (as usual with interpreters). Regards.
[Reply]
Thanks for this. Been relying heavily on python and R for years in and out of school in the field of quantatative/molecular genetics of livestock. Recently python has become more and more of a core component of my workflow with RPy for the plotting/graphing and modeling components.
[Reply]
Great article. I’ve recently embarked upon a bit of complex networks research with a former colleague of mine and I am currently surveying the tools available so this article is particularly timely for me. I was looking into using the Network Workbench and Gephi for most of my analysis, but I would love to have something like NetworkX that I could use from Python just for the ability to automate much of the drudgery. If you’ve got the time to reply, I’d love to read some of your thoughts on why you think NetworkX is superior to some of the tools you listed above with a specific emphasis on the Network Workbench since that is one of the tools that I am considering as well.
[Reply]
I do plenty of genomic data mining with python. I need the statistical functions of R but unfortunately R is a PAIN to learn and work with. I still dont understand its data structures, also its script is archaic. I will definitely look into rpy2.
I should mention that it is important to contribute as well to python libraries. My personal goal would be to bring pythons statistical capabilities to that of spss or systat.
[Reply]