How to: Use Python to Collect Data from the Web – Part 2: Saving the Data

Yesterday, I introduced a data collection problem and explained how to use Python to retrieve and parse the data in question from the web. To review, I wanted to download specific statistics on all of the New York Giants’ 2009 draft picks, and then save that data to a spreadsheet. I created a CSV file with their names, some additional information, and empty columns for height, weight, D.O.B. and college. Using Python’s built-in web interface, along with the html5lib HTML parser, I described how to retrieve the player’s profile URLs, downloaded the data, and parsed it into a Python dict indexed by the player’s name.

Our final task is to save this data back to the spreadsheet for storage. Fortunately, most of the hard work for this project was done in part 1. With the data stored as a Python dict, we will need only the built-in csv module to add the new data. This final step is accomplished with two simple Python functions.

The script takes three pieces of data: the dict of parsed web data, the file path the the old spreadsheet, and the file path to the new completed spreadsheet. It may seem duplicative to create a new spreadsheet with all of the data, rather than overwrite the original file with the new entries. This, however, is done as a precaution against error, and should be used as a base practice. When coding, it is always better to have too many copies of something rather than not enough!

The writing of rows to the new spreadsheet is very straightforward. First, we write the column headers from the the old file, which is done in lines 7-11. Then, we fill in the cells by matching the old data with the new in the appropriate order in lines 16-24. The first three cells of each row are from the old file, and the remainder come from parsed web data.

There is, however, one problem with the parsed web data that must be addressed before the data can be saved properly. The height data is in a format that causes problems for the CSV format; specifically, the “feet-inches” formatting will be interpreted as a date, such that 6-1 will be converted to 6/1/2009 by a spreadsheet program, such as Excel. To avoid this, we create a small helper function that converts the height format to feet’inches”, which will not cause this error.

After running the script and getting confirmation that the file has been written successfully, our new and completed spreadsheet looks like this:

Success! We have completed the task of saving the parsed data to a spreadsheet and are now done. I hope this two part series has helped you better understand some of the basic Python techniques and best practices for collecting data from the web. The code from this tutorial has been updated in the ZIA Code Repository, and is available for you to download.

As always, I welcome any questions or comments you have on either part of this tutorial.

Update: Michael Bommarito of Computational Legal Studies proposes a very elegant alternative to this example.

Photo: The Football Expert


Automatically Generated Related posts:

  1. How To: Use Python to Collect Data from the Web – Part 1: Parsing Data
  2. How to: Perform Multivariate Regression with Python
  3. How to: Use Python to Solve Optimization Problems
  4. How to: Use Python and Social Network Analysis to Find New Twitter Friends
  5. How Python Can Help Political Methodologist

2 comments to How to: Use Python to Collect Data from the Web – Part 2: Saving the Data

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Technorati Profile