How to Download & Unzip Zip files in Python
Processing a large amount of data on a local machine can be a hassle sometimes, especially if our primary purpose is rather exploratory. I ran into this problem when I wanted to see the overall trend of open policing data.
The Stanford Open Policing Project gathers data on vehicular and pedestrian stops made by the police across the country. They offer a very well-organized series of data divided by different locations. Instead of having to collect each file and creating a large compilation of all available instances, I needed to access only some parts of each dataset and bag the rest.
So here I outline how I batch-processed downloading multiple zip files from their website, extract the CSV file, and merge them into a data frame after filtering to a specific timeframe.
Web Scraping Links
First, we will collect all the download links available on their site. There are separate files for each available location.
BeautifulSoup allows searching within the HTML page using the tag and attribute. Here, we are looking for all
<a> tags with the
title of ‘Download data as CSV’, then extracting all
href attributes from these tags. This gives us the links to all zip files on this page.
Download, unzip and compile
Now, for each link we collected above, we will iterate the following steps: opening the link, unzip the zip file, reading the CSV as a Pandas data frame, and merging with the other datasets.
Voila, now we have a filtered and randomly sampled (undersampling majority) dataset from the website.
In this article, I briefly shared how to download multiple compressed CSV files from a website and compile them as a Pandas data frame.