A searchable dataset of Michigan air polluter documents that updates daily

Browse through documents—like violation notices and inspection reports—that EGLE's Air Quality Division has sent to air polluters throughout the state.

by Shelby Jouppi

Michigan's Department of Environment, Great Lakes and Energy (EGLE) monitors and inspects thousands of sources of air pollution throughout the state. The department publishes PDF files of inspection reports, violation notices, emissions test results and more on its database, however the website is not particularly user-friendly.

EGLE representatives told me they are in the process of creating a front-end interface for this database, but in the meantime I created a script to parse the documents by date and type, connect them to identifying information about the facilities, and daily look for updates to the database.

PLEASE NOTE: EGLE has stated that the database is not comprehensive, meaning documents could be missing for a variety of reasons.

EGLE Database Before & After

✨ Products


1. Google Sheet - Full history

🔗 tinyurl.com/egle-air-documents

Bookmark the above URL for a google sheet that will update daily, and make a copy when you want to work with the data yourself.

2. Google Sheet - Past 90 days

🔗 tinyurl.com/egle-air-documents-90

If you need something a little more lightweight, you can just keep track of activity from the past 90 days.

3. CSVs of both datasets

These are available as .csv files hosted on Github that you can work with via URL or download.

4. Free data assistance!

I would love to be of service by helping troubleshoot or consulting about how to use the data. Please reach out by submitting your question, error or request to this form and I will respond within 24 hours. Feel free to indicate if you'd like to meet about your question.

If you're a journalist with a time-sensitive question, feel free to contact me directly: hello [at] shelbyjouppi.com

Read the documentation


Browse activity from the past 90 days


Read more about the different types of documents in this dataset.


Documentation & Methodology


Important caveats to start

The dataset was created by scraping the Air Quality Division database. Please note that my scraper only looks for sources that are in a directory provided to me by EGLE in May 2022. There may be new sources that have cropped up since then, changed their names, or others that have gone out of service. I am in the process of requesting an up-to-date directory. In the mean time you can check the master list they publish.

EGLE has also said that their online database is not necessarily comprehensive. There may be missing documents or other types of reports not uploaded for certain reasons. For any questions about the specific sources or documents in this dataset, contact EGLE's public information office.

Columns in the dataset:

DOCUMENT INFO:
SOURCE INFO:
LOCATION INFO:
Use with caution. I have noticed inconsistencies. Sometimes address and zip code are related to the main facility, and not necessarily a specific plant.

Some definitions

To help get you started, here are some definitions of certain columns that I obtained during my reporting on this subject.

COMMON DOCUMENT TYPES
EPA CLASS

Document type code key

I created the key manually in September 2022 by reviewing examples of each document type and using the language in the document to come up with the key. Please report errors to the form below. Michigan EGLE should be consulted with any questions about specific documents or naming conventions.

Caveat: When a digit is included in the code (ie VN2), it is not always used in a uniform way. For simplicity's sake, I titled these 1st, 2nd, 3rd, etc. However these document should be reviewed in context with scrutiny and not necessarily used in aggregates.

Methodology

Using Beautiful Soup, I scraped over 18,000 documents for these sources of air pollution. With Regex, I extracted the urls of the documents as well as data from the documents' names, which were all structured predicably as such:

{SOURCE ID}_{TYPE OF DOCUMENT}_{DATE ISSUED}.pdf

I joined the scraped data with identifing information (name, location, source type) from the master list. I also created a document code key (EGLE-AQD-document-code-key.xlsx) manually so users can easily see what kind of document they are looking at.

The autoscraper uses Github Actions, Beautiful Soup and Regex search the database for directories that have new updates, go into those folders and search for URLs that are not already in the dataset. It also uses the Python library pygsheets to publish the updated Google Sheets.

View the code

Gratitude

Many thanks to my generous professors Jon Thirkhield and Jonathan Soma who shared their code and helped me troubleshoot along the way.

Happy searching! And don't hesitate to reach out.




<-- Visualizing the child psychologist shortage
Analyzing changes in Great Lakes water consumption -->