Michigan EGLE Air Quality Document Database

Read more about the different types of documents in this dataset.

Documentation & Methodology

Important caveats to start

The dataset was created by scraping the Air Quality Division database. Please note that my scraper only looks for sources that are in a directory provided to me by EGLE in May 2022. There may be new sources that have cropped up since then, changed their names, or others that have gone out of service. I am in the process of requesting an up-to-date directory. In the mean time you can check the master list they publish.

EGLE has also said that their online database is not necessarily comprehensive. There may be missing documents or other types of reports not uploaded for certain reasons. For any questions about the specific sources or documents in this dataset, contact EGLE's public information office.

Columns in the dataset:

DOCUMENT INFO:

date: Date document issued according to filename
facility_name: Name of company or facility
doc_type: Document type code (ie "VN")
type_name: Name of document type (ie "Violation Notice"). Read about where these came from
doc_url: Link to the document hosted on the EGLE database

SOURCE INFO:

srn: EGLE-issued identification number (can be cross referenced with EPA databases)
epa_class: EPA classification of the source (ie "Major")
district_name: EGLE district where facility is located
staff: EGLE staff member assigned to facility

LOCATION INFO:

Use with caution. I have noticed inconsistencies. Sometimes address and zip code are related to the main facility, and not necessarily a specific plant.

address
city
zip
county

Some definitions

To help get you started, here are some definitions of certain columns that I obtained during my reporting on this subject.

COMMON DOCUMENT TYPES

Staff Activity Report: Reports filed by EGLE staff when they visit or inspect a facility.
Violation Notice: Noticies sent to facilities when they violate their permits. These violations could be polluting too much, failing to keep records or report emissions, in response to public complaints, or a number of other reasons.
Response to Violation Notice: A company's mandatory response to a state-issued violation notice.
Full Compliance Evaluation: Comprehensive inspections that the EPA requires at a variety of intervals depending on the source's class. Major sources require FCEs every two years, synthetic minors every five years, and megasites every three years. EGLE staff inspects the facility, equipment, emissions records and more.
Test: Reports detailing the results of tests on equipment.
Administrative Consent Order: A bi-lateral legal agreement between a company and the government.
Enforcement Notice: The beginning of the state's escalated enforcement action against a company currently in violation of their permit.
Stipulation: A fine issued to a company for a violation.

EPA CLASS

Major: A Title V Major source of air pollution. These facilities pollute large quantities of hazardous air pollutantsRead about these permits.
SM Opt Out: A "Synthetic Minor" or "Opt Out" source of air pollution. These facilities have the capacity to pollute at or above the major source threshold, but accept permit restrictions to stay below the threshold.
Minor: A True Minor source of air pollution. These facilities thave the capacity to pollute below major source thresholds. Read more about the distinction between true minor and synthetic minor sources.
Megasite: A large operation with several major sources of air pollution.

Document type code key

I created the key manually in September 2022 by reviewing examples of each document type and using the language in the document to come up with the key. Please report errors to the form below. Michigan EGLE should be consulted with any questions about specific documents or naming conventions.

Caveat: When a digit is included in the code (ie VN2), it is not always used in a uniform way. For simplicity's sake, I titled these 1st, 2nd, 3rd, etc. However these document should be reviewed in context with scrutiny and not necessarily used in aggregates.

Methodology

Using Beautiful Soup, I scraped over 18,000 documents for these sources of air pollution. With Regex, I extracted the urls of the documents as well as data from the documents' names, which were all structured predicably as such:

{SOURCE ID}_{TYPE OF DOCUMENT}_{DATE ISSUED}.pdf

I joined the scraped data with identifing information (name, location, source type) from the master list. I also created a document code key (EGLE-AQD-document-code-key.xlsx) manually so users can easily see what kind of document they are looking at.

The autoscraper uses Github Actions, Beautiful Soup and Regex search the database for directories that have new updates, go into those folders and search for URLs that are not already in the dataset. It also uses the Python library pygsheets to publish the updated Google Sheets.

View the code

Gratitude

Many thanks to my generous professors Jon Thirkhield and Jonathan Soma who shared their code and helped me troubleshoot along the way.

Happy searching! And don't hesitate to reach out.

A searchable dataset of Michigan air polluter documents that updates daily

Browse through documents—like violation notices and inspection reports—that EGLE's Air Quality Division has sent to air polluters throughout the state.

✨ Products

1. Google Sheet - Full history

2. Google Sheet - Past 90 days

3. CSVs of both datasets

4. Free data assistance!

Browse activity from the past 90 days