Northwest Air

Downtown Portland

Well the Northwest air brings the fast boys to town
Be like fire on the Cascades
When our feet touch the ground

Problem to Solve

Time to head west! The United States’s “Pacific Northwest” is a northwest region including the states of Oregon, Washington, and Northern Idaho. The region’s cooler temperatures and cloudy skies attract athletes and outdoor adventurers of all kinds. Recently, though, the increasing prevalence of wild fires has threatened the region’s once pristine air quality.

In this problem, in a file called air.R in a folder called air, you’ll learn about indicators of air quality and use R to analyze common air pollutants in the state of Oregon.

Distribution Code

For this problem, you’ll need to download several .R files and air.csv.

Download the distribution code

Open RStudio per the linked steps and navigate to the R console:

Next execute

getwd()

to print your working directory. Ensure your current working directory is where you’d like to download this problem’s distribution code. If using RStudio through cs50.dev the recommended directory is /workspaces/NUMBER where NUMBER is a number unique to your codespace.

If you do not see the right working directory, use setwd to change it! Try typing setwd("..") if in the working directory of another problem, which will move you one directory higher.

Next execute

download.file("https://cdn.cs50.net/r/2024/x/psets/4/air.zip", "air.zip")

in order to download a ZIP called air.zip into your codespace.

Then execute

unzip("air.zip")

to create a folder called air. You no longer need the ZIP file, so you can execute

file.remove("air.zip")

Now type

setwd("air")

followed by Enter to move yourself into (i.e., open) that directory. Your working directory should now end with

air/

If all was successful, you should execute

list.files()

and see several .R files alongside air.csv. If not, retrace your steps and see if you can determine where you went wrong!

Schema

Before jumping in, it will be helpful to get a sense for the “schema” (i.e., organization!) of the data you’re given.

Learn about this data

In air.csv, you’ll find data from the United States Environmental Protection Agency’s National Emissions Inventory. This inventory tracks all emissions of certain pollutants (i.e., harmful substances released into the air).

Some pollutants are called “criteria air pollutants” (CAPs) by the United States Environmental Protection Agency (EPA). These pollutants are more harmful than others. According to the EPA:

Criteria air pollutants are found all over the U.S. They can harm your health and the environment, and cause property damage.

Ozone, carbon monoxide, lead, and nitrogen dioxide are all examples of criteria air pollutants.

air.csv shows the amount of criteria air pollutants emitted in Oregon. The data is very detailed! For each county (i.e., municipality), you can see how much of each pollutant is emitted by different sources, like school buses or wildfires.

Pay attention to these columns:

State-County, which is the county in which the emission took place
POLLUTANT, which is the pollutant being emitted
Emissions (Tons), which is the amount of the pollutant emitted (in tons)
SCC Levels 1–4, which describes the source of the pollutant, from more general (Level 1) to more specific (Level 4)

Specification

`1.R`

In 1.R, read air.csv into a tibble called air, renaming and selecting the columns you need. In particular, ensure the air tibble includes only the following columns:

state, renamed from State
county, renamed from State-County
pollutant, renamed from POLLUTANT
emissions, renamed from Emissions (Tons)
level_1, renamed from SCC LEVEL 1
level_2, renamed from SCC LEVEL 2
level_3, renamed from SCC LEVEL 3
level_4, renamed from SCC LEVEL 4

Turns out the tidyverse has a function you can use in place of read.csv! After loading the tidyverse, use read_csv to load a CSV file directly into a tibble.

Save the resulting air tibble, using save, in a file called air.RData. You’ll use this tibble in the remaining .R files.

`2.R`

To sustainably improve air quality, analysts often focus their efforts on particular sources of pollutants. To identify which sources might need the most attention, find the largest sources of pollutants in Oregon.

In 2.R, load the air tibble from air.RData with load. Update the tibble by sorting all rows by the emissions column, highest value to lowest.

Save the resulting air tibble, using save, in a file called 2.RData.

`3.R`

In addition to focusing on the largest sources of pollutants, analysts might focus on particular geographic regions. Choose one of the counties in Oregon from this list. Find all sources of pollutants in that county.

In 3.R, load the air tibble from air.RData with load. Transform the tibble so that it only includes data for the county of your choice.

Save the resulting air tibble, using save, in a file called 3.RData.

`4.R`

Combine your analyses from 2.R and 3.R.

In 4.R, load the air tibble from air.RData with load. Transform the tibble so that it only includes data for the county of your choice and sorts the data by the emissions column, highest value to lowest.

Save the resulting air tibble, using save, in a file called 4.RData.

`5.R`

So far, you’ve identified the largest sources of pollutants across the entire state of Oregon, as well as within a single county. Now, find the single largest pollutant source for each county.

In 5.R, load the air tibble from air.RData with load. Transform the tibble so that it includes the single row with the highest value in the emissions column for each county.

Save the resulting air tibble, using save, in a file called 5.RData.

`6.R`

Some pollutants tend to be emitted at higher rates than others. For each pollutant, find its total emissions across the entire state of Oregon.

In 6.R, load the air tibble from air.RData with load. Summarize the data in the tibble to find the total emissions for each pollutant. Sort the pollutants from highest to lowest emissions.

The resulting tibble should have two columns, one called pollutant and one called emissions. For example:

pollutant	emissions
Carbon Monoxide	8070434.86
Volatile Organic Compounds	2368212.66
PM10 Primary (Filt + Cond)	1266915.06
…	…

Save the resulting air tibble, using save, in a file called 6.RData.

`7.R`

In professional air quality reports, analysts will often calculate total emissions for broad categories of sources. In fact, the level_1 column in the air tibble lists the broad category in which a specific source is included! For each category of source in level_1, calculate the total emissions of each pollutant.

Want to learn more about these categories?

Among these broad categories are:

Industrial Processes, which includes activities such as chemical manufacturing, metal processing, and food production
Miscellaneous Area Sources, which includes small-scale activities like residential heating, lawn mowing, and commercial cooking
Mobile Sources, which includes vehicles such as cars, trucks, airplanes, and ships
Natural Sources, which includes wildfires, volcanic activity, and biogenic emissions from plants and trees
Solvent Utilization, which includes the use of products like paints, coatings, and cleaning agents
Stationary Source Fuel Combustion, which includes the burning of fuels in fixed locations, such as power plants, industrial boilers, and residential heating
Storage and Transport, which includes the handling and movement of fuels and chemicals, leading to evaporative emissions
Waste Disposal, Treatment, and Recovery, which includes landfills, waste treatment facilities, and recycling operations

In 7.R, load the air tibble from air.RData with load. Transform the tibble to find the total emissions of each pollutant from each of the level_1 source categories. Sort the rows first alphabetically by source name, then alphabetically by pollutant name.

The resulting tibble should have three columns, one called source, one called pollutant, and one called emissions. For example:

source	pollutant	emissions
Industrial Processes	Carbon Monoxide	1460
Industrial Processes	Nitrogen Oxides	9.96
Miscellaneous Area Sources	Ammonia	161756
Miscellaneous Area Sources	Carbon Monoxide	7385998
…	…

Save the resulting air tibble, using save, in a file called 7.RData.

Advice

There are a few ways to approach these problems! Consider the below as advice to help you on your way:

Use read_csv to read data directly into a tibble

Turns out the tidyverse has a function you can use in place of read.csv! After loading the tidyverse, try using read_csv to load a CSV file directly into a tibble.

Use summarize to summarize data

In lecture, you saw that summarize can calculate the number of rows in a group, using the function n. You can use other functions with summarize too, such as sum:

summarize(DATA, sum())

where DATA is a data frame.

In fact, you can pass an argument to sum if you’d like to summarize data by summing a particular column for each group in your data:

summarize(DATA, sum(COLUMN))

where COLUMN is a column name in your data frame.

Usage

Assuming your .R files are in your working directory, execute each file individually to test your work:

source("1.R")

How to Test

Here’s how to test your code manually:

Executing 1.R should create a tibble named air with 32,015 rows and 8 columns
Executing 2.R should create a tibble named air with 32,015 rows and 8 columns, sorted from highest to lowest value in the emissions column
Executing 3.R should create a tibble named air with rows for a single county and 8 columns
Executing 4.R should create a tibble named air like the one in 3.R, but with rows sorted from highest to lowest value in the emissions column
Executing 5.R should create a tibble named air with 36 rows and 8 columns, sorted from highest to lowest value in the emissions column
Executing 6.R should create a tibble named air with 7 rows and 2 columns, sorted from highest to lowest value in the emissions column
Executing 7.R should create a tibble named air with 39 rows and 3 columns, sorted first alphabetically by source name, then alphabetically by pollutant name

`check50`

You can also check your code using check50, a program that CS50 will use to test your code when you submit. But be sure to test it yourself as well!

Run the following command in the RStudio console:

check50("cs50/problems/2024/r/air")

Be sure that you’ve created each .R file’s corresponding .RData file—it’s your .RData files that check50 will check!

Green smilies mean your program has passed a test! Red frownies will indicate your program output something unexpected. Visit the URL that check50 outputs to see the input check50 handed to your program, what output it expected, and what output your program actually gave.

How to Submit

After you submit, be sure to check your autograder results. If you see SUBMISSION ERROR: missing files (0.0/1.0), it means your file was not named exactly as prescribed (or you uploaded it to the wrong problem).

Correctness in submissions entails everything from reading the specification, writing code that is compliant with it, and submitting files with the correct name. If you see this error, you should resubmit right away, making sure your submission is fully compliant with the specification. The staff will not adjust your filenames for you after the fact!

In RStudio, select all .RData and .R files you created for this problem, as by checking the box to the left of the files’ names. With the file selected, click on the icon at the top of the file explorer. Choose Export, name your file northwest-air-solution.zip, followed by Download.

Go to CSCI E-5a’s Gradescope page.

Click Problem Set 4: Northwest Air.

Unzip your northwest-air-solution.zip file. Open the folder. Drag and drop your .RData and .R files to the area that says Drag & Drop. Be sure that your .RData and .R files are correctly named exactly as prescribed above, lest the autograder fail to run on your submission! Note that your submission is considered incomplete if any of the files are missing—be sure they’re all there!

Click Upload.

You should see a message that says “Problem Set 4: Northwest Air submitted successfully!”

Be sure to double-check your autograder results before moving on!

Acknowledgements

Data retrieved from the United States Environmental Protection Agency’s National Emissions Inventory.