Northwest Air
Well the Northwest air brings the fast boys to town
Be like fire on the Cascades
When our feet touch the ground
– Coming Home (Oregon), by Mat Kearney
Problem to Solve
Time to head west! The United States’s “Pacific Northwest” is a northwest region including the states of Oregon, Washington, and Northern Idaho. The region’s cooler temperatures and cloudy skies attract athletes and outdoor adventurers of all kinds. Recently, though, the increasing prevalence of wild fires has threatened the region’s once pristine air quality.
In this problem, in a file called air.R
in a folder called air
, you’ll learn about indicators of air quality and use R to analyze common air pollutants in the state of Oregon.
Distribution Code
For this problem, you’ll need to download several .R
files and air.csv
.
Download the distribution code
Open RStudio per the linked steps and navigate to the R console:
>
Next execute
getwd()
to print your working directory. Ensure your current working directory is where you’d like to download this problem’s distribution code. If using RStudio through cs50.dev the recommended directory is /workspaces/NUMBER
where NUMBER
is a number unique to your codespace.
If you do not see the right working directory, use setwd
to change it! Try typing setwd("..")
if in the working directory of another problem, which will move you one directory higher.
Next execute
download.file("https://cdn.cs50.net/r/2024/x/psets/4/air.zip", "air.zip")
in order to download a ZIP called air.zip
into your codespace.
Then execute
unzip("air.zip")
to create a folder called air
. You no longer need the ZIP file, so you can execute
file.remove("air.zip")
Now type
setwd("air")
followed by Enter to move yourself into (i.e., open) that directory. Your working directory should now end with
air/
If all was successful, you should execute
list.files()
and see several .R
files alongside air.csv
. If not, retrace your steps and see if you can determine where you went wrong!
Schema
Before jumping in, it will be helpful to get a sense for the “schema” (i.e., organization!) of the data you’re given.
Learn about this data
In air.csv
, you’ll find data from the United States Environmental Protection Agency’s National Emissions Inventory. This inventory tracks all emissions of certain pollutants (i.e., harmful substances released into the air).
Some pollutants are called “criteria air pollutants” (CAPs) by the United States Environmental Protection Agency (EPA). These pollutants are more harmful than others. According to the EPA:
Criteria air pollutants are found all over the U.S. They can harm your health and the environment, and cause property damage.
Ozone, carbon monoxide, lead, and nitrogen dioxide are all examples of criteria air pollutants.
air.csv
shows the amount of criteria air pollutants emitted in Oregon. The data is very detailed! For each county (i.e., municipality), you can see how much of each pollutant is emitted by different sources, like school buses or wildfires.
Pay attention to these columns:
- State-County, which is the county in which the emission took place
- POLLUTANT, which is the pollutant being emitted
- Emissions (Tons), which is the amount of the pollutant emitted (in tons)
- SCC Levels 1–4, which describes the source of the pollutant, from more general (Level 1) to more specific (Level 4)
Specification
1.R
In 1.R
, read air.csv
into a tibble called air
, renaming and selecting the columns you need. In particular, ensure the air
tibble includes only the following columns:
- state, renamed from State
- county, renamed from State-County
- pollutant, renamed from POLLUTANT
- emissions, renamed from Emissions (Tons)
- level_1, renamed from SCC LEVEL 1
- level_2, renamed from SCC LEVEL 2
- level_3, renamed from SCC LEVEL 3
- level_4, renamed from SCC LEVEL 4
Turns out the tidyverse has a function you can use in place of read.csv
! After loading the tidyverse, use read_csv
to load a CSV file directly into a tibble.
Save the resulting air
tibble, using save
, in a file called air.RData
. You’ll use this tibble in the remaining .R
files.
2.R
To sustainably improve air quality, analysts often focus their efforts on particular sources of pollutants. To identify which sources might need the most attention, find the largest sources of pollutants in Oregon.
In 2.R
, load the air
tibble from air.RData
with load
. Update the tibble by sorting all rows by the emissions column, highest value to lowest.
Save the resulting air
tibble, using save
, in a file called 2.RData
.
3.R
In addition to focusing on the largest sources of pollutants, analysts might focus on particular geographic regions. Choose one of the counties in Oregon from this list. Find all sources of pollutants in that county.
In 3.R
, load the air
tibble from air.RData
with load
. Transform the tibble so that it only includes data for the county of your choice.
Save the resulting air
tibble, using save
, in a file called 3.RData
.
4.R
Combine your analyses from 2.R
and 3.R
.
In 4.R
, load the air
tibble from air.RData
with load
. Transform the tibble so that it only includes data for the county of your choice and sorts the data by the emissions column, highest value to lowest.
Save the resulting air
tibble, using save
, in a file called 4.RData
.
5.R
So far, you’ve identified the largest sources of pollutants across the entire state of Oregon, as well as within a single county. Now, find the single largest pollutant source for each county.
In 5.R
, load the air
tibble from air.RData
with load
. Transform the tibble so that it includes the single row with the highest value in the emissions column for each county.
Save the resulting air
tibble, using save
, in a file called 5.RData
.
6.R
Some pollutants tend to be emitted at higher rates than others. For each pollutant, find its total emissions across the entire state of Oregon.
In 6.R
, load the air
tibble from air.RData
with load
. Summarize the data in the tibble to find the total emissions for each pollutant. Sort the pollutants from highest to lowest emissions.
The resulting tibble should have two columns, one called pollutant and one called emissions. For example:
pollutant | emissions |
---|---|
Carbon Monoxide | 8070434.86 |
Volatile Organic Compounds | 2368212.66 |
PM10 Primary (Filt + Cond) | 1266915.06 |
… | … |
Save the resulting air
tibble, using save
, in a file called 6.RData
.
7.R
In professional air quality reports, analysts will often calculate total emissions for broad categories of sources. In fact, the level_1 column in the air
tibble lists the broad category in which a specific source is included! For each category of source in level_1, calculate the total emissions of each pollutant.
Want to learn more about these categories?
Among these broad categories are:
- Industrial Processes, which includes activities such as chemical manufacturing, metal processing, and food production
- Miscellaneous Area Sources, which includes small-scale activities like residential heating, lawn mowing, and commercial cooking
- Mobile Sources, which includes vehicles such as cars, trucks, airplanes, and ships
- Natural Sources, which includes wildfires, volcanic activity, and biogenic emissions from plants and trees
- Solvent Utilization, which includes the use of products like paints, coatings, and cleaning agents
- Stationary Source Fuel Combustion, which includes the burning of fuels in fixed locations, such as power plants, industrial boilers, and residential heating
- Storage and Transport, which includes the handling and movement of fuels and chemicals, leading to evaporative emissions
- Waste Disposal, Treatment, and Recovery, which includes landfills, waste treatment facilities, and recycling operations
In 7.R
, load the air
tibble from air.RData
with load
. Transform the tibble to find the total emissions of each pollutant from each of the level_1 source categories. Sort the rows first alphabetically by source name, then alphabetically by pollutant name.
The resulting tibble should have three columns, one called source, one called pollutant, and one called emissions. For example:
source | pollutant | emissions |
---|---|---|
Industrial Processes | Carbon Monoxide | 1460 |
Industrial Processes | Nitrogen Oxides | 9.96 |
Miscellaneous Area Sources | Ammonia | 161756 |
Miscellaneous Area Sources | Carbon Monoxide | 7385998 |
… | … |
Save the resulting air
tibble, using save
, in a file called 7.RData
.
Advice
There are a few ways to approach these problems! Consider the below as advice to help you on your way:
Use read_csv
to read data directly into a tibble
Turns out the tidyverse has a function you can use in place of read.csv
! After loading the tidyverse, try using read_csv
to load a CSV file directly into a tibble.
Use summarize
to summarize data
In lecture, you saw that summarize
can calculate the number of rows in a group, using the function n
. You can use other functions with summarize
too, such as sum
:
summarize(DATA, sum())
where DATA is a data frame.
In fact, you can pass an argument to sum if you’d like to summarize data by summing a particular column for each group in your data:
summarize(DATA, sum(COLUMN))
where COLUMN is a column name in your data frame.
Usage
Assuming your .R
files are in your working directory, execute each file individually to test your work:
source("1.R")
How to Test
Here’s how to test your code manually:
- Executing
1.R
should create a tibble namedair
with 32,015 rows and 8 columns - Executing
2.R
should create a tibble namedair
with 32,015 rows and 8 columns, sorted from highest to lowest value in the emissions column - Executing
3.R
should create a tibble namedair
with rows for a single county and 8 columns - Executing
4.R
should create a tibble namedair
like the one in3.R
, but with rows sorted from highest to lowest value in the emissions column - Executing
5.R
should create a tibble namedair
with 36 rows and 8 columns, sorted from highest to lowest value in the emissions column - Executing
6.R
should create a tibble namedair
with 7 rows and 2 columns, sorted from highest to lowest value in the emissions column - Executing
7.R
should create a tibble namedair
with 39 rows and 3 columns, sorted first alphabetically by source name, then alphabetically by pollutant name
check50
You can also check your code using check50
, a program that CS50 will use to test your code when you submit. But be sure to test it yourself as well!
Run the following command in the RStudio console:
check50("cs50/problems/2024/r/air")
Be sure that you’ve created each .R
file’s corresponding .RData
file—it’s your .RData
files that check50
will check!
Green smilies mean your program has passed a test! Red frownies will indicate your program output something unexpected. Visit the URL that check50 outputs to see the input check50 handed to your program, what output it expected, and what output your program actually gave.
How to Submit
After you submit, be sure to check your autograder results. If you see SUBMISSION ERROR: missing files (0.0/1.0)
, it means your file was not named exactly as prescribed (or you uploaded it to the wrong problem).
Correctness in submissions entails everything from reading the specification, writing code that is compliant with it, and submitting files with the correct name. If you see this error, you should resubmit right away, making sure your submission is fully compliant with the specification. The staff will not adjust your filenames for you after the fact!
In RStudio, select all .RData
and .R
files you created for this problem, as by checking the box to the left of the files’ names. With the file selected, click on the icon at the top of the file explorer. Choose Export, name your file
northwest-air-solution.zip
, followed by Download.
Go to CSCI E-5a’s Gradescope page.
Click Problem Set 4: Northwest Air.
Unzip your northwest-air-solution.zip
file. Open the folder. Drag and drop your .RData
and .R
files to the area that says Drag & Drop. Be sure that your .RData
and .R
files are correctly named exactly as prescribed above, lest the autograder fail to run on your submission! Note that your submission is considered incomplete if any of the files are missing—be sure they’re all there!
Click Upload.
You should see a message that says “Problem Set 4: Northwest Air submitted successfully!”
Be sure to double-check your autograder results before moving on!
Acknowledgements
Data retrieved from the United States Environmental Protection Agency’s National Emissions Inventory.