Northwest Air
Well the Northwest air brings the fast boys to town
Be like fire on the Cascades
When our feet touch the ground
– Coming Home (Oregon), by Mat Kearney
Problem to Solve
Time to head west! The United States’s “Pacific Northwest” is a northwest region including the states of Oregon, Washington, and Northern Idaho. The region’s cooler temperatures and cloudy skies attract athletes and outdoor adventurers of all kinds. Recently, though, the increasing prevalence of wild fires has threatened the region’s once pristine air quality.
In this problem, in a file called air.R
in a folder called air
, you’ll learn about indicators of air quality and use R to analyze common air pollutants in the state of Oregon.
Distribution Code
For this problem, you’ll need to download several .R
files and air.csv
.
Download the distribution code
Open RStudio per the linked steps and navigate to the R console:
>
Next execute
getwd()
to print your working directory. Ensure your current working directory is where you’d like to download this problem’s distribution code. If using RStudio through cs50.dev the recommended directory is /workspaces/NUMBER
where NUMBER
is a number unique to your codespace.
If you do not see the right working directory, use setwd
to change it! Try typing setwd("..")
if in the working directory of another problem, which will move you one directory higher.
Next execute
download.file("https://cdn.cs50.net/r/2024/x/psets/4/air.zip", "air.zip")
in order to download a ZIP called air.zip
into your codespace.
Then execute
unzip("air.zip")
to create a folder called air
. You no longer need the ZIP file, so you can execute
file.remove("air.zip")
Now type
setwd("air")
followed by Enter to move yourself into (i.e., open) that directory. Your working directory should now end with
air/
If all was successful, you should execute
list.files()
and see several .R
files alongside air.csv
. If not, retrace your steps and see if you can determine where you went wrong!
Schema
Before jumping in, it will be helpful to get a sense for the “schema” (i.e., organization!) of the data you’re given.
Learn about this data
In air.csv
, you’ll find data from the United States Environmental Protection Agency’s National Emissions Inventory. This inventory tracks all emissions of certain pollutants (i.e., harmful substances released into the air).
Some pollutants are called “criteria air pollutants” (CAPs) by the United States Environmental Protection Agency (EPA). These pollutants are more harmful than others. According to the EPA:
Criteria air pollutants are found all over the U.S. They can harm your health and the environment, and cause property damage.
Ozone, carbon monoxide, lead, and nitrogen dioxide are all examples of criteria air pollutants.
air.csv
shows the amount of criteria air pollutants emitted in Oregon. The data is very detailed! For each county (i.e., municipality), you can see how much of each pollutant is emitted by different sources, like school buses or wildfires.
Pay attention to these columns:
- State-County, which is the county in which the emission took place
- POLLUTANT, which is the pollutant being emitted
- Emissions (Tons), which is the amount of the pollutant emitted (in tons)
- SCC Levels 1–4, which describes the source of the pollutant, from more general (Level 1) to more specific (Level 4)
Specification
1.R
In 1.R
, read air.csv
into a tibble called air
, renaming and selecting the columns you need. In particular, ensure the air
tibble includes only the following columns:
- state, renamed from State
- county, renamed from State-County
- pollutant, renamed from POLLUTANT
- emissions, renamed from Emissions (Tons)
- level_1, renamed from SCC LEVEL 1
- level_2, renamed from SCC LEVEL 2
- level_3, renamed from SCC LEVEL 3
- level_4, renamed from SCC LEVEL 4
Save the resulting air
tibble, using save
, in a file called air.RData
. You’ll use this tibble in the remaining .R
files.
2.R
To sustainably improve air quality, analysts often focus their efforts on particular sources of pollutants. To identify which sources might need the most attention, find the largest sources of pollutants in Oregon.
In 2.R
, load the air
tibble from air.RData
with load
. Update the tibble by sorting all rows by the emissions column, highest value to lowest.
Save the resulting air
tibble, using save
, in a file called 2.RData
.
3.R
In addition to focusing on the largest sources of pollutants, analysts might focus on particular geographic regions. Choose one of the counties in Oregon from this list. Find all sources of pollutants in that county.
In 3.R
, load the air
tibble from air.RData
with load
. Transform the tibble so that it only includes data for the county of your choice.
Save the resulting air
tibble, using save
, in a file called 3.RData
.
4.R
Combine your analyses from 2.R
and 3.R
.
In 4.R
, load the air
tibble from air.RData
with load
. Transform the tibble so that it only includes data for the county of your choice and sorts the data by the emissions column, highest value to lowest.
Save the resulting air
tibble, using save
, in a file called 4.RData
.
5.R
So far, you’ve identified the largest sources of pollutants across the entire state of Oregon, as well as within a single county. Now, find the single largest pollutant source for each county.
In 5.R
, load the air
tibble from air.RData
with load
. Transform the tibble so that it includes the single row with the highest value in the emissions column for each county.
Save the resulting air
tibble, using save
, in a file called 5.RData
.
6.R
Some pollutants tend to be emitted at higher rates than others. For each pollutant, find its total emissions across the entire state of Oregon.
In 6.R
, load the air
tibble from air.RData
with load
. Summarize the data in the tibble to find the total emissions for each pollutant. Sort the pollutants from highest to lowest emissions.
The resulting tibble should have two columns, one called pollutant and one called emissions. For example:
pollutant | emissions |
---|---|
Carbon Monoxide | 8070435 |
Volatile Organic Compounds | 2368213 |
PM10 Primary (Filt + Cond) | 1266915 |
… | … |
Save the resulting air
tibble, using save
, in a file called 6.RData
.
7.R
In professional air quality reports, analysts will often calculate total emissions for broad categories of sources. In fact, the level_1 column in the air
tibble lists the broad category in which a specific source is included! For each category of source in level_1, calculate the total emissions of each pollutant.
Want to learn more about these categories?
Among these broad categories are:
- Industrial Processes, which includes activities such as chemical manufacturing, metal processing, and food production
- Miscellaneous Area Sources, which includes small-scale activities like residential heating, lawn mowing, and commercial cooking
- Mobile Sources, which includes vehicles such as cars, trucks, airplanes, and ships
- Natural Sources, which includes wildfires, volcanic activity, and biogenic emissions from plants and trees
- Solvent Utilization, which includes the use of products like paints, coatings, and cleaning agents
- Stationary Source Fuel Combustion, which includes the burning of fuels in fixed locations, such as power plants, industrial boilers, and residential heating
- Storage and Transport, which includes the handling and movement of fuels and chemicals, leading to evaporative emissions
- Waste Disposal, Treatment, and Recovery, which includes landfills, waste treatment facilities, and recycling operations
In 7.R
, load the air
tibble from air.RData
with load
. Transform the tibble to find the total emissions of each pollutant from each of the level_1 source categories. Sort the rows first alphabetically by source name, then alphabetically by pollutant name.
The resulting tibble should have three columns, one called source, one called pollutant, and one called emissions. For example:
source | pollutant | emissions |
---|---|---|
Industrial Processes | Carbon Monoxide | 1460 |
Industrial Processes | Nitrogen Oxides | 9.96 |
Miscellaneous Area Sources | Ammonia | 161756 |
Miscellaneous Area Sources | Carbon Monoxide | 7385998 |
… | … |
Save the resulting air
tibble, using save
, in a file called 7.RData
.
Advice
There are a few ways to approach these problems! Consider the below as advice to help you on your way:
Use read_csv
to read data directly into a tibble
Turns out the tidyverse has a function you can use in place of read.csv
! After loading the tidyverse, try using read_csv
to load a CSV file directly into a tibble.
Use summarize
to summarize data
In lecture, you saw that summarize
can calculate the number of rows in a group, using the function n
. You can use other functions with summarize
too, such as sum
:
summarize(DATA, sum())
where DATA is a data frame.
In fact, you can pass an argument to sum if you’d like to summarize data by summing a particular column for each group in your data:
summarize(DATA, sum(COLUMN))
where COLUMN is a column name in your data frame.
Usage
Assuming your .R
files are in your working directory, execute each file individually to test your work:
source("1.R")
How to Test
Here’s how to test your code manually:
- Executing
1.R
should create a tibble namedair
with 32,015 rows and 8 columns - Executing
2.R
should create a tibble namedair
with 32,015 rows and 8 columns, sorted from highest to lowest value in the emissions column - Executing
3.R
should create a tibble namedair
with rows for a single county and 8 columns - Executing
4.R
should create a tibble namedair
like the one in3.R
, but with rows sorted from highest to lowest value in the emissions column - Executing
5.R
should create a tibble namedair
with 36 rows and 8 columns, sorted from highest to lowest value in the emissions column - Executing
6.R
should create a tibble namedair
with 7 rows and 2 columns, sorted from highest to lowest value in the emissions column - Executing
7.R
should create a tibble namedair
with 39 rows and 3 columns, sorted first from highest to lowest value in the emissions column and then alphabetically by pollutant name
check50
You can also check your code using check50
, a program that CS50 will use to test your code when you submit. But be sure to test it yourself as well!
Run the following command in the RStudio console:
check50("cs50/problems/2024/r/air")
Be sure that you’ve created each .R
file’s corresponding .RData
file—it’s your .RData
files that check50
will check!
Green smilies mean your program has passed a test! Red frownies will indicate your program output something unexpected. Visit the URL that check50 outputs to see the input check50 handed to your program, what output it expected, and what output your program actually gave.
How to Submit
You can submit your code using submit50
.
Keeping in mind the course’s policy on academic honesty, run the following command in the RStudio console:
submit50("cs50/problems/2024/r/air")
Acknowledgements
Data retrieved from the United States Environmental Protection Agency’s National Emissions Inventory.