Lecture 2
- Welcome!
- Outliers
- Logical Expressions
- Subsets with Logical Vectors
- Subsets of Data Frames
- Menus
- Escape Characters
- Conditionals
- Combining Data Sources
- Summing Up
Welcome!
- Welcome back to CS50’s Introduction to Programming with R!
- We will learn how to remove portions of data, find specific pieces of data, and how to take different data from different sources and combine them.
Outliers
- In statistics, outliers are data that are outside an expected range.
- Typically, statisticians and data scientists want to identify outliers for special consideration. Sometimes, outliers will want to be removed from a calculation. Other times, you might want to analyze with outliers included.
- To illustrate how we can work with outliers in R, you can create a new file in RStudio by typing
file.create("temps.R")
in the R console. Further, you will need to download a file calledtemps.RData
into your working directory. -
To load data, we can write code as follows:
# Demonstrates loading data from an .RData file load("temps.RData") mean(temps)
Notice how the
load
function loads our data file calledtemps.RData
. Next,mean
will average this data. - Running this script, you can see the results of this calculation.
- However, as stated before, there are outliers in this underlying data. Let’s discover those.
- Looking at temps overall, as illustrated in the lecture video, we want to be able to directly access these outlier temperatures.
-
Recall during Week 1 how we indexed into data in a vector. Modify your code as follows:
# Demonstrates identifying outliers by index load("temps.RData") temps[2] temps[4] temps[7] temps[c(2, 4, 7)]
Notice how
temps[2]
will directly access one of the outlier temperatures. The final line of code takes a subset of thetemps
vector, which includes only the elements at the 2nd, 4th, and 7th indexes. -
As a next step, we can remove the outlier data:
# Demonstrates removing outliers by index load("temps.RData") no_outliers <- temps[-c(2, 4, 7)] mean(no_outliers) mean(temps)
Notice that the data is loaded. Then,
no_outliers
is a new vector which includes only the temperatures that are not outliers. The vector calledtemps
does still include the outlier data.
Logical Expressions
- Logical expressions are means by which to programmatically answer yes and no questions. Logical expressions make use of logical operators, which are used for comparing values.
-
There are many logical operators you can use in R, including:
== != > >= < <=
- For example, you could ask whether 1 is equal to 2 by typing
1 == 2
in the R console. The result should beFALSE
(or “no!”). However,1 < 2
should beTRUE
(or “yes!). - Logicals are the response provided by a logical expression. Logicals can be
TRUE
orFALSE
. These values can alternatively be expressed in a more abbreviated form asT
orF
. -
Using logical operators within your code, you can modify your code as follows:
# Demonstrates identifying outliers with logical expressions load("temps.RData") temps[1] < 0 temps[2] < 0 temps[3] < 0
Notice how running this code will result in answers in terms of
TRUE
andFALSE
in the R console. -
This code can be further improved as follows:
# Demonstrates comparison operators are vectorized load("temps.RData") temps < 0
Notice how running this code will create a logical vector (i.e., a vector of logicals). Each value in the logical vector answers whether its corresponding value is less than 0.
-
To identify the indexes for which some logical expression is true, you can modify your code as follows:
# Demonstrates `which` to return indices for which a logical expression is TRUE load("temps.RData") which(temps < 0)
Notice that now the indexes of the temperatures in the vector that are less than 0 are output to the R console. The function
which
takes a logical vector as input and returns the indexes of the values that areTRUE
. -
When working with outliers, a common desire is to show data that is below or above a threshold. You can accomplish this in your code as follows:
# Demonstrates identifying outliers with compound logical expressions load("temps.RData") temps < 0 | temps > 60
Notice that the character
|
symbolizes or in the expression. This logical expression will beTRUE
for any value intemps
that is less than0
or greater than60
. -
In addition to the logical operators we discussed earlier, we now add two new ones to our vocabulary:
| &
Notice how the capability of expressing or and and are provided.
-
You can further improve your code as follows:
# Demonstrates `any` and `all` to test for outliers load("temps.RData") any(temps < 0 | temps > 60) all(temps < 0 | temps > 60)
Notice how
any
andall
take logical vectors as input.any
answers the question, “are any of these logical values true?”.all
answers the question “are all of these temperatures true?”.
Subsets with Logical Vectors
-
As illustrated before, we can create a new vector that removes the outliers as follows:
# Demonstrates subsetting a vector with a logical vector load("temps.RData") filter <- temps < 0 | temps > 60 temps[filter]
Notice how a new subsetting vector called
filter
is created based on a logical expression. Thus,filter
can now be provided totemps
to request only those items intemps
that evaluated asTRUE
in the logical expression. -
Similarly, the code can be modified to only filter those that are not outliers:
# Demonstrates negating a logical expression with ! load("temps.RData") filter <- !(temps < 0 | temps > 60) temps[filter]
Notice the addition of the
!
means does not equal or simply not. -
This negation can be leveraged to remove outliers entirely from the data:
# Demonstrates removing outliers load("temps.RData") no_outliers <- temps[!(temps < 0 | temps > 60)] save(no_outliers, file = "no_outliers.RData") outliers <- temps[temps < 0 | temps > 60] save(outliers, file = "outliers.RData")
Notice how two files are now saved. One excludes the outliers. The other includes the outliers. These files are saved in the working directory.
Subsets of Data Frames
- How can we find a subset of data we are interested in from a dataset?
- Imagine a table of data that logs each chick (a baby chicken!), the feed each chick is fed, and the weight of each chick. You can download
chicks.csv
from the lecture source code to see this data. -
Closing our previous file in RStudio, let’s create a new file in the R console by typing
file.create("chicks.R")
. Ensure you havechicks.csv
in your working directory, then selectchicks.R
and write your code as follows:# Reads a CSV of data chicks <- read.csv("chicks.csv") View(chicks)
Notice that
read.csv
reads the CSV file into a data frame calledchicks
. Then,chicks
is viewed. -
Looking at the output of the above, notice that there are many
NA
values, representing data that is not available. Consider how this may impact a calculation of the average chick weight. Modify your code as follows:# Demonstrates `mean` calculation with NA values chicks <- read.csv("chicks.csv") average_weight <- mean(chicks$weight) average_weight
Notice how running this code will result in an error, as some values are not available to be mathematically evaluated.
-
Missing data is an expected problem within statistics. You, as the programmer, need to make a decision about how to treat missing data. You can calculate the average chick weight while removing
NA
values as follows:# Demonstrates na.rm to remove NA values from mean calculation chicks <- read.csv("chicks.csv") average_weight <- mean(chicks$weight, na.rm = TRUE) average_weight
Notice how
na.rm = TRUE
will remove allNA
values for the purpose of computing an average withmean
. Per the documentation,na.rm
can be set asTRUE
orFALSE
. -
Now, let’s figure out how the food each chick eats impacts their weight:
# Demonstrates computing casein average with explicit indexes chicks <- read.csv("chicks.csv") casein_chicks <- chicks[c(1, 2, 3), ] mean(casein_chicks$weight)
Notice that a subset of the
chicks
data frame is created by explicitly specifying the appropriate indexes. -
This is not an efficient way of programming, since we shouldn’t expect our data to never change. How can we modify our code so that it is more flexible? We can use logical expressions to dynamically subset a data frame.
# Demonstrates logical expression to identify rows with casein feed chicks <- read.csv("chicks.csv") chicks$feed == "casein"
Notice how the logical expression identifies whether each value in the feed column is equal to “casein.”
-
We can leverage this logical expression within our code as follows:
# Demonstrates subsetting data frame with logical vector chicks <- read.csv("chicks.csv") filter <- chicks$feed == "casein" casein_chicks <- chicks[filter, ] mean(casein_chicks$weight)
As demonstrated earlier in the lecture, notice how a logical vector called
filter
is created. Then, only those rows that areTRUE
infilter
are brought into the data framecasein_chicks
. - We now have a subset of our data frame.
-
You can accomplish the same result by using the function
subset
:# Demonstrates subsetting with `subset` chicks <- read.csv("chicks.csv") casein_chicks <- subset(chicks, feed == "casein") mean(casein_chicks$weight, na.rm = TRUE)
This data frame, called
casein_chicks
, is created with thesubset
function. -
Now, one may wish to filter out all
NA
values at the start. Consider the following code:# Demonstrates identifying NA values with `is.na` chicks <- read.csv("chicks.csv") is.na(chicks$weight) !is.na(chicks$weight) chicks$chick[is.na(chicks$weight)]
Notice how this code will use
is.na
to findNA
values. -
Records can be entirely removed, leveraging
is.na
as follows:# Demonstrates removing NA values and resetting row names chicks <- read.csv("chicks.csv") chicks <- subset(chicks, !is.na(weight)) rownames(chicks) rownames(chicks) <- NULL rownames(chicks)
Notice how this code creates a subset of
chicks
whereis.na(weight)
is equal toFALSE
. That is,chicks
only includes rows whereNA
is not present in theweight
column. If you care about your data frame’s row names, though, be aware that—when you removed certain rows—you also removed those rows’ correspondingrownames
. You can ensure the names of your rows still ascend sequentially by runningrownames(chicks) <- NULL
, which resets the names of all of your rows.
Menus
- In R, you can present users with options. For example, you can offer the user the type of feed they wish to filter for the chicks.
-
Consider this code:
# Demonstrates interactive program to view data by feed type # Read and clean data chicks <- read.csv("chicks.csv") chicks <- subset(chicks, !is.na(weight)) # Determine feed options feed_options <- unique(chicks$feed) # Prompt user with options cat("1.", feed_options[1]) cat("2.", feed_options[2]) cat("3.", feed_options[3]) cat("4.", feed_options[4]) cat("5.", feed_options[5]) cat("6.", feed_options[6]) feed_choice <- as.integer(readline("Feed type: "))
Notice how this code uses
unique
to discover the individual unique feed options. Each of these feed options is then outputted withcat
. - This code works in the sense that it shows the various options of feed, but it’s not very well formatted. How can we output the different options on their own line in the R console?
Escape Characters
- Escape characters are characters whose output differs from the way you type them.
- For instance, some commonly used escape characters are
\n
, which prints a new line, or\t
, which prints a tab. -
Leveraging escape characters, we can modify our code as follows:
# Demonstrates \n # Read and clean data chicks <- read.csv("chicks.csv") chicks <- subset(chicks, !is.na(weight)) # Determine feed options feed_options <- unique(chicks$feed) # Prompt user with options cat("1.", feed_options[1], "\n") cat("2.", feed_options[2], "\n") cat("3.", feed_options[3], "\n") cat("4.", feed_options[4], "\n") cat("5.", feed_options[5], "\n") cat("6.", feed_options[6], "\n") feed_choice <- as.integer(readline("Feed type: "))
Notice how this outputs all the options of feed on individual lines.
-
While we have the right sort of menu being displayed, we can still improve our code from a design perspective. For example, why should we repeat all of these
cat
lines? Simplify your code as follows:# Demonstrates interactive program to view data by feed type # Read and clean data chicks <- read.csv("chicks.csv") chicks <- subset(chicks, !is.na(weight)) # Determine feed options feed_options <- unique(chicks$feed) # Format feed options formatted_options <- paste0(1:length(feed_options), ". ", feed_options) # Prompt user with options cat(formatted_options, sep = "\n") feed_choice <- as.integer(readline("Feed type: "))
Notice how
formatted_options
includes all the individual feed options. Each element of this vector offormatted_options
is printed and separated by a new line usingcat(formatted_options, sep = "\n")
. -
Now, as we indicated earlier, our intention is to create an interactive program. Thus, we can now prompt the user with options:
# Demonstrates interactive program to view data by feed type # Read and clean data chicks <- read.csv("chicks.csv") chicks <- subset(chicks, !is.na(weight)) # Determine feed options feed_options <- unique(chicks$feed) # Format feed options formatted_options <- paste0(1:length(feed_options), ". ", feed_options) # Prompt user with options cat(formatted_options, sep = "\n") feed_choice <- as.integer(readline("Feed type: ")) # Print selected option selected_feed <- feed_options[feed_choice] print(subset(chicks, feed == selected_feed))
Notice how the user is prompted with
Feed type:
, where a number can be converted to the text-based representation of the feed option. Then, thefeed_choice
they selected is assigned toselected_feed
. Finally, the subset corresponding to theselected_feed
is outputted to the user. - However, you can imagine how your user may not behave as expected. For example, if the user inputted
0
, which is not a potential choice, the output of our program will be strange. How can we ensure our user inputs the right text?
Conditionals
- Conditionals are ways to determine
if
a condition has been met. -
Consider the following code:
# Demonstrates interactive program to view data by feed type # Read and clean data chicks <- read.csv("chicks.csv") chicks <- subset(chicks, !is.na(weight)) # Determine feed options feed_options <- unique(chicks$feed) # Format feed options formatted_options <- paste0(1:length(feed_options), ". ", feed_options) # Prompt user with options cat(formatted_options, sep = "\n") feed_choice <- as.integer(readline("Feed type: ")) # Invalid choice? if (feed_choice < 1 || feed_choice > length(feed_options)) { cat("Invalid choice.") } selected_feed <- feed_options[feed_choice] print(subset(chicks, feed == selected_feed))
Notice how
if (feed_choice < 1 || feed_choice > length(feed_options))
determines if the user’s input falls outside a range of values. If so, the program displays “Invalid choice.” However, there’s still a problem: the program will continue to run, even with that invalid choice. -
if
andelse
can be leveraged as follows to only run the final calculation if the user inputs a valid choice:# Demonstrates interactive program to view data by feed type # Read and clean data chicks <- read.csv("chicks.csv") chicks <- subset(chicks, !is.na(weight)) # Determine feed options feed_options <- unique(chicks$feed) # Format feed options formatted_options <- paste0(1:length(feed_options), ". ", feed_options) # Prompt user with options cat(formatted_options, sep = "\n") feed_choice <- as.integer(readline("Feed type: ")) # Invalid choice? if (feed_choice < 1 || feed_choice > length(feed_options)) { cat("Invalid choice.") } else { selected_feed <- feed_options[feed_choice] print(subset(chicks, feed == selected_feed)) }
Notice, that code which is wrapped in
if
runs only if there is an invalid choice. That code which is wrapped inelse
will run only if the previous condition inif
was not met.
Combining Data Sources
- As a final matter of this lecture, let’s examine how to combine sources of data.
- Imagine a table that represents sales to customers, like Amazon might have.
- You can imagine scenarios where that data is spread across many tables. How can these data be combined from many sources?
-
Consider the following code called
sales.R
:# Reads 4 separate CSVs Q1 <- read.csv("Q1.csv") Q2 <- read.csv("Q2.csv") Q3 <- read.csv("Q3.csv") Q4 <- read.csv("Q4.csv")
Notice how each quarter of financial data, such as
Q1
andQ2
, is read into their own data frames. -
Now, let’s combine this data from these four data frames:
# Combines data frames with `rbind` Q1 <- read.csv("Q1.csv") Q2 <- read.csv("Q2.csv") Q3 <- read.csv("Q3.csv") Q4 <- read.csv("Q4.csv") sales <- rbind(Q1, Q2, Q3, Q4)
Notice that
rbind
is used to gather together the data from each of these data frames. - It’s worth mentioning that
rbind
is usable in this case because all four data frames are structured the same way. - The result of the previously run program is that
sales
includes each row from each data frame. Instead of showingQ1
,Q2
, etc. for each customer, it simply creates new rows for each line of data at the bottom of the file. Hence, the file becomes longer and longer as more and more data is combined into it. It’s entirely unclear in what quarter each sales value occurred. -
Our code can be improved such that a column for the financial quarter is created for each record as follows:
# Adds quarter column to data frames Q1 <- read.csv("Q1.csv") Q1$quarter <- "Q1" Q2 <- read.csv("Q2.csv") Q2$quarter <- "Q2" Q3 <- read.csv("Q3.csv") Q3$quarter <- "Q3" Q4 <- read.csv("Q4.csv") Q4$quarter <- "Q4" sales <- rbind(Q1, Q2, Q3, Q4)
Notice how each quarter is added to a specific
quarter
column. Thus, whenrbind
combines the data frames intosales
with the sales organized byquarter
columns. -
As a final flourish, let’s add a
value
column where high returns and regular returns are noted:# Demonstrates flagging sales as high value Q1 <- read.csv("Q1.csv") Q1$quarter <- "Q1" Q2 <- read.csv("Q2.csv") Q2$quarter <- "Q2" Q3 <- read.csv("Q3.csv") Q3$quarter <- "Q3" Q4 <- read.csv("Q4.csv") Q4$quarter <- "Q4" sales <- rbind(Q1, Q2, Q3, Q4) sales$value <- ifelse(sales$sale_amount > 100, "High Value", "Regular")
Notice how the final line of code assigns “High Value” when the
sales_amount
is greater than100
. Otherwise, the transaction is assigned “Regular.”
Summing Up
In this lesson, you learned how to represent data in R. Specifically, you learned…
- Outliers
- Logical Expressions
- Subsets
- Menus
- Escape Characters
- Conditionals
- Combining Data Sources
See you next time when we discuss how to write functions of our own.