Lecture 2
- Welcome!
- Outliers
- Logical Expressions
- Subsets with Logical Vectors
- Subsets of Data Frames
- Menus
- Escape Characters
- Conditionals
- Combining Data Sources
- Summing Up
Welcome!
- Welcome back to CS50’s Introduction to Programming with R!
- We will learn how to remove portions of data, find specific pieces of data, and how to take different data from different sources and combine them.
Outliers
- In statistics, outliers are data that are outside an expected range.
- Typically, statisticians and data scientists want to identify outliers for special consideration. Sometimes, outliers will want to be removed from a calculation. Other times, you might want to analyze with outliers included.
- To illustrate how we can work with outliers in R, you can create a new file in RStudio by typing
file.create("temps.R")in the R console. Further, you will need to download a file calledtemps.RDatainto your working directory. -
To load data, we can write code as follows:
# Demonstrates loading data from an .RData file load("temps.RData") mean(temps)Notice how the
loadfunction loads our data file calledtemps.RData. Next,meanwill average this data. - Running this script, you can see the results of this calculation.
- However, as stated before, there are outliers in this underlying data. Let’s discover those.
- Looking at temps overall, as illustrated in the lecture video, we want to be able to directly access these outlier temperatures.
-
Recall during Week 1 how we indexed into data in a vector. Modify your code as follows:
# Demonstrates identifying outliers by index load("temps.RData") temps[2] temps[4] temps[7] temps[c(2, 4, 7)]Notice how
temps[2]will directly access one of the outlier temperatures. The final line of code takes a subset of thetempsvector, which includes only the elements at the 2nd, 4th, and 7th indexes. -
As a next step, we can remove the outlier data:
# Demonstrates removing outliers by index load("temps.RData") no_outliers <- temps[-c(2, 4, 7)] mean(no_outliers) mean(temps)Notice that the data is loaded. Then,
no_outliersis a new vector which includes only the temperatures that are not outliers. The vector calledtempsdoes still include the outlier data.
Logical Expressions
- Logical expressions are means by which to programmatically answer yes and no questions. Logical expressions make use of logical operators, which are used for comparing values.
-
There are many logical operators you can use in R, including:
== != > >= < <= - For example, you could ask whether 1 is equal to 2 by typing
1 == 2in the R console. The result should beFALSE(or “no!”). However,1 < 2should beTRUE(or “yes!”). - Logicals are the response provided by a logical expression. Logicals can be
TRUEorFALSE. These values can alternatively be expressed in a more abbreviated form asTorF. -
Using logical operators within your code, you can modify your code as follows:
# Demonstrates identifying outliers with logical expressions load("temps.RData") temps[1] < 0 temps[2] < 0 temps[3] < 0Notice how running this code will result in answers in terms of
TRUEandFALSEin the R console. -
This code can be further improved as follows:
# Demonstrates comparison operators are vectorized load("temps.RData") temps < 0Notice how running this code will create a logical vector (i.e., a vector of logicals). Each value in the logical vector answers whether its corresponding value is less than 0.
-
To identify the indexes for which some logical expression is true, you can modify your code as follows:
# Demonstrates `which` to return indices for which a logical expression is TRUE load("temps.RData") which(temps < 0)Notice that now the indexes of the temperatures in the vector that are less than 0 are output to the R console. The function
whichtakes a logical vector as input and returns the indexes of the values that areTRUE. -
When working with outliers, a common desire is to show data that is below or above a threshold. You can accomplish this in your code as follows:
# Demonstrates identifying outliers with compound logical expressions load("temps.RData") temps < 0 | temps > 60Notice that the character
|symbolizes or in the expression. This logical expression will beTRUEfor any value intempsthat is less than0or greater than60. -
In addition to the logical operators we discussed earlier, we now add two new ones to our vocabulary:
| &Notice how the capability of expressing or and and are provided.
-
You can further improve your code as follows:
# Demonstrates `any` and `all` to test for outliers load("temps.RData") any(temps < 0 | temps > 60) all(temps < 0 | temps > 60)Notice how
anyandalltake logical vectors as input.anyanswers the question, “are any of these logical values true?”.allanswers the question “are all of these temperatures true?”.
Subsets with Logical Vectors
-
As illustrated before, we can create a new vector that removes the outliers as follows:
# Demonstrates subsetting a vector with a logical vector load("temps.RData") filter <- temps < 0 | temps > 60 temps[filter]Notice how a new subsetting vector called
filteris created based on a logical expression. Thus,filtercan now be provided totempsto request only those items intempsthat evaluated asTRUEin the logical expression. -
Similarly, the code can be modified to only filter those that are not outliers:
# Demonstrates negating a logical expression with ! load("temps.RData") filter <- !(temps < 0 | temps > 60) temps[filter]Notice the addition of the
!means does not equal or simply not. -
This negation can be leveraged to remove outliers entirely from the data:
# Demonstrates removing outliers load("temps.RData") no_outliers <- temps[!(temps < 0 | temps > 60)] save(no_outliers, file = "no_outliers.RData") outliers <- temps[temps < 0 | temps > 60] save(outliers, file = "outliers.RData")Notice how two files are now saved. One excludes the outliers. The other includes the outliers. These files are saved in the working directory.
Subsets of Data Frames
- How can we find a subset of data we are interested in from a dataset?
- Imagine a table of data that logs each chick (a baby chicken!), the feed each chick is fed, and the weight of each chick. You can download
chicks.csvfrom the lecture source code to see this data. -
Closing our previous file in RStudio, let’s create a new file in the R console by typing
file.create("chicks.R"). Ensure you havechicks.csvin your working directory, then selectchicks.Rand write your code as follows:# Reads a CSV of data chicks <- read.csv("chicks.csv") View(chicks)Notice that
read.csvreads the CSV file into a data frame calledchicks. Then,chicksis viewed. -
Looking at the output of the above, notice that there are many
NAvalues, representing data that is not available. Consider how this may impact a calculation of the average chick weight. Modify your code as follows:# Demonstrates `mean` calculation with NA values chicks <- read.csv("chicks.csv") average_weight <- mean(chicks$weight) average_weightNotice how running this code will result in an error, as some values are not available to be mathematically evaluated.
-
Missing data is an expected problem within statistics. You, as the programmer, need to make a decision about how to treat missing data. You can calculate the average chick weight while removing
NAvalues as follows:# Demonstrates na.rm to remove NA values from mean calculation chicks <- read.csv("chicks.csv") average_weight <- mean(chicks$weight, na.rm = TRUE) average_weightNotice how
na.rm = TRUEwill remove allNAvalues for the purpose of computing an average withmean. Per the documentation,na.rmcan be set asTRUEorFALSE. -
Now, let’s figure out how the food each chick eats impacts their weight:
# Demonstrates computing casein average with explicit indexes chicks <- read.csv("chicks.csv") casein_chicks <- chicks[c(1, 2, 3), ] mean(casein_chicks$weight)Notice that a subset of the
chicksdata frame is created by explicitly specifying the appropriate indexes. -
This is not an efficient way of programming, since we shouldn’t expect our data to never change. How can we modify our code so that it is more flexible? We can use logical expressions to dynamically subset a data frame.
# Demonstrates logical expression to identify rows with casein feed chicks <- read.csv("chicks.csv") chicks$feed == "casein"Notice how the logical expression identifies whether each value in the feed column is equal to “casein.”
-
We can leverage this logical expression within our code as follows:
# Demonstrates subsetting data frame with logical vector chicks <- read.csv("chicks.csv") filter <- chicks$feed == "casein" casein_chicks <- chicks[filter, ] mean(casein_chicks$weight)As demonstrated earlier in the lecture, notice how a logical vector called
filteris created. Then, only those rows that areTRUEinfilterare brought into the data framecasein_chicks. - We now have a subset of our data frame.
-
You can accomplish the same result by using the function
subset:# Demonstrates subsetting with `subset` chicks <- read.csv("chicks.csv") casein_chicks <- subset(chicks, feed == "casein") mean(casein_chicks$weight, na.rm = TRUE)This data frame, called
casein_chicks, is created with thesubsetfunction. -
Now, one may wish to filter out all
NAvalues at the start. Consider the following code:# Demonstrates identifying NA values with `is.na` chicks <- read.csv("chicks.csv") is.na(chicks$weight) !is.na(chicks$weight) chicks$chick[is.na(chicks$weight)]Notice how this code will use
is.nato findNAvalues. -
Records can be entirely removed, leveraging
is.naas follows:# Demonstrates removing NA values and resetting row names chicks <- read.csv("chicks.csv") chicks <- subset(chicks, !is.na(weight)) rownames(chicks) rownames(chicks) <- NULL rownames(chicks)Notice how this code creates a subset of
chickswhereis.na(weight)is equal toFALSE. That is,chicksonly includes rows whereNAis not present in theweightcolumn. If you care about your data frame’s row names, though, be aware that—when you removed certain rows—you also removed those rows’ correspondingrownames. You can ensure the names of your rows still ascend sequentially by runningrownames(chicks) <- NULL, which resets the names of all of your rows.
Menus
- In R, you can present users with options. For example, you can offer the user the type of feed they wish to filter for the chicks.
-
Consider this code:
# Demonstrates interactive program to view data by feed type # Read and clean data chicks <- read.csv("chicks.csv") chicks <- subset(chicks, !is.na(weight)) # Determine feed options feed_options <- unique(chicks$feed) # Prompt user with options cat("1.", feed_options[1]) cat("2.", feed_options[2]) cat("3.", feed_options[3]) cat("4.", feed_options[4]) cat("5.", feed_options[5]) cat("6.", feed_options[6]) feed_choice <- as.integer(readline("Feed type: "))Notice how this code uses
uniqueto discover the individual unique feed options. Each of these feed options is then outputted withcat. - This code works in the sense that it shows the various options of feed, but it’s not very well formatted. How can we output the different options on their own line in the R console?
Escape Characters
- Escape characters are characters whose output differs from the way you type them.
- For instance, some commonly used escape characters are
\n, which prints a new line, or\t, which prints a tab. -
Leveraging escape characters, we can modify our code as follows:
# Demonstrates \n # Read and clean data chicks <- read.csv("chicks.csv") chicks <- subset(chicks, !is.na(weight)) # Determine feed options feed_options <- unique(chicks$feed) # Prompt user with options cat("1.", feed_options[1], "\n") cat("2.", feed_options[2], "\n") cat("3.", feed_options[3], "\n") cat("4.", feed_options[4], "\n") cat("5.", feed_options[5], "\n") cat("6.", feed_options[6], "\n") feed_choice <- as.integer(readline("Feed type: "))Notice how this outputs all the options of feed on individual lines.
-
While we have the right sort of menu being displayed, we can still improve our code from a design perspective. For example, why should we repeat all of these
catlines? Simplify your code as follows:# Demonstrates interactive program to view data by feed type # Read and clean data chicks <- read.csv("chicks.csv") chicks <- subset(chicks, !is.na(weight)) # Determine feed options feed_options <- unique(chicks$feed) # Format feed options formatted_options <- paste0(1:length(feed_options), ". ", feed_options) # Prompt user with options cat(formatted_options, sep = "\n") feed_choice <- as.integer(readline("Feed type: "))Notice how
formatted_optionsincludes all the individual feed options. Each element of this vector offormatted_optionsis printed and separated by a new line usingcat(formatted_options, sep = "\n"). -
Now, as we indicated earlier, our intention is to create an interactive program. Thus, we can now prompt the user with options:
# Demonstrates interactive program to view data by feed type # Read and clean data chicks <- read.csv("chicks.csv") chicks <- subset(chicks, !is.na(weight)) # Determine feed options feed_options <- unique(chicks$feed) # Format feed options formatted_options <- paste0(1:length(feed_options), ". ", feed_options) # Prompt user with options cat(formatted_options, sep = "\n") feed_choice <- as.integer(readline("Feed type: ")) # Print selected option selected_feed <- feed_options[feed_choice] print(subset(chicks, feed == selected_feed))Notice how the user is prompted with
Feed type:, where a number can be converted to the text-based representation of the feed option. Then, thefeed_choicethey selected is assigned toselected_feed. Finally, the subset corresponding to theselected_feedis outputted to the user. - However, you can imagine how your user may not behave as expected. For example, if the user inputted
0, which is not a potential choice, the output of our program will be strange. How can we ensure our user inputs the right text?
Conditionals
- Conditionals are ways to determine
ifa condition has been met. -
Consider the following code:
# Demonstrates interactive program to view data by feed type # Read and clean data chicks <- read.csv("chicks.csv") chicks <- subset(chicks, !is.na(weight)) # Determine feed options feed_options <- unique(chicks$feed) # Format feed options formatted_options <- paste0(1:length(feed_options), ". ", feed_options) # Prompt user with options cat(formatted_options, sep = "\n") feed_choice <- as.integer(readline("Feed type: ")) # Invalid choice? if (feed_choice < 1 || feed_choice > length(feed_options)) { cat("Invalid choice.") } selected_feed <- feed_options[feed_choice] print(subset(chicks, feed == selected_feed))Notice how
if (feed_choice < 1 || feed_choice > length(feed_options))determines if the user’s input falls outside a range of values. If so, the program displays “Invalid choice.” However, there’s still a problem: the program will continue to run, even with that invalid choice. -
ifandelsecan be leveraged as follows to only run the final calculation if the user inputs a valid choice:# Demonstrates interactive program to view data by feed type # Read and clean data chicks <- read.csv("chicks.csv") chicks <- subset(chicks, !is.na(weight)) # Determine feed options feed_options <- unique(chicks$feed) # Format feed options formatted_options <- paste0(1:length(feed_options), ". ", feed_options) # Prompt user with options cat(formatted_options, sep = "\n") feed_choice <- as.integer(readline("Feed type: ")) # Invalid choice? if (feed_choice < 1 || feed_choice > length(feed_options)) { cat("Invalid choice.") } else { selected_feed <- feed_options[feed_choice] print(subset(chicks, feed == selected_feed)) }Notice, that code which is wrapped in
ifruns only if there is an invalid choice. That code which is wrapped inelsewill run only if the previous condition inifwas not met.
Combining Data Sources
- As a final matter of this lecture, let’s examine how to combine sources of data.
- Imagine a table that represents sales to customers, like Amazon might have.
- You can imagine scenarios where that data is spread across many tables. How can these data be combined from many sources?
-
Consider the following code called
sales.R:# Reads 4 separate CSVs Q1 <- read.csv("Q1.csv") Q2 <- read.csv("Q2.csv") Q3 <- read.csv("Q3.csv") Q4 <- read.csv("Q4.csv")Notice how each quarter of financial data, such as
Q1andQ2, is read into their own data frames. -
Now, let’s combine this data from these four data frames:
# Combines data frames with `rbind` Q1 <- read.csv("Q1.csv") Q2 <- read.csv("Q2.csv") Q3 <- read.csv("Q3.csv") Q4 <- read.csv("Q4.csv") sales <- rbind(Q1, Q2, Q3, Q4)Notice that
rbindis used to gather together the data from each of these data frames. - It’s worth mentioning that
rbindis usable in this case because all four data frames are structured the same way. - The result of the previously run program is that
salesincludes each row from each data frame. Instead of showingQ1,Q2, etc. for each customer, it simply creates new rows for each line of data at the bottom of the file. Hence, the file becomes longer and longer as more and more data is combined into it. It’s entirely unclear in what quarter each sales value occurred. -
Our code can be improved such that a column for the financial quarter is created for each record as follows:
# Adds quarter column to data frames Q1 <- read.csv("Q1.csv") Q1$quarter <- "Q1" Q2 <- read.csv("Q2.csv") Q2$quarter <- "Q2" Q3 <- read.csv("Q3.csv") Q3$quarter <- "Q3" Q4 <- read.csv("Q4.csv") Q4$quarter <- "Q4" sales <- rbind(Q1, Q2, Q3, Q4)Notice how each quarter is added to a specific
quartercolumn. Thus, whenrbindcombines the data frames intosaleswith the sales organized byquartercolumns. -
As a final flourish, let’s add a
valuecolumn where high returns and regular returns are noted:# Demonstrates flagging sales as high value Q1 <- read.csv("Q1.csv") Q1$quarter <- "Q1" Q2 <- read.csv("Q2.csv") Q2$quarter <- "Q2" Q3 <- read.csv("Q3.csv") Q3$quarter <- "Q3" Q4 <- read.csv("Q4.csv") Q4$quarter <- "Q4" sales <- rbind(Q1, Q2, Q3, Q4) sales$value <- ifelse(sales$sale_amount > 100, "High Value", "Regular")Notice how the final line of code assigns “High Value” when the
sale_amountis greater than100. Otherwise, the transaction is assigned “Regular.”
Summing Up
In this lesson, you learned how to transform data in R. Specifically, you learned…
- Outliers
- Logical Expressions
- Subsets
- Menus
- Escape Characters
- Conditionals
- Combining Data Sources
See you next time when we discuss how to write functions of our own.