Lecture 2

Welcome!

  • Welcome back to CS50’s Introduction to Programming with R!
  • We will learn how to remove portions of data, find specific pieces of data, and how to take different data from different sources and combine them.

Outliers

  • In statistics, outliers are data that are outside an expected range.
  • Typically, statisticians and data scientists want to identify outliers for special consideration. Sometimes, outliers will want to be removed from a calculation. Other times, you might want to analyze with outliers included.
  • To illustrate how we can work with outliers in R, you can create a new file in RStudio by typing file.create("temps.R") in the R console. Further, you will need to download a file called temps.RData into your working directory.
  • To load data, we can write code as follows:

    # Demonstrates loading data from an .RData file
    
    load("temps.RData")
    mean(temps)
    

    Notice how the load function loads our data file called temps.RData. Next, mean will average this data.

  • Running this script, you can see the results of this calculation.
  • However, as stated before, there are outliers in this underlying data. Let’s discover those.
  • Looking at temps overall, as illustrated in the lecture video, we want to be able to directly access these outlier temperatures.
  • Recall during Week 1 how we indexed into data in a vector. Modify your code as follows:

    # Demonstrates identifying outliers by index
    
    load("temps.RData")
    
    temps[2]
    temps[4]
    temps[7]
    
    temps[c(2, 4, 7)]
    

    Notice how temps[2] will directly access one of the outlier temperatures. The final line of code takes a subset of the temps vector, which includes only the elements at the 2nd, 4th, and 7th indexes.

  • As a next step, we can remove the outlier data:

    # Demonstrates removing outliers by index
    
    load("temps.RData")
    no_outliers <- temps[-c(2, 4, 7)]
    
    mean(no_outliers)
    mean(temps)
    

    Notice that the data is loaded. Then, no_outliers is a new vector which includes only the temperatures that are not outliers. The vector called temps does still include the outlier data.

Logical Expressions

  • Logical expressions are means by which to programmatically answer yes and no questions. Logical expressions make use of logical operators, which are used for comparing values.
  • There are many logical operators you can use in R, including:

    ==
    !=
    >
    >=
    <
    <=
    
  • For example, you could ask whether 1 is equal to 2 by typing 1 == 2 in the R console. The result should be FALSE (or “no!”). However, 1 < 2 should be TRUE (or “yes!).
  • Logicals are the response provided by a logical expression. Logicals can be TRUE or FALSE. These values can alternatively be expressed in a more abbreviated form as T or F.
  • Using logical operators within your code, you can modify your code as follows:

    # Demonstrates identifying outliers with logical expressions
    
    load("temps.RData")
    
    temps[1] < 0
    temps[2] < 0
    temps[3] < 0
    

    Notice how running this code will result in answers in terms of TRUE and FALSE in the R console.

  • This code can be further improved as follows:

    # Demonstrates comparison operators are vectorized
    
    load("temps.RData")
    
    temps < 0
    

    Notice how running this code will create a logical vector (i.e., a vector of logicals). Each value in the logical vector answers whether its corresponding value is less than 0.

  • To identify the indexes for which some logical expression is true, you can modify your code as follows:

    # Demonstrates `which` to return indices for which a logical expression is TRUE
    
    load("temps.RData")
    
    which(temps < 0)
    

    Notice that now the indexes of the temperatures in the vector that are less than 0 are output to the R console. The function which takes a logical vector as input and returns the indexes of the values that are TRUE.

  • When working with outliers, a common desire is to show data that is below or above a threshold. You can accomplish this in your code as follows:

    # Demonstrates identifying outliers with compound logical expressions
    
    load("temps.RData")
    temps < 0 | temps > 60
    

    Notice that the character | symbolizes or in the expression. This logical expression will be TRUE for any value in temps that is less than 0 or greater than 60.

  • In addition to the logical operators we discussed earlier, we now add two new ones to our vocabulary:

    |
    &
    

    Notice how the capability of expressing or and and are provided.

  • You can further improve your code as follows:

    # Demonstrates `any` and `all` to test for outliers
    
    load("temps.RData")
    
    any(temps < 0 | temps > 60)
    all(temps < 0 | temps > 60)
    

    Notice how any and all take logical vectors as input. any answers the question, “are any of these logical values true?”. all answers the question “are all of these temperatures true?”.

Subsets with Logical Vectors

  • As illustrated before, we can create a new vector that removes the outliers as follows:

    # Demonstrates subsetting a vector with a logical vector
    
    load("temps.RData")
    filter <- temps < 0 | temps > 60
    temps[filter]
    

    Notice how a new subsetting vector called filter is created based on a logical expression. Thus, filter can now be provided to temps to request only those items in temps that evaluated as TRUE in the logical expression.

  • Similarly, the code can be modified to only filter those that are not outliers:

    # Demonstrates negating a logical expression with !
    
    load("temps.RData")
    filter <- !(temps < 0 | temps > 60)
    temps[filter]
    

    Notice the addition of the ! means does not equal or simply not.

  • This negation can be leveraged to remove outliers entirely from the data:

    # Demonstrates removing outliers
    load("temps.RData")
    
    no_outliers <- temps[!(temps < 0 | temps > 60)]
    save(no_outliers, file = "no_outliers.RData")
    
    outliers <- temps[temps < 0 | temps > 60]
    save(outliers, file = "outliers.RData")
    

    Notice how two files are now saved. One excludes the outliers. The other includes the outliers. These files are saved in the working directory.

Subsets of Data Frames

  • How can we find a subset of data we are interested in from a dataset?
  • Imagine a table of data that logs each chick (a baby chicken!), the feed each chick is fed, and the weight of each chick. You can download chicks.csv from the lecture source code to see this data.
  • Closing our previous file in RStudio, let’s create a new file in the R console by typing file.create("chicks.R"). Ensure you have chicks.csv in your working directory, then select chicks.R and write your code as follows:

    # Reads a CSV of data
    
    chicks <- read.csv("chicks.csv")
    View(chicks)
    

    Notice that read.csv reads the CSV file into a data frame called chicks. Then, chicks is viewed.

  • Looking at the output of the above, notice that there are many NA values, representing data that is not available. Consider how this may impact a calculation of the average chick weight. Modify your code as follows:

    # Demonstrates `mean` calculation with NA values
    
    chicks <- read.csv("chicks.csv")
    average_weight <- mean(chicks$weight)
    average_weight
    

    Notice how running this code will result in an error, as some values are not available to be mathematically evaluated.

  • Missing data is an expected problem within statistics. You, as the programmer, need to make a decision about how to treat missing data. You can calculate the average chick weight while removing NA values as follows:

    # Demonstrates na.rm to remove NA values from mean calculation
    
    chicks <- read.csv("chicks.csv")
    average_weight <- mean(chicks$weight, na.rm = TRUE)
    average_weight
    

    Notice how na.rm = TRUE will remove all NA values for the purpose of computing an average with mean. Per the documentation, na.rm can be set as TRUE or FALSE.

  • Now, let’s figure out how the food each chick eats impacts their weight:

    # Demonstrates computing casein average with explicit indexes
    
    chicks <- read.csv("chicks.csv")
    casein_chicks <- chicks[c(1, 2, 3), ]
    mean(casein_chicks$weight)
    

    Notice that a subset of the chicks data frame is created by explicitly specifying the appropriate indexes.

  • This is not an efficient way of programming, since we shouldn’t expect our data to never change. How can we modify our code so that it is more flexible? We can use logical expressions to dynamically subset a data frame.

    # Demonstrates logical expression to identify rows with casein feed
    
    chicks <- read.csv("chicks.csv")
    
    chicks$feed == "casein"
    

    Notice how the logical expression identifies whether each value in the feed column is equal to “casein.”

  • We can leverage this logical expression within our code as follows:

    # Demonstrates subsetting data frame with logical vector
    
    chicks <- read.csv("chicks.csv")
    
    filter <- chicks$feed == "casein"
    casein_chicks <- chicks[filter, ]
    mean(casein_chicks$weight)
    

    As demonstrated earlier in the lecture, notice how a logical vector called filter is created. Then, only those rows that are TRUE in filter are brought into the data frame casein_chicks.

  • We now have a subset of our data frame.
  • You can accomplish the same result by using the function subset:

    # Demonstrates subsetting with `subset`
    
    chicks <- read.csv("chicks.csv")
    
    casein_chicks <- subset(chicks, feed == "casein")
    mean(casein_chicks$weight, na.rm = TRUE)
    

    This data frame, called casein_chicks, is created with the subset function.

  • Now, one may wish to filter out all NA values at the start. Consider the following code:

    # Demonstrates identifying NA values with `is.na`
    
    chicks <- read.csv("chicks.csv")
    
    is.na(chicks$weight)
    !is.na(chicks$weight)
    
    chicks$chick[is.na(chicks$weight)]
    

    Notice how this code will use is.na to find NA values.

  • Records can be entirely removed, leveraging is.na as follows:

    # Demonstrates removing NA values and resetting row names
    
    chicks <- read.csv("chicks.csv")
    
    chicks <- subset(chicks, !is.na(weight))
    rownames(chicks)
    
    rownames(chicks) <- NULL
    rownames(chicks)
    

    Notice how this code creates a subset of chicks where is.na(weight) is equal to FALSE. That is, chicks only includes rows where NA is not present in the weight column. If you care about your data frame’s row names, though, be aware that—when you removed certain rows—you also removed those rows’ corresponding rownames. You can ensure the names of your rows still ascend sequentially by running rownames(chicks) <- NULL, which resets the names of all of your rows.

  • In R, you can present users with options. For example, you can offer the user the type of feed they wish to filter for the chicks.
  • Consider this code:

    # Demonstrates interactive program to view data by feed type
    
    # Read and clean data
    chicks <- read.csv("chicks.csv")
    chicks <- subset(chicks, !is.na(weight))
    
    # Determine feed options
    feed_options <- unique(chicks$feed)
    
    # Prompt user with options
    cat("1.", feed_options[1])
    cat("2.", feed_options[2])
    cat("3.", feed_options[3])
    cat("4.", feed_options[4])
    cat("5.", feed_options[5])
    cat("6.", feed_options[6])
    feed_choice <- as.integer(readline("Feed type: "))
    

    Notice how this code uses unique to discover the individual unique feed options. Each of these feed options is then outputted with cat.

  • This code works in the sense that it shows the various options of feed, but it’s not very well formatted. How can we output the different options on their own line in the R console?

Escape Characters

  • Escape characters are characters whose output differs from the way you type them.
  • For instance, some commonly used escape characters are \n, which prints a new line, or \t, which prints a tab.
  • Leveraging escape characters, we can modify our code as follows:

    # Demonstrates \n
    
    # Read and clean data
    chicks <- read.csv("chicks.csv")
    chicks <- subset(chicks, !is.na(weight))
    
    # Determine feed options
    feed_options <- unique(chicks$feed)
    
    # Prompt user with options
    cat("1.", feed_options[1], "\n")
    cat("2.", feed_options[2], "\n")
    cat("3.", feed_options[3], "\n")
    cat("4.", feed_options[4], "\n")
    cat("5.", feed_options[5], "\n")
    cat("6.", feed_options[6], "\n")
    feed_choice <- as.integer(readline("Feed type: "))
    

    Notice how this outputs all the options of feed on individual lines.

  • While we have the right sort of menu being displayed, we can still improve our code from a design perspective. For example, why should we repeat all of these cat lines? Simplify your code as follows:

    # Demonstrates interactive program to view data by feed type
    
    # Read and clean data
    chicks <- read.csv("chicks.csv")
    chicks <- subset(chicks, !is.na(weight))
    
    # Determine feed options
    feed_options <- unique(chicks$feed)
    
    # Format feed options
    formatted_options <- paste0(1:length(feed_options), ". ", feed_options)
    
    # Prompt user with options
    cat(formatted_options, sep = "\n")
    feed_choice <- as.integer(readline("Feed type: "))
    

    Notice how formatted_options includes all the individual feed options. Each element of this vector of formatted_options is printed and separated by a new line using cat(formatted_options, sep = "\n").

  • Now, as we indicated earlier, our intention is to create an interactive program. Thus, we can now prompt the user with options:

    # Demonstrates interactive program to view data by feed type
    
    # Read and clean data
    chicks <- read.csv("chicks.csv")
    chicks <- subset(chicks, !is.na(weight))
    
    # Determine feed options
    feed_options <- unique(chicks$feed)
    
    # Format feed options
    formatted_options <- paste0(1:length(feed_options), ". ", feed_options)
    
    # Prompt user with options
    cat(formatted_options, sep = "\n")
    feed_choice <- as.integer(readline("Feed type: "))
    
    # Print selected option
    selected_feed <- feed_options[feed_choice]
    print(subset(chicks, feed == selected_feed))
    

    Notice how the user is prompted with Feed type: , where a number can be converted to the text-based representation of the feed option. Then, the feed_choice they selected is assigned to selected_feed. Finally, the subset corresponding to the selected_feed is outputted to the user.

  • However, you can imagine how your user may not behave as expected. For example, if the user inputted 0, which is not a potential choice, the output of our program will be strange. How can we ensure our user inputs the right text?

Conditionals

  • Conditionals are ways to determine if a condition has been met.
  • Consider the following code:

    # Demonstrates interactive program to view data by feed type
    
    # Read and clean data
    chicks <- read.csv("chicks.csv")
    chicks <- subset(chicks, !is.na(weight))
    
    # Determine feed options
    feed_options <- unique(chicks$feed)
    
    # Format feed options
    formatted_options <- paste0(1:length(feed_options), ". ", feed_options)
    
    # Prompt user with options
    cat(formatted_options, sep = "\n")
    feed_choice <- as.integer(readline("Feed type: "))
    
    # Invalid choice?
    if (feed_choice < 1 || feed_choice > length(feed_options)) {
      cat("Invalid choice.")
    }
    
    selected_feed <- feed_options[feed_choice]
    print(subset(chicks, feed == selected_feed))
    

    Notice how if (feed_choice < 1 || feed_choice > length(feed_options)) determines if the user’s input falls outside a range of values. If so, the program displays “Invalid choice.” However, there’s still a problem: the program will continue to run, even with that invalid choice.

  • if and else can be leveraged as follows to only run the final calculation if the user inputs a valid choice:

    # Demonstrates interactive program to view data by feed type
    
    # Read and clean data
    chicks <- read.csv("chicks.csv")
    chicks <- subset(chicks, !is.na(weight))
    
    # Determine feed options
    feed_options <- unique(chicks$feed)
    
    # Format feed options
    formatted_options <- paste0(1:length(feed_options), ". ", feed_options)
    
    # Prompt user with options
    cat(formatted_options, sep = "\n")
    feed_choice <- as.integer(readline("Feed type: "))
    
    # Invalid choice?
    if (feed_choice < 1 || feed_choice > length(feed_options)) {
      cat("Invalid choice.")
    } else {
      selected_feed <- feed_options[feed_choice]
      print(subset(chicks, feed == selected_feed))
    }
    

    Notice, that code which is wrapped in if runs only if there is an invalid choice. That code which is wrapped in else will run only if the previous condition in if was not met.

Combining Data Sources

  • As a final matter of this lecture, let’s examine how to combine sources of data.
  • Imagine a table that represents sales to customers, like Amazon might have.
  • You can imagine scenarios where that data is spread across many tables. How can these data be combined from many sources?
  • Consider the following code called sales.R:

    # Reads 4 separate CSVs
    
    Q1 <- read.csv("Q1.csv")
    Q2 <- read.csv("Q2.csv")
    Q3 <- read.csv("Q3.csv")
    Q4 <- read.csv("Q4.csv")
    

    Notice how each quarter of financial data, such as Q1 and Q2, is read into their own data frames.

  • Now, let’s combine this data from these four data frames:

    # Combines data frames with `rbind`
    
    Q1 <- read.csv("Q1.csv")
    Q2 <- read.csv("Q2.csv")
    Q3 <- read.csv("Q3.csv")
    Q4 <- read.csv("Q4.csv")
    
    sales <- rbind(Q1, Q2, Q3, Q4)
    

    Notice that rbind is used to gather together the data from each of these data frames.

  • It’s worth mentioning that rbind is usable in this case because all four data frames are structured the same way.
  • The result of the previously run program is that sales includes each row from each data frame. Instead of showing Q1, Q2, etc. for each customer, it simply creates new rows for each line of data at the bottom of the file. Hence, the file becomes longer and longer as more and more data is combined into it. It’s entirely unclear in what quarter each sales value occurred.
  • Our code can be improved such that a column for the financial quarter is created for each record as follows:

    # Adds quarter column to data frames
    
    Q1 <- read.csv("Q1.csv")
    Q1$quarter <- "Q1"
    
    Q2 <- read.csv("Q2.csv")
    Q2$quarter <- "Q2"
    
    Q3 <- read.csv("Q3.csv")
    Q3$quarter <- "Q3"
    
    Q4 <- read.csv("Q4.csv")
    Q4$quarter <- "Q4"
    
    sales <- rbind(Q1, Q2, Q3, Q4)
    

    Notice how each quarter is added to a specific quarter column. Thus, when rbind combines the data frames into sales with the sales organized by quarter columns.

  • As a final flourish, let’s add a value column where high returns and regular returns are noted:

    # Demonstrates flagging sales as high value
    
    Q1 <- read.csv("Q1.csv")
    Q1$quarter <- "Q1"
    
    Q2 <- read.csv("Q2.csv")
    Q2$quarter <- "Q2"
    
    Q3 <- read.csv("Q3.csv")
    Q3$quarter <- "Q3"
    
    Q4 <- read.csv("Q4.csv")
    Q4$quarter <- "Q4"
    
    sales <- rbind(Q1, Q2, Q3, Q4)
    
    sales$value <- ifelse(sales$sale_amount > 100, "High Value", "Regular")
    

    Notice how the final line of code assigns “High Value” when the sales_amount is greater than 100. Otherwise, the transaction is assigned “Regular.”

Summing Up

In this lesson, you learned how to represent data in R. Specifically, you learned…

  • Outliers
  • Logical Expressions
  • Subsets
  • Menus
  • Escape Characters
  • Conditionals
  • Combining Data Sources

See you next time when we discuss how to write functions of our own.