Lecture 5
Welcome!
- Welcome back to CS50’s Introduction to Programming with R!
 - Today, we will be learning about visualizing data. A good visualization can help us interpret and understand data in a whole new way.
 
ggplot2
- The 
plotinggplotmeans that we are going to plot our data. - The 
gginggplotreferences a grammar of graphics where individual components of graphics can be gathered together to visualize data. - There are many components that make up this grammar of graphics, starting with data.
 - Another component is geometries. These are the various types of graphical representation options for plots. These include columns, points, and lines.
 - Finally, aesthetic mappings are the relationships between the data and visual features of our plot. For example, in a plot, a horizontal 
xaxis may represent each candidate. Then, a verticalyaxis may be associated with the number of votes for each candidate. It is through this relationship between data and geometries that we are able to visualize and understand the aesthetic map of our plot. You may be able to imagine times when poorly designed plots have been shown to you: when the mapping is incorrect, data is more challenging to interpret and understand. - 
    
Download the lecture’s source files and run
library("tidyverse")in the R console such that thetidyverseis loaded into memory. Then, create a visualization as follows:# Create a blank visualization votes <- read.csv("votes.csv") ggplot()Notice how
votes.csvis loaded intovotes. Whenggplotis run, nothing is currently visualized. - 
    
We can provide inputs to
ggplotas follows:# Supply data votes <- read.csv("votes.csv") ggplot(votes)Notice how
votesis provided toggplot. Still, nothing is visualized. - 
    
We need to tell
ggplotwhat type of plot we want:# Add first geometry votes <- read.csv("votes.csv") ggplot(votes) + geom_col()Notice that
geom_colspecifies that the data should be visualized with a column geometry. However, at this point, an error will result. The error indicates that we need to designate aesthetic mappings. - Notice, too, that the 
+operator has a new meaning: using the+operator adds a layer on top of the base layer of your plot. - 
    
To designate aesthetic mappings, we can define them as follows:
# Add x and y aesthetics votes <- read.csv("votes.csv") ggplot(votes, aes(x = candidate, y = votes)) + geom_col()Notice how various aesthetic mappings, designated by
aes, are defined within the parenthesis. For example,x = candidateandy = votesare both aesthetic mappings. Now,ggplotknows which data maps to which aesthetic features of our plot. - Running the above code, our first visualization finally appears!
 
Scales
- Notice how 
ggplothas decided that the values of thevotesaxis range from0to200. What if we wanted to provide more headroom, such that we could visualize up to250? Let’s learn about Scales. - Scales can be continuous, ranging from one number to another, or discrete, which means categorical.
 - 
    
Continuous scales have limits. For example, the data provided in
votesranges from0to200. Hence, we can modify these limits as follows:# Adjust y scale votes <- read.csv("votes.csv") ggplot(votes, aes(x = candidate, y = votes)) + geom_col() + scale_y_continuous(limits = c(0, 250))Notice how the scale of
yis modified viascale_y_continuousto range from0to250. This, again, is provided via a new layer with the+operator. 
Labels
- 
    
Additionally, one can add labels to the plot. Consider the following:
# Add labels votes <- read.csv("votes.csv") ggplot(votes, aes(x = candidate, y = votes)) + geom_col() + scale_y_continuous(limits = c(0, 250)) + labs( x = "Candidate", y = "Votes", title = "Election Results" )Notice how labels are provided for
x,y, andtitle. These are added as a new layer via the+operator. 
Fill
- 
    
The fill color can also be changed, depending on the
candidatename. Consider the following:# Add fill aesthetic mapping for geom_col votes <- read.csv("votes.csv") ggplot(votes, aes(x = candidate, y = votes)) + geom_col(aes(fill = candidate)) + scale_y_continuous(limits = c(0, 250)) + labs( x = "Candidate", y = "Votes", title = "Election Results" )Notice that the
fillis dependent upon thecandidatevia theaesfunction. - 
    
We may wish to adjust the
fillcolor to be friendly for color blindness. We can do so as follows:# Use viridis scale to design for color blindness votes <- read.csv("votes.csv") ggplot(votes, aes(x = candidate, y = votes)) + geom_col(aes(fill = candidate)) + scale_fill_viridis_d("Candidate") + scale_y_continuous(limits = c(0, 250)) + labs( x = "Candidate", y = "Votes", title = "Election Results" )Notice how the
viridisscale is provided viascale_fill_viridis_dfunction. 
Themes
- 
    
One can also modify the themes being used by
ggplot. You can do so as follows:# Adjust ggplot theme votes <- read.csv("votes.csv") ggplot(votes, aes(x = candidate, y = votes)) + geom_col(aes(fill = candidate)) + scale_fill_viridis_d("Candidate") + scale_y_continuous(limits = c(0, 250)) + labs( x = "Candidate", y = "Votes", title = "Election Results" ) + theme_classic()Notice that
theme_classicis provided. ggplot2 offers several themes. 
Saving Your Plot
- 
    
Finally, plots can be saved.
# Save file votes <- read.csv("votes.csv") p <- ggplot(votes, aes(x = candidate, y = votes)) + geom_col(aes(fill = candidate)) + scale_fill_viridis_d("Candidate") + scale_y_continuous(limits = c(0, 250)) + labs( x = "Candidate", y = "Votes", title = "Election Results" ) + theme_classic() ggsave( "votes.png", plot = p, width = 1200, height = 900, units = "px" )Notice how the entire plot is designated as
p. Then,ggsaveis employed, naming the file name, the plot (in this case,p), height, width, and the units. - 
    
By executing this code, you have saved your first plot. Congratulations!
 
Point
- Now, let’s look at a new type of geometry called 
point. - Imagine data where candy’s price percentile and sugar percentile are represented.
 - You can imagine how the 
sugarpercentile can be mapped on theyaxis while thepricepercentile can be noted on thexaxis. - 
    
This can be realized in code form as follows:
# Introduce geom_point load("candy.RData") ggplot( candy, aes(x = price_percentile, y = sugar_percentile) ) + geom_point()Notice how the data
candyis provided to theggplotfunction. Then, the aesthetic mappings are set with theaesfunction.price_percentile, for example, is assigned to thexaxis. Finally, thegeom_pointfunction is run. - Running this code results in points being represented in a plot.
 - 
    
Labels can be added as follows:
# Add labels and theme load("candy.RData") ggplot( candy, aes(x = price_percentile, y = sugar_percentile) ) + geom_point() + labs( x = "Price", y = "Sugar", title = "Price and Sugar" ) + theme_classic()Notice how
labs(labels) forx,y, andtitleare provided. Also, a theme is named. - 
    
Now, a number of points do overlap.
jittercan be used to help visualize points that overlap:# Introduce geom_jitter ggplot( candy, aes(x = price_percentile, y = sugar_percentile) ) + geom_jitter() + labs( x = "Price", y = "Sugar", title = "Price and Sugar" ) + theme_classic()Notice how
geom_pointis replaced withgeom_jitter. This allows for the visualization of points that overlap. - 
    
We can add color aesthetics to our points:
# Introduce size and color aesthetic ggplot( candy, aes(x = price_percentile, y = sugar_percentile) ) + geom_jitter( color = "darkorchid", size = 2 ) + labs( x = "Price", y = "Sugar", title = "Price and Sugar" ) + theme_classic()Notice how all points are changed to one color.
 - 
    
Further, we can change the size and shape of our points:
# Introduce point shape and fill color ggplot( candy, aes(x = price_percentile, y = sugar_percentile) ) + geom_jitter( color = "darkorchid", fill = "orchid", shape = 21, size = 2 ) + labs( x = "Price", y = "Sugar", title = "Price and Sugar" ) + theme_classic()Notice how
shapeandsizeare changed. You can reference the documentation to learn more about which numbers correspond to which shapes. 
Visualizing Over Time
- You can imagine how data can be represented over time.
 - For example, consider how data regarding Hurricane Anita may be represented over time.
 - 
    
We could plot as we did prior with points:
# Visualize with geom_point load("anita.RData") ggplot(anita, aes(x = timestamp, y = wind)) + geom_point()Notice how
timestampandwindspeed are placed in points over time. - 
    
While this visualization is useful, it could be more useful to present with lines showing whether wind speed increased or decreased. Each point can be connected with a line as follows:
# Introduce geom_line load("anita.RData") ggplot(anita, aes(x = timestamp, y = wind)) + geom_line()Notice
geom_lineis employed as a new layer. - 
    
What results is a series of lines that change direction at each timestamp. What if we could combine both
pointandline? Well, indeed, we can!# Combine geom_line and geom_point load("anita.RData") ggplot(anita, aes(x = timestamp, y = wind)) + geom_line() + geom_point(color = "deepskyblue4")Notice how a layer with lines is added via
geom_line. Then,geom_pointis added as a layer usingdeepskyblue4. - 
    
Aesthetics can be modified in various ways:
# Experiment with geom_line and geom_point aesthetics load("anita.RData") ggplot(anita, aes(x = timestamp, y = wind)) + geom_line( linetype = 1, linewidth = 0.5 ) + geom_point( color = "deepskyblue4", size = 2 )Notice how the
linetypeandlinewidthare modified. Then, thesizeof the points is changed. You can reference the documentation to learn more about various line types. - 
    
As with our prior plots today, we can add labels and a theme:
# Add labels and adjust theme load("anita.RData") ggplot(anita, aes(x = timestamp, y = wind)) + geom_line( linetype = 1, linewidth = 0.5 ) + geom_point( color = "deepskyblue4", size = 2 ) + labs( y = "Wind Speed (Knots)", x = "Date", title = "Hurricane Anita" ) + theme_classic()Notice how
labsallows us to designate labels fory,x, andtitle. Then,theme_classicis enabled. - 
    
As a final flourish, we can also add a horizontal line to demarcate the hurricane status. When did Hurricane Anita become a hurricane?
# Add horizontal line to demarcate hurricane status load("anita.RData") ggplot(anita, aes(x = timestamp, y = wind)) + geom_line( linetype = 1, linewidth = 0.5 ) + geom_point( color = "deepskyblue4", size = 2 ) + geom_hline( linetype = 3, yintercept = 64 ) + labs( y = "Wind Speed (Knots)", x = "Date", title = "Hurricane Anita" ) + theme_classic()Notice how a new layer is added to display a line at
yintercept = 64, to designate that anything 64 or higher is considered a hurricane. Thelinetypeis designated as3or dotted. 
Summing Up
In this lesson, you learned how to visualize data in R. Specifically, you learned about:
- ggplot2
 - Scales
 - Labels
 - Fill
 - Themes
 - Point
 - Visualizing over time
 
See you next time when we discuss how to test our programs.