Lecture 8

Spreadsheets

  • Let’s create a sample spreadsheet for users in a web application. The first row of a spreadsheet generally includes a header.

    username name email address phone age
    malan David Malan malan@harvard.edu 33 Oxford Street, Cambridge, MA 02138 617-495-5000  
  • Each column has a data type, like a number or some text. For the username and email fields, the values are additionally unique.

  • We can also relate spreadsheets with one another by having some sort of common identifier; perhaps we would like a spreadsheets with customer information, invoices, products, or any number of other types of data. To relate the data in these spreadsheets, we can use a customer ID.

  • We might start needing larger and larger spreadsheets, and eventually, we might have more rows of data than these programs, like Microsoft Excel or Google Sheets, can handle. In that case, we might want to use a database instead.

Databases

  • A database is a piece of software that can run on our operating systems. Commonly, however, it runs on a server or somewhere else in the cloud to which our software connects.

  • Relational databases mimic the design of spreadsheets—data is stored in rows and columns, and we get more and more rows as we have more and more data.

    • Instead of having spreadsheets, we have a database, and instead of calling individual tabs sheets, we call them tables.

    • Additionally, we have to choose between various data types for each column. This will make searching and sorting the data much more efficient.

    • Examples include Oracle, Microsoft Access, SQL Server, My SQL, Postgres, or SQLite.

Data Types

  • Some of the data types we can use in SQLite include integer, real, numeric, text, and blob.

    • We can use an integer if we would like to represent something like 1, 2, 3, or a negative.

    • We can use real if we’d like to store a floating point value.

    • We can use numeric if we would like to represent dates, times, or other types that are numbers but have more formal structures.

    • We can use text if we have words, phrases, or even whole paragraphs.

    • We can use blob if we would like to store binary large objects, or binary data, like actual files!

  • Other databases support more than these data types, which allows the database to make smarter decisions when storing this data, so when we query the data, the database can respond quickly.

    • In some databases, the integer type can be subdivided into smallint, integer, and bigint, where the smallint type takes up 2 bytes, the integer type takes up 4 bytes, and the bigint type takes up 8 bytes.

    • The real data type can be subdivided into real and double precision, where a real type takes up 4 bytes and double precision takes up 8 bytes.

    • The numeric data type can include boolean, date, datetime, numeric(scale, precision), time, and timestamp.

      • An example of numeric(scale, precision) might be dollar amounts, where we want to store these to two decimal points.

      • timestamp counts the number of milliseconds or seconds from a certain starting point. It became conventional to start counting time from January 1st, 1970, and since timestamp has 4 bytes, when we get to 2038, we’ll run out of bits to represent time! To fix this, we’ll just have to use more space.

    • The text data type can be subdivided into char(n), varchar(n), and text.

      • We use char(n) when we would like to store a fixed number of characters, like state abbreviations, for example. Storing values as char can be very efficient, as indexing is a lot simpler knowing that after each n characters, we have a new value, which allows the database to sort and search with ease.

      • We use varchar(n) when we would like to store values containing up to n characters.

      • We use text whenever we have particularly large text.

  • Taking a look at our spreadsheet again, we might assign these data types:

    username name email address phone date of birth
    varchar(255) varchar(255) varchar(255) varchar(255) char(10) date
    • Note that we’ve used varchar(255) for many of these fields. Historically, many databases had limits of 255 for a varchar, so this is used very commonly.

    • These data types are not the only ones that can be used. Based on the data we’re storing, these types can be different.

SQL

  • SQL, a database language, has some fundamental operations that allow us to work with data inside databases. These operations include:

    • CREATE
    • SELECT
    • UPDATE, INSERT
    • DELETE, DROP
  • We will use SQLite, an implementation of SQL, which stores its tables locally, as a file. Additionally, a number of programs can be used to communicate with a SQL database. We will use DB Browser, which can also be installed on a Mac or PC.

  • When we open DB Browser, we’ll see a list of tables, as shown below.

    listoftables

  • If we open a few tables, we can see the columns inside each, as shown.

    columns

  • We might note that we see a lot of columns with various IDs. This will help us avoid redundancy.

    • For example, let us reconsider our spreadsheet. We currently have the address of the user “malan” as “33 Oxford Street, Cambridge, MA 02138.”

    • It would be rather inefficient for us to perform simple queries on that address column, like searching for the zip code “02138.” We would have to search all of the values in the column and also ignore the extraneous information stored in that address.

      • To fix this, we can change the address field to multiple fields. For example, we can have this table instead:

        username name email street city state zip phone age
        malan David Malan malan@harvard.edu 33 Oxford Street Cambridge MA 02138 617-495-5000  
      • We might assign these data types to the new fields:

        username name email street city state zip phone age
        varchar(255) varchar(255) varchar(255) varchar(255) varchar(255) char(2) char(5) char(10) date
    • Additionally, it turns out that multiple people have that address (we leave them unnamed here). Then, this address would appear repeatedly in our table.

      username name email street city state zip phone age
      malan David Malan malan@harvard.edu 33 Oxford Street Cambridge MA 02138 617-495-5000  
            33 Oxford Street Cambridge MA 02138    
            33 Oxford Street Cambridge MA 02138    
            33 Oxford Street Cambridge MA 02138    
            33 Oxford Street Cambridge MA 02138    
      • To avoid this redundancy, we can create a separate table called “cities.” Inside this table, we’ll have to include an ID to be able to link the previous spreadsheet to this spreadsheet. We might format the cities table like this:

        id city state zip
        integer varchar(255) char(2) char(5)
        1 Cambridge MA 02138
      • Now, if we go back to our original table, we can just write this:

        username name email street city_id phone age
        malan David Malan malan@harvard.edu 33 Oxford Street 1 617-495-5000  
              33 Oxford Street 1    
              33 Oxford Street 1    
              33 Oxford Street 1    
              33 Oxford Street 1    
              33 Oxford Street 1    
      • It’s much more efficient for the computer to perform operations on and relate these small numbers we’ve used for IDs.

  • Back to our music database, we’ll notice that the Album table is related to the Artist table via an ArtistId. This way, multiple albums can have the same artist without storing the artist’s name multiple times.

  • Additionally, if an artist were to change their name, we can simply change their name in the Artist table, without having to go through the entire Album table, searching for that artist.

Executing SQL Queries

  • If we want to see all the albums in the table, we can write SELECT * FROM Album; where SELECT * means to return all columns. After executing, we’ll get a table with the information we would like back. The first 5 entries are shown here (the leftmost column is the row number):

      AlbumId Title ArtistId
    1 1 For Those About To Rock We Salute You 1
    2 2 Balls to the Wall 2
    3 3 Restless and Wild 2
    4 4 Let There Be Rock 1
    5 5 Big Ones 3
  • If we’re curious about who wrote “For Those About To Rock We Salute You,” we’ll take note of the ArtistId for this song, or 1. Then, we can execute another SQL query: SELECT * FROM Artist WHERE ArtistId = 1;. This is our result:

      ArtistId Name
    1 1 AC/DC
    • Note that we used WHERE ArtistId = 1 as a predicate in this case.
  • We can also join data on different tables. Suppose we wanted to know the artist’s name for every album. Our SQL query might look like this (the spacing is for readability):

      SELECT * FROM Album
      JOIN Artist ON Album.ArtistId = Artist.ArtistId;
    
    • In this case, we’re joining the Album table and the Artist table. To join them, we’re matching the ArtistId column of the Album table with the ArtistId column of the Artist table. Our first five rows might look like this:

        AlbumId Title ArtistId ArtistId Name
      1 1 For Those About to Rock We Salute You 1 1 AC/DC
      2 2 Balls to the Wall 2 2 Accept
      3 3 Restless and Wild 2 2 Accept
      4 4 Let There Be Rock 1 1 AC/DC
      5 5 Big Ones 3 3 Aerosmith
    • We might not want to see all the columns in the middle, though. Instead, we can write:

      SELECT Title, Name FROM Album
      JOIN Artist ON Album.ArtistId = Artist.ArtistId;
      
      • Our first five rows now look like this:
        Title Name
      1 For Those About to Rock We Salute You AC/DC
      2 Balls to the Wall Accept
      3 Restless and Wild Accept
      4 Let There Be Rock AC/DC
      5 Big Ones Aerosmith
  • To return a table with album counts for each artist, we can execute this SQL query:

    SELECT Name, COUNT(Name) FROM Album
    JOIN Artist ON Album.ArtistId = Artist.ArtistId
    GROUP BY Name;
    
    • We use the GROUP BY keyword to group albums with the same artist name together.

    • We use the SELECT COUNT(Name) function to return the number of times this artist appears.

    • The first five rows look like this:

        Name COUNT(Name)
      1 AC/DC 2
      2 Aaron Copland & London Symphony Orchestra 1
      3 Aaron Goldberg 1
      4 Academy of St. Martin in the Fields & Sir Nevil… 1
      5 Academy of St. Martin in the Fields Chamber … 1
  • To return the same table as above but only for counts greater than 1, we can write:

    SELECT Name, COUNT(Name) FROM Album
    JOIN Artist ON Album.ArtistId = Artist.ArtistId
    GROUP BY Name HAVING COUNT(Name) > 1;
    
    • The first five rows look like this:

        Name COUNT(Name
      1 AC/DC 2
      2 Accept 2
      3 Amy Winehouse 2
      4 Antonio Carlos Jobim 2
      5 Audioslave 3
    • There are 56 rows in this table, and they were returned in 1 millisecond!

  • We’ve provided some specifications that help the computer respond to these queries quickly.

    • If we designate a column as a primary key, then this column uniquely identifies every row. Generally, the primary key is an integer, since it is much easier to ensure that there are no duplicates with integers than with characters, for example.

      • For example, the id column in the cities table is a primary key.
    • If we define a column to be a primary key in one table, in another table, a column with the same value is called a foreign key.

      • For example, we had an Album table and an Artist table earlier. The Artist table had an ArtistID column. Within the Artist table, that column is the primary key. In the Album table, there was also an ArtistID column, and that column, in this context, is a foreign key.
    • These two keys in conjunction help the database know how to link the two tables efficiently.

    • We can also specify that a column is unique. This doesn’t have to be the primary key; for example, for the users table we created, we would like the usernames and emails to be unique.

    • We can index a column, which will allow future queries on that column to be much quicker. By indexing a column, we’re telling the database beforehand that we plan to search and sort this column frequently. The database will then create a 2D tree structure from the data in that column, thereby ensuring that it doesn’t take as many steps for searching or sorting in the future.

    • We can also autoincrement a column. Instead of manually assigning Cambridge the id of 1 and the next city the id of 2, the autoincrement function will, each time we add a row, increment the previous value. This provides us with unique values that we can use as our primary key.

    • The not null keyword will ensure that the rows being added to our database aren’t only sparsely filled with real data.

  • SQL also provides functions like AVG, COUNT, MAX, MIN, SUM, among many others. We can simply pass in data to these functions to return the value we would like.

Race Conditions

  • We’ll introduce race conditions with an example. Suppose Alice and Bob exist on this planet, and both have just seen David’s latest Instagram post.

  • First, we’ll look at what Instagram can do with a “like.” When someone likes a post, Instagram can use a SELECT statement to see how many likes this post currently has. Then, an UPDATE statement can be executed to increment that number.

  • Now, suppose that Alice and Bob both like David’s instagram post at the same time. Suppose David’s latest post has 2141 likes.

  • Separately, the SELECT statements are processed. Since the likes were sent at the same time, both SELECT statements return 2141, for the current number of likes. Then, both UPDATE statements will change David’s post’s number of likes to 2141 + 1 = 2142, instead of 2143.

  • In this case, Alice and Bob’s inputs triggered a “race” to update data.

  • Now, let’s consider a different example. Suppose Alice and Bob are roommates, and Alice comes home, opens the fridge and finds that there is no milk. She goes to the store to buy some milk. Bob comes home while Alice is gone and finds that there is no milk. He, then, goes to the store and buys some milk. Now, they have twice as much milk.

    • This problem occurred because both Alice and Bob inspected the state of some value and made a decision on it before the other person was done acting on that information.

    • To fix this problem, Alice could write a note on the fridge, notifying Bob that she is gone for milk. Or, more dramatically, Alice could simply lock the fridge, so Bob can’t even inspect that state.

    • Databases provide atomicity, or the ability to do multiple things together or nothing at all. This means that the database cannot be interrupted by someone else’s update.

    • SQL databases also support transactions, where one can execute one or more commands again and again, back to back. All of these transactions have to go through before another person’s command is allowed to get executed.

SQL Injection Attacks

  • An adversary can trick our database into executing a query that we did not intend. This type of attack is a SQL injection attack.

  • Suppose we have a program such that when a user types in an album title, it searches our database for that title.

  • If someone queries our database and types in “Let There Be Rock; Delete” as the title, then, our entire table might be gone! SQL may interpret the semicolon in the input as the end of that line and read “Delete” as its own line, thereby dropping our table.

  • People have designed many packages to prevent SQL injection attacks. These packages often “escape” these dangerous characters, such as semicolons or apostrophes, by adding a backslash before it. Now, if a user were to type in “Let There Be Rock; Delete”, the query will be changed to “Let There Be Rock\; Delete.” This query will be literally interpreted, and the database will search for an album with the title “Let There Be Rock; Delete,” which would likely not return any results.

Scalability

  • If we have only one database, then we have a single point of failure. We could fix this by having two databases, but note that if we store half of our data in one database and half in the other, we’ll actually have two single points of failure. The second database will have to be a duplicate of the first database to resolve this issue.

    • Note that the second database is a duplicate of the first, not a backup of the first. We don’t want to waste time restoring from backup, as we’d like to maintain uptime. As a duplicate, whenever we write data to one database, we write this data to the other as well.
  • If we have more and more data to store, we’ll require more than one primary database. Then, we’ll have to duplicate this additional database as well. Eventually, we might hit a ceiling on vertical scaling—there is only so much space on our servers. We might employ a different method of scaling.

    • Currently, our databases are doing both reading and writing. To read data from a database means to take data from its memory. To write data into a database means to save such data into the database.

    • If we have two primary databases, we can connect databases to each, as their replicas. However, instead of having both read and write capabilities on these replicas, we can simply have read capabilities. Then, these databases exist solely for the purpose to read from them again and again.

    • An example of when this might be useful is on social media. On Facebook, there are likely more reads than there are writes. If the data for a business follows this pattern, where reads are way more common than writes, we can use this model instead.