Lecture 8
- Spreadsheets
- Databases
- Data Types
- SQL
- Executing SQL Queries
- Race Conditions
- SQL Injection Attacks
- Scalability
Spreadsheets
-
Let’s create a sample spreadsheet for users in a web application. The first row of a spreadsheet generally includes a header.
username name email address phone age malan David Malan malan@harvard.edu 33 Oxford Street, Cambridge, MA 02138 617-495-5000 -
Each column has a data type, like a number or some text. For the
username
andemail
fields, the values are additionally unique. -
We can also relate spreadsheets with one another by having some sort of common identifier; perhaps we would like a spreadsheets with customer information, invoices, products, or any number of other types of data. To relate the data in these spreadsheets, we can use a customer ID.
-
We might start needing larger and larger spreadsheets, and eventually, we might have more rows of data than these programs, like Microsoft Excel or Google Sheets, can handle. In that case, we might want to use a database instead.
Databases
-
A database is a piece of software that can run on our operating systems. Commonly, however, it runs on a server or somewhere else in the cloud to which our software connects.
-
Relational databases mimic the design of spreadsheets—data is stored in rows and columns, and we get more and more rows as we have more and more data.
-
Instead of having spreadsheets, we have a database, and instead of calling individual tabs sheets, we call them tables.
-
Additionally, we have to choose between various data types for each column. This will make searching and sorting the data much more efficient.
-
Examples include Oracle, Microsoft Access, SQL Server, My SQL, Postgres, or SQLite.
-
Data Types
-
Some of the data types we can use in SQLite include
integer
,real
,numeric
,text
, andblob
.-
We can use an
integer
if we would like to represent something like 1, 2, 3, or a negative. -
We can use
real
if we’d like to store a floating point value. -
We can use
numeric
if we would like to represent dates, times, or other types that are numbers but have more formal structures. -
We can use
text
if we have words, phrases, or even whole paragraphs. -
We can use
blob
if we would like to store binary large objects, or binary data, like actual files!
-
-
Other databases support more than these data types, which allows the database to make smarter decisions when storing this data, so when we query the data, the database can respond quickly.
-
In some databases, the
integer
type can be subdivided intosmallint
,integer
, andbigint
, where thesmallint
type takes up 2 bytes, theinteger
type takes up 4 bytes, and thebigint
type takes up 8 bytes. -
The
real
data type can be subdivided intoreal
anddouble precision
, where areal
type takes up 4 bytes anddouble precision
takes up 8 bytes. -
The
numeric
data type can includeboolean
,date
,datetime
,numeric(scale, precision)
,time
, andtimestamp
.-
An example of
numeric(scale, precision)
might be dollar amounts, where we want to store these to two decimal points. -
timestamp
counts the number of milliseconds or seconds from a certain starting point. It became conventional to start counting time from January 1st, 1970, and sincetimestamp
has 4 bytes, when we get to 2038, we’ll run out of bits to represent time! To fix this, we’ll just have to use more space.
-
-
The
text
data type can be subdivided intochar(n)
,varchar(n)
, andtext
.-
We use
char(n)
when we would like to store a fixed number of characters, like state abbreviations, for example. Storing values aschar
can be very efficient, as indexing is a lot simpler knowing that after eachn
characters, we have a new value, which allows the database to sort and search with ease. -
We use
varchar(n)
when we would like to store values containing up ton
characters. -
We use
text
whenever we have particularly large text.
-
-
-
Taking a look at our spreadsheet again, we might assign these data types:
username name email address phone date of birth varchar(255) varchar(255) varchar(255) varchar(255) char(10) date -
Note that we’ve used
varchar(255)
for many of these fields. Historically, many databases had limits of 255 for a varchar, so this is used very commonly. -
These data types are not the only ones that can be used. Based on the data we’re storing, these types can be different.
-
SQL
-
SQL, a database language, has some fundamental operations that allow us to work with data inside databases. These operations include:
CREATE
SELECT
UPDATE
,INSERT
DELETE
,DROP
-
We will use SQLite, an implementation of SQL, which stores its tables locally, as a file. Additionally, a number of programs can be used to communicate with a SQL database. We will use DB Browser, which can also be installed on a Mac or PC.
-
When we open DB Browser, we’ll see a list of tables, as shown below.
-
If we open a few tables, we can see the columns inside each, as shown.
-
We might note that we see a lot of columns with various IDs. This will help us avoid redundancy.
-
For example, let us reconsider our spreadsheet. We currently have the address of the user “malan” as “33 Oxford Street, Cambridge, MA 02138.”
-
It would be rather inefficient for us to perform simple queries on that address column, like searching for the zip code “02138.” We would have to search all of the values in the column and also ignore the extraneous information stored in that address.
-
To fix this, we can change the address field to multiple fields. For example, we can have this table instead:
username name email street city state zip phone age malan David Malan malan@harvard.edu 33 Oxford Street Cambridge MA 02138 617-495-5000 -
We might assign these data types to the new fields:
username name email street city state zip phone age varchar(255) varchar(255) varchar(255) varchar(255) varchar(255) char(2) char(5) char(10) date
-
-
Additionally, it turns out that multiple people have that address (we leave them unnamed here). Then, this address would appear repeatedly in our table.
username name email street city state zip phone age malan David Malan malan@harvard.edu 33 Oxford Street Cambridge MA 02138 617-495-5000 33 Oxford Street Cambridge MA 02138 33 Oxford Street Cambridge MA 02138 33 Oxford Street Cambridge MA 02138 33 Oxford Street Cambridge MA 02138 -
To avoid this redundancy, we can create a separate table called “cities.” Inside this table, we’ll have to include an ID to be able to link the previous spreadsheet to this spreadsheet. We might format the cities table like this:
id city state zip integer varchar(255) char(2) char(5) 1 Cambridge MA 02138 -
Now, if we go back to our original table, we can just write this:
username name email street city_id phone age malan David Malan malan@harvard.edu 33 Oxford Street 1 617-495-5000 33 Oxford Street 1 33 Oxford Street 1 33 Oxford Street 1 33 Oxford Street 1 33 Oxford Street 1 -
It’s much more efficient for the computer to perform operations on and relate these small numbers we’ve used for IDs.
-
-
-
Back to our music database, we’ll notice that the
Album
table is related to theArtist
table via an ArtistId. This way, multiple albums can have the same artist without storing the artist’s name multiple times. -
Additionally, if an artist were to change their name, we can simply change their name in the
Artist
table, without having to go through the entireAlbum
table, searching for that artist.
Executing SQL Queries
-
If we want to see all the albums in the table, we can write
SELECT * FROM Album;
whereSELECT *
means to return all columns. After executing, we’ll get a table with the information we would like back. The first 5 entries are shown here (the leftmost column is the row number):AlbumId Title ArtistId 1 1 For Those About To Rock We Salute You 1 2 2 Balls to the Wall 2 3 3 Restless and Wild 2 4 4 Let There Be Rock 1 5 5 Big Ones 3 -
If we’re curious about who wrote “For Those About To Rock We Salute You,” we’ll take note of the
ArtistId
for this song, or 1. Then, we can execute another SQL query:SELECT * FROM Artist WHERE ArtistId = 1;
. This is our result:ArtistId Name 1 1 AC/DC - Note that we used
WHERE ArtistId = 1
as a predicate in this case.
- Note that we used
-
We can also join data on different tables. Suppose we wanted to know the artist’s name for every album. Our SQL query might look like this (the spacing is for readability):
SELECT * FROM Album JOIN Artist ON Album.ArtistId = Artist.ArtistId;
-
In this case, we’re joining the
Album
table and theArtist
table. To join them, we’re matching theArtistId
column of theAlbum
table with theArtistId
column of theArtist
table. Our first five rows might look like this:AlbumId Title ArtistId ArtistId Name 1 1 For Those About to Rock We Salute You 1 1 AC/DC 2 2 Balls to the Wall 2 2 Accept 3 3 Restless and Wild 2 2 Accept 4 4 Let There Be Rock 1 1 AC/DC 5 5 Big Ones 3 3 Aerosmith -
We might not want to see all the columns in the middle, though. Instead, we can write:
SELECT Title, Name FROM Album JOIN Artist ON Album.ArtistId = Artist.ArtistId;
- Our first five rows now look like this:
Title Name 1 For Those About to Rock We Salute You AC/DC 2 Balls to the Wall Accept 3 Restless and Wild Accept 4 Let There Be Rock AC/DC 5 Big Ones Aerosmith
-
-
To return a table with album counts for each artist, we can execute this SQL query:
SELECT Name, COUNT(Name) FROM Album JOIN Artist ON Album.ArtistId = Artist.ArtistId GROUP BY Name;
-
We use the
GROUP BY
keyword to group albums with the same artist name together. -
We use the
SELECT COUNT(Name)
function to return the number of times this artist appears. -
The first five rows look like this:
Name COUNT(Name) 1 AC/DC 2 2 Aaron Copland & London Symphony Orchestra 1 3 Aaron Goldberg 1 4 Academy of St. Martin in the Fields & Sir Nevil… 1 5 Academy of St. Martin in the Fields Chamber … 1
-
-
To return the same table as above but only for counts greater than 1, we can write:
SELECT Name, COUNT(Name) FROM Album JOIN Artist ON Album.ArtistId = Artist.ArtistId GROUP BY Name HAVING COUNT(Name) > 1;
-
The first five rows look like this:
Name COUNT(Name 1 AC/DC 2 2 Accept 2 3 Amy Winehouse 2 4 Antonio Carlos Jobim 2 5 Audioslave 3 -
There are 56 rows in this table, and they were returned in 1 millisecond!
-
-
We’ve provided some specifications that help the computer respond to these queries quickly.
-
If we designate a column as a primary key, then this column uniquely identifies every row. Generally, the primary key is an integer, since it is much easier to ensure that there are no duplicates with integers than with characters, for example.
- For example, the
id
column in thecities
table is a primary key.
- For example, the
-
If we define a column to be a primary key in one table, in another table, a column with the same value is called a foreign key.
- For example, we had an
Album
table and anArtist
table earlier. TheArtist
table had anArtistID
column. Within theArtist
table, that column is the primary key. In theAlbum
table, there was also anArtistID
column, and that column, in this context, is a foreign key.
- For example, we had an
-
These two keys in conjunction help the database know how to link the two tables efficiently.
-
We can also specify that a column is unique. This doesn’t have to be the primary key; for example, for the
users
table we created, we would like the usernames and emails to be unique. -
We can index a column, which will allow future queries on that column to be much quicker. By indexing a column, we’re telling the database beforehand that we plan to search and sort this column frequently. The database will then create a 2D tree structure from the data in that column, thereby ensuring that it doesn’t take as many steps for searching or sorting in the future.
-
We can also autoincrement a column. Instead of manually assigning Cambridge the
id
of 1 and the next city theid
of 2, the autoincrement function will, each time we add a row, increment the previous value. This provides us with unique values that we can use as our primary key. -
The not null keyword will ensure that the rows being added to our database aren’t only sparsely filled with real data.
-
-
SQL also provides functions like
AVG
,COUNT
,MAX
,MIN
,SUM
, among many others. We can simply pass in data to these functions to return the value we would like.
Race Conditions
-
We’ll introduce race conditions with an example. Suppose Alice and Bob exist on this planet, and both have just seen David’s latest Instagram post.
-
First, we’ll look at what Instagram can do with a “like.” When someone likes a post, Instagram can use a
SELECT
statement to see how many likes this post currently has. Then, anUPDATE
statement can be executed to increment that number. -
Now, suppose that Alice and Bob both like David’s instagram post at the same time. Suppose David’s latest post has 2141 likes.
-
Separately, the
SELECT
statements are processed. Since the likes were sent at the same time, bothSELECT
statements return 2141, for the current number of likes. Then, bothUPDATE
statements will change David’s post’s number of likes to 2141 + 1 = 2142, instead of 2143. -
In this case, Alice and Bob’s inputs triggered a “race” to update data.
-
Now, let’s consider a different example. Suppose Alice and Bob are roommates, and Alice comes home, opens the fridge and finds that there is no milk. She goes to the store to buy some milk. Bob comes home while Alice is gone and finds that there is no milk. He, then, goes to the store and buys some milk. Now, they have twice as much milk.
-
This problem occurred because both Alice and Bob inspected the state of some value and made a decision on it before the other person was done acting on that information.
-
To fix this problem, Alice could write a note on the fridge, notifying Bob that she is gone for milk. Or, more dramatically, Alice could simply lock the fridge, so Bob can’t even inspect that state.
-
Databases provide atomicity, or the ability to do multiple things together or nothing at all. This means that the database cannot be interrupted by someone else’s update.
-
SQL databases also support transactions, where one can execute one or more commands again and again, back to back. All of these transactions have to go through before another person’s command is allowed to get executed.
-
SQL Injection Attacks
-
An adversary can trick our database into executing a query that we did not intend. This type of attack is a SQL injection attack.
-
Suppose we have a program such that when a user types in an album title, it searches our database for that title.
-
If someone queries our database and types in “Let There Be Rock; Delete” as the title, then, our entire table might be gone! SQL may interpret the semicolon in the input as the end of that line and read “Delete” as its own line, thereby dropping our table.
-
People have designed many packages to prevent SQL injection attacks. These packages often “escape” these dangerous characters, such as semicolons or apostrophes, by adding a backslash before it. Now, if a user were to type in “Let There Be Rock; Delete”, the query will be changed to “Let There Be Rock\; Delete.” This query will be literally interpreted, and the database will search for an album with the title “Let There Be Rock; Delete,” which would likely not return any results.
Scalability
-
If we have only one database, then we have a single point of failure. We could fix this by having two databases, but note that if we store half of our data in one database and half in the other, we’ll actually have two single points of failure. The second database will have to be a duplicate of the first database to resolve this issue.
- Note that the second database is a duplicate of the first, not a backup of the first. We don’t want to waste time restoring from backup, as we’d like to maintain uptime. As a duplicate, whenever we write data to one database, we write this data to the other as well.
-
If we have more and more data to store, we’ll require more than one primary database. Then, we’ll have to duplicate this additional database as well. Eventually, we might hit a ceiling on vertical scaling—there is only so much space on our servers. We might employ a different method of scaling.
-
Currently, our databases are doing both reading and writing. To read data from a database means to take data from its memory. To write data into a database means to save such data into the database.
-
If we have two primary databases, we can connect databases to each, as their replicas. However, instead of having both read and write capabilities on these replicas, we can simply have read capabilities. Then, these databases exist solely for the purpose to read from them again and again.
-
An example of when this might be useful is on social media. On Facebook, there are likely more reads than there are writes. If the data for a business follows this pattern, where reads are way more common than writes, we can use this model instead.
-