Lecture 7

Welcome!
Databases
Flat-File Databases
Relational Databases
SQL
SQLite
SELECT
DELETE
INSERT
UPDATE
IMDb
Primary Keys and Foreign Keys
One-to-One Relationships
Joins
One-to-Many Relationships
Nested Queries
Many-to-Many Relationships
Indexes
NoSQL
Race Conditions
SQL Injection Attacks
Summing Up

Welcome!

In our previous sessions, we learned about the Internet and the web, exploring how data travels between computers and how web pages are built.
Today is all about deploying databases, particularly when we have lots and lots of data that we want to store at scale.
What do we mean by data? Really just information, and that information can be textual in nature or perhaps binary in nature, that is to say files and such.
At the end of the day, we want to store all of this data in a so-called database, which is a collection of data in a computer system.
Databases specifically tend to be actual products that you can download and install on a computer, for instance, Oracle and SQL Server, MySQL, and Postgres.

Databases

There are different terms you might encounter in this world of data more broadly.
A data warehouse is really a database of databases, all of your data somehow combined in one place.
A data mart is actually just a subset thereof. This might just be your marketing database, financial database, or something else more narrowly defined.
A data lake is a hot mess of data whereby you might just dump all of your data in PDFs and files and CSV files and more, generally with the intent of coming back to it later. But it’s not necessarily as organized as other formats might take.
Today we’ll focus indeed on databases, which are organized collections of data, and those collections themselves can take different forms.

Flat-File Databases

Perhaps the simplest way of storing a lot of data is inside of a so-called flat-file database, which literally means just storing your data in a file, maybe a text file, maybe a binary file.
That file might be structured. A very common format for flat-file databases are CSV files for comma-separated values, whereby all of the textual data in your file is somehow organized effectively into rows and columns.
The rows are straightforward to implement in a CSV file because you just put one line of data per row, hitting Enter at the end of each of those lines.
Within each of those lines or rows, you can indicate the presence of a column, or more specifically an individual cell, by separating the data on that line with commas.
Alternatives include TSV files (tab-separated values) or using vertical bars to delineate data. The choice of character is up to you, but it does influence exactly what type of software or how the software reads that file.
Consider the following example of a phonebook stored in a CSV file:
```
name,number
Brian,+1-617-495-1000
David,+1-617-495-1000
Doug,+1-617-495-1000
Eric,+1-617-495-1000
John,+1-949-468-2750
```
Notice that the first line is the header row, indicating that this file has two columns: name and number. The comma indicates where one column ends and another begins.
If any of your data in a CSV file actually needed to contain actual commas grammatically, you can put all of the data in one cell in double quotes. It just requires that whatever software you’re using knows to treat those quotes appropriately.
The limitation of a flat-file database is that we would essentially have to read the entire file searching for a specific name. Moreover, making changes, additions, or deletions is a fairly manual process. There’s no inherent functionality offered to us.

Relational Databases

To get more user-friendly functionality, we can transition from a flat-file database to a relational database.
A relational database generally refers to software that not only stores your data for you, but also provides you with some core functionality. Moreover, that software allows you to relate data within your set of data to each other in different ways.
Consider a simple example with schools and cities. Harvard and MIT are both in Cambridge, while Oxford is in Oxford. Now suppose we add the University of Cambridge, also in Cambridge (but in the UK).
We have two problems: an inefficiency insofar as we’re storing “Cambridge” multiple times, and a problem of disambiguation since Cambridge, UK is different from Cambridge, Massachusetts, USA.
Let me propose that we explode this single table of data into two tables: one for the schools, one for the cities. Then we assign a unique identifier to each city, for instance, a simple number starting with 1.
Now in the schools table, instead of storing the city name, we store a city ID instead. Both Harvard and MIT would have city ID 1, Oxford would have city ID 2, and Cambridge University would have city ID 3, disambiguating between the two Cambridges.
This is how relational tables work inside of a relational database. It’s a collection of data, and it’s relational insofar as we use unique identifiers to relate some data to other data.

SQL

SQL, or Structured Query Language, is typically the language used with relational databases via which to query data and also update your data and delete data and insert more data.
SQL allows you with code to automate the process of reading and writing data.
There’s an acronym that might help you remember the four basic operations: CRUD.
C stands for Create, R stands for Read, U stands for Update, and D stands for Delete.
In SQL, reading data uses a command called SELECT. CREATE, UPDATE, and DELETE are also used. INSERT exists when we want to insert new data, and DROP can delete an entire table.
SQL is a declarative language where it’s closer to using English to just declare what question you have without worrying as much about how to get from that question to the answer.
You needn’t worry so much about loops, variables, or conditions. You can instead say, closer to English, exactly what data you would like to select, update, delete, or insert.

SQLite

We’re going to use a fairly simple version of SQL called SQLite. The command we’ll use to interact with this lightweight version of SQL is sqlite3.
SQLite is not a toy version of SQL. It’s very commonly used in web browsers and mobile applications, but it doesn’t have as many features as commercial tools suited for the largest sets of data.
To create a database, type sqlite3 phonebook.db in the terminal. This creates an empty database file.
To import data from a CSV file, you can use these SQLite-specific commands:
```
.mode csv
.import phonebook.csv phonebook
.quit
```
Notice that these dot commands are specific to SQLite. The .mode csv puts SQLite into CSV mode, and .import loads the CSV file into a table called phonebook.
To see the schema of your database, use .schema:
```
CREATE TABLE IF NOT EXISTS phonebook(name TEXT, number TEXT);
```
Notice that this shows the structure of the table with two columns: name and number, both of type TEXT.

SELECT

You can read items from a table using the syntax SELECT columns FROM table.
For example, SELECT * FROM phonebook; will print every row in the phonebook table. The * is a wildcard meaning “all columns.”
You can get a subset of the data:
```
SELECT name FROM phonebook;
SELECT number FROM phonebook;
```
Notice that you can specify exactly which columns you want to see.
SQL supports many functions to access data:
```
AVG
COUNT
DISTINCT
LOWER
MAX
MIN
UPPER
```
For example, SELECT COUNT(*) FROM phonebook; returns the count of rows. SELECT DISTINCT number FROM phonebook; returns only unique numbers. You can combine these: SELECT COUNT(DISTINCT number) FROM phonebook; returns a count of unique numbers.

SQL offers additional keywords for queries:

WHERE       -- adding a Boolean expression to filter our data
LIKE        -- filtering responses more loosely
ORDER BY    -- ordering responses
LIMIT       -- limiting the number of responses
GROUP BY    -- grouping responses together

Notice that -- is how you write a comment in SQL.

To find a specific person’s number:
```
SELECT number FROM phonebook WHERE name = 'John';
```
Notice the use of single quotes for text values.

To find everyone with a specific number:

SELECT name FROM phonebook WHERE number = '+1-617-495-1000';

To group and count by number:
```
SELECT number, COUNT(*) FROM phonebook GROUP BY number;
```
Notice how this groups all like numbers together and counts how many names share each number.

DELETE

The DELETE command removes data from a table:
```
DELETE FROM phonebook WHERE name = 'David';
```
Notice the importance of the WHERE clause. Without it, DELETE FROM phonebook; would delete everything in the table.

INSERT

The INSERT command adds new rows:
```
INSERT INTO phonebook (name) VALUES ('David');
INSERT INTO phonebook (name, number) VALUES ('David', '+1-617-495-1000');
```
Notice that if you omit a column, its value becomes NULL, a special sentinel value indicating the deliberate or accidental omission of data.

UPDATE

The UPDATE command modifies existing data:
```
UPDATE phonebook SET number = '+1-617-495-1000' WHERE name = 'David';
```
Notice that without the WHERE clause, this would update every row in the table.

To order results alphabetically:

SELECT * FROM phonebook ORDER BY name;
SELECT * FROM phonebook ORDER BY name ASC;
SELECT * FROM phonebook ORDER BY name DESC;

Notice that ASC is ascending (the default) and DESC is descending order.

IMDb

Let’s transition from our tiny phonebook to an actual real-world database from IMDb, the Internet Movie Database, which has information on actors, TV shows, movies, and more.
If you have thousands or millions of rows, you’ll want to store things as efficiently as possible.
Consider storing TV show data. You could put the show name in one column and then star, star, star for each actor. But different shows have different numbers of stars, creating ragged data. In relational databases, if you’re in the habit of adding more and more columns, you’re probably doing something wrong.
A better approach uses separate tables with unique identifiers. One table for shows (with ID, title, year, episodes), one for people (with ID, name, birth), and intermediate tables to create relationships.
Here is a diagram representing the IMDb database structure:
You’ll see tables like shows, people, stars, writers, ratings, and genres. Lines between tables indicate relationships.

Primary Keys and Foreign Keys

A primary key is a column in a table that uniquely identifies each of its rows, typically with a simple integer value like 1, 2, 3, or much larger.
A foreign key is the presence of a primary key in another table. Foreign keys enable us to create relations across tables.

In SQLite, we have five primary data types:

BLOB       -- binary large objects that are groups of ones and zeros
INTEGER    -- an integer
NUMERIC    -- for numbers formatted specially like dates
REAL       -- like a float
TEXT       -- for strings and the like

Columns can have constraints like NOT NULL (the database won’t let you omit a value) and UNIQUE (the database won’t let you insert duplicate values).

One-to-One Relationships

Between shows and ratings, we have a one-to-one relationship. Every row in the shows table corresponds to one row in the ratings table.
The shows table’s ID column lines up with the ratings table’s show_id column as a foreign key.
To see this in action with the shows database:
```
SELECT * FROM shows LIMIT 10;
SELECT * FROM ratings LIMIT 10;
SELECT COUNT(*) FROM shows;
```
Notice the use of LIMIT to restrict output to the first 10 rows.

Joins

A join allows you to take two tables and join them together so you have access to all the data you care about.

To join shows with ratings:

SELECT * FROM shows
    JOIN ratings ON shows.id = ratings.show_id
    WHERE shows.id = 386676;

Notice how we specify which columns to match on with ON shows.id = ratings.show_id.

To get just the title and rating:

SELECT title, rating FROM shows
    JOIN ratings ON shows.id = ratings.show_id
    WHERE shows.id = 386676;

Notice that this returns “The Office” with a rating of 9.0.

One-to-Many Relationships

Between shows and genres, we have a one-to-many relationship. One show can have many genres (comedy, drama, thriller, etc.).
To find genres for a specific show:
```
SELECT genre FROM genres WHERE show_id = 63881;
```
Notice this returns adventure, comedy, and family for Catweazle.

Nested Queries

You can nest SQL commands together. One query can figure out a unique ID, and then that result feeds into another query:
```
SELECT genre FROM genres
    WHERE show_id = (
        SELECT id FROM shows WHERE title = 'Catweazle'
    );
```
Notice that the inner query in parentheses executes first, returning the ID, which is then used by the outer query.

You could also use a join to solve the same problem:

SELECT genre FROM shows
    JOIN genres ON shows.id = genres.show_id
    WHERE shows.id = 63881;

Many-to-Many Relationships

Between shows and people (via the stars table), we have a many-to-many relationship. One person could be in multiple shows, and one show has multiple people.
The stars table has two columns: show_id and person_id. This junction table relates one table to another.

To find all people who starred in The Office (2005):

SELECT name FROM people
    WHERE id IN (
        SELECT person_id FROM stars
        WHERE show_id = (
            SELECT id FROM shows WHERE title = 'The Office' AND year = 2005
        )
    );

Notice the nested queries: we first find the show’s ID, then find all person_ids from stars, then get those people’s names.

To find all shows Steve Carell has been in using nested queries:

SELECT title FROM shows
    WHERE id IN (
        SELECT show_id FROM stars
        WHERE person_id = (
            SELECT id FROM people WHERE name = 'Steve Carell'
        )
    );

The same query using explicit joins:

SELECT title FROM shows
    JOIN stars ON shows.id = stars.show_id
    JOIN people ON stars.person_id = people.id
    WHERE name = 'Steve Carell';

Or using an implicit join:

SELECT title FROM shows, stars, people
    WHERE shows.id = stars.show_id
    AND people.id = stars.person_id
    AND name = 'Steve Carell';

Notice how the comma-separated table names create an implicit join, with the WHERE clause specifying the relationships.

Indexes

Not all queries perform the same. If you have not indexed your database tables, queries can be slow.
An index allows you to optimize the underlying representation of data by creating a search structure for specific columns:
```
CREATE INDEX title_index ON shows (title);
```
Notice that you give the index a name and specify which table and column(s) to index.
When you create an index, the database builds a B-tree data structure in memory:
A B-tree is a fairly wide tree with leaf nodes as close to the root as possible, giving logarithmic running time instead of linear search.
To see the effect, you can enable timing in SQLite:
```
.timer on
```
Before creating an index, a query like SELECT * FROM shows WHERE title = 'The Office'; might take 0.038 seconds. After creating the index on the title column, the same query takes essentially 0.000 seconds.
The trade-off: indexes take up additional space in memory, and for tables that are frequently updated, they might slow down insertions, updates, and deletions. But for the most common queries like searching, indexes provide significant speedups.

NoSQL

An alternative approach to storing data is NoSQL, which stands for “not SQL” or “not only SQL.”
Instead of rows and columns, NoSQL stores documents or objects, often in JSON format:
```
{
    "_id": 386676,
    "title": "The Office",
    "stars": [
        {"id": 136797, "name": "Steve Carell"},
        ...
    ]
}
```
Notice that all related data is bundled together in one document rather than spread across multiple tables.
The appeal of NoSQL is that you can grab all the data you care about right away. But you don’t necessarily benefit from the same constraints that a SQL database can impose, like uniqueness and referential integrity. Nor do you have features like indexes for efficient searching.

Race Conditions

Race conditions can occur when multiple requests access the same database simultaneously.
Consider a social media “like” button. The code might:
1. Select the current number of likes: SELECT likes FROM posts WHERE id = 13;
2. Add one and update: UPDATE posts SET likes = 51 WHERE id = 13;
If two users click “like” at essentially the same time on different servers, both might read “50 likes,” both add 1 to get 51, and both update to 51, when the correct answer should be 52.
This is like two roommates both checking an empty fridge, both going to buy milk, and both coming home with milk.
The solution is transactions and locks:
```
BEGIN TRANSACTION;
SELECT likes FROM posts WHERE id = 13;
UPDATE posts SET likes = 51 WHERE id = 13;
COMMIT;
```
Notice that transactions ensure both queries execute atomically, without interruption from other processes.

SQL Injection Attacks

SQL injection attacks occur when programmers trust user input without sanitizing it.
Consider a login form where someone enters their email as malan@harvard.edu'--:
If the server naively constructs a query like:
```
SELECT * FROM users WHERE username = 'malan@harvard.edu'--' AND password = '...'
```
Notice that -- starts a SQL comment, causing the password check to be ignored entirely.
The attacker could log in without knowing the password.
The solution is to never trust user input. Use parameterized queries with placeholders:
```
SELECT * FROM users WHERE username = ? AND password = ?
```
Notice that the ? placeholders let the library handle proper escaping and sanitization.
Stand on the shoulders of others and use well-tested libraries to format SQL queries safely.

Summing Up

In this lesson, you learned about deploying databases at scale. Specifically, you learned…

About different types of data storage: databases, data warehouses, data marts, and data lakes.
How flat-file databases like CSV files store data in rows and columns.
How relational databases use unique identifiers to relate data across tables.
About SQL, Structured Query Language, for querying and manipulating data.
The CRUD operations: Create, Read, Update, Delete.
How to use SELECT, INSERT, UPDATE, and DELETE commands.
About primary keys and foreign keys for creating relationships.
The difference between one-to-one, one-to-many, and many-to-many relationships.
How to use joins and nested queries to combine data from multiple tables.
How indexes speed up queries using B-tree data structures.
About NoSQL as an alternative to relational databases.
About race conditions and how transactions prevent them.
About SQL injection attacks and how to prevent them with parameterized queries.

See you next time!