your.harvard
The my.harvard logo
Problem to Solve
If you’re not already familiar, Harvard has a course shopping tool called my.harvard, with which students explore (and ultimately register for!) classes. To keep track of courses, students, and their registrations, my.harvard presumably uses some kind of underlying database. And yet, if you’ve ever used it, you’ll know that my.harvard isn’t especially… quick.
Here’s your chance to make my.harvard just a little bit faster! In this problem, take some Harvard course data and create indexes to speed up typical queries on the database. Keep in mind that indexing every column isn’t always the best solution: you’ll need to consider trade-offs in terms of space and time, ultimately representing Harvard’s courses and students in the most efficient way possible.
Demo
Distribution Code
For this problem, you’ll need to download harvard.db
and an indexes.sql
file. In indexes.sql
, you’ll write SQL statements to create your indexes and leave comments to explain your reasoning.
Download the distribution code
Log into cs50.dev, click on your terminal window, and execute cd
by itself. You should find that your terminal window’s prompt resembles the below:
$
Next execute
wget https://cdn.cs50.net/sql/2023/x/psets/5/harvard.zip
in order to download a ZIP called harvard.zip
into your codespace.
Then execute
unzip harvard.zip
to create a folder called harvard
. You no longer need the ZIP file, so you can execute
rm harvard.zip
and respond with “y” followed by Enter at the prompt to remove the ZIP file you downloaded.
Now type
cd harvard
followed by Enter to move yourself into (i.e., open) that directory. Your prompt should now resemble the below.
harvard/ $
If all was successful, you should execute
ls
and see a database named harvard.db
alongside a SQL file named indexes.sql
. If not, retrace your steps and see if you can determine where you went wrong!
Schema
erDiagram
"Student" }o--|{ "Course" : "enrolls in"
"Course" }|--o{ "Requirement" : "satisfies"
Within harvard.db
, you’ll find five tables that implement the relationships described in the ER diagram above. Click the drop-downs below to learn more about the schema of each individual table.
students
table
The students
table contains the following columns:
id
, which is the student’s ID.name
, which is the student’s name.
courses
table
The courses
table contains the following columns:
id
, which is the courses’s ID.department
, which is the department in which the course is taught (e.g., “Computer Science”, “Economics”, “Philosophy”).number
, which is the course number (e.g., 50, 12, 330).semester
, which is the semester in which the class was taught (e.g., “Spring 2024”, “Fall 2023”).title
, which is the title of the course (e.g., “Introduction to Computer Science”).
enrollments
table
The enrollments
table contains the following columns:
id
, which is the ID to identify the enrollment.student_id
, which is the ID of the student enrolled.course_id
, which is the ID of the course in which the student is enrolled.
requirements
table
The requirements
table contains the following columns:
id
, which is the ID of the requirement.name
, which is the name of the requirement.
satisfies
table
The satisfies
table contains the following columns:
id
, which is the ID of the course-requirement pair.course_id
, which is the ID of a given course.requirement_id
, which is the ID of the requirement which the given course satisfies.
Specification
In indexes.sql
, write a set of SQL statements that create indexes to optimize typical queries on the harvard.db
database. The number of indexes you create, as well as the tables and columns they include, is entirely up to you. For each index, be sure to leave a comment that explains your reasoning about why the index will optimize typical queries on harvard.db
.
Good comments should:
- Be concise (i.e., no longer than 3 sentences)
- Identify the type of query the index will optimize
- Justify the implementation of the index, such as the tables and columns you’ve chosen to include or not include
Before you create your indexes, take a look at some of the typical queries run on harvard.db
! Click the spoiler tag below to see the set of typical SELECT
queries run on harvard.db
.
Typical SELECT
queries on harvard.db
-
Find a student’s historical course enrollments, based on their ID:
SELECT "courses"."title", "courses"."semester" FROM "enrollments" JOIN "courses" ON "enrollments"."course_id" = "courses"."id" JOIN "students" ON "enrollments"."student_id" = "students"."id" WHERE "students"."id" = 3;
-
Find all students who enrolled in Computer Science 50 in Fall 2023:
SELECT "id", "name" FROM "students" WHERE "id" IN ( SELECT "student_id" FROM "enrollments" WHERE "course_id" = ( SELECT "id" FROM "courses" WHERE "courses"."department" = 'Computer Science' AND "courses"."number" = 50 AND "courses"."semester" = 'Fall 2023' ) );
-
Sort courses by most- to least-enrolled in Fall 2023:
SELECT "courses"."id", "courses"."department", "courses"."number", "courses"."title", COUNT(*) AS "enrollment" FROM "courses" JOIN "enrollments" ON "enrollments"."course_id" = "courses"."id" WHERE "courses"."semester" = 'Fall 2023' GROUP BY "courses"."id" ORDER BY "enrollment" DESC;
-
Find all computer science courses taught in Spring 2024:
SELECT "courses"."id", "courses"."department", "courses"."number", "courses"."title" FROM "courses" WHERE "courses"."department" = 'Computer Science' AND "courses"."semester" = 'Spring 2024';
-
Find the requirement satisfied by “Advanced Databases” in Fall 2023:
SELECT "requirements"."name" FROM "requirements" WHERE "requirements"."id" = ( SELECT "requirement_id" FROM "satisfies" WHERE "course_id" = ( SELECT "id" FROM "courses" WHERE "title" = 'Advanced Databases' AND "semester" = 'Fall 2023' ) );
-
Find how many courses in each requirement a student has satisfied:
SELECT "requirements"."name", COUNT(*) AS "courses" FROM "requirements" JOIN "satisfies" ON "requirements"."id" = "satisfies"."requirement_id" WHERE "satisfies"."course_id" IN ( SELECT "course_id" FROM "enrollments" WHERE "enrollments"."student_id" = 8 ) GROUP BY "requirements"."name";
-
Search for a course by title and semester:
SELECT "department", "number", "title" FROM "courses" WHERE "title" LIKE "History%" AND "semester" = 'Fall 2023';
Be sure to consider the Advice section as you get started!
Advice
In this problem, you’ll take the opposite perspective you did while working on In a Snap: rather than design a query that takes advantage of existing indexes, your task is to design indexes which existing queries can take advantage of.
Use EXPLAIN QUERY PLAN
on each SELECT
query to assess where best to create indexes
Begin by assessing where best to create indexes by understanding the plan for each typical query on my.harvard’s database.
For example, try revealing the plan for the first typical query, as by executing the following:
EXPLAIN QUERY PLAN
SELECT "courses"."title", "courses"."semester"
FROM "enrollments"
JOIN "courses" ON "enrollments"."course_id" = "courses"."id"
JOIN "students" ON "enrollments"."student_id" = "students"."id"
WHERE "students"."id" = 3;
The output of the above is as follows:
QUERY PLAN
|--SEARCH students USING INTEGER PRIMARY KEY (rowid=?)
|--SCAN enrollments
`--SEARCH courses USING INTEGER PRIMARY KEY (rowid=?)
Notice that, while the database engine is already SEARCH
ing the students
and courses
tables using their primary key indexes, there are still improvements to be made: the database engine is SCAN
ning the enrollments
table without an index. Recall that to SCAN
means that the database engine must search through all rows, one by one—a process that is much slower than searching an index!
Experiment now by creating an index which could turn that SCAN
into a SEARCH
that uses an index. Then, repeat the same process for each of the typical queries on my.harvard’s database until you’ve arrived at a set of indexes which ensure all queries are using indexes to their full potential.
Minimize the number of indexes you've created
Keep in mind that indexes take up additional space, and that they can slow INSERT
, UPDATE
, and DELETE
queries. Once you’ve arrived at an initial set of indexes, start paring them down until you’ve created the minimum required for each query to use indexes optimally. How to start this process? Consider the following questions:
- Do any of your indexes include the same columns? If so, it’s likely you need only one index on that particular column.
- Do any of your indexes include columns unused by the given queries? If so, it’s likely you can remove those columns from your indexes.
- Does removing an index have any impact on each query’s plan? If not, might be best to remove it!
Through the iterative process above, you’ll refine the indexes you’ve chosen to create.
Usage
To load your indexes as you write them in indexes.sql
, you can use
.read indexes.sql
Keep in mind you can also use
DROP INDEX name;
where name
is the name of your index, to remove an index before creating it anew.
You may want to use VACUUM
to free up disk space after you delete an index!
How to Test
While check50
is available for this problem, the problem’s open-ended nature means there aren’t clear right or wrong answers. Instead, try following the advice and use what you’ve learned from lecture.
Correctness
check50 cs50/problems/2024/sql/harvard
How to Submit
After you submit, be sure to check your autograder results. If you see SUBMISSION ERROR: missing files (0.0/1.0)
, it means your file was not named exactly as prescribed (or you uploaded it to the wrong problem).
Correctness in submissions entails everything from reading the specification, writing code that is compliant with it, and submitting files with the correct name. If you see this error, you should resubmit right away, making sure your submission is fully compliant with the specification. The staff will not adjust your filenames for you after the fact!
Ensure your terminal’s working directory is the harvard
folder. Your prompt should resemble the below:
harvard/ $
When you type ls
to list the files in your working directory, you should see your .sql
files for harvard
.
Zip up your solution files by executing the following:
zip harvard-solution.zip *.sql
Want to learn about this command?
Notice this command has three parts:
zip
, which is the name of the tool which will create a.zip
fileharvard-solution.zip
, which is the name of the.zip
file to create*.sql
, which represents the file(s) to include.*.sql
matches all files that end with.sql
, as*
is a “wildcard” character which matches any set of characters (similar to%
in SQL!).
Next, download harvard-solution.zip
by control-clicking or right-clicking on the file in your codespace’s file browser and choosing Download.
Go to CSCI E-151’s Gradescope page.
Click Problem Set 5: your.harvard.
Drag and drop your .zip
file to the area that says Drag & Drop. Be sure that each .sql
file is correctly named exactly as prescribed above, lest the autograder fail to run on your submission! Note that your submission is considered incomplete if any of the files are missing—be sure they’re all there!
Click Upload.
You should see a message that says “Problem Set 5: your.harvard submitted successfully!”
Be sure to double-check your autograder results before moving on!