Moneyball
It’s about getting things down to one number. Using stats the way we read them, we’ll find the value in players that nobody else can see.
Peter Brand in Moneyball
Problem to Solve
The year is 2001. You’ve been hired to help make the most of the Oakland Athletics baseball team’s dwindling player budget. Each year, teams like the “A’s” hire new baseball players. Unfortunately, you’re low on star players—and on funds. Though, with a bit of SQL and some luck, who says you can’t still create a team that defies expectations?
Given a database called moneyball.db
—one that contains information on players, their performances, and their salaries—help the Oakland Athletics find the value in players others might miss.
Demo
Distribution Code
For this problem, you’ll need to download moneyball.db
, along with several .sql
files in which you’ll write your queries.
Download the distribution code
Log into cs50.dev, click on your terminal window, and execute cd
by itself. You should find that your terminal window’s prompt resembles the below:
$
Next execute
wget https://cdn.cs50.net/sql/2024/x/psets/1/moneyball.zip
in order to download a ZIP called moneyball.zip
into your codespace.
Then execute
unzip moneyball.zip
to create a folder called moneyball
. You no longer need the ZIP file, so you can execute
rm moneyball.zip
and respond with “y” followed by Enter at the prompt to remove the ZIP file you downloaded.
Now type
cd moneyball
followed by Enter to move yourself into (i.e., open) that directory. Your prompt should now resemble the below.
moneyball/ $
If all was successful, you should execute
ls
and see a database named moneyball.db
alongside several .sql
files. Executing sqlite3 moneyball.db
should open the database in sqlite3
, via which you’ll execute SQL queries. If not, retrace your steps and see if you can determine where you went wrong!
Schema
moneyball.db
represents all of Major League Baseball’s players, teams, salaries, and performances up until 2001. In particular, moneyball.db
represents the following entities:
- A player, which includes anyone who’s played Major League Baseball for any amount of time
- A team, which includes all teams, past and present, in Major League Baseball
- A performance, which describes the types of hits a player made for their team in a given year
- A salary, which is the amount of money a team paid one of their players in a given year
These entities are related per the entity relationship (ER) diagram below:
erDiagram
Player ||--|{ Salary : earns
Player ||--|{ "Performance" : generates
Team ||--|{ Salary : "pays"
"Performance" }|--|| Team : with
Within moneyball.db
, you’ll find several tables that implement the relationships described in the ER diagram above. Click the drop-downs below to learn more about the schema of each individual table.
players
table
The players
table contains the following columns:
id
, which is the ID of the playerfirst_name
, which is the first name of the playerlast_name
, which is the last name of the playerbats
, which is the side (“R” for right or “L” for left) the player bats onthrows
, which is the hand (“R” for right or “L” for left) the player throws withweight
, which is the player’s weight in poundsheight
, which is the player’s height in inchesdebut
, which is the date (expressed asYYYY-MM-DD
) the player began their career in the MLBfinal_game
, which is the date (expressed asYYYY-MM-DD
) the player played their last game in the MLBbirth_year
, which is the year the player was bornbirth_month
, which is the month (expressed as an integer) the player was bornbirth_day
, which is the day the player was bornbirth_city
, which is the city in which the player was bornbirth_state
, which is the state in which the player was bornbirth_country
, which is the country in which the player was born
teams
table
The teams
table contains the following columns:
id
, which is the ID of each teamyear
, which is the year the team was foundedname
, which is the name of the teampark
, which is name of the park at which the team plays (or played)
performances
table
The performances
table contains the following columns:
id
, which is the ID of the performanceplayer_id
, which is the ID of the player who generated the performanceteam_id
, which is the ID of the team for which the player generated the performanceyear
, which is the year in which the player generated the performanceG
, which is the number of games played by the player, for the given team, in the given yearAB
, which is the player’s number of “at bats” (i.e., times they went up to bat), for the given team, in the given yearH
, which is the player’s number of hits, for the given team, in the given year2B
, which is the player’s number of doubles (two-base hits), for the given team, in the given year3B
, which is the player’s number of triples (three-base hits), for the given team, in the given yearHR
, which is the player’s number of home runs, for the given team, in the given yearRBI
, which is the player’s number of “runs batted in” (i.e., runs scored), for the given team, in the given yearSB
, which is the player’s number of stolen bases, for the given team, in the given year
salaries
table
The salaries
table contains the following columns:
id
, which is the ID of the salaryplayer_id
, which is the ID of the player earning the salaryteam_id
, which is the ID of the team paying the salaryyear
, which is the year during which the salary was paidsalary
, which is the salary itself in US dollars (not adjusted for inflation)
Specification
1.sql
You should start by getting a sense for how average player salaries have changed over time. In 1.sql
, write a SQL query to find the average player salary by year.
- Sort by year in descending order.
- Round the salary to two decimal places and call the column “average salary”.
- Your query should return a table with two columns, one for year and one for average salary.
2.sql
Your general manager (i.e., the person who makes decisions about player contracts) asks you whether the team should trade a current player for Cal Ripken Jr., a star player who’s likely nearing his retirement. In 2.sql
, write a SQL query to find Cal Ripken Jr.’s salary history.
- Sort by year in descending order.
- Your query should return a table with two columns, one for year and one for salary.
3.sql
Your team is going to need a great home run hitter. Ken Griffey Jr., a long-time Silver Slugger and Gold Glove award winner, might be a good prospect. In 3.sql
, write a SQL query to find Ken Griffey Jr.’s home run history.
- Sort by year in descending order.
- Note that there may be two players with the name “Ken Griffey.” This Ken Griffey was born in 1969.
- Your query should return a table with two columns, one for year and one for home runs.
4.sql
You need to make a recommendation about which players the team should consider hiring. With the team’s dwindling budget, the general manager wants to know which players were paid the lowest salaries in 2001. In 4.sql
, write a SQL query to find the 50 players paid the least in 2001.
- Sort players by salary, lowest to highest.
- If two players have the same salary, sort alphabetically by first name and then by last name.
- If two players have the same first and last name, sort by player ID.
- Your query should return three columns, one for players’ first names, one for their last names, and one for their salaries.
5.sql
It’s a bit of a slow day in the office. Though Satchel no longer plays, in 5.sql
, write a SQL query to find all teams that Satchel Paige played for.
- Your query should return a table with a single column, one for the name of the teams.
6.sql
Which teams might be the biggest competition for the A’s this year? In 6.sql
, write a SQL query to return the top 5 teams, sorted by the total number of hits by players in 2001.
- Call the column representing total hits by players in 2001 “total hits”.
- Sort by total hits, highest to lowest.
- Your query should return two columns, one for the teams’ names and one for their total hits in 2001.
7.sql
You need to make a recommendation about which player (or players) to avoid recruiting. In 7.sql
, write a SQL query to find the name of the player who’s been paid the highest salary, of all time, in Major League Baseball.
- Your query should return a table with two columns, one for the player’s first name and one for their last name.
8.sql
How much would the A’s need to pay to get the best home run hitter this past season? In 8.sql
, write a SQL query to find the 2001 salary of the player who hit the most home runs in 2001.
- Your query should return a table with one column, the salary of the player.
9.sql
What salaries are other teams paying? In 9.sql
, write a SQL query to find the 5 lowest paying teams (by average salary) in 2001.
- Round the average salary column to two decimal places and call it “average salary”.
- Sort the teams by average salary, least to greatest.
- Your query should return a table with two columns, one for the teams’ names and one for their average salary.
10.sql
The general manager has asked you for a report which details each player’s name, their salary for each year they’ve been playing, and their number of home runs for each year they’ve been playing. To be precise, the table should include:
- All player’s first names
- All player’s last names
- All player’s salaries
- All player’s home runs
- The year in which the player was paid that salary and hit those home runs
In 10.sql
, write a query to return just such a table.
- Your query should return a table with five columns, per the above.
- Order the results, first and foremost, by player’s IDs (least to greatest).
- Order rows about the same player by year, in descending order.
- Consider a corner case: suppose a player has multiple salaries or performances for a given year. Order them first by number of home runs, in descending order, followed by salary, in descending order.
- Be careful to ensure that, for a single row, the salary’s year and the performance’s year match.
Example table
To help you visualize what the general manager would like, they gave you an example table:
+------------+-----------+--------+------+----+
| first_name | last_name | salary | year | HR |
+------------+-----------+--------+------+----+
| Don | Aase | 400000 | 1989 | 0 |
| Don | Aase | 675000 | 1988 | 0 |
| Don | Aase | 625000 | 1987 | 0 |
| Don | Aase | 600000 | 1986 | 0 |
| Jeff | Abbott | 300000 | 2001 | 0 |
| Jeff | Abbott | 255000 | 2000 | 3 |
| Jeff | Abbott | 255000 | 1999 | 2 |
+------------+-----------+--------+------+----+
If all goes well, you might also see two rows like this in your final table:
+-------------+---------------+----------+----+------+
| first_name | last_name | salary | HR | year |
+-------------+---------------+----------+----+------+
| Todd | Zeile | 3700000 | 9 | 1995 |
| Todd | Zeile | 3700000 | 5 | 1995 |
+-------------+---------------+----------+----+------+
As an aside, based on the schema of the database, why do you think Todd Zeile appears to have two different salaries (and two different HR counts) for the same year?
11.sql
You need a player that can get hits. Who might be the most underrated? In 11.sql
, write a SQL query to find the 10 least expensive players per hit in 2001.
- Your query should return a table with three columns, one for the players’ first names, one of their last names, and one called “dollars per hit”.
- You can calculate the “dollars per hit” column by dividing a player’s 2001 salary by the number of hits they made in 2001. Recall you can use
AS
to rename a column. - Dividing a salary by 0 hits will result in a
NULL
value. Avoid the issue by filtering out players with 0 hits. - Sort the table by the “dollars per hit” column, least to most expensive. If two players have the same “dollars per hit”, order by first name, followed by last name, in alphabetical order.
- As in
10.sql
, ensure that the salary’s year and the performance’s year match. - You may assume, for simplicity, that a player will only have one salary and one performance in 2001.
12.sql
Hits are great, but so are RBIs! In 12.sql
, write a SQL query to find the players among the 10 least expensive players per hit and among the 10 least expensive players per RBI in 2001.
- Your query should return a table with two columns, one for the players’ first names and one of their last names.
- You can calculate a player’s salary per RBI by dividing their 2001 salary by their number of RBIs in 2001.
- You may assume, for simplicity, that a player will only have one salary and one performance in 2001.
- Order your results by player ID, least to greatest (or alphabetically by last name, as both are the same in this case!).
- Keep in mind the lessons you’ve learned in
10.sql
and11.sql
!
Usage
To test your queries as you write them in your .sql
files, you can query the database by running
.read FILENAME
where FILENAME
is the name of the file containing your SQL query. For example,
.read 1.sql
You can also run
$ cat FILENAME | sqlite3 moneyball.db > output.txt
to redirect the output of the query to a text file called output.txt
. (This can be useful for checking how many rows are returned by your query!)
How to Test
While check50
is available for this problem, you’re encouraged to instead test your code on your own for each of the following. If you’re using the moneyball.db
database provided in this problem’s distribution, you should find that…
- Executing
1.sql
results in a table with 2 columns and 17 rows. - Executing
2.sql
results in a table with 2 columns and 17 rows. - Executing
3.sql
results in a table with 2 columns and 13 rows. - Executing
4.sql
results in a table with 3 columns and 50 rows. - Executing
5.sql
results in a table with 1 column and 3 rows. - Executing
6.sql
results in a table with 2 columns and 5 rows. - Executing
7.sql
results in a table with 2 columns and 1 row. - Executing
8.sql
results in a table with 1 column and 1 row. - Executing
9.sql
results in a table with 2 columns and 5 rows. - Executing
10.sql
results in a table with 5 columns and 14,915 rows. - Executing
11.sql
results in a table with 3 columns and 10 rows. - Executing
12.sql
results in a table with 2 columns and 6 rows.
Note that row counts do not include header rows that only show column names.
Correctness
For performance’s sake, check50
uses a smaller database than you’ve downloaded. For that reason, results may not match between the two databases. If your query runs correctly on the database you’ve downloaded, it will also run correctly on check50
’s database! If your query runs incorrectly on check50
’s database, it’s also running incorrectly on your own database.
check50 cs50/problems/2024/sql/moneyball
Is check50
running into an error?
Be sure that your queries are returning the correct number of rows per the above. If your query for 10.sql
in particular is returning a different number of rows, it may cause check50
to time out and produce an error. You can temporarily comment out your lines in 10.sql
in order to let check50
test all of the others.
How to Submit
In your terminal, execute the below to submit your work.
submit50 cs50/problems/2024/sql/moneyball
Acknowledgements
Narrative adapted from the book by Michael Lewis, Moneyball: The Art of Winning an Unfair Game. Data adapted from the Lahman Baseball Database, licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License.