Speller
Problem to Solve
For this problem, youâll implement a program that spell-checks a file, a la the below, using a hash table.
Demo
Distribution Code
For this problem, youâll extend the functionality of code provided to you by CS50âs staff.
Download the distribution code
Log into cs50.dev, click on your terminal window, and execute cd
by itself. You should find that your terminal windowâs prompt resembles the below:
$
Next execute
wget https://cdn.cs50.net/2022/fall/psets/5/speller.zip
in order to download a ZIP called speller.zip
into your codespace.
Then execute
unzip speller.zip
to create a folder called speller
. You no longer need the ZIP file, so you can execute
rm speller.zip
and respond with âyâ followed by Enter at the prompt to remove the ZIP file you downloaded.
Now type
cd speller
followed by Enter to move yourself into (i.e., open) that directory. Your prompt should now resemble the below.
speller/ $
Execute ls
by itself, and you should see a few files and folders:
dictionaries/ dictionary.c dictionary.h keys/ Makefile speller.c speller50 texts/
If you run into any trouble, follow these same steps again and see if you can determine where you went wrong!
Background
Given the many files in this program, itâs important to read this section in its entirety before starting. Youâll then know what to do and how to do it!
Theoretically, on input of size n, an algorithm with a running time of n is âasymptotically equivalent,â in terms of O, to an algorithm with a running time of 2n. Indeed, when describing the running time of an algorithm, we typically focus on the dominant (i.e., most impactful) term (i.e., n in this case, since n could be much larger than 2). In the real world, though, the fact of the matter is that 2n feels twice as slow as n.
The challenge ahead of you is to implement the fastest spell checker you can! By âfastest,â though, weâre talking actual âwall-clock,â not asymptotic, time.
In speller.c
, weâve put together a program thatâs designed to spell-check a file after loading a dictionary of words from disk into memory. That dictionary, meanwhile, is implemented in a file called dictionary.c
. (It could just be implemented in speller.c
, but as programs get more complex, itâs often convenient to break them into multiple files.) The prototypes for the functions therein, meanwhile, are defined not in dictionary.c
itself but in dictionary.h
instead. That way, both speller.c
and dictionary.c
can #include
the file. Unfortunately, we didnât quite get around to implementing the loading part. Or the checking part. Both (and a bit more) we leave to you! But first, a tour.
Understanding
dictionary.h
Open up dictionary.h
, and youâll see some new syntax, including a few lines that mention DICTIONARY_H
. No need to worry about those, but, if curious, those lines just ensure that, even though dictionary.c
and speller.c
(which youâll see in a moment) #include
this file, clang
will only compile it once.
Next notice how we #include
a file called stdbool.h
. Thatâs the file in which bool
itself is defined. Youâve not needed it before, since the CS50 Library used to #include
that for you.
Also notice our use of #define
, a âpreprocessor directiveâ that defines a âconstantâ called LENGTH
that has a value of 45
. Itâs a constant in the sense that you canât (accidentally) change it in your own code. In fact, clang
will replace any mentions of LENGTH
in your own code with, literally, 45
. In other words, itâs not a variable, just a find-and-replace trick.
Finally, notice the prototypes for five functions: check
, hash
, load
, size
, and unload
. Notice how three of those take a pointer as an argument, per the *
:
bool check(const char *word);
unsigned int hash(const char *word);
bool load(const char *dictionary);
Recall that char *
is what we used to call string
. So those three prototypes are essentially just:
bool check(const string word);
unsigned int hash(const string word);
bool load(const string dictionary);
And const
, meanwhile, just says that those strings, when passed in as arguments, must remain constant; you wonât be able to change them, accidentally or otherwise!
dictionary.c
Now open up dictionary.c
. Notice how, atop the file, weâve defined a struct
called node
that represents a node in a hash table. And weâve declared a global pointer array, table
, which will (soon) represent the hash table you will use to keep track of words in the dictionary. The array contains N
node pointers, and weâve set N
equal to 26
for now, to match with the default hash
function as described below. You will likely want to increase this depending on your own implementation of hash
.
Next, notice that weâve implemented load
, check
, size
, and unload
, but only barely, just enough for the code to compile. Notice too that weâve implemented hash
with a sample algorithm based on the first letter of the word. Your job, ultimately, is to re-implement those functions as cleverly as possible so that this spell checker works as advertised. And fast!
speller.c
Okay, next open up speller.c
and spend some time looking over the code and comments therein. You wonât need to change anything in this file, and you donât need to understand its entirety, but do try to get a sense of its functionality nonetheless. Notice how, by way of a function called getrusage
, weâll be âbenchmarkingâ (i.e., timing the execution of) your implementations of check
, load
, size
, and unload
. Also notice how we go about passing check
, word by word, the contents of some file to be spell-checked. Ultimately, we report each misspelling in that file along with a bunch of statistics.
Notice, incidentally, that we have defined the usage of speller
to be
Usage: speller [dictionary] text
where dictionary
is assumed to be a file containing a list of lowercase words, one per line, and text
is a file to be spell-checked. As the brackets suggest, provision of dictionary
is optional; if this argument is omitted, speller
will use dictionaries/large
by default. In other words, running
./speller text
will be equivalent to running
./speller dictionaries/large text
where text
is the file you wish to spell-check. Suffice it to say, the former is easier to type! (Of course, speller
will not be able to load any dictionaries until you implement load
in dictionary.c
! Until then, youâll see Could not load
.)
Within the default dictionary, mind you, are 143,091 words, all of which must be loaded into memory! In fact, take a peek at that file to get a sense of its structure and size. Notice that every word in that file appears in lowercase (even, for simplicity, proper nouns and acronyms). From top to bottom, the file is sorted lexicographically, with only one word per line (each of which ends with \n
). No word is longer than 45 characters, and no word appears more than once. During development, you may find it helpful to provide speller
with a dictionary
of your own that contains far fewer words, lest you struggle to debug an otherwise enormous structure in memory. In dictionaries/small
is one such dictionary. To use it, execute
./speller dictionaries/small text
where text
is the file you wish to spell-check. Donât move on until youâre sure you understand how speller
itself works!
Odds are, you didnât spend enough time looking over speller.c
. Go back one square and walk yourself through it again!
texts/
So that you can test your implementation of speller
, weâve also provided you with a whole bunch of texts, among them the script from La La Land, the text of the Affordable Care Act, three million bytes from Tolstoy, some excerpts from The Federalist Papers and Shakespeare, and more. So that you know what to expect, open and skim each of those files, all of which are in a directory called texts
within your pset5
directory.
Now, as you should know from having read over speller.c
carefully, the output of speller
, if executed with, say,
./speller texts/lalaland.txt
will eventually resemble the below.
Belowâs some of the output youâll see. For informationâs sake, weâve excerpted some examples of âmisspellings.â And lest we spoil the fun, weâve omitted our own statistics for now.
MISSPELLED WORDS
[...]
AHHHHHHHHHHHHHHHHHHHHHHHHHHHT
[...]
Shangri
[...]
fianc
[...]
Sebastian's
[...]
WORDS MISSPELLED:
WORDS IN DICTIONARY:
WORDS IN TEXT:
TIME IN load:
TIME IN check:
TIME IN size:
TIME IN unload:
TIME IN TOTAL:
TIME IN load
represents the number of seconds that speller
spends executing your implementation of load
. TIME IN check
represents the number of seconds that speller
spends, in total, executing your implementation of check
. TIME IN size
represents the number of seconds that speller
spends executing your implementation of size
. TIME IN unload
represents the number of seconds that speller
spends executing your implementation of unload
. TIME IN TOTAL
is the sum of those four measurements.
Note that these times may vary somewhat across executions of speller
, depending on what else your codespace is doing, even if you donât change your code.
Incidentally, to be clear, by âmisspelledâ we simply mean that some word is not in the dictionary
provided.
Makefile
And, lastly, recall that make
automates compilation of your code so that you donât have to execute clang
manually along with a whole bunch of switches. However, as your programs grow in size, make
wonât be able to infer from context anymore how to compile your code; youâll need to start telling make
how to compile your program, particularly when they involve multiple source (i.e., .c
) files, as in the case of this problem. And so weâll utilize a Makefile
, a configuration file that tells make
exactly what to do. Open up Makefile
, and you should see four lines:
- The first line tells
make
to execute the subsequent lines whenever you yourself executemake speller
(or justmake
). - The second line tells
make
how to compilespeller.c
into machine code (i.e.,speller.o
). - The third line tells
make
how to compiledictionary.c
into machine code (i.e.,dictionary.o
). - The fourth line tells
make
to linkspeller.o
anddictionary.o
in a file calledspeller
.
Be sure to compile speller
by executing make speller
(or just make
). Executing make dictionary
wonât work!
Specification
Alright, the challenge now before you is to implement, in order, load
, hash
, size
, check
, and unload
as efficiently as possible using a hash table in such a way that TIME IN load
, TIME IN check
, TIME IN size
, and TIME IN unload
are all minimized. To be sure, itâs not obvious what it even means to be minimized, inasmuch as these benchmarks will certainly vary as you feed speller
different values for dictionary
and for text
. But therein lies the challenge, if not the fun, of this problem. This problem is your chance to design. Although we invite you to minimize space, your ultimate enemy is time. But before you dive in, some specifications from us.
- You may not alter
speller.c
orMakefile
. - You may alter
dictionary.c
(and, in fact, must in order to complete the implementations ofload
,hash
,size
,check
, andunload
), but you may not alter the declarations (i.e., prototypes) ofload
,hash
,size
,check
, orunload
. You may, though, add new functions and (local or global) variables todictionary.c
. - You may change the value of
N
indictionary.c
, so that your hash table can have more buckets. - You may alter
dictionary.h
, but you may not alter the declarations ofload
,hash
,size
,check
, orunload
. - Your implementation of
check
must be case-insensitive. In other words, iffoo
is in dictionary, thencheck
should return true given any capitalization thereof; none offoo
,foO
,fOo
,fOO
,fOO
,Foo
,FoO
,FOo
, andFOO
should be considered misspelled. - Capitalization aside, your implementation of
check
should only returntrue
for words actually indictionary
. Beware hard-coding common words (e.g.,the
), lest we pass your implementation adictionary
without those same words. Moreover, the only possessives allowed are those actually indictionary
. In other words, even iffoo
is indictionary
,check
should returnfalse
givenfoo's
iffoo's
is not also indictionary
. - You may assume that any
dictionary
passed to your program will be structured exactly like ours, alphabetically sorted from top to bottom with one word per line, each of which ends with\n
. You may also assume thatdictionary
will contain at least one word, that no word will be longer thanLENGTH
(a constant defined indictionary.h
) characters, that no word will appear more than once, that each word will contain only lowercase alphabetical characters and possibly apostrophes, and that no word will start with an apostrophe. - You may assume that
check
will only be passed words that contain (uppercase or lowercase) alphabetical characters and possibly apostrophes. - Your spell checker may only take
text
and, optionally,dictionary
as input. Although you might be inclined (particularly if among those more comfortable) to âpre-processâ our default dictionary in order to derive an âideal hash functionâ for it, you may not save the output of any such pre-processing to disk in order to load it back into memory on subsequent runs of your spell checker in order to gain an advantage. - Your spell checker must not leak any memory. Be sure to check for leaks with
valgrind
. - The hash function you write should ultimately be your own, not one you search for online.
Alright, ready to go?
- Implement
load
. - Implement
hash
. - Implement
size
. - Implement
check
. - Implement
unload
.
Hints
Implement load
Complete the load
function. load
should load the dictionary into memory (in particular, into a hash table!). load
should return true
if successful and false
otherwise.
Consider that this problem is just composed of smaller problems:
- Open the dictionary file
- Read each word in the file
- Add each word to the hash table
- Close the dictionary file
Write some pseudocode to remind yourself to do just that:
bool load(const char *dictionary)
{
// Open the dictionary file
// Read each word in the file
// Add each word to the hash table
// Close the dictionary file
}
Consider first how to open the dictionary file. fopen
is a natural choice. You can use mode r
, given that you need only read words from the dictionary file (not write or append them).
bool load(const char *dictionary)
{
// Open the dictionary file
FILE *source = fopen(dictionary, "r");
// Read each word in the file
// Add each word to the hash table
// Close the dictionary file
}
Before moving on, you should write code to check whether the file opened correctly. Thatâs up to you! Itâs also best to ensure you close every file you open, so nowâs a good time to write the code to close the dictionary file:
bool load(const char *dictionary)
{
// Open the dictionary file
FILE *source = fopen(dictionary, "r");
// Read each word in the file
// Add each word to the hash table
// Close the dictionary file
fclose(source);
}
What remains is to read each word in the file and to add each word to the hash table. Return true
when the entire operation is successful and false
if it ever fails. Consider following this problemâs walkthrough and continue to break sub-problems into even smaller problems. For example, adding each word to the hash table might only be a matter of implementing a few, even smaller, steps:
- Create space for a new hash table node
- Copy the word into the new node
- Hash the word to obtain its hash value
- Insert the new node into the hash table (using the index specified by its hash value)
Of course, thereâs more one way to approach this problem, each with their own design trade-offs. For that reason, the rest of the code is up to you!
Implement hash
Complete the hash
function. hash
should take a string, word
, as input and return a positive (âunsignedâ) int
.
The hash function given to you returns an int
between 0 and 25, inclusive, based on the first character of word
. However, there are many ways to implement a hash function beyond using the first character (or characters) of a word. Consider a hash function that uses a sum of ASCII values or the length of a word. A good hash function reduces âcollisionsâ and has a (mostly!) even distribution across hash table âbucketsâ.
Implement size
Complete the size
function. size
should return the number of words loaded in the dictionary. Consider two approaches to this problem:
- Count each word as you load it into the dictionary. Return that count when
size
is called. - Each time
size
is called, iterate through the words in the hash table to count them up. Return that count.
Which seems most efficient to you? Whichever you choose, weâll leave the code up to you.
Implement check
Complete the check
function. check
should return true
if a word is located in the dictionary, otherwise false
.
Consider that this problem is also composed of smaller problems. If youâve implemented a hash table, finding a word takes only a few steps:
- Hash the word to obtain its hash value
- Search the hash table at the location specified by the wordâs hash value
- Return
true
if the word is found
- Return
- Return
false
if no word is found
To compare two strings case-insensitively, you may find strcasecmp
(declared in strings.h
) useful! Youâll likely also want to ensure that your hash function is case-insensitive, such that foo
and FOO
have the same hash value.
Implement unload
Complete the unload
function. Be sure to free
in unload
any memory that you allocated in load
!
Recall that valgrind
is your newest best friend. Know that valgrind
watches for leaks while your program is actually running, so be sure to provide command-line arguments if you want valgrind
to analyze speller
while you use a particular dictionary
and/or text, as in the below. Best to use a small text, though, else valgrind
could take quite a while to run.
valgrind ./speller texts/cat.txt
If you run valgrind
without specifying a text
for speller
, your implementations of load
and unload
wonât actually get called (and thus analyzed).
If unsure how to interpret the output of valgrind
, do just ask help50
for help:
help50 valgrind ./speller texts/cat.txt
Walkthroughs
Note that there are 6 videos in this playlist.
How to Test
How to check whether your program is outting the right misspelled words? Well, youâre welcome to consult the âanswer keysâ that are inside of the keys
directory thatâs inside of your speller
directory. For instance, inside of keys/lalaland.txt
are all of the words that your program should think are misspelled.
You could therefore run your program on some text in one window, as with the below.
./speller texts/lalaland.txt
And you could then run the staffâs solution on the same text in another window, as with the below.
./speller50 texts/lalaland.txt
And you could then compare the windows visually side by side. That could get tedious quickly, though. So you might instead want to âredirectâ your programâs output to a file, as with the below.
./speller texts/lalaland.txt > student.txt
./speller50 texts/lalaland.txt > staff.txt
You can then compare both files side by side in the same window with a program like diff
, as with the below.
diff -y student.txt staff.txt
Alternatively, to save time, you could just compare your programâs output (assuming you redirected it to, e.g., student.txt
) against one of the answer keys without running the staffâs solution, as with the below.
diff -y student.txt keys/lalaland.txt
If your programâs output matches the staffâs, diff
will output two columns that should be identical except for, perhaps, the running times at the bottom. If the columns differ, though, youâll see a >
or |
where they differ. For instance, if you see
MISSPELLED WORDS MISSPELLED WORDS
TECHNO TECHNO
L L
> Thelonious
Prius Prius
> MIA
L L
that means your program (whose output is on the left) does not think that Thelonious
or MIA
is misspelled, even though the staffâs output (on the right) does, as is implied by the absence of, say, Thelonious
in the lefthand column and the presence of Thelonious
in the righthand column.
Correctness
check50 cs50/problems/2023/summer/speller
Style
style50 dictionary.c
Staffâs Solution
How to assess just how fast (and correct) your code is? Well, as always, feel free to play with the staffâs solution, as with the below, and compare its numbers against yours.
./speller50 texts/lalaland.txt
How to Submit
Per Step 4 below, after you submit, be sure to check your autograder results. If you see SUBMISSION ERROR: missing files (0.0/1.0)
, it means your file was not named exactly as prescribed (or you uploaded it to the wrong problem).
Correctness in submissions entails everything from reading the specification, writing code that is compliant with it, and submitting files with the correct name. If you see this error, you should resubmit right away, making sure your submission is fully compliant with the specification. The staff will not adjust your filenames for you after the fact!
- Download your
dictionary.c
anddictionary.h
files by control-clicking or right-clicking on the file in your codespaceâs file browser and choosing Download. - Go to CS50âs Gradescope page.
- Click Problem Set 5: Speller.
- Drag and drop your
dictionary.c
anddictionary.h
files to the area that says Drag & Drop. Be sure they have those exact filenames! If you upload a file with a different name, the autograder likely will fail when trying to run it. Ensuring you have uploaded files with the correct filename is your responsibility! - Click Upload.
You should see a message that says âProblem Set 5: Speller submitted successfully!â