Count on It
Recall that ASCII is just one way to represent characters.
- (1 point.) How many total characters can be represented in ASCII, if each character is represented using 7 bits?
If we want to represent more characters than ASCII allows, we can use Unicode, which uses more bits than ASCII to represent some characters. One implemetnation of Unicode, UTF-8, uses âvariable-width encodingâ to represent characters: characters can be represented by either one, two, three, or four bytes.
Read up on UTF-8 at fileformat.info/info/unicode/utf8.htm.
-
(2 points.) When reading, as via
fread
, a text file encoded as UTF-8, how can you determine how many bytes a character will take? -
(2 points.) If youâre reading a file encoded as UTF-8, and you read a byte, how can you determine if that byte is a continuation of an existing character, rather than the beginning of a new character?
-
(4 points.) The program below counts the number of characters in a file, assuming the file is encoded as ASCII. Modify the program so that it counts the number of characters in a file encoded as UTF-8.
#include <stdbool.h>
#include <stdio.h>
typedef unsigned char BYTE;
int main(int argc, char *argv[])
{
if (argc != 2)
{
printf("Usage: ./count INPUT\n");
return 1;
}
FILE *file = fopen(argv[1], "r");
if (!file)
{
printf("Could not open file.\n");
return 1;
}
int count = 0;
while (true)
{
BYTE b;
fread(&b, 1, 1, file);
if (feof(file))
{
break;
}
count++;
}
printf("Number of characters: %i\n", count);
}
In addition to submitting this subquestion using the instructions contained in the âHow to Submitâ section of the Test, you must also:
- Write your program in a file called
count.c
- Submit
count.c
by runningsubmit50 cs50/problems/2020/summer/test/count
.