Count on It

Recall that ASCII is just one way to represent characters.

  1. (1 point.) How many total characters can be represented in ASCII, if each character is represented using 7 bits?

If we want to represent more characters than ASCII allows, we can use Unicode, which uses more bits than ASCII to represent some characters. One implemetnation of Unicode, UTF-8, uses “variable-width encoding” to represent characters: characters can be represented by either one, two, three, or four bytes.

Read up on UTF-8 at fileformat.info/info/unicode/utf8.htm.

  1. (2 points.) When reading, as via fread, a text file encoded as UTF-8, how can you determine how many bytes a character will take?

  2. (2 points.) If you’re reading a file encoded as UTF-8, and you read a byte, how can you determine if that byte is a continuation of an existing character, rather than the beginning of a new character?

  3. (4 points.) The program below counts the number of characters in a file, assuming the file is encoded as ASCII. Modify the program so that it counts the number of characters in a file encoded as UTF-8.

#include <stdbool.h>
#include <stdio.h>
typedef unsigned char BYTE;
int main(int argc, char *argv[])
{
    if (argc != 2)
    {
        printf("Usage: ./count INPUT\n");
        return 1;
    }
    FILE *file = fopen(argv[1], "r");
    if (!file)
    {
        printf("Could not open file.\n");
        return 1;
    }
    int count = 0;
    while (true)
    {
        BYTE b;
        fread(&b, 1, 1, file);
        if (feof(file))
        {
            break;
        }
        count++;
    }
    printf("Number of characters: %i\n", count);
}

In addition to submitting this subquestion using the instructions contained in the “How to Submit” section of the Test, you must also:

  • Write your program in a file called count.c
  • Submit count.c by running submit50 cs50/problems/2020/summer/test/count.