Less is More
Recall that, in Problem Set 6, you wrote code to analyze sequences of DNA. Those sequences of DNA were stored in text files that might resemble the below.
AAAAAAATTTTTAAAGGGGCCCCCCAAACCCC
Some of those text files, though, were quite large. And we gave you quite a few such files!
It turns out we could have compressed those files by representing those sequences differently. Rather than store one character per nucleotide (A
, C
, G
, or T
), we could have stored one character (plus a number) per “run” of nucleotides, whereby a run is a sequence of identical values. For instance,
A
could instead be encoded asA1
,TTT
could instead be encoded asT3
, andCCCCCCCCCC
could instead be encoded asC10
.
This type of compression is known as “run-length encoding,” wherein runs of values are encoded by storing the number of repetitions.
Our original sequence of 32 nucleotides, AAAAAAATTTTTAAAGGGGCCCCCCAAACCCC
, could thus be encoded A7T5A3G4C6A3C4
with only 14 characters, thereby decreasing the length of our sequence (and number of bytes) by more than half!
-
(1 point.) Consider the DNA sequence below.
TTTAAAACCGAAA
How could that sequence instead be encoded using run-length encoding?
While run-length encoding will result in smaller file sizes for some DNA sequences, for other sequences this algorithm might actually increase the size of the file.
- (2 points.) For what types of sequences would this algorithm likely increase the file size, rather than decrease it? Include in your answer an example of a DNA sequence for which the “compressed” version actually requires more characters than the original.
It turns out that run-length encoding can be used to compress images as well. Recall that (24-bit) BMP files can be thought of as a sequence of rows, where each row consists of pixels, and each pixel is represented using 1 byte for an amount of red, 1 byte for an amount of green, and 1 byte for an amount of blue.
Consider what might happen if, instead of storing every pixel, we were to compress images using run-length encoding, where runs of the same pixel within a row were stored as a pixel, followed by a count.
- (2 points.) Imagine what might happen if we compressed BMPs of the flags of Romania and Germany, below, and compressed each using run-length encoding for each row of pixels. Assuming both flags have the same resolution (i.e., rows and columns of pixels), which image could be compressed more (i.e., take up less space once compressed)? Why?
Let’s now revisit our encoding of DNA and consider a different method altogether for storing a DNA sequence using fewer bytes. Recall that Problem Set 6’s files were stored using ASCII, where each character in the text file took up 1 byte (8 bits) of space. According to ASCII,
A
is represented as01000001
(which is 65 in decimal),C
is represented as01000011
(which is 67 in decimal),G
is represented as01000111
(which is 71 in decimal), andT
is represented as01010100
(which is 84 in decimal).
An ASCII encoding is useful if our text files will contain many different characters: uppercase letters, lowercase letters, numbers, punctuation, and/or the like. But if the only characters we need our file to store are A
, C
, G
, and T
, then we could use fewer bits to represent each nucleotide.
Consider the following encoding instead:
A
is represented as0
C
is represented as1
G
is represented as10
T
is represented as11
Thus, the sequence AGCC
would be represented as 01011
. Only 5 bits, instead of the original 32!
- (1 point.) Using this new encoding, how would you represent the sequence
CCCAGTTA
in binary? - (1 point.) Using this new encoding, how would you represent the sequence
TGCATCCA
in binary? - (2 points.) What problem does this encoding thus have? Propose another encoding for DNA nucleotides that fixes that problem, while still using fewer bits than storing nucleotides as ASCII characters.