- Dec 14, 2004
- 7,665
- 0
- 71
2 bits would be all you needed as raw (uncompressed). You have 4 possible answers. A, T, C, G. If A = 00, T = 01, C = 10, and G = 11, 2 bits. 4 pairs per byte. Remember that you do not need to store the pair. You only need one side and the sequence.Originally posted by: spidey07
one byte would equal one pair. Not letters.
Even then you could fit multiple pairs in a single byte. You can do the math. one byte = 256 combinations. divide that by the number of pair combinations and you have how many pairs you can fit in a byte.
Originally posted by: eLiu
I assume they don't put these into a straight text file. For one thing, a standard ASCII character can handle 256 different things. Well, in DNA (let's count RNA too for kicks), there are only 5.
And then sequences of 3 encode single amino acids (or something like that, I don't remember much biology), and there are what... like 20 of those?
So yeah there's tons of room for compression just by shrinking it down this way. Not to mention actual compression algorithms that can take these alphabets and assign them smaller codewords.
:thumbsup:Originally posted by: gsellis
2 bits would be all you needed as raw (uncompressed). You have 4 possible answers. A, T, C, G. If A = 00, T = 01, C = 10, and G = 11, 2 bits. 4 pairs per byte. Remember that you do not need to store the pair. You only need one side and the sequence.
Originally posted by: Gigantopithecus
I have the sequences of the chromosomes downloaded from Project Gutenberg, but the files only total 400mb, which I thought was way too small.
Originally posted by: JHutch
Originally posted by: Gigantopithecus
I have the sequences of the chromosomes downloaded from Project Gutenberg, but the files only total 400mb, which I thought was way too small.
Are you sure you downloaded everything? Chromosome 1 is 273.78 MB alone. Chromosome 2 is 245.99 MB, etc ... Quick glance shows that each chromosome averages about 200MB. Plus there is a separate X and Y sequence. Quite a bit more than 400MB of data there...
See http://www.gutenberg.org/etext/3501 thru 3524 for Gutenberg files.
JHutch
EDIT - Granted these are uncompressed text files. Simple zip compression would make it MUCH smaller.