The best way to archive data might not be to preserve it electronically, but to store information inside DNA. This idea occurred to Nick Goldman and Ewan Birney of the European Bioinformatics Institute when they were trying to decide what to do with the large amount of data they generate in their research. As the amount of data that needs to be archived increases, the capacity of the hard drives that need to hold it must naturally grow as well. An immediate consequence of this is the rise in cost of data storage. Faced with this problem, Goldman and Birney, in their research published in Nature, speculated that the easiest way to store the data might be to input data within strands of DNA.
DNA seems like the great storage mechanism because it already is the perfect storage mechanism in nature — it contains information about the genetic makeup of every single living organism on Earth. In fact, scientists have been trying to store information as DNA for a while now. However, most previous attempts have been prone to errors when it comes to converting the information from DNA back into readable data.
Goldman and Birney hoped solve this problem by changing the way data is translated into DNA. Most earlier schemes required using a binary form of encoding. DNA contains information in the form of four chemical bases: adenosine, thymine, cytosine, and guanine. Generally, a 0 would represent two bases and 1 would represent two other bases in binary schemes. For example, adenosine and cytosine might represent 0 and guanine and thymine might represent a 1. However, this can lead to the accidental repetition of a base when the code is being generated as DNA by a DNA-sequencing machine. This generates errors when translating the information back into binary.
The solution to this problem might be the use of a ternary language — a system of 0s, 1s, and 2s — instead of a binary one. The coding scheme is trickier than most binary systems. Instead of identifying each nucleotide by a certain number, the ternary numbers are identified by looking at the previous base, then identifying what the current base is. For instance, if the previous base was a G, the ternary number would be 0 if the current base is T, 1 if it is A, and 2 if it is C. There are rules for every possible combination of digits, ensuring that a sequence of identical digits isn't encoded by a sequence of identical bases.
When it came time to actually generate DNA strands, however, there was another problem: DNA-synthesis machines cannot generate one long strand of DNA for every file, which would have been the simplest approach. Instead, the scientists decided to divide each file into individual pieces that were each 117 bases long. 100 of those bases were used to store information, and the other 17 contained information about where in the file this part should be located.
Another way the researchers avoided problems with copying the data was to further divide the pieces of data into three overlapping parts, so that each 25-base segment of a 100-base larger segment would be copied four times in four different pieces of data. Then, if any copying errors did occur, the four different copies of the same information could be compared, and the translation supported by the majority of the copies would be the one used as the correct one.
The researchers first tested their idea on five different computer files, including a recording of part of Martin Luther King's “I have a dream” speech and the 1953 paper by Francis Crick and James Watson about the structure of DNA. A small problem with the machines led to two segments of 25 bases going missing. Everything else was converted to readable data flawlessly.
A benefit of this method is the amount of data DNA can store. The researchers have already managed to set a new record for DNA storage by storing 739.3 kilobytes of unique information. However, they estimate that DNA can potentially store far more — it should be able to store all the digital data that currently exists in the world, which is 3 zettabytes or 3 billion trillion bytes of data, with a density of 2.2 petabytes per gram, where a petabyte is 1015 bytes.
One drawback, however, is the time it takes for data to be read back from DNA. For instance, it took Goldman and his team two weeks. However, this might be less of a problem for archives that don't need to be accessed as often, such as the one CERN keeps for data collected by the Large Hadron Collider. Another downside is that it costs about $12,400 to store one megabyte. This is significantly more than storing the data digitally costs. However, digital storage technology can degrade or become outmoded, while DNA can be preserved for tens of thousands of years and will always be in use.