A team of scientists led by Nick Goldman from EBI have demonstrated information storage in DNA that is more practical and scalable.
Goldman et. al. demonstrated their storage technique by storing a variety of digital files into DNA; all 154 Shakespeare sonnets, Watson and Crick’s 1953 Nature paper as PDF document (Watson and Crick, 1953), Martin Luther King “I Have a Dream” speech as MP3 file, and JPEG 2000 image of EBI.
The EBI team could store 757 Kilo bytes of information in 337 pg of DNA at an information storage density of ~2.2 Petabyte per gram (1 Petabyte = 10^15 bytes).
At this information density one can store ~100 TB of the US National Archives and Records Administration’s Electronic Records Archives’ 2011 in just about 0.05 grams of DNA, 2 Peta Byte archive of web sites from the Internet Archive Wayback Machines in ~1 gram of DNA, and CERN’s 80 Peta Byte LHC data in ~35 grams of DNA.
This is not the first time scientists have tried to exploit DNA for information storage. Just last year, a Harvard team led by George Church showed how one can store information in DNA. However, that and other previous uses of DNA for information storage either were not scalable or error prone.
Encoding Information as non-repetitive DNA
In the paper published in Nature this week Goldman et. al made a big stride towards making DNA for storing information practical. The first step in information storage in DNA is a scheme to convert the document into a string of DNA. Goldman et. al. came up with a encoding procedure that converts information into DNA by converting each byte of the original information into a trit (0/1/2) and then each trit to a nucleotide.
In case of a text file each character is a byte size and each character is coded into to five nucleotides. For example, in the above figure, the letter “B” in “Birney” is encoded as “TAGTA”.
The cleverness of their method lies in coming up with an encoding scheme that converts the base-3 trits to DNA, makes sure that there are no repeated nucleotides. Note that the trits corresponding to “B” is “20100” with two consecutive zeroes at the end, but the encoded DNA is “TAGTA” with no repetition of nucleotides. The non-repetitive DNA makes it easy to sequence with no error and in reconstructing the original message. On the other hand, the approach Church et. al took simply converted the binary version of text/information to A/T/G/C and that resulted in a lot of repetitive nucleotides making it an error prone approach.
Goldman et. al. also investigated the practicality of using DNA as storage option. Their work showed that with the current sequencing technology, encoding information as DNA costs about $12,400 per MB and decoding the DNA costs about $220 per MB. The high encoding/storage cost makes this technology a viable only for a long term storage. Ewan Birney, a member of the project said in his blogpost
Using tape storage for comparison, we estimate that it is currently cheaper to store digital information as DNA only if you plan to store the information for a long time (in our model between 600 to 5000 years). If the cost of DNA synthesis falls by two orders of magnitude (which is kind of what happened over the past decade), it will become a sensible way to store data for the medium term (below 50 years).
In the blogpost, Ewan Birney said that idea of this project came during a discussion with Nick Goldman after a long day at a bar (Gaswerk hotel bar in Hamburg to be precise) and extremely happy to see the fruition of the idea to reality.
Ewan Birney and Nick Goldman do not want to just stop here, but want to build an “archive that stored a substantial amount of information for a future civilization to read?” Exciting times ahead.