The institute at http://www.tigr.org/ runs a set of gene-sequencing applications that analyze large amounts of data from a DNA sample. Under a procedure pioneered by the institute, the sample is fractured into many small parts as a way of being able to identify bite-sized chunks.
"The bits need to be put back together to identify the entire gene. That's the essence of the computational problem," says Vadim Sapiro, IT director at the Rockville, Md., institute. On the institute's aging servers, whose origins go back to the Digital Equipment Corp.'s Alpha architecture, "it would sometimes take months to babysit one assembly to completion."
For example, by finding the parts that contain some precise nucleotide overlap, they can slowly build out the sequence of proteins in the gene until they've mapped its complete, unique structure. It's like matching up the sequence 2, 3, 4, 5 with the sequence 3, 4, 5, 6. By finding the match, you've extended by the map by one nucleotide.
It might sound easy, but the number of possibilities is mind boggling, Sapiro says. Three billion nucleotides need to be mapped to come up with the composite genome of 20,000-plus human genes. The same sequences are easily found on different parts of a single gene, so additional software needs to sort through the matches, looking for errors