Assembling Raspberry on Raspberry Pi

A group of scientists has managed to assemble the hefty 400 million base pairs of the raspberry genome on . . . the eponymous Raspberry Pi, a mini-computer with only 512MB of RAM. The prowess was meant to demonstrate the dramatically low memory footprint of a new generation of software designed by GenScale, a bioinformatics project team at Inria research center, in Rennes, Brittany, France. This technology will be made available to biologists through GATB, a soon-to-be-released toolbox dedicated to genome assembly.

Featuring a 800 MHz CPU and 512 MB RAM, the Rasperry Pi is less powerfull than a smartphone.
It weighs merely 45 g., operates under Linux and costs only $35.

Sequencing a genome has become a rather routine operation in recent years due to new generation sequencers (NGS). These relatively inexpensive machines generate heretofore unseen volumes of data: DNA molecule is recorded as millions of strings in the ‘A C G T’ alphabet. Yet, a problem remains: the outpout is not readily exploitable. It is like a puzzle made of short fragments —genomic reads— that need to be compared and assembled in order to reconstruct the sequence. Such a task is the raison d'être of specific applications called assemblers. Alas these tools come with a drawback: one needs a cluster to run them. In a context of data deluge, assembly is thus emerging as the bottleneck in biologists' workflow.

“Memory footprint is really the crux of the matter here,” sums up Dominique Lavenier, head of GenScale ⁽¹⁾, a research team that has taken upon itself to redesign the underpinning algorithms of such tools. As a result of a lengthy effort of code optimization, scientists are coming up with a suite of applications that run with little RAM ⁽²⁾. “Roughly 20 to 50 times less memory compared to previous tools in the field.” Such quantum leap could prove a game-changer for biologists. “They will be able to assemble on desktops instead of clusters.”

In order to demonstrate the RAM frugality of their novel algorithms, the researchers went as far as trying to run one application on the Raspberry Pi, a famous credit-card-sized rudimentary computer sporting a paltry 512MB of memory. The first major experiment was conducted on C. Elegans, a microscopic roundworm whose 100 million base pairs genome was assembled in about 19 hours. “It didn't work the first time, which prompted us to undertake further code optimization. But in the end of the day, we succeeded.” Credit is due to postdoc Guillaume Collet who managed to implement the software on so little a device Shortly therafter the idea popped up of trying to assemble the Raspberry genome and its daunting 400 million base pairs on the eponymous mini-computer. A professor at Brigham Young University in Utah, Joshua Udall graciously supplied the sequence dataset. The assembly then took a solid week, but it worked, opening the door to more convenient assembly on run-of-the mill desktop computers.

Instrumental to this effort were two PhD students. “ Rayan Chikhiand Guillaume Rizk were the ones who brought up the seminal ideas⁽³⁾. Three computer scientists at Paris-Est Marne-la-Vallée University also joined in, adding a very valuable contributio⁽⁴⁾ to our code optimization. After a while, it became apparent that their research had laid the foundation for a real software that could be of service to the scientific community.” At this juncture, the French research agency ANR stepped up to the plate with a €180,000 funding meant to help morphing what was still a scientific prototype into a fully operational software. The hope is also to foster a technology transfer to the industry. Interestingly, the funding by ANR enabled Genscale to recruit an engineer who completely refashioned the software structure. “Erwan Drezen put a strong emphasis on building libraries as bases for all our applications. In hindsight, this proved very relevant as creating specific tools from these libraries now only requires a few days of work.”

Many Tools in the Box

Releasded under AGPL license, the GATB suite comprises a variety of applications that address the distinct phases of genome assembly. Namely: k-mer counting, contig construction and scaffolding. These successive steps could be described as: assembling small parts, bigger chunks and the final puzzle.

For k-mer counting, biologists will use DSK and then switch to BLOOCOO, which is a read error corrector. Errors being contingent on the different sequencers on the market, several NGS-specific correctors will actually be added to the toolbox. Next in line stands Minia, an application designed for contig construction, which incidentaly happens to be the most costly phase. As a stand-alone tool, Minia has already been downloaded 10,000 times. “At the moment, we are working on the scaffolding step as well as on filling the gaps between the scaffolds. Other applications will follow, including MapSembler which is an assembler targeted on given regions of interest.”

Full toolbox profile at http://gatb.inria.fr

The libraries will be part of the toolbox, thus enabling computer programmers to swiftly compose their own high-throughput low-memory applications. “And we very much encourage them to do so, Dominique Lavenier concludes. Indeed, assembling the genome of a bacterium or a mammal calls for different approaches and different tools. Therefore, our vision is that genomic assembly should rely on a multitude of custom-made assemblers targeted for specific species.”

Publication: E. Drezen, G. Rizk, R. Chikhi, C. Deltel, C. Lemaitre, P. Peterlongo, D. Lavenier, GATB: Genome Assembly & Analysis Tool Box, Bioinformatics, 2014 (link)
- - - -

Notes:
(1) Scalable Optimized and Parallel Algorithms for Genomics: GenScale is a joint project team of Inria/ENS Cachan/Rennes 1 University, common to Irisa (UMR CNRS 6074). Guillaume Collet is a member of Dyliss, a joint project team of Inria/CNRS/Rennes 1 University, common to Irisa.
(2) Random Access Memory. RAM is used as temporary storage for the applications.
(3) Read: R. Chikhi, G. Rizk. Space-efficient and exact de Bruijn graph representation based on a Bloom filter, Algorithms for Molecular Biology 2013, 8:22.
(4) Read: K.l Salikhov, G. Sacomoto, G. Kucherov, Using Cascading Bloom Filters to Improve the Memory Usage for de Brujin Graphs, Algorithms in Bioinformatics, Lecture Notes in Computer Science, Volume 8126, 2013, pp 364-376