Developing computational algorithms for sequence analysis
The motivation behind developing new algorithms stems from the relentless advances in sequencing technologies and a concomitant precipitous drop in cost. The first draft of the human genome and its approximately 3 billion base pairs was published in 2001 after 11 years of labour and came attached with a price tag of roughly $3 billion (£2.1 billion). Two years later in 2003, the project was considered complete. This endeavour involved scientists from 20 research institutions in six different countries. In parallel, a private company called Celera, backed by private funding, independently sequenced the human genome at a fraction of the price ($300 million / £210 million) and at a fraction of the time (9 months) compared to the publicly funded genome effort. Their draft paper was also published in 2001.
Today, 15 years later, sequencing all the approximately three billion base pairs of a human being thirty times over can be done on a single sequencing machine with minimal staff in just three days for a total cost of under $1000 / £700. This monumental achievement constitutes a price drop of >3,000,000-fold and an increase in throughput rate of >24,000-fold in less than two decades (Figure 1a). This free-fall in total cost and astronomical advances in throughput has ushered in an era of data generation previously unwitnessed in medicine.
In fact, with continuous breakthroughs in sequencing technologies, the number of sequenced genomes could potentially reach anywhere between 100 million and 2 billion (25% of the world population by that time) by the year 2025 (Figure 1b).
Genomic data contains massive amounts of information. For example, a human genome sequenced to an average depth of 30-fold occupies some 100 gigabytes of compressed storage space. Many large-scale resequencing efforts today involve many thousands (e.g. 1000 Genomes Project and deCODE), tens of thousands (The Cancer Genome Atlas, International Cancer Genome Consortium, and UK10K), to many hundreds of thousands of genomes (The 100,000 Genomes Project and Precision Medicine Initiative). Collectively, and separately, these gargantuan datasets, the so-called “big data”, of dense and complex information provide an unprecedented opportunity to investigate human variation in both health and disease.
This incessant stream of genomic information will require novel computational approaches to address specific biological questions in reasonable timeframes. In many settings, the technological advancements have already outpaced the development of fast and memory frugal software used to analyse the data. Therefore, there is an urgent need to develop computational frameworks that are rapid, easy-to-use (especially for users without extensive domain-specific knowledge), and are highly modular.
POLYGON: an ultra-fast and unified computational framework for sequencing data
Raw sequencing data streaming off the machine generally require considerable amounts of polishing prior to being usable for downstream analysis. Confounding factors include systematic experimental errors, both avoidable and unavoidable, and random artefacts arising during sequencing. Because of the paramount importance of overcoming these problems, quality control analyses and pre-processing steps are inextricably linked to most, if not all, next-generation sequencing workflows.
In addition to quality control issues, investigators must frequently assess a number of metrics for suitability, summarize data for further analysis, and remove attached barcodes used in identification or stratification strategies. To this end, a plethora of tools have been developed to address these challenges.
These tools are generally run either standalone or in a chain where the next tool accepts as input the output of the previous tool—a so-called computational pipeline. Because these tools take as input identical, or close to identical, data there is unnecessary, and completely avoidable, redundancy in processing. In stark contrast, we developed a unified computational framework for handling most common computational features applied to sequencing data called POLYGON (Figure 2). This results in considerable savings in time and space (disk storage and memory usage).
Djinn: a compression framework for genotyping data
Genotypes (literally gene + type) describe what combination of alleles (genetic variations) a person has at a given locus (position in the genome). Human beings are genetically diploid—having two copies of each chromosome—and thus a genotype describes the genetic makeup at a given locus in either chromosome.
Simplified, a human genotype can consist of either: (1) both reference; (2) one non-reference and one reference; (3) one reference and one non-reference (mirrored case); and (4) neither reference. Reference and non-reference refers to the expected base at that position in the idealized reference human genome. If we abbreviate R for reference and N for non-reference this situation can be rewritten: (1) RR; (2) NR; (3) RN; and (4) NN. The non-reference can be any base (A, T, G, or C) that is not the reference or can be a short insertion or deletion (indel for short).
These tables describing genotypes for many thousands of genomes are very large. For example, genotypes for over 79 millions sites from the 2,535 genomes from 26 genetically different populations described in the 1000 Genome Project phase III data occupy almost 1,000 gigabyte of uncompressed disk space. In order to speed up frequent computations applied to genotyping data we developed a computational framework termed Djinn. Our framework compress this information into a small fraction of its original size and allows users find a locus of interest in a specific population virtually instantaneously and calculate a large body of metrics frequently used in the analysis of this data.