Current research

My research interests are in the use and development of computational algorithms and mathematical models to investigate biological processes in both healthy and diseased individuals. Primary research areas include computational cancer genomics, human structural variations, integration of multi-omics data, and developing algorithms for compressing, processing, and analysing next- and third-generation sequencing data. Collectively, I work at the interface between the computational, biological, and mathematical sciences, making heavy use of genome-wide technologies to advance basic and translational medical research. A large fraction of this research is naturally translatable to other research fields and the private sector.

Developing computational algorithms for sequence analysis

The motivation behind developing new algorithms stems from the relentless advances in sequencing technologies and a concomitant precipitous drop in cost. The first draft of the human genome and its approximately 3 billion base pairs was published in 2001 after 11 years of labour and came attached with a price tag of roughly $3 billion (£2.1 billion). Two years later in 2003, the project was considered complete. This endeavour involved scientists from 20 research institutions in six different countries. In parallel, a private company called Celera, backed by private funding, independently sequenced the human genome at a fraction of the price ($300 million / £210 million) and at a fraction of the time (9 months) compared to the publicly funded genome effort. Their draft paper was also published in 2001.

Today, 15 years later, sequencing all the approximately three billion base pairs of a human being thirty times over can be done on a single sequencing machine with minimal staff in just three days for a total cost of under $1000 / £700. This monumental achievement constitutes a price drop of >3,000,000-fold and an increase in throughput rate of >24,000-fold in less than two decades (Figure 1a). This free-fall in total cost and astronomical advances in throughput has ushered in an era of data generation previously unwitnessed in medicine.

PANDORA figure 1

Figure 1 a) How the approximate total cost (labour, instrument cost, consumables, and informatics) for sequencing a human genome has evolved over time. Data from the US National Human Genome Research Institute (NHGRI) Genome Sequencing Programme (GSP). Major historical events are highlighted and the cost at those timepoints are displayed beneath in magenta. Colours were chosen to match those in b). Released under CC by 4.0. b) Modified from Nature (doi:10.1038/527S2a) that was in turn adapted from Stephens, Z. D. et al. PLoS Biol.13, e1002195 (2015)/CC by 4.0

In fact, with continuous breakthroughs in sequencing technologies, the number of sequenced genomes could potentially reach anywhere between 100 million and 2 billion (25% of the world population by that time) by the year 2025 (Figure 1b).

Genomic data contains massive amounts of information. For example, a human genome sequenced to an average depth of 30-fold occupies some 100 gigabytes of compressed storage space. Many large-scale resequencing efforts today involve many thousands (e.g. 1000 Genomes Project and deCODE), tens of thousands (The Cancer Genome Atlas, International Cancer Genome Consortium, and UK10K), to many hundreds of thousands of genomes (The 100,000 Genomes Project and Precision Medicine Initiative). Collectively, and separately, these gargantuan datasets, the so-called “big data”, of dense and complex information provide an unprecedented opportunity to investigate human variation in both health and disease.

This incessant stream of genomic information will require novel computational approaches to address specific biological questions in reasonable timeframes. In many settings, the technological advancements have already outpaced the development of fast and memory frugal software used to analyse the data. Therefore, there is an urgent need to develop computational frameworks that are rapid, easy-to-use (especially for users without extensive domain-specific knowledge), and are highly modular.

Implementations

POLYGON: an ultra-fast and unified computational framework for sequencing data

Figure 2 POLYGON is a multiphasic implementation of our proposed unified computational framework. Unlike all other software used to analyse next-generation sequencing data, POLYGON has the capability to apply most processing and analytical functions in a single pass of the data resulting in considerable savings in runtimes. Consolidating multiple functions into a single tool also greatly simplifies, or even supplants, existing complex computational pipelines.

Raw sequencing data streaming off the machine generally require considerable amounts of polishing prior to being usable for downstream analysis. Confounding factors include systematic experimental errors, both avoidable and unavoidable, and random artefacts arising during sequencing. Because of the paramount importance of overcoming these problems, quality control analyses and pre-processing steps are inextricably linked to most, if not all, next-generation sequencing workflows.

In addition to quality control issues, investigators must frequently assess a number of metrics for suitability, summarize data for further analysis, and remove attached barcodes used in identification or stratification strategies. To this end, a plethora of tools have been developed to address these challenges.

These tools are generally run either standalone or in a chain where the next tool accepts as input the output of the previous tool—a so-called computational pipeline. Because these tools take as input identical, or close to identical, data there is unnecessary, and completely avoidable, redundancy in processing. In stark contrast, we developed a unified computational framework for handling most common computational features applied to sequencing data called POLYGON (Figure 2). This results in considerable savings in time and space (disk storage and memory usage).

Djinn: a compression framework for genotyping data

Genotypes (literally gene + type) describe what combination of alleles (genetic variations) a person has at a given locus (position in the genome). Human beings are genetically diploid—having two copies of each chromosome—and thus a genotype describes the genetic makeup at a given locus in either chromosome.

Simplified, a human genotype can consist of either: (1) both reference; (2) one non-reference and one reference; (3) one reference and one non-reference (mirrored case); and (4) neither reference. Reference and non-reference refers to the expected base at that position in the idealized reference human genome. If we abbreviate R for reference and N for non-reference this situation can be rewritten: (1) RR; (2) NR; (3) RN; and (4) NN. The non-reference can be any base (A, T, G, or C) that is not the reference or can be a short insertion or deletion (indel for short).

These tables describing genotypes for many thousands of genomes are very large. For example, genotypes for over 79 millions sites from the 2,535 genomes from 26 genetically different populations described in the 1000 Genome Project phase III data occupy almost 1,000 gigabyte of uncompressed disk space. In order to speed up frequent computations applied to genotyping data we developed a computational framework termed Djinn. Our framework compress this information into a small fraction of its original size and allows users find a locus of interest in a specific population virtually instantaneously and calculate a large body of metrics frequently used in the analysis of this data.