I'm a scientist in Richard Durbin's group at the Sanger Institute working with population-scale data from humans and other vertebrates to perform 'big data' analysis in genomics and medicine. I work at the interface between the computational, mathematical, and biological sciences.
In short, I use genetic and genomic approaches to improve our understanding of the aetiology of rare and complex disease, to characterise healthy variation in humans of different ancestry and to advance knowledge of human population evolution, demography and history. Currently I am developing efficient and robust algorithms in the context of several on-going large scale driver projects including the Earth BioGenome Project, Haplotype Reference Consortium, the UK100K Genomes Project, The Birds 10K Project, Sanger VGP Project, and the Genome 10K Project
I am interested in 'Big Data' science that allows unprecedented opportunities to disentangle complex biological problems
We use machine learning approaches to address statistical problems that are hard to model exactly. Currently we're investigating the use of AI in indel variant calling. Learn more
We aim to: (1) develop algorithms to work efficiently with sequence graphs; (2) exploit genetic information to assemble reference genomes; and (3) address the computational complexity of de novo assemblies especially in the context of variation graphs. Learn more
Javelin use linkage disequilibrium structure in population data estimated by Tomahawk to scaffold assemblies.
Uses a graph-based assembly-like approach to finding an optimal path through contigs
Model tries to find the correct orientation for contigs when building scaffolds
Takes into consideration hundreds of millions to billions of genome-wide LD values to build a model
Specialized algorithms used to calculate pairwise LD in large-scale population cohorts. Achieves speeds in excess of a trillion genotypes per second per workstation.
Linkage disequilibrium can be calculated from variant calls generated with any technology
Two different algorithms to maximize throughput. Specialized code for CPU-specific SIMD instructions: support from SSE2 to future AVX-1024
Algorithms are embarassingely parellel and comes shipped with Scatter-Gather reduction functionality
All equations for calculating LD are closed form
Djinn is a framework for population-scale haplotype compression exploiting linkage disequilibrium
POLYGON is a ultra-fast and unified computational framework for next-generation sequencing data. It is constructed with light-weight and memory frugal algorithms with low overhead and extensive capabiltities