Hello, my name is Marcus!

I'm a scientist in Richard Durbin's group at the Sanger Institute working with population-scale data from humans and other vertebrates to perform 'big data' analysis in genomics and medicine. I work at the interface between the computational, mathematical, and biological sciences.

In short, I use genetic and genomic approaches to improve our understanding of the aetiology of rare and complex disease, to characterise healthy variation in humans of different ancestry and to advance knowledge of human population evolution, demography and history. Currently I am developing efficient and robust algorithms in the context of several on-going large scale driver projects including the Earth BioGenome Project, Haplotype Reference Consortium, the UK100K Genomes Project, The Birds 10K Project, Sanger VGP Project, and the Genome 10K Project

Research interests

I am interested in 'Big Data' science that allows unprecedented opportunities to disentangle complex biological problems

Robot

Artificial Intelligence

We use machine learning approaches to address statistical problems that are hard to model exactly. Currently we're investigating the use of AI in indel variant calling. Learn more

Robot

Genome Assembly and Variation Graphs

We aim to: (1) develop algorithms to work efficiently with sequence graphs; (2) exploit genetic information to assemble reference genomes; and (3) address the computational complexity of de novo assemblies especially in the context of variation graphs. Learn more

Robot

Developing Algorithms

Our focus is on developing methods for sharing, integrating and analyzing diverse datasets with currently unachievable speed and scale. Learn more

Robot

Assembling scaffolds

Javelin use linkage disequilibrium structure in population data estimated by Tomahawk to scaffold assemblies.

Scaffold

Assemble contigs into scaffolds

Uses a graph-based assembly-like approach to finding an optimal path through contigs

Scaffold

Orients contigs and scaffolds

Model tries to find the correct orientation for contigs when building scaffolds

Scaffold

Uses billions of assocations

Takes into consideration hundreds of millions to billions of genome-wide LD values to build a model

Read more
Robot

Calculate linkage disequilibrium

Specialized algorithms used to calculate pairwise LD in large-scale population cohorts. Achieves speeds in excess of a trillion genotypes per second per workstation.

Scaffold

Intrinsic genetic properties

Linkage disequilibrium can be calculated from variant calls generated with any technology

Scaffold

Fast algorithms and optimized code

Two different algorithms to maximize throughput. Specialized code for CPU-specific SIMD instructions: support from SSE2 to future AVX-1024

Scaffold

Parallel code execution

Algorithms are embarassingely parellel and comes shipped with Scatter-Gather reduction functionality

Scaffold

Closed form mathematics

All equations for calculating LD are closed form

Read more
Robot

Compressing haplotypes

Djinn is a framework for population-scale haplotype compression exploiting linkage disequilibrium

Robot

Quality control

POLYGON is a ultra-fast and unified computational framework for next-generation sequencing data. It is constructed with light-weight and memory frugal algorithms with low overhead and extensive capabiltities