Untangling graphical pangenomics

Jul 9, 2019

For five years I have worked on a simple, universal, graphical model for pangenomic reference systems: the variation graph. I haven’t been alone. This work began as part of the GA4GH data working group (GA4GH-DWG), which spent several years debating and proposing various graphical pangenomic models. After I built a working system based on the variation graph model, vg, many members of that groups then joined me to continue its development, forming the vgteam. Although the data formats produced by our effort was never ratified as a standard by the GA4GH, they are now in wide use.

This group continues to build and refine reference implementations of bioinformatic methods based on the variation graph model. Together, we have demonstrated that these approaches can significantly improve read mapping to variant alleles, and although this benefit is subtle when the reference graph is built from small variants detected by alignment against a linear reference, it becomes substantial when the graph is built from whole genome assemblies representing all types and scales of variation. These methods and the lessons learned developing them are thus of critical importance to the construction of unified data models for de novo assemblies of large genomes that are rapidly being generated using third generation sequencing technologies.

A model for graph genomes

A variation graph combines three types of elements in a pangenomic data structure. We have DNA sequences (nodes), allowed linkages between them (edges), and genomes (paths) that are walks through the graph. Nodes have identifiers, which are numeric, and paths have names, which are text strings. One concession is made to reflect the genomic use of these graphs. They are bidirectional, and represent both strands of DNA, so positions refer to either the forward or reverse complement orientation of nodes. This means that there are four kinds of edges (+/+, +/-, -/-, -/+), each of which implies its own reverse complement.

There are many ways to visualize these graphs, but perhaps the most instructive here is the first one that I developed, based on graphviz’s dot. This rendering (from my thesis, Graphical pangenomics) shows a fragment of a variation graph built from a progressive alignment of the GRCh38 ALT haplotypes for HLA gene H-3136. We don’t have any inverting edges or cycles, but we do see a feature that’s impossible to express in VCF: nested variation. Above, we see the core of the graphical structure. Below, paths are represented by the sequence of node identifiers they pass through.

Example variation graph visualization with dot for H-3136

Schema based data formats for variation graphs

Along with Adam Novak (who helped to make it bidirectional), I first implemented this model in a short protobuf schema. By compiling this schema into a set of class libraries, we built both an API and a set of data types for graph-based genomics directly from the schema. This approach was very much in line with the activities of the GA4GH-DWG community discussions (for instance: 1 2), which aimed to build schema-based interchange formats rather than basic textual ones.

The schema includes both the basic graph implementation (the .vg format, which is equivalent to a subset of GFAv1), and Alignment object definition that allows us to use these graphs in resequencing. Others have extended this schema to support particular applications such as genotyping and graph structure decomposition. Many more have used it in their own projects. Of particular note are two graph based read aligners, GraphAligner and PaSGAL, which implement efficient heuristic and exact long read alignment against sequence graphs. These methods, as well as vg, emit the GAM (graphical alignment/map) format. GAM extends the concept of a pairwise sequence alignment record to the case of sequence alignments to the graph. As with all data formats in vg, GAM has both textual (in JSON) and binary (protobuf) versions. Mirroring the .vg equivalence with GFA, GAM is semantically equivalent to the nascent line-based GAF format that has recently been proposed. The vg schema and an independent API for reading and writing files in it now lives in libvgio.

In vg, the Path data type plays double duty. It represents both walks through the graph and the base level alignment of a sequence to the variation graph (it’s one of the components of the Alignment object). We see the same pattern in GFA/GAF. Path steps in GFA have cigar strings, as can alignment records in GAF. However, it’s important to note what this allows us to do. We can support the “fundamental” operation of variation graphs, edit. Here, we augment the graph on the left (G) with alignment z, that includes a SNP allele, yielding the graph on the right:

Editing a variation graph

With this operation and an aligner, we can progressively build variation graphs (as in vg msga). If we build a variant caller that emits genotypes as sets of Paths, then we can directly use the output of the variant caller to extend the reference system with the same function (as in vg call).

Pangenomic model choice

The variation graph model is designed to be as simple as possible. It makes no assertions about graph structure or coordinates. This design was intentional, because simplicity allows for generality. It is possible to use virtually any kind of sequence graph as the basis for a variation graph. In vg, we have implemented constructors for variation graphs that start from VCF files and references, de Bruijn graphs, string graphs, multiple sequence alignments, RNA splicing graphs, and whole genome alignments. If you can build a sequence graph, vg can use it as a reference system.

Although graphs built from linear references and VCF files were our primary focus, in our paper on read alignment to variation graphs, we demonstrate that we can use vg to align long PacBio reads from a held out strain to a whole genome alignment graph (made by Cactus) built from six other de novo S. cerevisiae assemblies. My thesis covers many other examples, including a freshwater viral metagenome, a bacterial pangenome, a structural variation graph, a yeast splicing graph, and a human gut microbiome.

This generality came at a cost of increased development time, as there was more for us to learn and discover. However, as the collaboration driving vg progressed, we learned how to handle this complexity. We found that, while the variation graph is an ideal data integration system, we may need to construct technical transformations of genome graphs to enable efficient read alignment against them. In general, these graphs are strict subsets of larger graphs, with reduced complexity. But they may also involve the unfolding or duplication of complex regions, so as to reduce the path explosion that can occur when many variants occur in proximity. We preserve a mapping between the technical and base graphs, using each in its optimal application. It would trivialize our work to suggest that this is always easy. Our success indicates that it is possible and more expedient than was expected by groups that chose to restrict their pangenomic model a priori.

Paths provide coordinates and memory to genome graphs

Many genomics researchers, when confronted with genome graphs, immediately worry that there will no longer be any stable coordinates to place annotations on the graph or relate it to known genomes. Graphs are just representations of multiple sequence alignments. Alignments are capricious and can change depending on scoring parameters, order, or the structure of input sequences and their overlaps. We should not trust that the same input sequences will result in the same graph.

In the GA4GH-DWG, we spent most of a year (2014-2015) debating different data models for reference pangenomes that attempted to resolve this issue. Many participants proposed models that preserved a preexisting reference coordinate system directly in the structure of the graph. However, few provided working implementations to back up their argument, and so these conversations mostly addressed hypothetical issues.

In response, I proposed the first prototype of vg itself and demonstrated that I was able to use it to align reads to a graphical genome model built from the whole 1000 Genomes Project phase 3 release. Because of its generality, the variation graph model could consume any of the other data models that had been proposed, which made it of immediate practical use to drive the comparative evaluations between different graph models that were the primary research output of the GA4GH-DWG.

Variation graphs address the pangenome coordinate problem in the most simple way possible. As long as they embed reference paths that cover most of their space, such that no graph position is “far” from a reference position, we can derive reference relative coordinates for positions on the graph. If we build the graph a different way from the same sequences, the relationship between coordinates in different reference paths changes, but we do not gain or lose any coordinates. This solution is inelegant (we don’t get a single, immutable coordinate hierarchy), but practical and extremely flexible.

A genome graph is just an alignment of linear sequences, and these remain linear even if they are embedded in a graph. We can use this property to project results obtain relative to genome graphs to linear sequences, such as by “surjecting” alignments from the graph into a subset of reference paths that overlap most of the graph (from GAM to BAM). Many analyses in my thesis used this feature, and it is a key aspect of our ongoing work on ancient DNA. In this context, the graph is simply a black box that is used to reduce reference bias during alignment. To illustrate this concept, the following figure shows a surjection function in terms of node space for a graph and the subgraph corresponding to its embedded path. This kind of projection will work for any pair of graphs with the same embedded path.

Surjection between variation graphs

This feature also addresses another key shortcoming of genome graphs. Graphs built from sequences and linkages between them are memoryless, and allow recombinations of alleles that are very unlikely to exist in reality. The paths in the variation graph address exactly this issue, and provide long range structure to the object that would be lost if it were simply a graphical model.

Although paths are expensive to store, in a given species most haplotypes at a given locus will be similar, and we can exploit this to compress genomes into haplotype indexes like the GBWT that use only a fraction of a bit per basepair of stored sequence while providing linear-time haplotype matching and extraction functionality. This result suggests that it might be tenable and even desirable to store all the sequences that we used to build a pangenome, even if these number in the many millions.

The next phase of graphical pangenomics

Now, five years into my experience in this subfield, we are at an inflection point. Far from being a curious way to reduce reference bias against small variants, vg and other graphical genomic methods are poised to become a key toolset for managing and understanding an oncoming deluge of whole, large, genomes from vertebrates, including humans. To address this need, I have spent much of the last year working on a series of tools that allow us to construct and manipulate variation graphs representing whole genome alignments of large numbers of large genomes.

The first, seqwish, consumes alignments made by minimap2 over a set of sequences and produces a variation graph (in GFA format) that losslessly encodes all the sequences (as paths) and their base pair exact alignments (in the graph topology itself). This method is several orders of magnitude faster than equivalent methods like Cactus, and is more flexible than de Bruijn graph based methods like SibeliaZ. Although it is very much still a prototype, I am now reliably able to apply seqwish to collections of eukaryotic genomes such as human, medaka, and cichlids. For testing, I construct pangenomes for yeast on my laptop in a few minutes, a task that used to take up to a day on a large compute node. Still, its performance can be improved greatly, and the complete range-based compression of its data structures (using Heng Li’s awesome implicit interval tree) will allow it to scale to the construction of a lossless pangenome from hundreds of human genomes. Because it exactly respects its input alignments, seqwish can act as a self-contained kernel in a pangenomic construction pipeline. Once it is sufficiently optimized, the difficulty will lie in structuring and selecting the best set of alignments for a given pangenome.

A key failing of vg has been its in-memory graph model, which can consume up to 100 bytes of memory per input base of the graph. Although acceptable for a starting PhD project and proof of principle methods, this is not scalable to the problems that we routinely apply vg to. We have had to work around this in various ways, such as by subsetting graphs to single (human scale) chromosomes, by using larger memory machines, or by transforming the graph into static succinct indexes like xg. Last winter, at the NBDC/DBCLS BioHackathon in Matsue, Jordan Eizenga and I decided to tackle this issue by building a set of dynamic succinct variation graph models. These present a consistent HandleGraph C++ interface, and can be used interchangeably with any algorithms written to work on this interface, of which there are now many. The result of my own work on this topic is odgi, which provides efficient postprocessing for the large graphs built with seqwish. Currently, this method supports the difficult steps of graph topological sorting, pruning, simplification, and kmer indexing. I plan to extend it to support read alignment and progressive graph construction, to complement seqwish and other emerging methods like minigraph.

To illustrate these two tools, consider the set of genomes from the Yeast Population Reference Panel. Here, taking only the cerevisiae genomes, we run minimap2, filter the alignments to remove short alignments <10kb, then induce the variation graph with seqwish and render an image of the output with Bandage:

Seqwish yeast Bandage visualization

This shows a relatively open graph, with some collapse and some apparently unaligned chromosome ends (perhaps because of our length filter).

Rendering with Bandage can take an extremely long time and the resulting graphs are difficult to interpret because it is not easy to view the relationship between different embedded paths in the graph. However, with odgi viz, we can obtain an image of how the embedded chromosomes relate to each other and to the topology of the graph. This linear time rendering method displays the graph topology at the bottom of the image, using rectangular motifs to show where edges (hung below) link two positions in the sorted sequence space of the graph (the black line dividing the top and bottom of the image). Paths are displayed above this topology, with one path at each position on the y axis. The layout is nonlinear, and only shows what positions in the graph are touched by a given path. But, because genome graphs are built from linear sequences, they have a manifold linear property and we can often apply a linear intuition to interpreting them.

In the following figure, genomes are ordered from top to bottom: S288c, DBVPG6765, UWOPS034614, Y12, YPS128, SK1, DBVPG6044. The chromosome order partly results from the initial sequential assignment of node ids by seqwish. We immediately see that a chromosome in UWOPS034614 (a highly diverged strain) has been rearranged into other chromosomes. Also notable are the frequent structural variations and CNVs that appear to be embedded in the ends of chromosomes. This confirms another finding from the paper describing these assemblies, whose authors observed that subtelomeric regions were hotspots of structural variation.

Seqwish yeast odgi viz

If you’re curious how this works, I’ve documented steps to do this for the whole collection. In the coming weeks, I’ll be reporting more on these methods and their applications to collections of human genome assemblies.

Working with other pangenomic methods

I was partly motivated to write this post by Heng Li’s release of a similar toolchain for the construction of pangenomic reference graphs. This approach is the first practical implementation of ideas about coordinate systems and reference graphs that arose during the GA4GH graph genome conversations. Heng proposes that we need a new data model to encode stable coordinates in pangenomes. This model (implemented in rGFA) is akin to the “side graph” that was discussed in those days, and also to the hierarchical model used by Seven Bridges Genomics. It annotates GFA elements with information that indicates their origin in a coordinate hierachy that has been constructed progressively. In effect, rGFA will be able to support the full range of semantics that we maintain in GFA, but it simplifies the process of relating stable coordinate systems into the graph using these annotations. rGFA provides a formal specification for the exchange of path relative coordinates that we have long been caching in various indexes used by tools in the vg ecosystem, and that may be very important for users of pangenomic models.

Heng also puts forward a prototype progressive pangenome construction algorithm (minigraph) that builds the graph to contain only structurally novel sequences relative to an established reference or graph that is being extended. minigraph is beautiful manifestation of Heng’s commitment to simple, efficient, and conceptually clean bioinformatic methods. Its performance is particularly motivating (I measure it as 5-10x faster than the seqwish pipeline), and provided it is based on minimap2 chaining, I would expect the resulting graphs to be of high quality in terms of capturing the large scale relationships between sequences. I am very happy that there is now another method to build pangenomes that can scale to the problem sizes that I’m working at, and I expect to be working with the output of minigraph in my own workflows.

However, I want to point out that there is a stark difference between what minigraph produces and the pangenomic models I’ve laid out here. Users of this method should understand that this is not a general solution to recording collections of genomes in a pangenome, but a way of deriving a coordinate hierarchy that relates to novel sequences according to a given progressive alignment model. I fear that, if taken up as the primary pangenomic model, the minigraph approach in its current form will perpetuate a long cycle of reference bias that has dominated many aspects of genomics since the establishment of resequencing based methods.

Minigraph pangenomes are lossy relative to their input sequences. It will not be possible to build a minigraph that contains all of the sequence in a set of samples unless there are no small variants between these sequences. The sequence included in the graph will be order dependent. This exposes us to reference bias for all kinds of variation that are not sufficiently large or divergent to frustrate the progressive alignment algorithm. The exact configuration that will trigger this may be opaque to end users. In these graphs we will have trouble knowing when we should expect to find a variant on this or that side of a bubble, or knowing for certain that a given variant has not been seen before based on read alignment to the graph.

In minigraph based graphs, we don’t have any small variation. This simplifies things and allows us to use legacy algorithms (sequence to sequence chaining and mapping) in the graph, but our results show that this will come at a cost in terms of accuracy for alignment to variants excluded from the graph. This also means that working with small variation in the context of minigraph references will require generalizations of VCF, and will retain all of the complexity related to that format. The effect of this will be to shift the difficulty of genotyping from alignment (as in graph genomes) back to variant calling (as in linear genomes).

The coordinate model that Heng proposes will provide a stable hierarchy, but only if the graph was constructed in the same way, by the same algorithm, or if the graph is a strict extension of a previous graph. Changes to the base reference will make the graphs incompatible. Because of this inflexibility, I do not believe that we should embed the coordinate system we use in the structure of the graph. Rather the two should be independent, allowing diverse graph structures to be built and used as needed during research even as we maintain a common coordinate space. All that is needed to provide coordinates (in variation graph terms) is a set of paths that mostly cover the graph. And that suggests how these models can work together. With minigraph, we can build a base rGFA that provides a covering coordinate space for the graph. Then, embedding these coordinate spaces as paths in our graph, we can decorate, extend, or rebuild the graph as we please, with whatever variation and genomes are useful for our analysis. This is all easy, because we’re already reading and writing the same format (GFA), although this exchange may not always be bidirectional due to algorithm limitations.

To make all this concrete, I leave you with two versions of the GRCh38 ALTs for DRB1-3123 in HLA. This gene is highly divergent in the human population, and I’ve long used it as a test during development and exploration with vg. The top graph was produced by minigraph, using a shorter (2kb) alignment threshold than default, while the bottom was produced by seqwish, with the same 2kb alignment filter. They are different in size (minigraph yields 20% more sequence in the graph) and complexity (seqwish is much more complex topologically) but suggest the same structure.

minigraph for DRB1-3123 seqwish for DRB1-3123

I look forward to working with Heng and the rest of the community to understand the best way to build and use these objects. I believe that this will require competitive but supportive and participant-oriented projects in which we not only build, but use pangenomes to drive genomic inference. Together, I am sure we will figure out how to bring these ideas to fruition and build scalable and helpful pangenomic methods. The important thing is that we learn to read and write the same data types. And, crucially, I hope that these data types are general enough to support all the things that researchers might need to do. There are exciting times ahead!

corrections

2019-07-10: I have updated this post to properly distinguish rGFA and minigraph and to resolve an error in the variation graph editing figure.

How to freebayes

Dec 8, 2015

Modern DNA sequencing would be impossible if not for high performance inference methods

As the price of doing so has decreased, sequencing genomes has become an essential part of biology. The revolutionary step in this space has been the development of short, “noisy” DNA sequence reads. Although designed conceptually in the 1990s, it was only in the early 2000s that technology allowed the implementation of these massively-multiplex sequencing approaches. The key was cheap, fast computers. These approaches would not work if algorithms to organize and clean the data were expensive to run.

The short (50-150bp) reads of DNA derived from the lowest-cost and most-popular contemporary sequencing technology must be organized (most frequently by alignment against a reference genome) so that we can then extract information about the genome(s) of the sample(s) we are examining in an experiment. The reads are relatively noisy, with error rates (1%) around an order of magnitude above the actual rate of variation we’d expect to find in a human sample (0.1%), so we cannot simply trust the information that comes out of them. We need to aggregate data across experiments and apply classifiers to the putative variations that we observe in order to reduce the error rate to something which can be used for downstream analysis. In the field the classifier that we run to do this is called a “variant caller”

So the typicaly contemporary sequencing workflow goes: sequence, align, call variants. The final result can be reduced to a matrix of genotypes over sites (and individuals, if many have been sequenced at the same time), and this is then generally usable for a wide variety of genomics tasks.

What does variant calling do?

In variant calling we start with alignments (typically in BAM format) made by an aligner like bwa:

alignments as seen by samtools tview

Then for each position in the reference we run a process that detects canditate variants and determines genotypes over a set of samples that we’ve defined in the input alignments:

Variant calling process

The result is a file (a VCF file) that lists a series of variable sites an annotations that let us interpret the results, such as the call quality, the reference and alternate alleles, and genotypes for all the samples that are mentioned in the BAM file. You can do this in one step by giving freebayes a reference and BAM file, like so:

➜  test git:(master) ✗ freebayes -f tiny/q.fa tiny/NA12878.chr22.tiny.bam | grep -v "^##"
#CHROM  POS    ID  REF         ALT     QUAL         FILTER  INFO              FORMAT  1
q       186    .   T           C       340.765      .       AC=1;AO=16;RO=25  GT:DP   0/1:41
q       1008   .   C           T       484.25       .       AC=1;AO=22;RO=17  GT:DP   0/1:39
q       1817   .   GCCC        ACCT    104.428      .       AC=1;AO=9;RO=19   GT:DP   0/1:28
q       1917   .   A           G       466.669      .       AC=1;AO=19;RO=18  GT:DP   0/1:37
q       2350   .   T           C       3.30386e-05  .       AC=1;AO=4;RO=14   GT:DP   0/1:18
q       3263   .   C           A       7.50468e-06  .       AC=1;AO=3;RO=12   GT:DP   0/1:15
q       4449   .   G           A       298.354      .       AC=1;AO=13;RO=18  GT:DP   0/1:31
q       5009   .   C           T       761.008      .       AC=1;AO=30;RO=14  GT:DP   0/1:46
q       5080   .   A           C       0.000474156  .       AC=1;AO=9;RO=27   GT:DP   0/1:37
q       5095   .   A           G       8.19916e-11  .       AC=1;AO=7;RO=27   GT:DP   0/1:34
q       5638   .   CATA        CA      270.936      .       AC=1;AO=17;RO=13  GT:DP   0/1:30
q       6418   .   G           A       326.813      .       AC=1;AO=14;RO=12  GT:DP   0/1:26
q       8846   .   T           C       329.36       .       AC=1;AO=15;RO=30  GT:DP   0/1:45

For example, here we see that on the contig named “q” we have a high quality SNP at position 1008, a low-quality one at position 2350, and an indel at position 5638.

We can think of the variant caller as a kind of compressor for genomic data. In this example, which is a standard test run in freebayes, we take 300KB of alignment data and reference sequence and reduce it to a 16KB description of the variation:

➜  test git:(master) ✗ freebayes -f tiny/q.fa tiny/NA12878.chr22.tiny.bam | wc -c
16101
➜  test git:(master) ✗ ls -sh tiny/q.fa tiny/NA12878.chr22.tiny.bam
284K tiny/NA12878.chr22.tiny.bam   16K tiny/q.fa

freebayes is a Bayesian haplotype-based variant caller

For the past five years I’ve worked on a variant caller, freebayes. The project continues the work of Gabor Marth, who wrote the first variant caller of this type, Polybayes in 1999. He had updated it several times, eventually producing BamBayes, which was written in C++ and read the alignment format developed in the early phase of the 1000 Genomes Project. When I joined his lab in 2010 he encouraged me to start working on the next version of this software, which became freebayes.

Bayesian inference methods produce probabilistic classifications that combine new information (in the form of observations) with preexisting models (our priors) that could be derived from knowledge or our expectations of how a system works. When applying Bayesian models to genomics, we model the genome or genomes we’re interested in to derive prior estimates of the distribution of genotypes and allele frequencies, and then incorporate read evidence by estimating how likely it is that a given set of reads derived from each one of our potential genotypes for each sample. The basic idea follows directly from Bayes Theorem: P(Genotype|Data) = P(Data|Genotype)P(Genotype)/P(Data). We call P(Genotype|Data) our posterior— it’s our result, and we call P(Data|Genotype) our data likelihood— it explains how likely it would be to see a given set of observations given a particular underlying genotype. P(Genotype) is our prior. We use it to link in our expectations of how much variation there is in the genome(s) we’re considering and what kind of distribution of genotypes we’d expect to see.

It’s simple enough to set up a few models that let us think about a collection of individuals which are diploid or haploid, but in nature many organisms have higher genomic copy number, or are best thought of as having an unknown copy number. Also, it’s straightforward to work with only two alleles (such as biallelic SNPs), but in many cases we find more than one allele at a given genomic position, and this only gets worse as we think about larger regions of the genome and more individuals. So as to not exclude these kinds of contexts from our inference, we had to generalize the model that Gabor had developed to handle handle arbitrary ploidy and arbitrary numbers of alleles. This extension was the genesis of freebayes.

Then, I found that local alignment artifacts caused many problems when attempting to detect indels. I realized, as others had, that the read sequence itself is a more stable thing to work with than the output of the aligner. And so I generalized freebayes to operate directly haplotypes, or small bits of sequence, rather than just abstract SNPs and indels. The result was a kind of ultra-local assembler that would throw away the alignment information before trying to figure out what alleles it was looking at. Provided that errors are weakly correlated, considering haplotypes also has a strong effect on the signal to noise ratio of the process. The number of possible errors increases exponentially with the length of the haplotype, while the number of true underlying states remains constant. By considering longer haplotypes we tend disperse sequencing noise into many erroneous haplotypes that are unlikely to occur more than once, while the real underlying alleles continue to have strong support.

direct haplotype detection

After working on freebayes for several years, I saw that many erroneous variant calls would have apparently strong support, but the reads at the genomic locus would look very strange. The supporting observations might only lie on one strand, or be aligned only to the left or right of the locus. The read quality might degrade as the sequencing process progressed, and a given allele might only be supported in the “tails” of the reads. These sources of error aren’t incorporated into the standard Bayesian variant calling model, and in order to account for them others have developed complex and costly post-processing steps which provide limited benefit to results but are now considered standard by many people working in bioinformatics. Instead of implementing a post-processing step, I decided to extend the prior model used in freebayes to include an estimate of the sequencability of the locus and its alleles, which is termed P(S) in the following illustration:

sequencability priors

This extension improves the performance of the method to be in line with what you’d get by modeling these factors post-hoc and recalibrating the quality scores. By including this directly in the model we greatly improve the usability of freebayes. In a single step, we can go from appropriately processed alignments to a highly accurate description of the genomes we’ve sequenced.

If you’re interested in the methods in freebayes, you can read the initial documentation on arXiv: Haplotype-based variant detection from short-read sequencing. I’ve been working on an update describing all of the extensions I mention here, but right now it’s only available in the paper directory in freebayes’ git repo (cd paper && make will produce a PDF).

freebayes is a free software project

I have always been inspired by free software. It is a manifestation of the power of the commons to create value for society. Although 10 or 15 years ago this may have been a point of contention, it is now impossible to ignore the importance of the commons to the public good. Every day we use free and open source software. As of February 2015, 96.6% of web servers ran Linux, and so it is extremely unlikely that you have done anything on the internet today that was not enabled by a vast network of free software.

Free software is critically important to science, where reproducibility can be impossible without the genuine artifact of code that enabled a certain analysis. Virtually all bioinformatics is supported by open source software, and run in open source operating systems. Only a few holdouts in this space use proprietary licensing or distribute closed source products.

When I started working with Gabor on freebayes, he needed little convincing to allow me to release it under the extremely permissive MIT license, which effectively releases the code into the public domain under a waiver of warranty and attribution. If I were to do it again, I might have elected to use the GPLv3, as employed by other popular bioinformatics software, but this permissive license has ensured wide use of freebayes in a huge array of contexts (for instance, a performance-optimized fork of freebayes runs inside of the Ion Torrent device). The decision to put freebayes into the commons has provided huge returns on investment to our funders. However, the small space of variant detection is crowded and has been dominated by a proprietary platform, and we were unable to obtain new funding to specifically develop freebayes. Despite the lack of funding, interest in freebayes and its use has grown continuously as more biologists apply high-throughput sequencing techniques in their work. I have attempted to stay on top of maintenance even as I have begun to work on a new paradigm in resequencing informatics, but this has been difficult without support.

I’m working with DNAnexus to maintain freebayes as an open source project!

I am very happy to announce that I have been collaborating with the research team at DNAnexus to develop new extensions to freebayes! Their support has already helped me implement basic gVCF support in freebayes, and we have a bunch of improvements queued up that should ensure that freebayes continues to be useful to the community well into the future. This post coincides with DNAnexus’ announcement about this work on their blog.

In addition to supporting me in the development of new features, they’ve asked me to continue providing technical support via public channels, such as on the freebayes issues page and mailing list, and also on this blog, where I plan to dive deeper into documentation of the many facets of the method.

Our arrangement obligates us to keep everything in the open, and I think it can serve as a model for how industry can support the public commons for the benefit of all. I’d like to say “thank you” to the team at DNAnexus, and also to all of the researchers who are using freebayes to further our understanding of the natural world! I’m looking forward to enabling some really cool research.

(By the way, I should make clear that the views on this blog are mine, and are not meant to represent those of DNAnexus.)

To begin

Nov 23, 2015

Hi

I’m Erik.

I am proud to be from the warm-hearted Commonwealth of Kentucky.

I grew up in the state’s tiny capital city, Frankfort. A childhood of exposure to government and psychology primed me to study society, which I did when I went to college. But I began to follow an unusual path. A persistent curiosity about computing led me to take a handful of computer science courses. I stated to write programs to enable my social science research. I installed Linux on my laptop and started learning everything I could about open source software. My thesis (since lost ಥل͟ಥ) focused on peer production in Wikipedia, reddit , and Linux and the ways that the pattern of communication that a technology enables affect the societies that use it.

After graduating I worked on an assortment of research projects as a contractor, programming with open source tools and writing free software.

The gig that caught my mind on fire was an idealistic but ill-fated attempt to bring the Polony sequencer to market as a the first fully open source DNA sequencing platform. After another job embedded in free software (at One Laptop Per Child), I found myself unemployed, making music and looking for a nice job. With a friend’s help I found my way back into biology, with a job focused on building computational techniques for handling sequencing data.

I still work in genomics. It has been an amazing experience to be part of front lines in the effort to observe the universe of DNA that defines life on our small planet. In these years the cost of observing DNA has fallen precipitously, and we have really extended our understanding of the biology of the world and ourselves. To a serious bio nerd, he only thing more cool than this is what happens next, as the need to make and refine microscopes wanes and we begin to focus ourselves on creating and changing things using the knowledge that genomics makes easy to come to.

I find it strange that I never really had a blog. I have been addicted to the internet since the web was unleashed on Prodigy in 1994. I spent years living with active bloggers and going to iron blogger parties they threw. I’ve run my own could host for ages. All that should matter, but until now I’ve not had much that I needed to explain that couldn’t be expressed in code and through direct communication.

This is changing. Today I live in England as a PhD student, and a big part of why I am here is that I want to focus on writing. I am also beginning to work on larger projects, and as these start to become more diffuse I find I am spending lots of time explaining ideas by text and email. Sometimes this format is the right way to share a small, clean idea. With enough care, a string of posts can become a paper or a book. I’d like to work in the open as much as possible. As time allows, I’ll try to do that here.

What will I write about? Eventually, everything! But first, I have interests in science, art, and communication which writing can help focus and I hope to explore here. More importantly, there are a pile of ideas that I want to unpack from the last half decade of effort. I’ve worked on applying Bayesian filters to detecting and genotyping genomic variation (known as variant calling). I’m also proud to have been part of the just-completed 1000 Genomes Project. I committed the last year to developing technology that uses the results of that project as a reference system in further analysis: variation graphs. With these we feed our knowledge of genomic variation back into the thing we align new reads to, which stands to make a huge array of problems in observing DNA vanish. More on that soon!