Bernie Pope: Software

Varlap

Summary

Varlap is a quality control tool for genetic variants arising from high throughput DNA sequencing, where the variants have been called by aligning DNA sequencing reads to a reference genome.

Repository

https://github.com/bjpop/varlap

Description

Varlap is primarily a quality control tool for genetic variants arising from high throughput DNA sequencing, where the variants have been called by aligning DNA sequencing reads to a reference genome. It takes as input a set of DNA variants and one or more BAM files. Varlap considers the genomic locus of each variant in each of the supplied BAM files and records information about the corresponding alignment context at that locus. For example, one of the metrics it calculates is the average edit distance of reads overlapping the variant locus. This can be a useful metric because regions with significantly higher average edit distance are more likely to contain erroneous variant calls. Varlap outputs a CSV file containing one row per input variant, with columns recording the various computed metrics about that variant. Subsequent analysis of this output (such as outlier detection) can be used to identify potentially problematic variants and samples.

Common use cases are to consider somatic variants in the context of tumour and normal alignments, or germline variants against normal alignments. However, varlap is quite flexible and allows the use of any number of BAM files as input.

Gurita

Summary

Gurita is a command line program for plotting, transforming and analysing tabular data (CSV, TSV files).

Repository

https://github.com/bjpop/gurita

Description

At its core Gurita provides a suite of commands, each of which carries out a common data analytics or plotting task. Additionally, Gurita allows commands to be chained together into flexible analysis pipelines.

It is designed to be fast and convenient, and is particularly suited to data exploration tasks. Input files with large numbers of rows (> millions) are readily supported.

Gurita commands are highly customisable, however sensible defaults are applied. Therefore simple tasks are easy to express and complex tasks are possible.

Bionitio

Summary: Bionitio provides a template for command line bioinformatics tools in various programming languages, and automates the creation of new software repositories following best practices.
Repository: https://github.com/bionitio-team/bionitio
Description: The purpose of Bionitio is to provide an easy-to-understand working example that is built on best-practice software engineering principles. It can be used as a basis for learning and as a solid foundation for starting new projects. We provide a script called bionitio-boot.sh for starting new projects from bionitio, which saves time and ensures good programming practices are adopted from the beginning
Publication: Bionitio: demonstrating and facilitating best practices for bioinformatics command-line software, GigaScience.

UNDR ROVER

Summary: A fast and accurate variant caller for targeted DNA sequencing.
Repository: https://github.com/bjpop/undr_rover
Description: UNDR ROVER is an improved version of our ROVER variant calling tool for targeted DNA sequencing. It enables users to quickly and accurately identify genetic variants from PCR-targeted, overlapping paired-end MPS datasets. It calls the same variants as the ROVER tool but at a significantly reduced runtime. It achieves its higher performance by avoiding read alignment before variant calling, and can be applied directly to input FASTQ files.
Publication: UNDR ROVER - a fast and accurate variant caller for targeted DNA sequencing, BMC Bioinformatics.

HiTIME

Summary

High-resolution Twin-Ion Metabolic Extraction.

Repository

https://github.com/bjpop/HiTIME

Description

HiTIME is a software tool for detecting twin ion signals in high resolution liquid chromatography mass spectrometry (LCMS) data.

This is a collaboration with Andrew Isaac, Michael Leeming, Richard O'Hair and William Alexander Donald.

Publication

SRST2

Summary

Short Read Sequence Typing for Bacterial Pathogens.

Repository

http://katholt.github.io/srst2/

Description

This takes Illumina sequence data, a MLST (Multi-Locus Sequence Type) database and/or a database of gene sequences (e.g. resistance genes, virulence genes, etc) and report the presence of STs and/or reference genes.

This is a collaboration with Kat Holt, Mike Inouye and others.

Publication

SRST2: Rapid genomic surveillance for public health and hospital microbiology labs, Genome Medicine.

Methpat

Summary: A program for summarising and visualising CpG methylation patterns.
Repository: https://github.com/bjpop/methpat
Description: Methpat summarises the resultant DNA methylation pattern data from the output of Bismark methylation extractor. Information of the DNA methylation positions for each amplicon, DNA methylation patterns observed within each amplicon and their abundance counts are summarised into a tab delimited text file amenable for further downstream statistical analysis and visualization.
Publication: MethPat: a tool for the analysis and visualisation of complex methylation patterns obtained by massively parallel sequencing, BMC Bioinformatics.

Annokey

Summary: Gene-based search for key-terms in the NCBI gene database and associated PubMed abstracts.
Repository: http://bjpop.github.io/annokey/
Description: Annokey is a command line tool for annotating gene lists with the results of a key-term search of the NCBI Gene database and linked PubMed article abstracts. Its purpose is to help users prioritise genes by relevance to a domain of interest, such as "breast cancer" or "DNA repair" etcetera. The user steers the search by specifying a ranked list of keywords and terms that are likely to be highly correlated with their domain of interest.
Publication: Annokey: an annotation tool based on key term search of the NCBI Entrez Gene database, Source Code for Biology and Medicine.

ROVER

Summary: Read-pair overlap considerate variant-calling software for PCR-based massively parallel sequencing datasets.
Repository: https://github.com/bjpop/rover
Description: ROVER-PCR Variant Caller enables users to quickly and accurately identify genetic variants from PCR-targeted, overlapping paired-end MPS datasets. The open-source availability of the software and threshold tailorability enables broad access for a range of PCR-MPS users.
Publication: ROVER variant caller: read-pair overlap considerate variant-calling software applied to PCR-based massively parallel sequencing datasets, Source Code for Biology and Medicine.

Blip

Summary: A bytecode compiler for Python 3.
Repository: https://github.com/bjpop/blip
Description: Blip compiles Python 3 source files to bytecode. The output bytecode is compatible with the CPython interpreter.

FAVR

Summary: Filtering and Annotation of Variants that are Rare.
Repository: https://github.com/bjpop/favr
Description: Characterizing genetic diversity through the analysis of massively parallel sequence (MPS) data offers enormous potential in terms of our understanding of predisposition to complex human disease. Great challenges remain, however, regarding our ability to resolve those genetic variants that are genuinely associated with disease from the millions of "bystanders" and artefactual signals. FAVR is designed to assist in the resolution of some of these issues in the context of rare germline variants by facilitating "platform-steered" artefact filtering.
Publication: FAVR (Filtering and Annotation of Variants that are Rare): methods to facilitate the analysis of rare germline genetic variants from massively parallel sequencing datasets, BMC Bioinformatics

berp

Summary

A compiler and interpreter for Python 3.

Repository

http://github.com/bjpop/berp

Description

Berp is an implementation of Python 3. At its heart is a translator, which takes Python code as input and generates Haskell code as output. The Haskell code is fed into a Haskell compiler (GHC) for compilation to machine code or interpretation as byte code.

Berp provides both a compiler and an interactive interpreter. For the most part it can be used in the same way as CPython (the main Python implementation).

haskell-mpi

Summary: A Haskell interface to the MPI distributed parallel library.
Repository: http://hackage.haskell.org/package/haskell-mpi
Description: MPI is defined by the Message-Passing Interface Standard, as specified by the Message Passing Interface Forum. The latest release of the standard is known as MPI-2. These Haskell bindings are designed to work with any standards compliant implementation of MPI-2.
Publication: High Performance Haskell with MPI, The Monad Reader

language-python

Summary: A lexer, parser and pretty printer for Python programs, written in Haskell.
Repository: http://hackage.haskell.org/package/language-python
Description: This package provides a parser (and lexer) for Python written in Haskell. It supports version 2 and 3 of Python. The parser is implemented using the happy parser generator, and the alex lexer generator. The package also provides a pretty printer, which makes it also suitable for generating Python code.

ministg

Summary: An interpreter for the small-step operational semantics of the STG machine.
Repository: http://www.haskell.org/haskellwiki/Ministg
Description: Ministg is an interpreter for a high-level, small-step, operational semantics for the STG machine. The STG machine is the abstract machine at the core of GHC. The operational semantics used in Ministg is taken from the paper "Making a fast curry: push/enter versus eval/apply for higher-order languages" by Simon Marlow and Simon Peyton Jones. Ministg implements both sets of evaluation rules from the paper.
Publication: Ministg, Haskell Wiki

Research Software

Varlap

Gurita

Bionitio

UNDR ROVER

HiTIME

SRST2

Methpat

Annokey

ROVER

Blip

FAVR

berp

haskell-mpi

language-python

ministg