Research Software

GitHub: github.com/bjpop

Varlap

Summary
Varlap is a quality control tool for genetic variants arising from high throughput DNA sequencing, where the variants have been called by aligning DNA sequencing reads to a reference genome.
Repository
https://github.com/bjpop/varlap
Description

Varlap is primarily a quality control tool for genetic variants arising from high throughput DNA sequencing, where the variants have been called by aligning DNA sequencing reads to a reference genome. It takes as input a set of DNA variants and one or more BAM files. Varlap considers the genomic locus of each variant in each of the supplied BAM files and records information about the corresponding alignment context at that locus. For example, one of the metrics it calculates is the average edit distance of reads overlapping the variant locus. This can be a useful metric because regions with significantly higher average edit distance are more likely to contain erroneous variant calls. Varlap outputs a CSV file containing one row per input variant, with columns recording the various computed metrics about that variant. Subsequent analysis of this output (such as outlier detection) can be used to identify potentially problematic variants and samples.

Common use cases are to consider somatic variants in the context of tumour and normal alignments, or germline variants against normal alignments. However, varlap is quite flexible and allows the use of any number of BAM files as input.

Gurita

Summary
Gurita is a command line program for plotting, transforming and analysing tabular data (CSV, TSV files).
Repository
https://github.com/bjpop/gurita
Description

At its core Gurita provides a suite of commands, each of which carries out a common data analytics or plotting task. Additionally, Gurita allows commands to be chained together into flexible analysis pipelines.

It is designed to be fast and convenient, and is particularly suited to data exploration tasks. Input files with large numbers of rows (> millions) are readily supported.

Gurita commands are highly customisable, however sensible defaults are applied. Therefore simple tasks are easy to express and complex tasks are possible.

Bionitio

Summary
Bionitio provides a template for command line bioinformatics tools in various programming languages, and automates the creation of new software repositories following best practices.
Repository
https://github.com/bionitio-team/bionitio
Description

The purpose of Bionitio is to provide an easy-to-understand working example that is built on best-practice software engineering principles. It can be used as a basis for learning and as a solid foundation for starting new projects. We provide a script called bionitio-boot.sh for starting new projects from bionitio, which saves time and ensures good programming practices are adopted from the beginning

Publication
Bionitio: demonstrating and facilitating best practices for bioinformatics command-line software, GigaScience.

UNDR ROVER

Summary
A fast and accurate variant caller for targeted DNA sequencing.
Repository
https://github.com/bjpop/undr_rover
Description

UNDR ROVER is an improved version of our ROVER variant calling tool for targeted DNA sequencing. It enables users to quickly and accurately identify genetic variants from PCR-targeted, overlapping paired-end MPS datasets. It calls the same variants as the ROVER tool but at a significantly reduced runtime. It achieves its higher performance by avoiding read alignment before variant calling, and can be applied directly to input FASTQ files.

Publication
UNDR ROVER - a fast and accurate variant caller for targeted DNA sequencing, BMC Bioinformatics.

HiTIME

Summary
High-resolution Twin-Ion Metabolic Extraction.
Repository
https://github.com/bjpop/HiTIME
Description

HiTIME is a software tool for detecting twin ion signals in high resolution liquid chromatography mass spectrometry (LCMS) data.

This is a collaboration with Andrew Isaac, Michael Leeming, Richard O'Hair and William Alexander Donald.

Publication

SRST2

Summary
Short Read Sequence Typing for Bacterial Pathogens.
Repository
http://katholt.github.io/srst2/
Description

This takes Illumina sequence data, a MLST (Multi-Locus Sequence Type) database and/or a database of gene sequences (e.g. resistance genes, virulence genes, etc) and report the presence of STs and/or reference genes.

This is a collaboration with Kat Holt, Mike Inouye and others.

Publication
SRST2: Rapid genomic surveillance for public health and hospital microbiology labs, Genome Medicine.

Methpat

Summary
A program for summarising and visualising CpG methylation patterns.
Repository
https://github.com/bjpop/methpat
Description

Methpat summarises the resultant DNA methylation pattern data from the output of Bismark methylation extractor. Information of the DNA methylation positions for each amplicon, DNA methylation patterns observed within each amplicon and their abundance counts are summarised into a tab delimited text file amenable for further downstream statistical analysis and visualization.

Publication
MethPat: a tool for the analysis and visualisation of complex methylation patterns obtained by massively parallel sequencing, BMC Bioinformatics.

Annokey

Summary
Gene-based search for key-terms in the NCBI gene database and associated PubMed abstracts.
Repository
http://bjpop.github.io/annokey/
Description

Annokey is a command line tool for annotating gene lists with the results of a key-term search of the NCBI Gene database and linked PubMed article abstracts. Its purpose is to help users prioritise genes by relevance to a domain of interest, such as "breast cancer" or "DNA repair" etcetera. The user steers the search by specifying a ranked list of keywords and terms that are likely to be highly correlated with their domain of interest.

Publication
Annokey: an annotation tool based on key term search of the NCBI Entrez Gene database, Source Code for Biology and Medicine.

ROVER

Summary
Read-pair overlap considerate variant-calling software for PCR-based massively parallel sequencing datasets.
Repository
https://github.com/bjpop/rover
Description

ROVER-PCR Variant Caller enables users to quickly and accurately identify genetic variants from PCR-targeted, overlapping paired-end MPS datasets. The open-source availability of the software and threshold tailorability enables broad access for a range of PCR-MPS users.

Publication
ROVER variant caller: read-pair overlap considerate variant-calling software applied to PCR-based massively parallel sequencing datasets, Source Code for Biology and Medicine.

Blip

Summary
A bytecode compiler for Python 3.
Repository
https://github.com/bjpop/blip
Description

Blip compiles Python 3 source files to bytecode. The output bytecode is compatible with the CPython interpreter.

FAVR

Summary
Filtering and Annotation of Variants that are Rare.
Repository
https://github.com/bjpop/favr
Description

Characterizing genetic diversity through the analysis of massively parallel sequence (MPS) data offers enormous potential in terms of our understanding of predisposition to complex human disease. Great challenges remain, however, regarding our ability to resolve those genetic variants that are genuinely associated with disease from the millions of "bystanders" and artefactual signals. FAVR is designed to assist in the resolution of some of these issues in the context of rare germline variants by facilitating "platform-steered" artefact filtering.

Publication
FAVR (Filtering and Annotation of Variants that are Rare): methods to facilitate the analysis of rare germline genetic variants from massively parallel sequencing datasets, BMC Bioinformatics

berp

Summary
A compiler and interpreter for Python 3.
Repository
http://github.com/bjpop/berp
Description

Berp is an implementation of Python 3. At its heart is a translator, which takes Python code as input and generates Haskell code as output. The Haskell code is fed into a Haskell compiler (GHC) for compilation to machine code or interpretation as byte code.

Berp provides both a compiler and an interactive interpreter. For the most part it can be used in the same way as CPython (the main Python implementation).

haskell-mpi

Summary
A Haskell interface to the MPI distributed parallel library.
Repository
http://hackage.haskell.org/package/haskell-mpi
Description

MPI is defined by the Message-Passing Interface Standard, as specified by the Message Passing Interface Forum. The latest release of the standard is known as MPI-2. These Haskell bindings are designed to work with any standards compliant implementation of MPI-2.

Publication
High Performance Haskell with MPI, The Monad Reader

language-python

Summary
A lexer, parser and pretty printer for Python programs, written in Haskell.
Repository
http://hackage.haskell.org/package/language-python
Description

This package provides a parser (and lexer) for Python written in Haskell. It supports version 2 and 3 of Python. The parser is implemented using the happy parser generator, and the alex lexer generator. The package also provides a pretty printer, which makes it also suitable for generating Python code.

ministg

Summary
An interpreter for the small-step operational semantics of the STG machine.
Repository
http://www.haskell.org/haskellwiki/Ministg
Description

Ministg is an interpreter for a high-level, small-step, operational semantics for the STG machine. The STG machine is the abstract machine at the core of GHC. The operational semantics used in Ministg is taken from the paper "Making a fast curry: push/enter versus eval/apply for higher-order languages" by Simon Marlow and Simon Peyton Jones. Ministg implements both sets of evaluation rules from the paper.

Publication
Ministg, Haskell Wiki