We distribute the software we develop to the wider scientific community free of charge. You can find the links to the softwares developed by the members of this lab and collaborators as part of the different research projects we are involved in.
Identifying cryptic haplotypes from metagenomic datasets
These are twin software tools written by Sam Nichols as part of his PhD. Hansel implements a graph-inspired data structure for determining likely chains of sequences from breadcrumbs of evidence and Gretel implements an algorithm for recovering haplotypes from metagenomes from Hansel. The preprint of the manuscript describing Hansel and Gretel can be found here: http://biorxiv.org/content/early/2017/03/17/117838
Links: Hansel & Gretel.
An iterative approach for large metagenome assemblies.
Spherical is an iterative approach to assembling metagenomic datasets written by Tom Hitch as part of his PhD. Spherical has been designed to produce a more complete assembly from deep sequenced metagenomic data. Utilization of multiple iterations of assembly allows for regions which otherwise would be missed to be assembled without a reduction in contig accuracy. Another use for Spherical is its ability to produce metagenomic assemblies using a subset of the initial input file, allowing for assembly of a metagenome whilst using a fraction of the RAM that would otherwise be required.
Metagenomic Framewotk for the Study of Microbial Communities
While metagenomics has been used extensively to study microbial communities from a taxonomic and functional perspective, little has been done to address how the species in a microbiome are adapted to and maintain specific roles in dynamic environments like the rumen.
To address this issue we have developed a framework for the robust analysis of metagenomic data that includes fully automated analysis from next-generation sequencing (NGS) reads to assembly, gene-predicition and taxonomic identification. Furthermore we implement approaches to estimate SNP diversity in metagenomic samples and carry out statistical tests to identify genes where sequence diversity exists.
The framework allows easy customisation of any metagenomic workflow, by providing the necessary functions and scripts to manipulate data from NGS pipelines and provides bespoke analyses of the data. MGKit also does not enforce a specific pipeline on the user, but leverages analysis patterns and common files formats to make it easier to experiment with different types of analyses.
MGKit is implemented in Python1 and uses common libraries used in the Python Scientific Community, like NumPy, SciPy, Matplotlib4 and pandas5, along with packages used in NGS data analysis, like HTSeq and pysam.
Links: Website; Documentation.
Software for annotation of both novel and known single nucleotide polymorphisms (SNP) developed by Anthony Doran and Chris Creevey. It is specifically designed for use with organisms which are either not supported by other tools or have a small number of annotated SNPs available, however it can also be used to analyse datasets from organisms which are densely sampled for SNPs.
The method used is detailed in Doran and Creevey (2012). The source code and manuals can be downloaded here.
Construction of Supertrees and exploration of phylogenomic information from partially overlapping datasets.
The software, developed by Chris Creevey, implements methods of determining the optimal phylogenetic supertrees, given a set of input source trees. The methods implemented all allow the investigation of data in a phylogenomic context.You can download the latest version of Clann and the manual at Clanns GitHub site.
Fast heuristic methods of detecting adaptive evolution in protein-coding genes.
You can download the latest version of Crann and manuals here.
If you use Crann, cite:
Creevey, C. and J. O. McInerney (2003). CRANN: Detecting adaptive evolution in protein-coding DNA sequences Bioinformatics 19: 1726.
Creevey, C. and J. O. McInerney (2002). An algorithm for detecting directional and non-directional positive selection, neutrality and negative selection in protein coding DNA sequences. Gene 300: 43-51.
Automated Quality improvement for multiple sequence alignments
Chris Creevey co-developed this protocol and software with Jean Muller while working at the Bork Group in EMBL. The protocol carries out the automatic identification of the most reliable multiple sequence alignment for a given protein family. The implementation relies on two alignment programs (MUSCLE and MAFFT), one refinement program (RASCAL) and one assessment program (NORMD), but other programs could be incorporated at any of the three steps.
More details about the method here.
If you use AQUA, cite:
Muller J., Creevey C.J., Thompson J.D., Arendt D., Bork P. 2010. Aqua: Automated Quality Improvement for Multiple Sequence Alignments. Bioinformatics 26:263-265.
Identify unstable taxa in phylogenies
Method and implementation developed where Chris Creevey worked in collaboration with Karen Siu-Ting, Mark Wilkinson and Davide Pisani. The method uses is a heuristic extension to the Safe Taxonomic Reduction method to identify unstable taxa in phylogenies and extends it by using a compatibility approach to test for taxa that can be equivalent or not in their character information. The implementation and program also uses Cytoscape to visualise taxonomic equivalents in a network.
If you use Concatabominations, cite:
Siu-Ting K., Pisani D., Creevey C., Wilkinson M. 2014. Concatabominations: Identifying Unstable Taxa in Morphological Phylogenetics Using a Heuristic Extension to Safe Taxonomic Reduction. Systematic Biology (accepted).
eggNOG (evolutionary genealogy of genes: Non-supervised Orthologous Groups) is a database of orthologous groups of genes. The orthologous groups are annotated with functional description lines (derived by identifying a common denominator for the genes based on their various annotations), with functional categories (i.e derived from the original COG/KOG categories).
eggNOG's database currently counts 1.7 million orthologous groups in 3686 species, covering over 7.7 million proteins (built from 9.6 million proteins).
Access the publication here.