The Creevey Lab: Software and Tools

We develop bioinformatics software to help better understand microbiomes and evolution. Below you can find the links to the software developed by the members of this lab and collaborators as part of the different research projects we are involved in.

AMPLY

A computational tool for identifying novel Antimicrobial Paptides (AMPs).
AMPLY is a bioinformatics pipeline designed to take any form of digital biological data and retrieve novel antimicrobial peptides (AMPs) for synthesis and screening against multi-drug resistant (MDR) strains of bacteria and fungi. Developed by Ben Thomas, this work was funded by Life Science Research Network Wales.
Links: Amply Website.

Clan_Check

Check trees for compatibility with defined groupings in unrooted trees - "The incontestable clan test"
Clan_Check analyses single-copy phylogenetic trees to assess if they violate clans defined by the user. This is designed for large-scale phylogenomic analyses where the user may have thousands of phylogenetic trees. This tool can help enrich the data for orthologs, by identifying where paralogy has caused violation of "well known" clans in outgroups.
Published in "Siu-Ting, Karen, et al. "Inadvertent paralog inclusion drives artefactual topologies and timetree estimates in phylogenomics." Molecular biology and evolution (2019)."
Links: Clan_Check Website.

CowPI

A rumen microbiome focussed version of the PICRUSt functional inference software
Using 16S rDNA profiles from the Global Rumen Census and almost 500 fully sequenced microbial genomes from the Hungate 1000 project CowPI is a rumen focused version of the PICRUSt tool for functional inference from 16S metataxonomic data.
CowPi is published in: "Wilkinson, Toby J., et al. "CowPI: a rumen microbiome focussed version of the PICRUSt functional inference software." Frontiers in microbiology 9 (2018)."
Links: CowPI website

Hansel & Gretel

Identifying cryptic haplotypes from metagenomic datasets
These are twin software tools written by Sam Nichols as part of his PhD. Hansel implements a graph-inspired data structure for determining likely chains of sequences from breadcrumbs of evidence and Gretel implements an algorithm for recovering haplotypes from metagenomes from Hansel. The preprint of the manuscript describing Hansel and Gretel can be found here: http://biorxiv.org/content/early/2017/03/17/117838
Links: Hansel & Gretel.

Spherical

An iterative approach for large metagenome assemblies.
Spherical is an iterative approach to assembling metagenomic datasets written by Tom Hitch as part of his PhD. Spherical has been designed to produce a more complete assembly from deep sequenced metagenomic data. Utilization of multiple iterations of assembly allows for regions which otherwise would be missed to be assembled without a reduction in contig accuracy. Another use for Spherical is its ability to produce metagenomic assemblies using a subset of the initial input file, allowing for assembly of a metagenome whilst using a fraction of the RAM that would otherwise be required.
Spherical is published in: "Hitch, Thomas CA, and Christopher J. Creevey. "Spherical: an iterative workflow for assembling metagenomic datasets." BMC bioinformatics 19.1 (2018): 20."
Links: Website.

MGKit

Metagenomic Framewotk for the Study of Microbial Communities
While metagenomics has been used extensively to study microbial communities from a taxonomic and functional perspective, little has been done to address how the species in a microbiome are adapted to and maintain specific roles in dynamic environments like the rumen.
To address this issue we have developed a framework for the robust analysis of metagenomic data that includes fully automated analysis from next-generation sequencing (NGS) reads to assembly, gene-predicition and taxonomic identification. Furthermore we implement approaches to estimate SNP diversity in metagenomic samples and carry out statistical tests to identify genes where sequence diversity exists.
The framework allows easy customisation of any metagenomic workflow, by providing the necessary functions and scripts to manipulate data from NGS pipelines and provides bespoke analyses of the data. MGKit also does not enforce a specific pipeline on the user, but leverages analysis patterns and common files formats to make it easier to experiment with different types of analyses.
MGKit is implemented in Python1 and uses common libraries used in the Python Scientific Community, like NumPy, SciPy, Matplotlib4 and pandas5, along with packages used in NGS data analysis, like HTSeq and pysam.
Links: Website; Documentation.

SNPdat

Software for annotation of both novel and known single nucleotide polymorphisms (SNP) developed by Anthony Doran and Chris Creevey. It is specifically designed for use with organisms which are either not supported by other tools or have a small number of annotated SNPs available, however it can also be used to analyse datasets from organisms which are densely sampled for SNPs.

SNPdat is published in "Doran, Anthony G., and Christopher J. Creevey. "Snpdat: easy and rapid annotation of results from de novo snp discovery projects for model and non-model organisms." BMC bioinformatics 14.1 (2013): 45."
The source code and manuals can be downloaded here.

Clann

Construction of Supertrees and exploration of phylogenomic information from partially overlapping datasets.

The software, developed by Chris Creevey, implements methods of determining the optimal phylogenetic supertrees, given a set of input source trees. The methods implemented all allow the investigation of data in a phylogenomic context.
Clans has been published here: "Creevey, C. J., and James O. McInerney. "Clann: investigating phylogenetic information through supertree analyses." Bioinformatics 21.3 (2004): 390-392."
You can download the latest version of Clann and the manual at Clanns GitHub site.

Crann

Fast heuristic methods of detecting adaptive evolution in protein-coding genes.

You can download the latest version of Crann and manuals here.

If you use Crann, cite:

Creevey, C. and J. O. McInerney (2003). CRANN: Detecting adaptive evolution in protein-coding DNA sequences Bioinformatics 19: 1726.

Creevey, C. and J. O. McInerney (2002). An algorithm for detecting directional and non-directional positive selection, neutrality and negative selection in protein coding DNA sequences. Gene 300: 43-51.

AQUA

Automated Quality improvement for multiple sequence alignments

Chris Creevey co-developed this protocol and software with Jean Muller while working at the Bork Group in EMBL. The protocol carries out the automatic identification of the most reliable multiple sequence alignment for a given protein family. The implementation relies on two alignment programs (MUSCLE and MAFFT), one refinement program (RASCAL) and one assessment program (NORMD), but other programs could be incorporated at any of the three steps.

Download AQUA

More details about the method here.

If you use AQUA, cite:

Muller J., Creevey C.J., Thompson J.D., Arendt D., Bork P. 2010. Aqua: Automated Quality Improvement for Multiple Sequence Alignments. Bioinformatics 26:263-265.

Concatabominations

Identify unstable taxa in phylogenies

Method and implementation developed where Chris Creevey worked in collaboration with Karen Siu-Ting, Mark Wilkinson and Davide Pisani. The method uses is a heuristic extension to the Safe Taxonomic Reduction method to identify unstable taxa in phylogenies and extends it by using a compatibility approach to test for taxa that can be equivalent or not in their character information. The implementation and program also uses Cytoscape to visualise taxonomic equivalents in a network.

The method used is detailed here. The source code and manuals can be downloaded here.

If you use Concatabominations, cite:

Siu-Ting, Karen, et al. "Concatabominations: identifying unstable taxa in morphological phylogenetics using a heuristic extension to safe taxonomic reduction." Systematic biology 64.1 (2014): 137-143.

eggNOG

eggNOG (evolutionary genealogy of genes: Non-supervised Orthologous Groups) is a database of orthologous groups of genes. The orthologous groups are annotated with functional description lines (derived by identifying a common denominator for the genes based on their various annotations), with functional categories (i.e derived from the original COG/KOG categories).

eggNOG's database currently counts 1.7 million orthologous groups in 3686 species, covering over 7.7 million proteins (built from 9.6 million proteins).

Access the publication here.

Pages

Software and Tools

Hansel & Gretel