Software | The Smith Lab

RNA-protein interactions

Zagros

Zagros is a motif discovery software for CLIP-Seq high-throughput protein-RNA interaction data. Given the regions of significant enrichment for reads Zagros can characterize the binding site for the given RBP. Zagros contains two additional programs to calculate the base pairing probabilities of the input sequences and extracting experiment specific events to incorporate such information for an extremely accurate motif discovery. Go to the Zagros Homepage.

Piranha

Piranha is a peak-caller for CLIP- and RIP-Seq high-throughput protein-RNA interaction data. It accepts input in BED or BAM format and identifies regions of significant enrichment for reads. Piranha can also optionally incorporate additional external covariates into the peak-calling process, and identify sites of differential binding occupancy between cell types, conditions or development stages. Go to the Piranha Homepage.

DNA Methylation

MethPipe

The MethPipe software package is a computational pipeline for analyzing bisulfite sequencing data (WGBS and RRBS). MethPipe provides tools for mapping bisulfite sequencing reads and estimating methylation levels at individual cytosine sites. MethPipe also includes tools for identifying higher-level methylation features, including hypo-methylated regions (HMR), partially methylated domains (PMD), hyper-methylated regions (HyperMR), and allele-specific methylated regions (AMR). Go to the MethPipe Homepage.

abismal

Abismal is a mapper for short (50 to 1000 bp) WGBS sequences. It aligns reads in FASTQ files to a FASTA reference genome. Mapped reads are represented as SAM or BAM output (which is typically the input for methpipe). Abismal requires less than 4 GB of RAM to map reads to most organism genomes (including human) and can be installed and run in most UNIX machines. The latest version can be found at the abismal GitHub page or at bioconda.

RADMeth

RADMeth: Regression Analysis of Differential Methylation is a software package for computing individual differentially methylated sites and genomic regions in whole genome bisulfite sequencing (WGBS) data. Go to the RADMeth Homepage.

MLML

MLML is a useful tool to simultaneously estimate hydroxymethylation (5hmC) and methylation (5mC) levels from BS-seq, oxBS-seq and TAB-seq experiments. It generates consistent estimates across experiment types. Go to the MLML homepage.

amrfinder

amrfinder implements a novel probabilistic model to predict allele-specific DNA methylation (ASM) in mammals in the absence of SNP data. Given a set of mapped reads, it uses the distribution of methylation on the reads to compute the likelihood of allele-specific methylation in read-sized regions of the genome, and then ties them together to form contiguous allele-specific methylated regions (AMRs). Go to the amrfinder Homepage.

PRESEQ: Predicting Library Complexity

The preseq package is aimed at predicting the number of distinct reads and how many will be expected from additional sequencing using an initial sequencing experiment. The estimates can then be used to examine the utility of further sequencing, optimize the sequencing depth, or to screen multiple libraries to avoid low complexity samples. Go to the preseq homepage.

It can also apply to estimating the expected number of species as a function of the number of captures, which is called the species discovery curve or species accumulation curve in ecology. For people who prefer to work under the R statistical computing environment, we provide an R package called preseqR, which makes the functionality of preseq available in R. Download preseqR from CRAN.

RSEG

The RSEG software package is aimed to analyze ChIP-Seq data, especially for identifying genomic regions and their boundaries marked by diffusive histone modification markers, such as H3K36me3 and H3K27me3. It can work with or without control sample. It can be used to find regions with differential histone modifications patterns, either comparison between two cell types or between two kinds of histone modifications. Go to RSEG Homepage.

RMAP

RMAP is aimed to map accurately reads from the next-generation sequencing technology. RMAP can map reads with or without error probability information (quality scores) and supports paired-end reads or bisulfite-treated reads mapping. There are no limitations on read widths or number of mismatches. RMAP can now map more than 8 million reads in an hour at full sensitivity to 2 mismatches. Go to RMAP Homepage.

Ribotricer

Ribotricer is a method for detecting actively-translating ORFs by directly leveraging the three-nucleotide periodicity of Ribo-seq data. It accurately identifies both short and long active ORFs. Visit the Ribotricer page on GitHub.

Falco

Falco is a drop-in emulation of FastQC for UNIX machines. It reproduces the FastQC code in C++ and usually results in faster processing times while generating identical results. The most recent version of Falco can be found at its GitHub page or through bioconda.

Regulatory Sequence Analysis

DME

DME is a program that discovers transcription factor binding site motifs in nucleotide sequences. DME identifies motifs, represented as position weight matrices, that are overrepresented in one set of sequences relative to another set. The ability to directly optimize relative overrepresentation is a unique feature of DME, making DME an ideal tool for analyzing promoters of transcripts found to have differential expression in a particular context. The optimization procedure is based on an enumerative algorithm that is guaranteed to identify optimal motifs from a discrete space of matrices with a specific lower bound on information content. This strategy scales very well with the number and length of the sequences used, and is well-suited to analyzing very large data sets. Go to DME Homepage.

CREAD

CREAD is a framework for studying regulatory elements in a genome. Currently focusing on patterns involved in transcriptional regulation, CREAD includes efficient tools for performing fundamental tasks in motif discovery and regulatory sequence analysis. CREAD also includes code libraries to facilitate the implementation of new tools. In addition to fundamental tools, CREAD includes an implementation of the MARS machine learning algorithm, and a suffix tree implementation designed for repeated searching of large amounts of sequence data using position-weight matrices, a common representation for transcription-factor binding-sites. Go to CREAD Homepage.

Metagenomics

Amordad

Amordad is a database engine for comparing metagenomic data at massive scale. It first obtains the sequence signature of metagenomes and organizes them as points in high dimensional space. This geometric space is then searched using a combined strategy of random hashing with nearest neighbor graph to efficiently answer to queries even when the number of data points reaches to millions. Therefore, it provides the solutions for the next generation of metagenomic studies where alignment-based methods might be too time-consuming to be used. To learn more about the database engine and download code for setting it up on your own server, you can go to the Amordad Software Page. To use our own Amordad server, which indexes the MG-RAST database of metagenomes, go to the Amordad homepage.

Smith Lab source code can also be found on GitHub here.