RNA-protein interactions


Zagros is a motif discovery software for CLIP-Seq high-throughput protein-RNA interaction data. Given the regions of significant enrichment for reads Zagros can characterize the binding site for the given RBP. Zagros contains two additional programs to calculate the base pairing probabilities of the input sequences and extracting experiment specific events to incorporate such information for an extremely accurate motif discovery. Go to the Zagros Homepage.


Piranha is a peak-caller for CLIP- and RIP-Seq high-throughput protein-RNA interaction data. It accepts input in BED or BAM format and identifies regions of significant enrichment for reads. Piranha can also optionally incorporate additional external covariates into the peak-calling process, and identify sites of differential binding occupancy between cell types, conditions or development stages. Go to the Piranha Homepage

DNA Methylation


The MethPipe software package is a computational pipeline for analyzing bisulfite sequencing data (WGBS and RRBS). MethPipe provides tools for mapping bisulfite sequencing reads and estimating methylation levels at individual cytosine sites. MethPipe also includes tools for identifying higher-level methylation features, including hypo-methylated regions (HMR), partially methylated domains (PMD), hyper-methylated regions (HyperMR), and allele-specific methylated regions (AMR). Go to the MethPipe Homepage.


RADMeth: Regression Analysis of Differential Methylation is a software package for computing individual differentially methylated sites and genomic regions in whole genome bisulfite sequencing (WGBS) data. Go to the RADMeth Homepage.


MLML is a useful tool to simultaneously estimate hydroxymethylation (5hmC) and methylation (5mC) levels from BS-seq, oxBS-seq and TAB-seq experiments. It generates consistent estimates across experiment types. Go to the MLML homepage.


amrfinder implements a novel probabilistic model to predict allele-specific DNA methylation (ASM) in mammals in the absence of SNP data. Given a set of mapped reads, it uses the distribution of methylation on the reads to compute the likelihood of allele-specific methylation in read-sized regions of the genome, and then ties them together to form contiguous allele-specific methylated regions (AMRs). Go to the amrfinder Homepage

PRESEQ: Predicting Library Complexity

The preseq package is aimed at predicting the number of distinct reads and how many will be expected from additional sequencing using an initial sequencing experiment. The estimates can then be used to examine the utility of further sequencing, optimize the sequencing depth, or to screen multiple libraries to avoid low complexity samples. Go to the preseq homepage.

It can also apply to estimating the expected number of species as a function of the number of captures, which is called the species discovery curve or species accumulation curve in Ecology. For people who prefer to work under the R statistical computing environment, we provide an R package called preseqR, which makes the functionality of PRESEQ available in the R. Download preseqR from CRAN.


The RSEG software package is aimed to analyze ChIP-Seq data, especially for identifying genomic regions and their boundaries marked by diffusive histone modification markers, such as H3K36me3 and H3K27me3. It can work with or without control sample. It can be used to find regions with differential histone modifications patterns, either comparsion between two cell types or between two kinds of histone modifications. Go to RSEG Homepage


RMAP is aimed to map accurately reads from the next-generation sequencing technology. RMAP can map reads with or without error probability information (quality scores) and supports paired-end reads or bisulfite-treated reads mapping. There is no limitaions on read widths or number of mismatches. RMAP can now map more than 8 million reads in an hour at full sensitivity to 2 mismatches. Go to RMAP Homepage

Regulatory Sequence Analysis


DME is a program that discovers transcription factor binding site motifs in nucleotide sequences. DME identifies motifs, represented as position weight matrices, that are overrepresented in one set of sequences relative to another set. The ability to directly optimize relative overrepresentation is a unique feature of DME, making DME an ideal tool for analyzing promoters of transcripts found to have differential expression in a particular context. The optimization procedure is based on an enumerative algorithm that is guaranteed to identify optimal motifs from a discrete space of matrices with a specific lower bound on information content. This strategy scales very well with the number and length of the sequences used, and is well-suited to analyzing very large data sets. Go to DME Homepage


CREAD is a framework for studying regulatory elements in a genome. Currently focusing on patterns involved in transcriptional regulation, CREAD includes efficient tools for performing fundamental tasks in motif discovery and regulatory sequence analysis. CREAD also includes code libraries to facilitate the implementation of new tools. In addition to fundamental tools, CREAD includes an implementation of the MARS machine learning algorithm, and a Suffix Tree implementation designed for repeated searching of large amounts of sequence data using position-weight matrices, a common representation for transcription-factor binding-sites. Go to CREAD Homepage



Amordad is a database engine for comparing metagenomic data at massive scale. It first obtains the sequence signature of metagenomes and organizes them as points in high dimensional space. This geometric space is then searched using a combined strategy of random hashing with nearest neighbor graph to efficiently answer to queries even when the number of data points reaches to millions. Therefore, it provides the solutions for the next generation of metagenomic studies where alignment-based methods might be too time-consuming to be used. To learn more about the database engine and download code for setting it up on your own server, you can go to the Amordad Software Page. To use our own Amordad server, which indexes the MG-RAST database of metagenomes, go to the Amordad homepage

Smith Lab source code can also be found on Git Hub here.