CREAD: Comprehensive Regulatory Element Analysis and Discovery

CREAD: Comprehensive Regulatory Element Analysis and Discovery

Understanding how DNA and RNA elements participate in regulating gene expression, at both the transcriptional and translational levels, is among the most important immediate challenges in genome science. It is also an area where computational machinery must work in tandem with bench work. Regulatory patterns are diverse and no single abstract representation or discovery algorithm can adequately capture all types of regulatory patterns. A framework of low level standards can simplify the design of tools and in silico experiments, especially in the areas of pattern discovery, machine learning and pattern visualization, designed to address the specific challenges of working with biological information.

Providing such a framework is the purpose of CREAD. Currently focusing on patterns involved in transcriptional regulation, CREAD includes efficient tools for performing fundamental tasks in motif discovery and regulatory sequence analysis. CREAD also includes code libraries to facilitate the implementation of new tools. In addition to fundamental tools, CREAD includes an implementation of the MARS machine learning algorithm, and a Suffix Tree implementation designed for repeated searching of large amounts of sequence data using position-weight matrices, a common representation for transcription-factor binding-sites.

The Pattern-Feature-Model framework underlying the machine learning programs of CREAD were designed to directly address the problem of exploratory data analysis when the user wants to identify how different types of patterns can be used to characterize sets of sequences that have common functions, such as regulatory sequences that control expression in a specific context. Precise characterizations of the properties of certain types of sequences sequences can suggest hypotheses about how these sequences function in the cell. Patterns in CREAD are representations, such as position-weight matrices or regular expressions, that describe a class of sequences elements. Features are functions of the patterns, such as the number of subsequences matching a particular regular expression. Models are constructed from the features using machine learning methods, for the purpose of using those features to make predictions about the function of sequences.

Currently CREAD includes a set of state of the art tools for motif discovery and regulatory sequence analysis. The programs of CREAD use a common set of file formats, which facilitates pipeline construction. CREAD programs and libraries are written in C++, with plans to include pipeline scripts, written in Python, in the near future.

References

Andrew D. Smith, Pavel Sumazin, Zhenyu Xuan, and Michael Q. Zhang
DNA motifs in human and mouse proximal promoters predict tissue specific expression. PNAS, 103(16):6275-6280 (2006) [PDF]

Andrew D. Smith, Pavel Sumazin, Debopriya Das, and Michael Q. Zhang
Mining ChIP-chip data for transcription factor and cofactor binding sites.
Bioinformatics, 21(Suppl 1):i403–i412 (2005) [PDF]

Dustin E. Schones, Pavel Sumazin, Michael Q. Zhang
Similarity of position frequency matrices for transcription factor binding sites.
Bioinformatics, 21(3):307-313 (2005) [PDF]