Zagros — Motif discovery using CLIP-Seq data
______ |___ / / / __ _ __ _ _ __ ___ ___ / / / _` |/ _` | '__/ _ \/ __| / /_| (_| | (_| | | | (_) \__ \ /_____\__,_|\__, |_| \___/|___/ __/ | |___/ ******************************** * V1.1.0 * ********************************
Zagros is a motif discovery software for CLIP-Seq high-throughput protein-RNA interaction data. Given the regions of significant enrichment for reads, Zagros can characterize the binding site for the given RBP. Zagros contains two additional programs to calculate the base pairing probabilities of the input sequences and extracting experiment specific events to incorporate such information for an extremely accurate motif discovery. Download the latest Zagros distribution here. Check out the latest development branch on Github.
System requirements
64-bit machine and GCC version ≥ 4.1 (to support TR1).
Install
To install Zagros, download the compressed archive, unpack it similar to:
> tar -xvf zagros-X.Y.Z.tar.gz
To build the binaries, type
> make all
To install the binaries, type
> make install
This will place the binaries in the bin directory under the package root. They can be used directly from there without any additional steps. You can add that directory to your PATH environment variable to avoid having to specify their full paths, or you can copy the binaries to another directory of your choice in your PATH.
Basic Usage
Zagros has four modes of operations.
1) Sequence only:
In this mode, only the sequence information is used for motif discovery. The input can either be a set of sequences in fasta format or genomic regions in bed format. This set of regions/sequences corresponds to the locations of significant enrichment for reads in the experiment.
In case the input is the set of sequences, you can simply run:
> ./zagros input.fa
If the input consists of a set of genomic regions, the set of the target genome
sequences must also be provided to Zagros to extract the sequnces:
> ./zagros -c path/to/chrom_directory input.bed
The chromosome directory can be downloaded from UCSC genome browser website.
The lastest versions can be found here:
http://hgdownload.soe.ucsc.edu/downloads.html
2) Sequence and Structure
In this mode, in addition to the target sequences the secondary structure information is used as well. In this case, the secondary structure data must be first obtained and saved using the “thermo” program:
> ./thermo -o input.str input.fa
or
> ./thermo -c path/to/chrom_directory -o input.str input.bed
After this step, by providing both the target and secondary structure file to Zagros
the motif discovery is performed based on both.
> ./zagros -t input.str input.fa
or
> ./zagros -c path/to/chrom_directory -t input.str input.bed
3) Sequence and Diagnostic events
In this mode, in addition to the target sequences the information about cross-link modification events is used as well. In this case, the diagnostic events information must be first obtained and saved using the “extractDEs” program.
The input to extractDEs program is the set of mapped reads. The user must specify what technology is used for obtaining the reads (hCLIP, pCLIP or iCLIP), what mapper is used for mapping the reads, and the genomic regions of significant regions that is used for zagros as input. ExtractDEs then produces the set of diagnostic events corrsponding to the regions of interest. Zagros can interprete the mapped reads from three mappers: bowtie (native output format), novoalign (native output format) and RMAP (bed format).
> ./extractDEs -m novoalign -t iCLIP -o input.des -r input.bed mapped_reads.novo
Then run zagros program by inputing the diagnostic events file as one of the options.
> ./zagros -d input.des input.fa
or
> ./zagros -c path/to/chrom_directory -d input.des input.bed
4) Sequence, Structure and Diagnostic events
> ./thermo -o input.str input.fa > ./extractDEs -m novoalign -t iCLIP -o input.des -r mapped_reads.nov > ./zagros -t input.structure -d input.des input.fa