The preseq package is aimed at predicting and estimating the complexity of a genomic sequencing library, equivalent to predicting and estimating the number of redundant reads from a given sequencing depth and how many will be expected from additional sequencing using an initial sequencing experiment. The estimates can then be used to examine the utility of further sequencing, optimize the sequencing depth, or to screen multiple libraries to avoid low complexity samples.
Here is a link to the latest development branch at Github.
For users doing single-cell DNA sequencing: The Preseq package includes new functionality designed for single-cell genome resequencing projects, in particular to predict the genomic coverage for single-cell DNA sequencing experiments. This functionality, which is called gc_extrap in the software, was presented at ISMB/ECCB 2013 and has recently been published online in Bioinformatics. The poster can be found here, the paper can be found here, and basic usage is explained in the manual.
Installing a pre-compiled binary
We provide pre-compiled binary releases for version 2.0 for your convenience. This is to save time and effort from compiling and building preseq yourself. Please choose the appropriate binary for your platform.
In the following example, we use the Linux binary. Unpack the compressed archive
$ tar -jxvf preseq_linux_v2.0.tar.bz2
you can now change into the directory
$ cd preseq_linux_v2.0/
and can start using preseq. See the usage guide or manual below for more information.
Building preseq from the source code
Click to download the latest preseq source code (version 3.1.2), or you can clone the latest version from our Github repository. In the latter case recursive cloning is required to ensure the necessary submodules are included.
System requirements
64-bit machine, GCC version >= 4.1
Installation
To install preseq, download the compressed archive. Unpack it using
$ tar -zxvf preseq-3.1.2.tar.gz
Move into the preseq directory and configure:
$ cd preseq-3.1.2
$ ./configure --prefix=/where/to/install
If you would like to read in bam input , then HTSLib needs to be installed (http://www.htslib.org/download/). Configure with the following flag:
$ ./configure --enable-hts
Compile and install the tools:
$ make $ make install
Quick usage guide
There are five main programs in the preseq package: c_curve, lc_extrap, gc_extrap, bound_pop, and pop_size. The first two require as input a sorted bed or bam file, with duplicate reads included. bound_pop requires a sorted bed or mapped read (mr) file. The bam file can be sorted with the samtools (http://samtools.sourceforge.net/) sort functions. The bed or mr file should be sorted by chromosome, start position, end position, and strand and can be done with the command line function sort
$ sort -k 1,1 -k 2,2n -k 3,3n -k 6,6 input.bed > input.sort.bed
c_curve computes the expected yield of distinct reads for experiments smaller than the input experiment in a .bed or .bam file through resampling. The full set of parameters can be outputed by simply typing the program name. If output.txt is the desired output file name and input.bed is the input .bed file, then simply type
$ preseq c_curve -o output.txt input.sort.bed
lc_extrap computes the expected future yield of distinct reads and bounds on the number of total distinct reads in the library and the associated confidence intervals. The -o parameter specifies the output file for the expected future yield. Our new module, pop_size, estimates the number of unique species in a sample, and receives input identical to lc_extrap.
$ preseq lc_extrap -o yield.txt input.sort.bed
If the input is in sorted bam format, then the option -B must be included. If the input is paired end, the option -P should be included. In this case only concordantly mapped reads are counted. c_curve and lc_extrap take count files, either as a text file histogram of duplicate counts or a text file column of duplicate counts, as input. The corresponding flags are -H and -V. gc_extrap computes the expected genomic coverage for deeper sequencing for single cell sequencing experiments. The input should be a mr or bed file. The tool to-mr is provided to convert sorted bam or sam files to mapped read format.
$ preseq gc_extrap -o coverage_curve.txt input.sort.mr
For fast estimates, lc_extrap or gc_extrap may be run in quick mode with the option -Q. This predicts the complexity without bootstrapping, so that confidence intervals will be missing. This significantly reduces the computation time.
bound_pop estimates the total size of the population in the sample, known as the species richness, with a nonparametric moment-based estimator. Typical input will be a text file with a single column of duplicate counts.
$ preseq bound_pop -o estim_species_richness.txt -V input.txt
Preseq uses rational function approximations to Good and Toulmin’s (1956) non-parametric empirical Bayes power series estimator. For more information, see the preseq manual or paper, published in Nature Methods in the April 2013 issue. For more information on gc_extrap, see our poster presented at ISMB/ECCB 2013 in Berlin.
Questions, comments, advice, or bugs can be sent to Andrew D. Smith at andrewds@usc.edu. Thank you for using preseq.
R package
An R package called preseqR makes the functionality of preseq available in the R statistical computing environment. It estimates the expected number of species as a function of the number of captures, which is called the species discovery curve or species accumulation curve in Ecology. Not only does preseqR covers the basic functionality of the preseq program, but it also functions to fit a zero-truncated negative binomial (ZTNB) distribution to the number of individuals captured for each observed species and extrapolate species accumulation curve. You can download the R package from CRAN.
The library complexity challenge
You are always welcome to compare the performance of preseq with other software on predicting library complexity. See this page for more information.