Preseq

The preseq package is aimed at predicting and estimating the complexity of a genomic sequencing library, equivalent to predicting and estimating the number of redundant reads from a given sequencing depth and how many will be expected from additional sequencing using an initial sequencing experiment. The estimates can then be used to examine the utility of further sequencing, optimize the sequencing depth, or to screen multiple libraries to avoid low complexity samples.

For users doing single-cell DNA sequencing: The Preseq package includes new functionality designed for single-cell genome resequencing projects, in particular to predict the genomic coverage for single-cell DNA sequencing experiments. This functionality, which is called gc_extrap in the software, was presented at ISMB/ECCB 2013 and has recently been published online in Bioinformatics. The poster can be found here, the paper can be found here, and basic usage is explained in the manual.

Installing a pre-compiled binary

We provide pre-compiled binary releases for version 2.0 for your convenience. This is to save time and effort from compiling and building preseq yourself, in case you do not have the SAMTools or GSL libraries installed. Please choose the appropriate binary for your platform.

preseq_linux_v2.0.tar.bz2

preseq_osx_v2.0.tar.bz2

In the following example, we use the Linux binary. Unpack the compressed archive

$ tar -jxvf preseq_linux_v2.0.tar.bz2

you can now change into the directory

$ cd preseq_linux_v2.0/

and can start using preseq. See the usage guide or manual below for more information.

Building preseq from the source code

Click to download the latest preseq source code (version 2.0), or you can clone the latest version from our github repository.  In the latter case recursive cloning is required to ensure the necessary submodules are included.

System requirements

64-bit machine, GCC version >= 4.1, and GSL version 1.15 available here.

Installation

To install preseq, download the compressed archive, unpack it using

$ tar -zxvf preseq-2.0.tar.gz

change directory into the unpacked source directory and type

$ make all

If the input file is in bam format, then samtools (http://samtools.sourceforge.net/) is requred. If the root directory of samtools is $samtools, instead run

$ make all SAMTOOLS_DIR=$samtools

Quick usage guide

There are three main programs in the preseq package: c_curve, lc_extrap, and gc_extrap. The first two require as input a sorted bed or bam file, with duplicate reads included.  The last function requires a sorted bed or mapped read(mr) file.  The bam file can be sorted with the samtools (http://samtools.sourceforge.net/) sort functions. The bed or mr file should be sorted by chromosome, start position, end position, and strand and can be done with the command line function sort

$ sort -k 1,1 -k 2,2n -k 3,3n -k 6,6 input.bed > input.sort.bed

c_curve computes the expected yield of distinct reads for experiments smaller than the input experiment in a .bed or .bam file through resampling. The full set of parameters can be outputed by simply typing the program name. If output.txt is the desired output file name and input.bed is the input .bed file, then simply type

$ preseq c_curve -o output.txt input.sort.bed

lc_extrap computes the expected future yield of distinct reads and bounds on the number of total distinct reads in the library and the associated confidence intervals. The -o parameter specifies the output file for the expected future yield.

$ preseq lc_extrap -o yield.txt  input.sort.bed

If the input is in sorted bam format, then the option -B must be included. If the input is paired end, the option -P should be included. In this case only concordantly mapped reads are counted. c_curve and lc_extrap take count files, either as a text file histogram of duplicate counts or a text file column of duplicate counts, as input. The corresponding flags are -H and -V. gc_extrap computes the expected genomic coverage for deeper sequencing for single cell sequencing experiments.  The input should be a mr or bed file.  The tool bam2mr is provided to convert sorted bam or sam files to mapped read format.

$ preseq gc_extrap -o coverage_curve.txt  input.sort.nr

For fast estimates, lc_extrap or gc_extrap may be run in quick mode with the option -Q. This predicts the complexity without bootstrapping, so that confidence intervals will be missing. This significantly reduces the computation time.

bound_pop estimates the total size of the population in the sample, known as the species richness, with a nonparametric moment-based estimator.  Typical input will be a text file with a single column of duplicate counts.

$ preseq bound_pop -o estim_species_richness.txt -V input.txt

Preseq uses rational function approximations to Good and Toulmin’s (1956) non-parametric empirical Bayes power series estimator. For more information, see the preseq manual or paper, published in Nature Methods in the April 2013 issue.  For more information on gc_extrap, see our poster presented at ISMB/ECCB 2013 in Berlin.

Questions, comments, advice, or bugs can be sent to Timothy Daley at tdaley@usc.edu. Thank you for using preseq.

R package

An R package called preseqR makes the functionality of PRESEQ available in the R statistical computing environment. It estimates the expected number of species as a function of the number of captures, which is called the species discovery curve or species accumulation curve in Ecology. Not only does preseqR covers the basic functionality of the preseq program, but it also functions to fit a zero-truncated negative binomial (ZTNB) distribution to the number of individuals captured for each observed species and extrapolate species accumulation curve. You can download the R package from CRAN.

The library complexity challenge

It is always welcome to compare the performance of preseq with other softwares on predicting library complexity. See this page for more information.