Predicting Library Complexity

Predicting Library Complexity

The preseq package is aimed at predicting and estimating the complexity of a genomic sequencing library, equivalent to predicting and estimating the number of redundant reads from a given sequencing depth and how many will be expected from additional sequencing using an initial sequencing experiment. The estimates can then be used to examine the utility of further sequencing, optimize the sequencing depth, or to screen multiple libraries to avoid low complexity samples.

Click to download the latest preseq source code (version 0.1.0).

System requirements

64-bit machine, GCC version >= 4.1, and GSL version 1.15 available here.

Installation

To install preseq, download the compressed archive, unpack it using

$ tar -jxvf preseq-0.1.0.tar.bz2

change directory into the unpacked source directory and type

$ make all

If the input file is in bam format, then samtools (http://samtools.sourceforge.net/) is requred.  If the root directory of samtools is $samtools, instead run

$ make all SAMTOOLS_DIR=$samtools

Quick usage guide

There are two programs in the preseq package, c_curve and lc_extrap.  Both require as input a sorted bed or bam file, with duplicate reads included.  The bam file can be sorted with the samtools (http://samtools.sourceforge.net/) sort functions.  The bed file should be sorted by chromosome, start position, end position, and strand and can be done with the command line function sort

$ sort -k 1,1 -k 2,2n -k 3,3n -k 6,6 input.bed > input.sort.bed

c_curve computes the expected yield of distinct reads for experiments smaller than the input experiment in a .bed or .bam file through resampling. The full set of parameters can be outputed by simply typing the program name. If output.txt is the desired output file name and input.bed is the input .bed file, then simply type

$ preseq c_curve -o output.txt input.sort.bed

lc_extrap computes the expected future yield of distinct reads and bounds on the number of total distinct reads in the library and the associated confidence intervals. The -o parameter specifies the output file for the expected future yield.

$ preseq lc_extrap -o yield.txt  input.sort.bed

If the input is in sorted bam format, then the option -B must be included.  If the input is paired end, the option -P should be included.  In this case only concordantly mapped reads are counted.

Both programs take count files, either as a text file histogram of duplicate counts or a text file column of duplicate counts, as input.  The corresponding flags are -H and -V.

For fast estimates, lc_extrap may be run in quick mode with the option -Q.  This predicts the complexity without bootstrapping, so that confidence intervals will be missing.  This significantly reduces the computation time.

For more information, see the preseq manual or paper, published in Nature Methods in the April 2013 issue.  Questions, comments, advice, or bugs can be sent to Timothy Daley attdaley@usc.edu.  Thank you for using preseq.