ArthropodEST: K-State Bioinformatics EST analyses pipeline

ArthropodEST is an EST analyses pipeline (not limited to only arthropods) developed and hosted at the Bioinformatics Center, Kansas State University.
ArthropodEST is supported by the Arthropod Genomics Center, Kansas State University.

Research group: Sanjay Chellapilla, Profs. Yoonseong Park, Doina Caragea, Susan J. Brown.
Design, development, implementation and maintenance: Sanjay Chellapilla.

ArthropodEST: K-State Bioinformatics EST analysis pipeline (poster, MS-Powerpoint)

The initial desktop-computer version (Jan 2008 - ) of the pipeline, integrating "pre-pairing" (2008-2009) and post-assembly single-linkage clustering of contigs and singlets for paired-end/mate-pair reads based on read-names (2010), is here. The Beocat version (Sep 2010 - ) of the pipeline running on K-State's compute-cluster, Beocat, is here. If one version is unavailable due to updates/maintenance, the other may be used.

Please contact Prof. Susan J. Brown (Director, Bioinformatics Center & Arthropod Genomics Center) and/or Sanjay Chellapilla for more information, questions, comments etc. Thank you.

"Pre-pairing" of paired-end/mate-pair reads 2008-2009 [plain-text version]

Overlapping forward-reverse (5' and 3') EST reads of the same mate-pair that could have assembled together into one consensus sequence are often not assembled at the high stringency that most contig-assembly programs use. This causes the reads to either be put into different contigs or to remain as singletons. ESTs usually have low-quality sequence at the ends (within about 20% from each end) and significantly better quality towards the middle [1,2]. We found that the two reads of a mate-pair can be assembled ("pre-paired") into a consensus sequence at a lower stringency prior to downstream contig-assembly, without losing quality. We have developed a simple algorithmic procedure ("pre-pairing") to determine a consensus sequence for each mate-pair, using the well-known, widely-used BLAST program. The two reads of a mate-pair are aligned to each other using 'blastn' with the expect value (e-value) 1e-04 and the resulting alignment overlap of the top hit (if any) is examined to determine the non-degenerate consensus sequence. If the overlap is atleast 20 bases long *and* occurs completely within 20% of the ends of both reads of the pair, then a non-degenerate consensus sequence for the pair is determined. If the alignment doesn't satisfy these thresholds (i.e., not atleast 20 bases long *or* not occuring within 20% of ends of both reads), the individual reads in the mate-pair remain as is (not "pre-paired").

References

  1. Nagaraj SH, Gasser RB, Ranganathan S. A hitchhiker's guide to expressed sequence tag (EST) analysis. Brief Bioinform (2007) 8(1): 6-21 first published online May 23, 2006 doi:10.1093/bib/bbl015
    http://bib.oxfordjournals.org/content/8/1/6.full
  2. Aaronson JS, Eckman B, Blevins RA, et al. Toward the development of a gene index to the human genome: an assessment of the nature of high-throughput EST sequence data. Genome Res 1996;6:829-45.
    http://genome.cshlp.org/content/6/9/829.abstract


Sanjay Chellapilla / sanjayc at ksu dot edu
K-State Bioinformatics Center
last updated Mar 2012