ArthropodEST is an EST analyses pipeline (not limited to only arthropods) developed and hosted at
the Bioinformatics Center, Kansas State University.
ArthropodEST is supported by the Arthropod Genomics Center, Kansas State University.
Research group: Sanjay Chellapilla, Profs. Yoonseong Park, Doina Caragea, Susan J. Brown.
Design, development, implementation and maintenance: Sanjay Chellapilla.
ArthropodEST: K-State Bioinformatics EST analysis pipeline (poster, MS-Powerpoint)
The initial desktop-computer version (Jan 2008 - ) of the pipeline, integrating "pre-pairing" (2008-2009) and post-assembly single-linkage clustering of contigs and singlets for paired-end/mate-pair reads based on read-names (2010), is here. The Beocat version (Sep 2010 - ) of the pipeline running on K-State's compute-cluster, Beocat, is here. If one version is unavailable due to updates/maintenance, the other may be used.
Please contact Prof. Susan J. Brown (Director, Bioinformatics Center & Arthropod Genomics Center) and/or Sanjay Chellapilla for more information, questions, comments etc. Thank you.
Overlapping forward-reverse (5' and 3') EST reads of the same mate-pair that could have assembled together into one consensus sequence are often not assembled at the high stringency that most contig-assembly programs use. This causes the reads to either be put into different contigs or to remain as singletons. ESTs usually have low-quality sequence at the ends (within about 20% from each end) and significantly better quality towards the middle [1,2]. We found that the two reads of a mate-pair can be assembled ("pre-paired") into a consensus sequence at a lower stringency prior to downstream contig-assembly, without losing quality. We have developed a simple algorithmic procedure ("pre-pairing") to determine a consensus sequence for each mate-pair, using the well-known, widely-used BLAST program. The two reads of a mate-pair are aligned to each other using 'blastn' with the expect value (e-value) 1e-04 and the resulting alignment overlap of the top hit (if any) is examined to determine the non-degenerate consensus sequence. If the overlap is atleast 20 bases long *and* occurs completely within 20% of the ends of both reads of the pair, then a non-degenerate consensus sequence for the pair is determined. If the alignment doesn't satisfy these thresholds (i.e., not atleast 20 bases long *or* not occuring within 20% of ends of both reads), the individual reads in the mate-pair remain as is (not "pre-paired").
References
The forward-reverse paired-end/mate-pair sequences in the input FASTA file *must* be consistently clearly labeled (in the FASTA label/identifier line) with names ending in a direction-suffix i.e., ".forward-read suffix" and a ".reverse-read suffix", including the "." (dot). For reads of the same mate-pair, the part of the name before the direction-suffix needs to be identical. The forward-reverse direction suffixes chosen by the user can be a string of alphanumeric characters of length at least one.
For example, using the direction suffixes ".fwd" and ".rev", reads could be labeled as follows:
>read1.fwd sequence >read1.rev sequence >read2.fwd sequence >read2.rev sequence ...
or, using direction suffixes ".F" and ".R", the reads could be labeled
>read1.F sequence >read1.R sequence >read2.F sequence >read2.R sequence ...and so on.
Paired-end/Mate-pair sequences that pre-paired would have the FASTA label/identifier line ending in ".pre-paired". The ones that did not pre-pair retain their original labels from the input file.
An example FASTA file of paired-end / mate-pair sequences is here.
The gzip-compressed output FASTA file of pre-paired sequences after preliminary sequence cleaning/trimming and vector/contaminant screening of the example file above, is here.
An example analyses using the example file above is here.
Sanjay Chellapilla / sanjayc at ksu dot edu
K-State Bioinformatics Center
last updated Mar 2012