Research group: Sanjay Chellapilla, Yoonseong Park, Doina Caragea, Susan Brown. Kansas State University, Manhattan KS ----------------------------------------------------- "Pre-pairing" of paired-end/mate-pair reads 2008-2009 ----------------------------------------------------- Overlapping forward-reverse (5' and 3') EST reads of the same mate-pair that could have assembled together into one consensus sequence are often not assembled at the high stringency that most contig-assembly programs use. This causes the reads to either be put into different contigs or to remain as singletons. ESTs usually have low-quality sequence at the ends (within about 20% from each end) and significantly better quality towards the middle [1,2]. We found that the two reads of a mate-pair can be assembled ("pre-paired") into a consensus sequence at a lower stringency prior to downstream contig-assembly, without losing quality. We have developed a simple algorithmic procedure ("pre-pairing") to determine a consensus sequence for each mate-pair, using the well-known, widely-used BLAST program. The two reads of a mate-pair are aligned to each other using 'blastn' with the expect value (e-value) 1e-04 and the resulting alignment overlap of the top hit (if any) is examined to determine the non-degenerate consensus sequence. If the overlap is atleast 20 bases long *and* occurs completely within 20% of the ends of both reads of the pair, then a non-degenerate consensus sequence for the pair is determined. If the alignment doesn't satisfy these thresholds (i.e., not atleast 20 bases long *or* not occuring within 20% of ends of both reads), the individual reads in the mate-pair remain as is (not "pre-paired"). References: [1] Nagaraj SH, Gasser RB, Ranganathan S. A hitchhiker's guide to expressed sequence tag (EST) analysis. Brief Bioinform (2007) 8(1): 6-21 first published online May 23, 2006 doi:10.1093/bib/bbl015 http://bib.oxfordjournals.org/content/8/1/6.full [2] Aaronson JS, Eckman B, Blevins RA, et al. Toward the development of a gene index to the human genome: an assessment of the nature of high-throughput EST sequence data. Genome Res 1996;6:829-45. http://genome.cshlp.org/content/6/9/829.abstract ------------------------ Input FASTA file format: ------------------------ The forward-reverse mate-pair sequences in the input FASTA file *must* be consistently clearly labeled (in the FASTA label/identifier line) with names ending in a direction-suffix i.e., ".forward-read suffix" and a ".reverse-read suffix", including the "." (dot). For reads of the same mate-pair, the part of the name before the direction-suffix needs to be identical. The forward-reverse direction suffixes chosen by the user can be a string of alphanumeric characters of length at least one. For example, using the direction suffixes ".fwd" and ".rev", reads could be labeled as follows: >read1.fwd sequence >read1.rev sequence >read2.fwd sequence >read2.rev sequence ... or, using direction suffixes ".F" and ".R", the reads could be labeled >read1.F sequence >read1.R sequence >read2.F sequence >read2.R sequence ... and so on. ---------------------------- Output pre-paired sequences: ---------------------------- Mate-pair sequences that pre-paired would have the FASTA label/identifier line ending in ".pre-paired". The ones that did not pre-pair retain their original labels from the input file. ------------------ Examples and usage ------------------ "Pre-pairing" has been integrated into and available in the ArthropodEST analyses pipeline at http://arthropodest.ksu.edu/e/ An example FASTA file of paired-end / mate-pair sequences is at http://arthropodest.ksu.edu/examples/p.fsa The gzip-compressed output FASTA file of pre-paired sequences after preliminary sequence cleaning/trimming and vector/contaminant screening of the example file above, is at http://arthropodest.ksu.edu/examples/test-pp-pa-cl.T7Uwt09YOrHrCB8Mp34MwKVm/p.fsa.clean.pre-paired.fsa.gz An example analyses using the example file above is at http://arthropodest.ksu.edu/examples/test-pp-pa-cl.T7Uwt09YOrHrCB8Mp34MwKVm/test-pp-pa-cl.report.txt