TXGP RNAseq assembly

From Marcotte Lab
Jump to: navigation, search


Raw data used for TXGP RNA-seq assembly

Dataset Contributor Samples Reads Assembled Tx(raw) X. laevis genes X. tropicalis genes H. sapiens genes
Amin201106_XENLA Nirav Amin, Frank Conlon (UNC) 2
(no rep)
(75bp, single)
61M total
~ 591k 13,523 10,225 11,540
Park201106_XENLA Tae Joo Park, Richard Harland (UC Berkeley) 5
(no rep)
(50bp, single)
500M total
~ 1,480k 14,890 12,648 13,328
(1x2 rep)
(100bp, paired)
400M total
~ 1,677k 14,441 12,482 12,986
Chung201110_XENLA Meii Chung, John Wallingford (UT Austin) 4
(2x2 rep)
(50bp, paired)
222M total
~ 600k 11,198 7,871 9,134
Quigley201112_XENLA Ian Quigley, Christopher R. Kintner (Salk Institute) 9
(unknown rep)
311M total
~ 647k 13,291 10,790 11,383
Jarikji201201_XENLA Zeina Jarikji, Marko Horb (MBL) 9
(3x3 rep)
932M total
~ 3,254k
14,613 12,342 13,218
TeperekTkacz201202_XENLA Marta Teperek-Tkacz, John Gurdon (Gurdon Institute) 1
(no rep)
200M total
~436k 13,838 10,559 11,409
  • assembled Tx == number of peptide query sequences for BLAST search.


  • Filter out reads with no-call.
  • Trim 5' or 3' end if necessary.
  • For paired-end library, compile paired reads (without filter-out reads at both side of pair).

Tx Assembly

  • We currently use velvet+oases pipeline, with different k-mer (25,29,33,37,41,45).
  • After first-round assembly, do the second round assembly with contigs of each k-mer, with k-mer 33.

Post-processing with orthology

  • Translate k33_merged assembled transcripts into peptides with standard codon table. Take longest peptide sequences from 6-frame translation.
  • Do BLAST to model oragnism protein sequences
    • EnsEMBL-63: HUMAN, MOUSE, DANRE(zebrafish), XENTR(X. tropicals)
    • XenBase: XENLA (2011-dec version)
  • Filter out BLAST hits with following conditons.
    • E-value < 0.01
    • Alignment percentage (aligned length/min(query_seq,target_seq)) > 0.50
    • len(query_seq)/len(target_seq) > 0.50 (to get rid of short peptides from assembled Tx)
  • Make a group of sequences per each model organism sequence (putative ortho-group).
  • Do multiple sequence alignment of ortho-groups with MUSCLE.
  • Based on second-iteration tree (generated by MUSCLE), select representative sequences per clusters (up to 6 sequences per group).

(under development for further steps)