Difference between revisions of "Xenopus Genome Project"

From Marcotte Lab
Jump to: navigation, search
m (moved Texas Xenopus Genome Project to Xenopus Genome Project: It becomes much more popular than we expect.)
(Web server)
 
(45 intermediate revisions by 3 users not shown)
Line 1: Line 1:
[[File:Xenopus-PV.jpg||100px|left]] <br>
+
[[File:Xenopus-PV.jpg||200px|left]]
''Xenopus laevis'' is an essential model organism in several areas of biology. In addition to the key attributes of these embryos for ''in vivo'' imaging, cell-free extracts from ''Xenopus'' provide among the most powerful ''in vitro'' systems for studies of cell and molecular biology. A complete sequence of the ''X. laevis'' genome is an essential resource for accurate identification of peptides for mass-spec analyses, for cloning of an ORFeome, for identifying evolutionarily conserved regulatory regions, and for design of morpholino-oligonucleotides for gene knockdowns.
+
''Xenopus laevis'' is an essential model organism in several areas of biology (see [http://www.cell.com/trends/genetics/abstract/S0168-9525(11)00136-3 Harland & Grainger, TIG (2011)] for review). In addition to the key attributes of these embryos for ''in vivo'' imaging, cell-free extracts from ''Xenopus'' provide among the most powerful ''in vitro'' systems for studies of cell and molecular biology. A complete sequence of the ''X. laevis'' genome is an essential resource for accurate identification of peptides for mass-spec analyses, for cloning of an ORFeome, for identifying evolutionarily conserved regulatory regions, and for design of morpholino-oligonucleotides for gene knockdowns.
  
The [http://www.bio.utexas.edu/faculty/wallingford/ Wallingford] and Marcotte labs have obtained funding from the [http://www.ti3d.utexas.edu/ Texas Institute for Drug and Diagnostic Development] (TI3D), in coordination with projects funded by the National Institutes of Health, to begin sequencing of the ''X. laevis'' genome. We are primarily working with [https://wikis.utexas.edu/display/GSAF/About+Us Scott Hunicke-Smith] at the [https://wikis.utexas.edu/display/GSAF/Home+Page University of Texas Genome Sequencing and Analysis facility], with funding sufficient for ~20x coverage of the ''X. laevis genome'' using ABI SOLiD next-generation sequencing.  
+
The [http://www.bio.utexas.edu/faculty/wallingford/ Wallingford] and Marcotte labs obtained funding from the [http://www.ti3d.utexas.edu/ Texas Institute for Drug and Diagnostic Development] (TI3D), in conjunction with projects funded by the National Institutes of Health, to begin sequencing of the ''X. laevis'' genome. We began the project with [https://wikis.utexas.edu/display/GSAF/About+Us Scott Hunicke-Smith] at the [https://wikis.utexas.edu/display/GSAF/Home+Page University of Texas Genome Sequencing and Analysis facility], with funding sufficient for ~20x coverage of the ''X. laevis genome'' using ABI SOLiD next-generation sequencing.
 +
 
 +
The project rapidly expanded to include ''de novo'' reconstruction of ''X. laevis'' transcripts, in collaboration with groups around the world donating Illumina Hi-Seq RNA sequencing datasets, coordinating these efforts with genome sequencing by the [http://mcb.berkeley.edu/labs/harland/ Harland] and [http://mcb.berkeley.edu/index.php?option=com_mcbfaculty&name=rokhsard Rokhsar] groups at UC Berkeley and with [http://www.biol.s.u-tokyo.ac.jp/english/labs/tairamasanori_lab-e.html Taira] and collaborators at the University of Tokyo, Japan.  We're posting our intermediate datasets here in advance of publication for use by the wider community. See [[Xenopus_Genome_Project_Consortium]] page for the members & contributors of the project.  
  
= Assembled transcripts from RNA-seq data =
 
 
{|border="1"
 
{|border="1"
 
|
 
|
Line 13: Line 14:
 
|}
 
|}
  
 +
If you have any question about this data in general, please contact [mailto:taejoon.kwon@marcottelab.org Taejoon Kwon].
  
If you have any question about this data in general, please contact to [mailto:taejoon.kwon@marcottelab.org Taejoon Kwon].
+
= Genome =
 +
You can download ''X. laevis'' genomes at [ftp://ftp.xenbase.org/pub/Genomics/JGI/ XenBase FTP site]. Alternatively, you can download them at [http://daudin.icmb.utexas.edu/pub/genome/ UTA 'daudin' web repository]. The current version is named [ftp://ftp.xenbase.org/pub/Genomics/JGI/Xenla7.1/ 'JGI 7.1'] at XenBase, and 'JGIv7b' at UTA web repository('daudin'). See [[XENLA_Genome]] page for detailed information about each version.
  
== TXGP201107_XENLA_EGG ==
+
* The main difference between the XenBase/JGI genome and UTA genome is the scaffold name. For example, I renamed 'Scaffold102974' in XenBase/JGI 7.1 genome to 'JGIv7b.000102974'. I converted original scaffold name to prevent confusion in comparison between different versions of the draft genome, because 'Scaffold102974' at XenBase/JGI 6.1 genome is different to 'Scaffold102974' at XenBase/JGI 7.2 genome.
Contact info:[mailto:edward.marcotte@gmail.com Edward Marcotte],[mailto:taejoon.kwon@marcottelab.org Taejoon Kwon].
+
<pre>XENLA_JGIv6a.seqlen:JGIv6a.000102974 325
 +
XENLA_JGIv7b.seqlen:JGIv7b.000102974 21560636</pre>
  
* [[xdata:/tx/TXGP201107_XENLA_EGG.pub_all.fasta]] (37,470 sequences)
+
= Web server =
** 9,780 pub_all sequences are mapped to 6,705 v2_ref sequences.
+
** 13,362 pub_all sequences are mapped to 9,920 v3_ref sequences.
+
  
* [[xdata:/tx/TXGP201107_XENLA_EGG.pub_long.fasta]] (20,005 sequences; only > 400 bp)
+
"Development version" of Xenopus gene annotation, with [http://jbrowse.org/ JBrowse] genome browser (Here is [[Xenopus_Genome_Browser|short intro]] if you want to mirror this site, or host your own Xenopus genome browser).
** 7,309 pub_long sequences are mapped to 6,082 v2_ref sequences.
+
* ''Xenopus laevis'' - http://daudin.icmb.utexas.edu/XENLA_JGIv72/
** 9,439 pub_long sequences are mapped to 8,137 v3_ref sequences.
+
* ''Xenopus tropicalis'' - http://daudin.icmb.utexas.edu/XENTR_JGIv80/
  
== Park201106_XENLA ==
+
Official gene annotation/genome browser is served at XenBase.
(Courtesy of Tae Joo Park & Richard Harland, University of California at Berkeley)
+
* ''Xenopus laevis'' - http://gbrowse.xenbase.org/fgb2/gbrowse/xl7_1/
Contact info: [mailto:tjpark01@gmail.com Tae Joo Park], [mailto:harland@berkeley.edu Richard Harland].  
+
* ''Xenopus tropicalis'' - http://gbrowse.xenbase.org/fgb2/gbrowse/xt8_0/
  
* [[xdata:/tx/Park201106_XENLA.pub_all.fasta]] (109,667 sequences)
+
Old websites for UT Austin gene model
** 41,847 pub_all sequences are mapped to 7,522 v2_ref sequences.  
+
* http://daudin.icmb.utexas.edu/Mayball/ - UT Austin "MayBall" gene model
** 59,419 pub_all sequences are mapped to 16,332 v3_ref sequences.  
+
* http://daudin.icmb.utexas.edu/Oktoberfest/ - UT Austin "Oktoberfest" gene model
  
* [[xdata:/tx/Park201106_XENLA.pub_long.fasta]] (19,716 sequences)
+
= Assembled transcripts =
** 10,790 pub_long sequences are mapped to 5,283 v2_ref sequences.
+
* [[XENLA_GeneModel2012]] - raw sequences from individual projects (before releasing Oktoberfest)
** 14,007 pub_long sequences are mapped to 7,977 v3_ref sequences.
+
* [[XENLA_Oktoberfest]] - released on October, 2012 (code name "Oktoberfest")
 
+
* [[XENLA_Mayball]] - released on May, 2013 (code name "MayBall")
== Amin201106_XENLA ==
+
(Courtesy of Nirav Amin & Frank Conlon, University of North Carolina at Chapel Hill)
+
Contact info: [mailto:nmamin@email.unc.edu Nirav Amin], [mailto:frank_conlon@med.unc.edu Frank Conlon]
+
<b>Only control samples are used for this assembly.</b>
+
 
+
* [[xdata:/tx/Amin201106_XENLA.pub_all.fasta]] (106,189 sequences)
+
** 26,154 pub_all sequences are mapped to 8,312 v2_ref sequences.
+
** 38,535 pub_all sequences are mapped to 17,135 v3_ref sequences.
+
 
+
* [[xdata:/tx/Amin201106_XENLA.pub_long.fasta]] (27,252 sequences)
+
** 10,893 pub_long sequences are mapped to 6,757 v2_ref sequences.
+
** 15,342 pub_long sequences are mapped to 10,955 v3_ref sequences.
+
 
+
== Chung201110_XENLA ==
+
(Courtesy of Meii Chung & John Wallingford, University of Texas at Austin)
+
Contact info: [mailto:meii@utexas.edu Meii Chung], [mailto:wallingford@mail.utexas.edu John Wallingford]
+
* [[xdata:/tx/Chung201110_XENLA.pub_all.fasta]] (109,258 sequences)
+
** 31,287 pub_all sequences are mapped to 7,126 v2_ref sequences.
+
** 44,577 pub_all sequences are mapped to 15,240 v3_ref sequences.
+
 
+
* [[xdata:/tx/Chung201110_XENLA.pub_long.fasta]] (20,682 sequences)
+
** 4,817 pub_long sequences are mapped to 8,614 v2_ref sequences.
+
** 7,680 pub_long sequences are mapped to 11,735 v3_ref sequences.
+
  
 
= CHORI-219 BAC sequencing =
 
= CHORI-219 BAC sequencing =
We have started the first runs by sequencing 96 BACs from the [http://bacpac.chori.org/library.php?id=323 CHORI-219] library (vector: [http://www.sanger.ac.uk/Teams/Team53/psub/sequences/pbacgk.shtml pBACGK1.1]) at ~100X coverage. The selected BACs include ~70 genes of interest (Shroom3, Wnt5a, Glypican-4, Noggin, Gremlin, Pax6, Formin, etc., as initially identified by the group of [http://www.jgi.doe.gov/whoweare/cheng.html Jan-Fang Cheng] via probing the CHORI-219 library), as well as 10 BACs that have already been sequenced by the [http://www.jgi.doe.gov/ DOE Joint Genome Institute]/[http://www.hudsonalpha.org/genome-sequencing-center HudsonAlpha Genome Sequencing Center] to serve as positive controls for the sequencing and assembly pipeline.   
+
We started the first runs by sequencing 96 BACs from the [http://bacpac.chori.org/library.php?id=323 CHORI-219] library (vector: [http://www.sanger.ac.uk/Teams/Team53/psub/sequences/pbacgk.shtml pBACGK1.1]) at ~100X coverage. The selected BACs include ~70 genes of interest (Shroom3, Wnt5a, Glypican-4, Noggin, Gremlin, Pax6, Formin, etc., as initially identified by the group of [http://www.jgi.doe.gov/whoweare/cheng.html Jan-Fang Cheng] via probing the CHORI-219 library), as well as 10 BACs that have already been sequenced by the [http://www.jgi.doe.gov/ DOE Joint Genome Institute]/[http://www.hudsonalpha.org/genome-sequencing-center HudsonAlpha Genome Sequencing Center] to serve as positive controls for the sequencing and assembly pipeline.   
 
* CHORI-219 BACs: [[xdata:BACsFor1Percent.xls| List of 96 test BACs]] (MS Excel file)
 
* CHORI-219 BACs: [[xdata:BACsFor1Percent.xls| List of 96 test BACs]] (MS Excel file)
  
Line 73: Line 52:
 
This (very roughly) corresponds to >600X coverage by raw data, ~50X coverage by high quality data, of the BAC set.
 
This (very roughly) corresponds to >600X coverage by raw data, ~50X coverage by high quality data, of the BAC set.
 
* Given that we currently see better mapping of the shotgun SA09023 reads to ''X. tropicalis'' than to ''X. laevis'' (both to BACs and mRNAs), we're confirming the sample identity before continuing with whole genome sequencing. See the 'sanity check' [[/Species_Identification]] for details.
 
* Given that we currently see better mapping of the shotgun SA09023 reads to ''X. tropicalis'' than to ''X. laevis'' (both to BACs and mRNAs), we're confirming the sample identity before continuing with whole genome sequencing. See the 'sanity check' [[/Species_Identification]] for details.
 
= J-strain whole genome sequencing =
 
In addition, we are generating several mate pair libraries of different sizes from genomic DNA prepared by [http://tropicalis.yale.edu/ Mustafa Khokha] from J strain frogs obtained from [http://www.urmc.rochester.edu/web/index.cfm?event=doctor.profile.show&person_id=1001617&display=for_researchers Jacques Robert], sequencing each to multiple-fold coverage of the genome.
 
 
The primary data from this project will be made available as soon as possible for use by the community. We plan to periodically post reports on our progress below.
 
  
 
= See also =
 
= See also =
Line 83: Line 57:
  
 
= References =
 
= References =
 +
* [[XENLA_Genome]] - Current status of <i>Xenopus</i> genome
 
* [[TXGP_reference]] - Public resources compiled to be used in TXGP.
 
* [[TXGP_reference]] - Public resources compiled to be used in TXGP.
 
* [[TXGP_ens63_reference]] - Some statistics derived from EnsEMBL-63 (used as a reference in TXGP).
 
* [[TXGP_ens63_reference]] - Some statistics derived from EnsEMBL-63 (used as a reference in TXGP).

Latest revision as of 10:06, 14 October 2014

Xenopus-PV.jpg

Xenopus laevis is an essential model organism in several areas of biology (see Harland & Grainger, TIG (2011) for review). In addition to the key attributes of these embryos for in vivo imaging, cell-free extracts from Xenopus provide among the most powerful in vitro systems for studies of cell and molecular biology. A complete sequence of the X. laevis genome is an essential resource for accurate identification of peptides for mass-spec analyses, for cloning of an ORFeome, for identifying evolutionarily conserved regulatory regions, and for design of morpholino-oligonucleotides for gene knockdowns.

The Wallingford and Marcotte labs obtained funding from the Texas Institute for Drug and Diagnostic Development (TI3D), in conjunction with projects funded by the National Institutes of Health, to begin sequencing of the X. laevis genome. We began the project with Scott Hunicke-Smith at the University of Texas Genome Sequencing and Analysis facility, with funding sufficient for ~20x coverage of the X. laevis genome using ABI SOLiD next-generation sequencing.

The project rapidly expanded to include de novo reconstruction of X. laevis transcripts, in collaboration with groups around the world donating Illumina Hi-Seq RNA sequencing datasets, coordinating these efforts with genome sequencing by the Harland and Rokhsar groups at UC Berkeley and with Taira and collaborators at the University of Tokyo, Japan. We're posting our intermediate datasets here in advance of publication for use by the wider community. See Xenopus_Genome_Project_Consortium page for the members & contributors of the project.

Disclaimer

  • Data users may freely download and analyze data. They may use data in publications focused around individual genes.
  • Data users may use data to analyze their own data, i.e. reference database for MS/MS proteomics data, and/or RNA-seq data.
  • The publication and presentation of global analysis of data with these sequences are not allowed until 'data owner' published the paper. As soon as the paper is accepted, we will post that info on this website.

If you have any question about this data in general, please contact Taejoon Kwon.

Contents

Genome

You can download X. laevis genomes at XenBase FTP site. Alternatively, you can download them at UTA 'daudin' web repository. The current version is named 'JGI 7.1' at XenBase, and 'JGIv7b' at UTA web repository('daudin'). See XENLA_Genome page for detailed information about each version.

  • The main difference between the XenBase/JGI genome and UTA genome is the scaffold name. For example, I renamed 'Scaffold102974' in XenBase/JGI 7.1 genome to 'JGIv7b.000102974'. I converted original scaffold name to prevent confusion in comparison between different versions of the draft genome, because 'Scaffold102974' at XenBase/JGI 6.1 genome is different to 'Scaffold102974' at XenBase/JGI 7.2 genome.
XENLA_JGIv6a.seqlen:JGIv6a.000102974	325
XENLA_JGIv7b.seqlen:JGIv7b.000102974	21560636

Web server

"Development version" of Xenopus gene annotation, with JBrowse genome browser (Here is short intro if you want to mirror this site, or host your own Xenopus genome browser).

Official gene annotation/genome browser is served at XenBase.

Old websites for UT Austin gene model

Assembled transcripts

CHORI-219 BAC sequencing

We started the first runs by sequencing 96 BACs from the CHORI-219 library (vector: pBACGK1.1) at ~100X coverage. The selected BACs include ~70 genes of interest (Shroom3, Wnt5a, Glypican-4, Noggin, Gremlin, Pax6, Formin, etc., as initially identified by the group of Jan-Fang Cheng via probing the CHORI-219 library), as well as 10 BACs that have already been sequenced by the DOE Joint Genome Institute/HudsonAlpha Genome Sequencing Center to serve as positive controls for the sequencing and assembly pipeline.

See /XENLA_SA09023 for more details. Three mate paired libraries were sequenced:

  • X_laevis_WG - the X. laevis whole genome library, 5kb insert size - about 4.4GB raw data, 0.4GB high quality data
  • X_laevis_2kb - The set of 96 BACs, with 2kb insert size - about 3.6GB raw data, 0.3GB high quality data
  • X_laevis_5kb - The set of 96 BACs, with 5kb insert size - about 2.8GB raw data, 0.2GB high quality data

This (very roughly) corresponds to >600X coverage by raw data, ~50X coverage by high quality data, of the BAC set.

  • Given that we currently see better mapping of the shotgun SA09023 reads to X. tropicalis than to X. laevis (both to BACs and mRNAs), we're confirming the sample identity before continuing with whole genome sequencing. See the 'sanity check' /Species_Identification for details.

See also

References

Protocol