MSblender
MSblender is a statistical tool for merging database search results from multiple database search engines for peptide identification based on a multivariate modelling approach. We will present this work at RECOMB-CP 2011 on March, 2011.
Authors
- Hyungwon Choi (hwchoi at umich dot edu)
- Taejoon Kwon (taejoon dot kwon at mail dot utexas dot edu)
Prerequisites
(We tested our codes at Mac OSX (10.5 Leopard) and Ubuntu Linux (10.04 and later). We don't support MS Windows platform yet.) To run MSblender, you should install the following programs/packages on the machine.
- python (2.5 or later)
- gcc (we used version 4.4.3, but we believe that our ANSI-C based codes are not dependent on specific version of gcc).
- GNU Scientific Library (version 1.13 or later)
- If you use ubuntu (or debian) linux, install 'gsl-bin' and 'libgsl0-*' packages.
- (Optional) matplotlib (python graph library). Only required for 'pre/plot-his_list.py' script.
Also, you need to have search engine results to run MSblender. All searches should be conducted with same concatenated database (target + decoy). Current script recognize 'xf_' prefix at protein ID as decoy sequence, but you can easily modify this at 'make-msblender_in.py'.
Installation
- Download source code from GitHub. Alternatively, you can download it from http://www.marcottelab.org/users/MSblender/src/MSblender-current.tgz .
- Enter to 'c/' directory, and execute './compile' script. You should have GNU Scientific Library before running this script. It will generate 'msblender' and 'msblender.h.gch' files at the same directory.
- That's it. Now you are ready to run MSblender.
How to use
MSblender is working in three steps: pre-processing, modelling and post-processing.
Pre-processing
First MSblender converts various search engine results into a unified tab-delimited text file called 'hit_list' format. Then it transfers 'hit_list' to MSblender modelling program input file. You can see 'test' dataset and their output at http://www.marcottelab.org/users/MSblender/test/.
Currently, MSblender supports the following search engine results (and scores).
- SEQUEST, Xcorr (If you get SRF file from Thermo BioWorks, you can convert it to pepxml directly using Mspire, developed by John T. Prince).
- X!Tandem, k-score (a.k.a COMET search engine) based -log(E-value)
- OMSSA, -log(E-value)
- InsPecT, MQscore
- MyriMatch, mvh
- MSGFDB, -log(SpecProb)
For example, you can convert X!Tandem pepxml file to logE_hit_score as below:
$ ../src/MSblender-20110130/pre/tandem_pepxml-to-logE_hit_list.py test.tandem_k.pepxml Write test.tandem_k.logE_hit_list ...
The hit_list file generated by this looks like as below:
# pepxml: test.tandem_k.pepxml #Spectrum_id Charge PrecursorMz MassDiff Peptide Protein MissedCleavages Score(-log10[E-value]) MSups_5ul.07228.07228.4 4 689.596425 0.004000 SLLSNVEGDNAVPMQHNNRPTQPLK CAH1_HUMAN_UPS|P00915|5000|50000|260 0 1.795880 MSups_5ul.11647.11647.2 2 592.839650 0.000000 ADGLAVIGVLMK CAH1_HUMAN_UPS|P00915|5000|50000|260 0 1.148742 MSups_5ul.06405.06405.2 2 524.279350 0.003000 DLFNAIATGK CATA_HUMAN_UPS|P04040|5000|5000|526 0 0.327902 ....
Some search engines report multiple PSMs from a single spectrum (mainly because of different charge state estimation). For example, in default setting, MyriMatch reports all best hits for both +2 and +3 charge states, so the total number of PSMs is almost two times more than other search engine results. To remove this imbalance, you can choose 'the best' PSM per each spectrum based on the score you defined. And 'select-best-PSM.py' is the script for that.
$ wc test.myrimatch.mvh_hit_list 10888 87099 1168772 test.myrimatch.mvh_hit_list $ ../src/MSblender-20110130/pre/select-best-PSM.py test.myrimatch.mvh_hit_list $ wc test.myrimatch.mvh_hit_list_best 5516 44123 598964 test.myrimatch.mvh_hit_list_best
Then, you can compile multiple 'hit_list' files into msblender input file. You need to have a text conf file as below:
InsPect test.inspect.MQscore_hit_list_best MyriMatch test.myrimatch.mvh_hit_list_best SEQUEST test.sequest.xcorr_hit_list_best X!Tandem test.tandem_k.logE_hit_list_best
Then, run 'make-msblender_in.py' script.
$ ../src/MSblender-20110130/pre/make-msblender_in.py msblender.conf > test.msblender_in
Output looks like this:
sp_pep_id decoy InsPect_score MyriMatch_score SEQUEST_score X!Tandem_score MSups_5ul.00439.00439.3.ASLSNTPSIGQ 0 0.031000 NA NA NA MSups_5ul.00439.00439.3.LDELRDEGK 0 NA 18.090108 0.914975 -0.832509 MSups_5ul.00444.00444.1.GQFVK 1 NA 2.598828 NA NA MSups_5ul.00446.00446.3.LDELRDEGK 0 NA 13.341218 0.930569 -0.579784 MSups_5ul.00461.00461.3.ADDKETCFAEEGKK 0 NA 16.846330 1.834260 -0.770852 ...
Multivariate Modeling
Feed 'msbledner_in' file to 'msblender' executive file under 'c/' directory as below:
$ ~/git/MSblender/c/msblender test.msblender_in 100 1 4469.537280 0.2852 2 67673.492372 0.4619 3 83020.543621 0.5275 4 82494.877698 0.5496 5 82243.485441 0.5601 6 81891.150707 0.5654 7 81745.917044 0.5676 8 81717.914272 0.5684 9 81732.128261 0.5686 10 81756.373959 0.5686 $
This program will be terminated when it is converged. If the number of iteration reaches to your initial setting (here is 100), try to run the script again with bigger number.
Now you can see the output file named 'test.msblender_in_msblender' in the same directory. The file looks like this:
Spectrum Decoy InsPect_score MyriMatch_score SEQUEST_score X!Tandem_score mvScore MSups_5ul.00439.00439.3.ASLSNTPSIGQ F 0.03 0.006 MSups_5ul.00439.00439.3.LDELRDEGK F 18.09 0.91 -0.83 1.000 MSups_5ul.00444.00444.1.GQFVK D 2.60 0.085 MSups_5ul.00446.00446.3.LDELRDEGK F 13.34 0.93 -0.58 1.000 MSups_5ul.00461.00461.3.ADDKETCFAEEGKK F 16.85 1.83 -0.77 1.000 MSups_5ul.00590.00590.2.AAFTECCQAADK F 4.80 34.62 3.17 1.39 1.000 ...
Post-processing
Based on target/decoy hits, you can estimate empirical false discovery rate. Based on msblender output, script 'filter-msblender-001.py' under 'post/' directory can report PSMs less than FDR<0.01. It also reports total number of PSMs with 'mvscore=1.0' (means perfect multivariate score). If this number is same as total number of PSMs selected like below, that means the model is not sensitive enough to capture PSMs less than FDR 0.01.
$ ../src/MSblender-20110130/post/filter-msblender-001.py test.msblender_in_msblender ... MSups_5ul.05806.05806.2.YAAELHLVHWNTK F 1.43 18.82 -0.61 1.000 0.0123 MSups_5ul.06533.06533.2.AFYVNVLNEEQR F 4.19 66.30 4.15 3.64 1.000 0.0123 MSups_5ul.04145.04145.2.SADFTNFDPR F 1.76 0.999 0.0123 #target=3705,decoy=46,total=3751,fdr=0.012 #N(mvscore=1.0): 3750
Citation
- T. Kwon*, H. Choi*, C. Vogel, A.I. Nesvizhskii, and E.M. Marcotte, MSblender: a probabilistic approach for integrating peptide identifications from multiple database search engines. J. Proteome Research, in press, 2011 Link
See also
- https://github.com/MarcotteLabGit/MSblender (GitHub source repository)