Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

A mass-tolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides

Subjects

An Erratum to this article was published on 07 August 2015

This article has been updated

Abstract

Fewer than half of all tandem mass spectrometry (MS/MS) spectra acquired in shotgun proteomics experiments are typically matched to a peptide with high confidence. Here we determine the identity of unassigned peptides using an ultra-tolerant Sequest database search that allows peptide matching even with modifications of unknown masses up to ± 500 Da. In a proteome-wide data set on HEK293 cells (9,513 proteins and 396,736 peptides), this approach matched an additional 184,000 modified peptides, which were linked to biological and chemical modifications representing 523 distinct mass bins, including phosphorylation, glycosylation and methylation. We localized all unknown modification masses to specific regions within a peptide. Known modifications were assigned to the correct amino acids with frequencies >90%. We conclude that at least one-third of unassigned spectra arise from peptides with substoichiometric modifications.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: A very wide precursor ion (open) search setting identified 185,000 modified peptides.
Figure 2: Averaging many independent events provides accurate net modification mass differences (sub-p.p.m.).
Figure 3: Many peptides of negative Δmass values are generated via in-source dissociation.
Figure 4: Analysis of 185,000 peptides provides insight into rare biological modifications and amino acid variants and variations.

Similar content being viewed by others

Change history

  • 18 June 2015

    In the version of this article initially published online, in Figure 3c, the label “unidentified b-type ions” should have been deleted, and in Figure 4, the symbols over the peptide sequences were misplaced. The errors have been corrected for the print, PDF and HTML versions of this article.

References

  1. Washburn, M.P., Wolters, D. & Yates, J.R. III. Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nat. Biotechnol. 19, 242–247 (2001).

    Article  CAS  Google Scholar 

  2. Wolters, D.A., Washburn, M.P. & Yates, J.R. III. An automated multidimensional protein identification technology for shotgun proteomics. Anal. Chem. 73, 5683–5690 (2001).

    Article  CAS  Google Scholar 

  3. Eng, J.K., McCormack, A.L. & Yates, J.R. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5, 976–989 (1994).

    Article  CAS  Google Scholar 

  4. Perkins, D.N., Pappin, D.J., Creasy, D.M. & Cottrell, J.S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20, 3551–3567 (1999).

    Article  CAS  Google Scholar 

  5. Beck, M. et al. The quantitative proteome of a human cell line. Mol. Syst. Biol. 7, 549 (2011).

    Article  Google Scholar 

  6. Geiger, T., Wehner, A., Schaab, C., Cox, J. & Mann, M. Comparative proteomic analysis of eleven common cell lines reveals ubiquitous but varying expression of most proteins. Mol. Cell. Proteomics 11, M111.014050 (2012).

    Article  Google Scholar 

  7. Nagaraj, N. et al. Deep proteome and transcriptome mapping of a human cancer cell line. Mol. Syst. Biol. 7, 548 (2011).

    Article  Google Scholar 

  8. Dasari, S. et al. TagRecon: high-throughput mutation identification through sequence tagging. J. Proteome Res. 9, 1716–1726 (2010).

    Article  CAS  Google Scholar 

  9. Mann, M. & Wilm, M. Error-tolerant identification of peptides in sequence databases by peptide sequence tags. Anal. Chem. 66, 4390–4399 (1994).

    Article  CAS  Google Scholar 

  10. Tabb, D.L., Saraf, A. & Yates, J.R. III. GutenTag: high-throughput sequence tagging via an empirically derived fragmentation model. Anal. Chem. 75, 6415–6421 (2003).

    Article  CAS  Google Scholar 

  11. Kim, S., Gupta, N., Bandeira, N. & Pevzner, P.A. Spectral dictionaries: Integrating de novo peptide sequencing with database search of tandem mass spectra. Mol. Cell. Proteomics 8, 53–69 (2009).

    Article  CAS  Google Scholar 

  12. Liu, C., Yan, B., Song, Y., Xu, Y. & Cai, L. Peptide sequence tag-based blind identification of post-translational modifications with point process model. Bioinformatics 22, e307–e313 (2006).

    Article  CAS  Google Scholar 

  13. Bern, M., Cai, Y. & Goldberg, D. Lookup peaks: a hybrid of de novo sequencing and database search for protein identification by tandem mass spectrometry. Anal. Chem. 79, 1393–1400 (2007).

    Article  CAS  Google Scholar 

  14. Shilov, I.V. et al. The Paragon Algorithm, a next generation search engine that uses sequence temperature values and feature probabilities to identify peptides from tandem mass spectra. Mol. Cell. Proteomics 6, 1638–1655 (2007).

    Article  CAS  Google Scholar 

  15. Sunyaev, S., Liska, A.J., Golod, A. & Shevchenko, A. MultiTag: multiple error-tolerant sequence tag search for the sequence-similarity identification of proteins by mass spectrometry. Anal. Chem. 75, 1307–1315 (2003).

    Article  CAS  Google Scholar 

  16. Renard, B.Y. et al. Overcoming species boundaries in peptide identification with Bayesian information criterion-driven error-tolerant peptide search (BICEPS). Mol. Cell. Proteomics 11, M111.014167 (2012).

    Article  Google Scholar 

  17. Creasy, D.M. & Cottrell, J.S. Error tolerant searching of uninterpreted tandem mass spectrometry data. Proteomics 2, 1426–1434 (2002).

    Article  CAS  Google Scholar 

  18. Tsur, D., Tanner, S., Zandi, E., Bafna, V. & Pevzner, P.A. Identification of post-translational modifications by blind search of mass spectra. Nat. Biotechnol. 23, 1562–1567 (2005).

    Article  CAS  Google Scholar 

  19. Savitski, M.M., Nielsen, M.L. & Zubarev, R.A. ModifiComb, a new proteomic tool for mapping substoichiometric post-translational modifications, finding novel types of modifications, and fingerprinting complex protein mixtures. Mol. Cell. Proteomics 5, 935–948 (2006).

    Article  CAS  Google Scholar 

  20. Choudhary, C. et al. Lysine acetylation targets protein complexes and co-regulates major cellular functions. Science 325, 834–840 (2009).

    Article  CAS  Google Scholar 

  21. Huttlin, E.L. et al. A Tissue-Specific Atlas of Mouse Protein Phosphorylation and Expression. Cell 143, 1174–1189 (2010).

    Article  CAS  Google Scholar 

  22. Zielinska, D.F., Gnad, F., Schropp, K., Wisniewski, J.R. & Mann, M. Mapping N-glycosylation sites across seven evolutionarily distant species reveals a divergent substrate proteome despite a common core machinery. Mol. Cell 46, 542–548 (2012).

    Article  CAS  Google Scholar 

  23. Kim, W. et al. Systematic and quantitative assessment of the ubiquitin-modified proteome. Mol. Cell 44, 325–340 (2011).

    Article  CAS  Google Scholar 

  24. Zhang, Y., Wang, J., Ding, M. & Yu, Y. Site-specific characterization of the Asp- and Glu-ADP-ribosylated proteome. Nat. Methods 10, 981–984 (2013).

    Article  CAS  Google Scholar 

  25. Banerji, S. et al. Sequence analysis of mutations and translocations across breast cancer subtypes. Nature 486, 405–409 (2012).

    Article  CAS  Google Scholar 

  26. Jones, S. et al. Core signaling pathways in human pancreatic cancers revealed by global genomic analyses. Science 321, 1801–1806 (2008).

    Article  CAS  Google Scholar 

  27. Zhang, J. et al. PEAKS DB: de novo sequencing assisted database search for sensitive and accurate peptide identification. Mol. Cell. Proteomics 11, M111.010587 (2012).

    Article  Google Scholar 

  28. Li, J., Duncan, D.T. & Zhang, B. CanProVar: a human cancer proteome variation database. Hum. Mutat. 31, 219–228 (2010).

    Article  Google Scholar 

  29. Makarov, A. et al. Performance evaluation of a hybrid linear ion trap/orbitrap mass spectrometer. Anal. Chem. 78, 2113–2120 (2006).

    Article  CAS  Google Scholar 

  30. Makarov, A., Denisov, E., Lange, O. & Horning, S. Dynamic range of mass accuracy in LTQ Orbitrap hybrid mass spectrometer. J. Am. Soc. Mass Spectrom. 17, 977–982 (2006).

    Article  CAS  Google Scholar 

  31. Olsen, J.V. et al. Parts per million mass accuracy on an Orbitrap mass spectrometer via lock mass injection into a C-trap. Mol. Cell. Proteomics 4, 2010–2021 (2005).

    Article  CAS  Google Scholar 

  32. Olsen, J.V. et al. Higher-energy C-trap dissociation for peptide modification analysis. Nat. Methods 4, 709–712 (2007).

    Article  CAS  Google Scholar 

  33. Boyne, M.T. et al. Tandem mass spectrometry with ultrahigh mass accuracy clarifies peptide identification by database retrieval. J. Proteome Res. 8, 374–379 (2009).

    Article  CAS  Google Scholar 

  34. Beausoleil, S.A., Villen, J., Gerber, S.A., Rush, J. & Gygi, S.P. A probability-based approach for high-throughput protein phosphorylation analysis and site localization. Nat. Biotechnol. 24, 1285–1292 (2006).

    Article  CAS  Google Scholar 

  35. Elias, J.E. & Gygi, S.P. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat. Methods 4, 207–214 (2007).

    Article  CAS  Google Scholar 

  36. Mullen, J.R. et al. Identification and characterization of genes and mutants for an N-terminal acetyltransferase from yeast. EMBO J. 8, 2067–2075 (1989).

    Article  CAS  Google Scholar 

  37. Polevoda, B. & Sherman, F. N-terminal acetyltransferases and sequence requirements for N-terminal acetylation of eukaryotic proteins. J. Mol. Biol. 325, 595–622 (2003).

    Article  CAS  Google Scholar 

  38. Arnesen, T. et al. Proteomics analyses reveal the evolutionary conservation and divergence of N-terminal acetyltransferases from yeast and humans. Proc. Natl. Acad. Sci. USA 106, 8157–8162 (2009).

    Article  CAS  Google Scholar 

  39. Helbig, A.O. et al. Profiling of N-acetylated protein termini provides in-depth insights into the N-terminal nature of the proteome. Mol. Cell. Proteomics 9, 928–939 (2010).

    Article  CAS  Google Scholar 

  40. Takeuchi, H., Kantharia, J., Sethi, M.K., Bakker, H. & Haltiwanger, R.S. Site-specific O-glucosylation of the epidermal growth factor-like (EGF) repeats of notch: efficiency of glycosylation is affected by proper folding and amino acid sequence of individual EGF repeats. J. Biol. Chem. 287, 33934–33944 (2012).

    Article  CAS  Google Scholar 

  41. Whiteheart, S.W., Shenbagamurthi, P., Chen, L., Cotter, R.J. & Hart, G.W. Murine elongation factor 1 alpha (EF-1 alpha) is posttranslationally modified by novel amide-linked ethanolamine-phosphoglycerol moieties. Addition of ethanolamine-phosphoglycerol to specific glutamic acid residues on EF-1 alpha. J. Biol. Chem. 264, 14334–14341 (1989).

    CAS  PubMed  Google Scholar 

  42. Moehring, J.M., Moehring, T.J. & Danley, D.E. Posttranslational modification of elongation factor 2 in diphtheria-toxin-resistant mutants of CHO-K1 cells. Proc. Natl. Acad. Sci. USA 77, 1010–1014 (1980).

    Article  CAS  Google Scholar 

  43. Nielsen, M.L., Savitski, M.M. & Zubarev, R.A. Extent of modifications in human proteome samples and their effect on dynamic range of analysis in shotgun proteomics. Mol. Cell. Proteomics 5, 2384–2391 (2006).

    Article  CAS  Google Scholar 

  44. Raftery, F., Adrian, E., Brendan Murphey, T. & Scrucca, L. C. mclust Version 4 for R: Normal Mixture Modeling for Model-Based Clustering, Classification, and Density Estimation Technical Report No. 597. Dep. Stat. Univ. Ofwashingt. (2012).

Download references

Acknowledgements

We thank all members of S.P.G.'s lab for fruitful discussions about this work. This work was funded in part by US National Institutes of Health grants HG3456 and GM67945 to S.P.G.

Author information

Authors and Affiliations

Authors

Contributions

B.Z. collected the proteomic data set. J.M.C. and S.P.G. implemented the search strategy, performed the data analysis and interpreted the results. D.P.N. provided Gaussian modeling analysis. D.K. and E.L.H. provided computational support. E.L.H. provided statistical expertise for FDR analysis. R.R. performed the Ascore localization analysis. J.M.C. and S.P.G. conceived the idea, discussed and wrote the manuscript.

Corresponding author

Correspondence to Steven P Gygi.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 The mass accuracy of fragment ions is important for peptide recovery in wide-tolerance searches.

The same data as in Figure 1B was searched with varying fragment ion tolerances (FIT). Peaks in MS/MS spectra were thus required to match with differing tolerances. A 1.0 Da FIT is typically used to search spectra collected at lower resolution. For the ±500-Da precursor ion search, only 43.9% of peptides were recovered for a FIT of 1.0 Da. In contrast, the ±500-Da precursor ion search using FIT of 0.01 recovered 85.5% of peptides.

Supplementary Figure 2 The open search approach assigns peptide matches without violating the target-decoy strategy.

A) First-ranked peptides are more commonly derived from the target (forward) database. Triplicate LC-MS/MS analysis of mouse brain peptides (same as in Figure 1B) were searched with either a 5 ppm or 500 Da precursor ion search tolerance. Matches were partitioned (regardless of score) based on their database origin (forward or reversed). Due to the large number of correctly-matched peptides, the target database is more frequently selected. B) Tenth-ranked peptides from these same searches are equally distributed between the forward and reversed databases. This rank position would overwhelmingly correspond to random matches. The 5 ppm closed search at a Sequest rank of 10 had an even split of 49.2% for forward and 50.8% for reversed matches. Likewise, a 500 Da search showed a 49.6% forward and 50.4% reverse identification rate.

Supplementary Figure 3 Comparisons using a closed search directed at the 15 most frequently detected modifications.

The same data as in Figure 1B were subjected to three different Closed searches (+/- 5 ppm) but 5 modifications were specified in each set. The modifications were chosen in order as the 15 most abundantly detected based on the Open search. They included the following three sets with the affected amino acids shown in parentheses. Set 1: Oxidation (M), deamidation (N, Q), phosphorylation (ST), pyro-glutamate (N-term Q), carbamylation (K). Set 2: Formylation (ST), iron (ED), iodoacetmaide (M), N-terminal methionine cleavage and acetylation. Set 3: Acetylation (K), dihydroxy tryptophan (W), methylation (K), iodination (Y), N-terminal methionine cleavage.

A) Breakdown for each search comparing the number of modified peptides found in each search with the overlap to the Open search results. Note that the Open search identified ~50% or fewer of the same peptides compared to the directed closed search. Similar to Figures 1F-1I, the sensitivity of the Open search for any modification is ~50%.

B) Comparison of the overlap between directed Closed and Open searches. Three directed searches with 5 modifications in each identified 145,138 modified peptides. The overlap with the 184,982 modified matches from the Open search was 71,995. Most assigned spectra from an Open search (117,073) correspond to one of the remaining ~500 ∆M bins.

Supplementary Figure 4 In-source dissociation produces fragment ions that can be selected as precursor ions for MS/MS analysis.

During the ionization process, we often detected evidence of in-source fragmentation. An example from an Open search shows that these peptides 1) co-elute, and 2) differ in mass by the removal of one or more amino acids from one terminus. In this example, a peptide from HSPA8 appears to co-elute with another peptide. The Open search identified the correct tryptic sequence with a ∆M value corresponding to the loss of two isoleucines (-226.1683 Da). A-C) MS/MS spectrum, extracted ion chromatogram, and predicted and observed fragment ions for the in-source dissociation species. D-F) MS/MS spectrum, extracted ion chromatogram, and predicted and observed fragment ions for the unmodified species. The intact species was recorded at 66-fold higher abundance than the in-source dissociation one.

Supplementary Figure 5 The ±500-Da search identified ~2,000 phosphopeptides.

A frequently-detected modification was phosphorylation. Likely due to high stoichiometry, many of these were detected only in their phosphorylated forms. Some examples of phosphorylation sites follow. A) Serine/arginine repetitive matrix protein 1 was identified with the phosphorylation of serine 696, which was not observed in an unmodified form. Ras GTPase-activating protein-binding protein 2 (G3bp2) was identified with phosphorylation of threonine 227 in which 76% of the peptides with that sequence were phosphorylated. There were other modifications on this peptide thus the total does not sum to 100%. C) Progesterone receptor membrane component 2 (Pgrmc2) was identified frequently with phosphorylation at threonine 205.

Supplementary Figure 6 Characterization of protein N termini identified through the 500-Da search in HEK293 cells.

A-D) N-terminal peptides were identified with four main distinct ∆mass values (-89 Da, -131 Da, +42 Da and unmodified). These values corresponded to protein N-terminal processing events which included methionine cleavage and/or acetylation. E) Summary of protein N-terminal modifications. The majority of proteins (78%) are acetylated in HEK293 cells. F) Venn diagram representation of all the N-terminal peptides demonstrating that a small fraction of protein N termini were actually identified with more than one processing type. In a few cases, all 4 possible N-terminal modifications were present.

Supplementary Figure 7 Glycosylated peptides were detected in the 500-Da search approach.

Host cell factor 1 was identified with 11 glycosylation sites, all of which were O-GlcNAc modifications. A) Diagrammatic representation of Host cell factor and its domains showing the position of each GlcNAc modification. These GlcNAc sites cluster around the known protein-protein interaction domains. B-E) Example MS/MS spectra for several GlcNAc sites showing the matching of fragment ions and the detection of the GlcNAc-specific ion at 204 Da with subsequent water losses (186 Da and 168 Da) (F-H) Three notch receptors were identified with various types of glycosylation modifications throughout EGF repeats. F) Notch 2 contained nine different peptides with several types of glycosylation, including fucose, Glc, Glc-Xyl, Glc-Xyl-Xyl, GlcNAc and GlcNAc-Glc. G) Notch 1 was identified with both fucose and Glc-Xyl-Xyl on five peptides. H) Notch 3 was identified with fucose, Glc and Glc-Xyl-Xyl on four different peptides.

Supplementary Figure 8 Example spectra showing the identification of the same peptide from histone H3 with four different modifications.

A) mono-methylation, B) di-methylation, C) tri-methylation, and D) acetylation.

Supplementary Figure 9 Detection of a diphthalamide modification on His715 in EF2.

A) Chemical formula and structure of the modified histidine. B) Predicted and identified fragment ions with a modified histidine residue. C) MS/MS spectrum with b- and y-type ions labeled with the addition of +83.065 Da to the histidine residue. The large diphthamide modification subsequently fragments to release a trimethylamine leaving the histidine modified by only a ∆mass +83.0651 Da.

Supplementary Figure 10 Example MS/MS spectra for single amino acid variants and polyalanine insertions in HEK293 proteins.

A-D) Examples of four detected mutations and their corresponding MS/MS spectra. Fragment ions are labeled based on the modification and position shown. E-F) Ribosomal protein L14 was identified with multiple alanine insertions ranging from three to six insertions. Blue circles designate b-type ions that were the result of the loss of water from the threonine residue (b2).

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–10 (PDF 898 kb)

Supplementary Table 1

Closed search analysis of HEK 293 cells. (XLSX 67472 kb)

Supplementary Table 2

Open search analysis of HEK 293 cells. (XLSX 111019 kb)

Supplementary Table 3

Gaussian mixed model analysis of HEK 293 cells searched using the open search. (XLSX 34 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chick, J., Kolippakkam, D., Nusinow, D. et al. A mass-tolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides. Nat Biotechnol 33, 743–749 (2015). https://doi.org/10.1038/nbt.3267

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nbt.3267

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research