Posted by Ville Koskinen (September 16, 2019)

Common myths about protein scores

Mascot Server is used in many different application areas by both mass spectrometry experts and non-experts. Over the years, we’ve spotted a few recurring misconceptions about how protein scores are interpreted and used. All the examples come from recent peer-reviewed papers.

Protein scores in PMF searches

The very first thing to check is, what type of experiment is being reported. If it’s peptide mass fingerprinting (PMF), Mascot calculates a statistical score for each identified protein. The score reflects the probability that the match between the observed molecular masses and the digested database entry is a random event. Mascot also reports a score threshold based on the selected significance level (by default 0.05). A protein hit is statistically significant if its score is above the threshold. The example PMF search illustrates these points.

A paper might say the authors accepted “all proteins with score > 51″ or “protein scores greater than 67 (p<0.05)”. You could even see the phrase “protein score at significance level (p<0.05)”. These are all valid ways of accepting protein hits in a PMF search. Another, perhaps simpler way is to look at the expect value. If the protein hit’s expect value (say 2.2e-14) is below the significance level (say 0.05), the hit is statistically significant.

Protein scores in MS/MS searches

If the experiment uses MS/MS data to identify peptides, the situation is more complicated. We haven’t always succeeded in making the documentation as clear and unambiguous as possible, so there is room for misunderstanding.

When Mascot compares an MS/MS peak list to the in silico fragmented peptide, it gives the match a statistical score. The score reflects the probability that the match between the observed and calculated fragment masses is a random event. The match is given an expect value, which is a function of the score and the threshold. If the expect value is below the significance level, the match is statistically significant. Have a look at the example MS/MS search for a sample of peptide scores and expect values.

Mascot also reports a protein score in an MS/MS search. However, this is not a statistical score. There are two different types:

  • MudPIT: sum of score above threshold of significant peptide matches, plus the average threshold of these matches
  • Standard: sum of scores of non-duplicate peptide matches, minus a small correction

Protein scores in MS/MS searches are only used for ranking protein hits. The goal is to put proteins with lots of strong peptide evidence at the top of the list. The MudPIT score is the only available type in Protein Family Summary. For small searches, the default report is Peptide Summary, where you can choose the protein score type.

Unfortunately, you sometimes see MS/MS papers with phrases like “all proteins identified with a Mascot score higher than 60 [...] were considered reliable” or “peptides and proteins with a Mascot score higher than 35 and 50, respectively, were automatically accepted.” Occasionally, you even see a mention of protein expect values, which Mascot does not calculate in MS/MS searches. As you can see from the definitions of MudPIT and standard scores, thresholding by score has no clear meaning. A MudPIT score of 60 could mean the protein has one significant peptide match with score 60, or the protein could have 47 peptide matches each with score 14 and threshold 13. Similarly, you can get standard score of 60 with with just one peptide match, which doesn’t even need to be a significant match.

Unintentionally accepting one-hit wonders

This brings us to another error sometimes seen in the scientific literature. Here’s a sample of methods from papers using LC-MS/MS:

“Proteins were accepted if they had at least one ‘rank 1′ peptide with a peptide ion score of more than 50.0 (p < 0.05).”

“The presence of at least one peptide with a significant ion score was required for positive protein identification.”

“Proteins that met our criteria for ‘identified proteins’ exhibited ≥ 1 peptide with an individual Mascot score of p < 0.05.”

“Proteins with a score of at least 30 for single high-confidence peptides were considered positive identifications.”

The intent is, without doubt, to filter out potential false positive proteins. Very few papers say which results report or protein score type was used, but assuming the above methods are accurate descriptions, they all let through one-hit wonders.

To see this, let’s look at Protein Family Summary and MudPIT scoring. Protein clustering ensures all family members must have at least one significant peptide match, which almost always is the rank 1 match. The absolute value of the peptide match score doesn’t matter as long as the match is significant. The second and third methods provide no filtering at all in this case. The first and fourth methods are inadequate: they will accept some one-hit wonders (peptide score is high enough) and reject some protein hits that are identified by more than one peptide.

There is a straightforward procedure to control for false positive protein hits. First of all, do a target-decoy search so that you get a reliable estimate of the peptide false discovery rate. Choose a target FDR appropriate for your experiment. Now filter out protein hits that were identified by a single peptide sequence. This gets rid of one-hit wonders as well as controls for false positive peptides. Have a look at Creating a list of confidently identified proteins for an example how to do this in Report Builder.

Keywords: , , ,

3 comments on “Common myths about protein scores

  1. Ger Ros on said:

    Great article!

  2. Ten years passed, but some people are still talking about filtering one-hit wonders as the way to “eliminate false protein identifications”… :`((

    https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3398614/

  3. Ejvind Mortz on said:

    Very good description!
    Allowing a one-peptide hit as a positive protein identification can be acceptable if you know what you are doing and you are evaluating the data manually. If the peptide has the correct Mw with a high mass accuracy, and many fragment ions in a Y-ion or B-ion series, it means that you have identified a peptide with this amino acid sequence with a high certainty.
    Assuming that the rest of this protein is also present in the sample is a matter of believing, or other evidence. It could be another protein containing the same peptide sequence :-)
    Ejvind Mortz, www.alphalyse.com

Leave a Reply

Your email address will not be published. Required fields are marked *

*

HTML tags are not allowed.