Posted by John Cottrell (November 22, 2013)

Does protein FDR have any meaning?

Its easy to grasp the concept of using a target/decoy search to estimate peptide false discovery rate. You search against a decoy database where there are no true matches available, so the number of observed matches provides a good estimate of the number of false matches in the results from the target. People debate implementation details, such as whether the target and decoy should be concatenated or how to create the decoy sequences, but these things are not important if you only want to know whether your peptide FDR is 1% or 5% or 25%.

Protein false discovery rate is not so easily estimated. First of all, what exactly do we mean by false proteins? One definition might be database entries that have only false peptide matches. These are clearly unwanted, so best to filter them out by requiring every protein to have significant matches to two or more distinct peptide sequences. This eliminates the ‘one hit wonder’ proteins, that are the bulk of the false proteins according to this definition.

A slightly more sophisticated approach is to calculate the distribution of false peptide matches using Poisson statistics. The current SwissProt (2013_10) has 20,278 human entries. If we searched these and got 100,000 matches at 1% peptide FDR, this would correspond to 1000 false peptide matches. The Poisson distribution predicts that, on average, 952 entries will get one false match, 23 entries will get two, and less than one will get three. If an entry has two false matches to different sequences, it will pass a ‘one hit wonder’ filter, so we could have as many as 23 false proteins in our report. If this is too many, we raise the bar, require significant matches to three or more distinct peptide sequences, and the anticipated number of false proteins drops to less than one.

Problem solved? Not if our goal is to present an accurate list of the proteins that were present in the sample. Try this thought experiment: Take the 3 or 4 peptide matches that constitute a confident hit in a typical shotgun experiment and append the sequences to a totally unrelated protein of similar length, even a decoy protein. Would any report based on conventional protein inference differentiate between the correct and unrelated proteins? No, because protein inference only considers the peptide matches. It ignores the unmatched parts of the sequence and there is no penalty for the matches we fail to observe.

So, when multiple database entries share the same set of peptide matches, and there is no evidence to differentiate them, we report them as a ‘same-set’ protein group. By parsimony, we suppose that only one of the group was present in the sample, but we don’t know which one, and we cannot rule out the possibility that two or more of them were present. Yet, the proteins in the group might be very different in any biological sense. If we report all of them when just one was present, how do we account for this in our protein FDR? What if we choose one and its the wrong one? Some might call this ambiguity, rather than false proteins, but this is just semantics.

Protein inference in shotgun proteomics is subject to some very serious and fundamental limitations:

No protein level information

When we analyse a pure protein from a 2D gel spot, protein inference is much easier. If you can identify one peptide, you should be able to identify several, and with high coverage, one database entry becomes the clear winner. Other entries may contain some of the same peptides, but unless they also have similar protein mass and pI, they can be ruled out. In shotgun proteomics, the protein level information is discarded in the interests of speed and scale, and protein inference comes to rely mainly on parsimony.

Low or unknown coverage

In most experiments, shotgun proteomics data are under-sampled. That is, MS/MS scans are acquired for the stronger peptide signals but the weaker ones get overlooked, and the number of different peptides observed for a particular protein depends on its abundance as well as its length. On the plus side, this is the basis of spectral counting as a method of quantitation. On the minus side, it means we can’t assume that a protein with low coverage is a false protein. It could be a true protein that happens to be present at a low level. Not that we actually know what the coverage is, because we don’t have masses for the proteins. When we talk about coverage, this means coverage for the database entry, not for the protein. Any attempt to use coverage in protein inference simply favours the shortest database entry that contains the observed matches.

Generic databases

Some day, it may become routine to create a custom database for the individual proteome under analysis using a technique such as RNA-Seq. Right now, most searches are against the public protein databases, and these will not contain perfectly correct sequences for many of the proteins in the sample. In the absence of the correct sequence, matches are assigned to a set of homologous entries. This search result provides a nice example. Much too small to estimate the peptide FDR with any accuracy, but the significance threshold has been set to a level where no decoy peptide matches survive. Expand hit 1 and you’ll see that most of the peptide matches are shared apart from three sequences that are unique to gi|76363596 (ATVFDQFTPLVEEPK, CCGAEDKEACFAEEGPK, and PPACYATVFDQFTPLVEEPK) and two that are unique to gi|126723507 (SALELDEGYVPK and RPCFSALELDEGYVPK). BLAST alignment between the two protein sequences shows them to be 99% identical. The alignments for the ‘unique’ peptides look like this:

CCGAEDKEACFAEEGPK - gi|76363596
CCGREDKEACFAEEGPK - gi|126723507
 
PPACYATVFDQFTPLVEEPK - gi|76363596
PPACYRTVFDQFTPLVEEPK - gi|126723507
 
RPCFSALELDEGYIPK - gi|76363596
RPCFSALELDEGYVPK - gi|126723507

It seems clear that we don’t have two distinct proteins, just a variant that is not 100% identical to either of the two entries in the database.

One way to simplify things is to search a non-redundant database. If your sample is from a well characterised organism, then SwissProt is always a good choice. Some peptide matches will be lost, which could lead to the loss of true proteins that had very low coverage, but the list of proteins with reasonable coverage will be more reliable in that you are less likely to over-report.

Artefacts from modifications

Including an unnecessary modification in a search or omitting a modification that is actually present in the sample can cause a false peptide match that leads to the wrong protein being inferred. The most frequent culprit is deamidation. The same peptide sequence may occur in two different proteins except that in one it has a D at a particular position and in the other an N. If the true protein is the one with the D, but the search included deamidation, we get an equally good match for the false protein. If the true protein is the one with the N, but it is mostly deamidated, we may not see the match for the true protein unless the search includes deamidation.

Deamidation is insidious because the substituted residue is also the site for the modification. There are many other cases where a common modification exactly compensates for a residue substitution, such as A + oxidation = S, S + acetyl = E, and A + carbamyl = N. But, the residue itself is not a common site for the modification, so the score for the match will suffer unless the modification can be located adjacent to the substitution, which will happen less frequently. The other common example is M + oxidation = F. The mass difference is 0.033 Da, so this can give an equally good match if the mass accuracy is not too tight.

Protein inference is a complex problem, which can be made more difficult by conflicting goals. A shotgun survey of the total protein complement of a complex sample is one thing. Detailed characterisation of individual proteins of interest is another. We cannot expect to get both from a single experiment. If the primary aim is an accurate list of the proteins in a complex sample, there are several steps we can take to minimise over-reporting:

  • Set the peptide FDR to 1% or less.
  • Filter out the ‘totally’ false proteins by requiring significant matches to a minimum number of distinct sequences (not just a minimum number of matches).
  • Use an absolute minimum of variable modifications. In particular, don’t include deamidation.
  • Search a non-redundant database.

If there is still ambiguity for a protein of interest, additional experiments will be required. The protein family summary, introduced in Mascot 2.3, attempts to present the search results as clearly as possible, so that you can make up your own mind about what to believe. Sadly, there is no magic solution that can recover the information lost when we do a shotgun experiment.

And the question in the title? I think the answer is no, unless you are willing to accept the (not very useful) definition of false proteins as database entries that have only false peptide matches. If we are trying to present a list of proteins that has some biological significance, it is very important to be aware of the issues associated with protein inference in shotgun proteomics, as discussed above, but it is hard to see how a meaningful numerical estimate can be made of the extent of over-reporting

7 comments on “Does protein FDR have any meaning?

  1. I really appreciated this article!! But one thing i want to make clear, you mentioned “The current SwissProt (2013_10) has 20,278 human entries. If we searched these and got 100,000 matches at 1% peptide FDR, this would correspond to 1000 false peptide matches. The Poisson distribution predicts that, on average, 952 entries will get one false match, 23 entries will get two, and less than one will get three.” This experiment has any reference our publication that i can cite? Thanks very much!! -Peter

    • John Cottrell on said:

      I didn’t think this had appeared in the literature, but a Google search shows it was discussed in ‘Exploring the Human Plasma Proteome’, edited by Gilbert S. Omenn (http://eu.wiley.com/WileyCDA/WileyTitle/productCd-3527609423.html)

      • Pater on said:

        Thanks a lot!!My another question is how much confident entries you can get from 100,000 matches? If there is 8,000, then can I say that the protein FDR is (952+23+1)/8000??
        Many thanks,
        -Peter

        • John Cottrell on said:

          952+23+1 is specific to the case of 1000 false peptides and a database of 20,278 entries. If you wanted to estimate the number of totally false proteins before filtering out one-hit wonders, you could perform a similar calculation.

  2. Maarten Dhaenens on said:

    How is it that for years now, peolple blindly report protein FDR’s without thinking about it. This issue is well-known (and was first posted by John in 2013), yet reviewers keep on asking for (controlled) protein FDRs and people keep on saying they found e.g. 5000 proteins, just because the number is high. Thank you Matrix Science, for focusing on quality and not quantity! You are one of the few that can make a large scale difference here!

  3. He, Si-Min on said:

    After nearly four years, how do you comment on your this post, especially considering the new progress in the field of protein FDR estimation:
    (1) A picked target-decoy strategy was proposed by Savitski et al. in 2015 in their MCP paper entitled “A scalable approach for protein false discovery rate estimation in large proteomic data sets” (http://www.mcponline.org/content/early/2015/05/17/mcp.M114.046995.abstract), which cited and was of course against your this post.
    (2) Lukas Kall et al. released Percolator 3.0 in 2016 in their JASMS paper entitled “Fast and Accurate Protein False Discovery Rates on Large-Scale Proteomics Data Sets with Percolator 3.0″ (https://link.springer.com/article/10.1007/s13361-016-1460-7), which adopted the picked TDS for protein FDR estimation as a new feature of Percolator 3.0. In particular, Mascot 2.6 has adopted Percolator 3.0, I wonder if Mascot has adopted the proteon FDR output of Percolator 3.0?
    (3) The HPP MS Data Interpretation Guideline 2.1 (http://pubs.acs.org/doi/abs/10.1021/acs.jproteome.6b00392) was released in 2016, which stipulated that “Present large-scale results thresholded at equal to or lower than 1% protein-level global FDR,” although not specifying which kinds of estimation methods were allowed.
    Thanks!

    • John Cottrell on said:

      I don’t think anything fundamental has changed. If I was writing the article today, it would probably make very similar points. None of the three papers that you cite defines what is meant by a false protein. Until there is a real consensus on this point, protein FDR is just a vague attempt at quality control.

Leave a Reply to John Cottrell Cancel reply

Your email address will not be published. Required fields are marked *

*

HTML tags are not allowed.