Posted by John Cottrell (December 17, 2019)

Protein FDR in Mascot Server 2.7

One of the new features in Mascot Server 2.7, now running on this web site, is an estimate of protein FDR. This is displayed in the Protein Family Summary for Fasta searches whenever automatic decoy is selected.

The basis is the number of proteins inferred in the target database compared with the number in the decoy database. Conceptually, this is similar to peptide FDR, but counting proteins calls for a number of additional definitions and assumptions.

  • Only peptide sequence matches (PSMs) with significant scores are used as evidence for proteins. Proteins with shared PSMs are grouped into families. Each distinct family member contains at least one unique peptide sequence, not shared with other family members.
  • A family member may represent multiple same-set proteins, one of which is given prominence as the anchor protein. Sub-set and intersection proteins are relegated to a lower level list, where they belong to the family but not to any particular family member.
  • The protein count used for FDR is a count of family members. That is, if the report contains 2 families, one with 4 members and the other with a single member, this counts as a total of 5 proteins. Same-set, sub-set and intersection proteins are not counted.
  • A protein identification is considered to be true positive if it contains at least one true positive PSM. A protein is a false positive only when all of its PSMs are false positives.

This last point is crucial. If there is good evidence for a protein, this isn’t negated because a false match has also been assigned. In the decoy database, all of the PSMs are false so that, by definition, all of the proteins are false. We should not assume that the number of false proteins in the target database is exactly the same because some of the proteins that contain false matches may also contain true matches, and should be counted as true. We have taken a similar approach to that used in MAYU, from the Aebersold group. Given the number of proteins and the numbers of true and false peptide sequences, we can use a hypergeometric model to estimate the number of proteins that are entirely false. The correction will be small for most searches, but can become significant when a large fraction of the entries in the database are true hits. (In the limiting case that all of the proteins in the database are present in the sample, and are measured and identified, the protein FDR is identically zero, no matter how many false PSMs are being reported.)

The default significance threshold for a Mascot search is usually 0.05 and this will often give a peptide FDR in the region of 5%. If the decoy search reveals the actual FDR to be excessively high or low, or if some other value of FDR is required, the significance threshold can be adjusted manually or automatically to achieve the required value.

Take a look at an example of typical search results. When first loaded, the FDR for PSMs is 2.33% and for distinct peptide sequences 3.53%. If we want to report results at an FDR of 1% for peptide sequences, we can select sequences and Adjust to 1%, which causes the number of sequences to drop from 9797 to 8418. As you might expect, the protein FDR tracks the peptide FDR, and falls from 7.15% (4487 proteins) to 1.98% (4036 proteins).

Maybe 2% is still too high? If you care to experiment, you’ll find that a peptide sequence FDR of 0.5% gives a protein FDR of just under 1%. Follow the nearby link to open the decoy search results in a new browser tab, and you’ll see the list contains 38 proteins. The count used for the protein FDR is 36 decoy proteins, the difference being the correction for false peptides assigned to true proteins. The correction is small because the count of true proteins is just 3820 out of a database of 92,910 entries. It is quite unlikely that a false PSM will be assigned to a true protein by chance.

There is a second control that can be used to adjust the protein FDR: the Min. number of sig. unique sequences. At the default setting of 1, we are reporting ‘one-hit wonders’. For large searches, conventional wisdom is that it is safer to exclude these, as explained in an earlier article. For this search, the precise counts of peptides sequences and proteins for 1% protein FDR look like this for two values of this setting:

Min. sig. seq.Sig. thresh.Target seq.Decoy seq.Peptide seq. FDRTarget prot.Decoy prot.Protein FDR
10.002157763410.53%3832391.02%
20.16100656186.14%2281231.01%

In fact, we can report a lot more proteins at 1% FDR by retaining the one-hit wonders. This is partly because the numbers of peptides and proteins being reported are both small compared with the size of the database. It is also a function of the peptide match score distribution. If we were to search a much smaller database, keeping everything else the same, we might find the situation would reverse, and we could report more proteins for a given FDR by setting Min. number of sig. unique sequences to a higher value. Note that this setting only affects the count of proteins, it doesn’t change the count of significant peptides. Significant peptide matches assigned to proteins that are dropped are moved to the unassigned list.

Often, database redundancy causes protein inference ambiguity, meaning that we could account for the peptide evidence using different sets of proteins. A protein FDR of 1% only tells us that 1% of the proteins listed are wholly false. This doesn’t mean the other 99% are "correct". In particular, a family member may represent a number of same-set and near same-set proteins. Unless the PSMs provide near complete coverage, two same-set proteins could have major differences in the regions for which we have no evidence. In such cases, it is important to remember that a protein accession in the summary report doesn’t mean "this is the correct protein", it means "the correct protein is likely to be very similar to one of the set of proteins represented by this family member".

Keywords: , , ,

Leave a Reply

Your email address will not be published. Required fields are marked *

*

HTML tags are not allowed.