Posted by John Cottrell (January 19, 2015)

Creating a list of confidently identified proteins

This can be done very easily using Report Builder:

  1. Select the Decoy checkbox when submitting the search
  2. Open the result report as a Protein Family Summary
  3. Switch to the Report Builder tab
  4. Expand the decoy search section and set the peptide FDR to 1%
  5. Expand the filters section and set ‘Num of significant unique sequences’ > 1
  6. Optionally, expand the columns section and choose which columns you require and their order
  7. Print the table or export it as CSV

At step 4, you may discover that you have too few matches to get an accurate peptide FDR. Since the main purpose of the exercise is to avoid reporting proteins based on false peptide matches, as long as you have very few false peptide matches, this won’t be a problem.

If you choose to use Percolator, make sure the significance threshold is left at the default setting of 0.05. The arguments passed to Percolator set the peptide FDR goal to 1%. If there is a problem, and it cannot get close to this, it isn’t a good idea to try and force things by changing the significance threshold.

At step 5, you may need to set a higher threshold. Consider a search with 10,000 significant target matches. While we might be happy to report these results with a 1% peptide FDR, meaning 100 false peptides, few of us would be comfortable reporting 100 false proteins. Filtering out ‘one-hit wonders’ by requiring significant matches to more than one peptide sequence only works well if the number of false matches is small compared with the number of database entries, as discussed in Does protein FDR have any meaning?.

At a rough approximation, you don’t have to worry about this if the the number of target matches at 1% FDR is less than the number of target database entries. You can find the count of entries in the report header. If you used a taxonomy filter, it is the count after the filter that matters. If the numbers are similar, or if the number of matches is greater than the number of entries, you need to use the Poisson distribution to decide where to draw a line. There are many online calculators, or you can use this spreadsheet. If you plug in the number of entries in SwissProt mouse (16,727), 18,000 true and 180 false matches, you’ll see that 178 entries get 1 false match and only 1 entry gets 2, by chance. In such a case, setting ‘Num of significant unique sequences’ > 1 is a safe choice. If you increase the number of false matches by a factor of 10, you’ll see that 87 entries have 2 matches and 3 entries have 3. If these were the numbers for your search, you might want to set ‘Num of significant unique sequences’ > 2

If you are running Mascot 2.4, the closest available filter term is ‘Num of significant sequences’, which is slightly less stringent. For example, if family member 6.2 had significant matches to 9 sequences, but 8 of these were also matched by 6.1, it would pass a filter of ‘Num of significant sequences’ > 1 but fail ‘Num of significant unique sequences’ > 1.

Many other useful filters are available. For example, you can filter by database so as to remove contaminants from the final table. If this is important work, it can be interesting to load the report for the decoy matches (link in the decoy section), apply the same filters, and see how many false proteins you would report, if any.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

HTML tags are not allowed.