High FDRs for methylated peptides II
In a previous article, we discussed how the false discovery rate (FDR) for modified peptides would be higher than the global FDR for all PSMs if the proportion of modified peptides in the search space for false matches was higher than for true matches. This is only one factor in the very high FDRs for methylated peptides reported in the MCP paper "Large Scale Mass Spectrometry-based Identifications of Enzyme-mediated Protein Methylation Are Subject to High False Discovery Rates", in which a match to the correct peptide with the correct modification but on the wrong site was counted as false. Potentially, a match to the correct peptide with the correct modification on the correct site could be counted as false because it could be shown that the modification was an artefact, and not post-translational. This is a very challenging definition of false.
Stricter than usual definitions of true and false
The UNSW paper is concerned with methylated peptides: whether the modification is post-translational or an artefact, and whether it has been localised to the correct residue. Multiple searches were performed using discrete sets of variable modifications and the results consolidated. All searches included carbamidomethyl (C) and oxidation (M), combined with one of the following:
- methyl (K), dimethyl (K), trimethyl (K)
- methyl (R), dimethyl (R)
- methyl (DE)
- ethyl (DE)
- isopropyl (DE)
- propionamide (C)
The authors determined the set of true, enzymatic methylation sites by growing their yeast cells on media containing labeled methionine. One such true positive is NVSVK*EIR from Elongation factor 1 – alpha, where the lysine may be modified by 1, 2, or 3 methyl groups. Let’s assume that a search using the first set of mods gave a number of strong matches to this sequence with dimethyl on K5. If so, then it is likely that the fourth set of mods would give weaker but still significant matches for ethyl on E6. In the UNSW study, no competition is allowed between searches, so these would be counted as false positives. If the second set of mods gave any significant scores for dimethyl on R8, these would also be counted as false positives.
This is unusually demanding. The higher the incidence of esterification on D and E, the more false positives can be expected from the first two searches whenever there is a K or R close to the modified residue. The authors also describe other sources of false postives, such as invoking a false methyl close to a cysteine because the alkylation was artefactual Propionamide (C) rather than the intended Carbamidomethyl (C), a difference of 14 Da.
Modifications carry additional ambiguity
Let’s review what we mean by true and false positives in the context of a typical database search. The Mascot score measures the probability of a match being a chance event. That is, a match to an unrelated sequence. What happens when two sequences are very similar? Consider a high scoring, highly significant match to a long peptide. If we were to interchange an adjacent pair of residues, the spectrum would hardly change and hence the score would hardly change; it would still be a highly significant match even though the sequence is no longer 100% correct. Fortunately, we don’t often see this type of mutation. A more common cause of false positives is when a SNP and a modification or two modifications balance out, so that the precursor mass is unchanged. For example, the analyte sequence contains —-M*A—- and the database sequence contains —-MS—- where M* represents oxidised Methionine. Strictly speaking, this is an incorrect match, but any real-life scoring algorithm will give similar scores to both. These are not unrelated sequences; they are very closely related. If both sequences were in the database, the best you could hope for is that the correct one would get a slightly higher score, but even this cannot be guaranteed.
Its a similar situation with modifications. Detecting the presence of a modification on a peptide is usually clear cut because it causes a mass shift. But, if there are many ways of creating a particular mass shift, we might get a good match even though the modification selected for the search is incorrect. An example would be searching with Phospho (ST) and getting a match to a sulfated peptide, or searching with methyl (K) and getting a match to a peptide that is actually modified on a nearby D or E.
Even when we search with the correct modification, determining the site of modification will usually be more difficult than detecting its presence. A peptide might have two adjacent S or T residues and the difference in the spectrum when a phosphate is moved from one to the other is the presence or absence of a single peak, not necessarily a very strong peak. Certainly, you can use the Mascot score for site analysis, but the greater the proximity of the sites, the greater the uncertainty.
What target/decoy can and cannot tell us
When FDR is estimated by target/decoy, the count of false positives is limited to significant matches to unrelated sequences. The FDR estimate does not represent the following types of false positive because the decoy database doesn’t contain correct sequences or even sequences that are highly homologous to correct sequences:
- A homologous sequence which has a peptide mass within tolerance, possibly through the addition or loss of spurious modifications
- The correct sequence with the correct modification on a nearby but incorrect site
- The correct sequence with an incorrect modification or combination of modifications that have a mass delta within tolerance, e.g. phospho vs. sulfo or propionamide vs. carbamidomethyl + methyl
- The correct sequence with a modification that has the correct elemental composition but the wrong structure, e.g. dimethyl vs. ethyl
- The correct sequence with a modification that has the wrong origin, e.g. post-translational vs. artefactual
If you are lucky, and the correct match is available, competition may come to the rescue in cases 1-3. That is, both the true and false matches may have significant scores, but the true match has the higher score and only the highest scoring match to each spectrum gets reported and counted. Database search cannot distinguish cases 4 and 5 from true matches.
Competition is essential
If competition was allowed, and only the highest scoring match to each spectrum counted, the FDRs in the UNSW study would be substantially lower. It is possible to simulate competition when merging search results, but an an error tolerant search is an easier way to test for a large number of modifications in a single search. The error tolerant search tests all the modifications in Unimod serially, so will fail to find a match that requires two different modifications unless one of them has been selected as a variable modification. Looking at the list of true positive methyl-PSMs in the supplementary data, (Tables SII and SIII), there is only one such peptide: NDYGPPR*GSYGGSR**GGYDGPR (methyl on R7, dimethyl on R14).
To get an idea of how the error tolerant search might perform on these files, the merged peak lists from nostainbands_orbi_1.raw through 28 were searched with Carbamidomethyl (C) as a fixed modification plus Oxidation (M) and Propionamide (C) as variable modifications. The significance threshold was set to give 1% FDR for the first pass matches.
As a specific example of how competition helps, these are the significant matches to AEQLYEGPADDANCIAIK in the UNSW study:
22 x Carbamidomethyl on C14
6 x Methyl on K18, Carbamidomethyl on C14
22 x Carbamidomethyl on C14
22 x Carbamidomethyl on C14
7 x Methyl on D11, Carbamidomethyl on C14
The PSMs to Methyl (K) were counted as false positives. In the error tolerant results, there are no rank 1 matches to Methyl for this peptide because matches to Propionamide on C14 always get a higher score:
22 x Carbamidomethyl on C14
1 x Cys->Dha on C14
4 x Dehydrated on D11, Carbamidomethyl on C14
3 x Deamidation on N13, Carbamidomethyl on C14
6 x Propionamide on C14
1 x Oxidation on Y5, Carbamidomethyl on C14
4 x Carbamidomethyl on D10, Carbamidomethyl on C14
The UNSW paper was concerned with post-translational methylation on K and R. In the error tolerant results, there are 39 cases of methylation on the peptide C-terminus, which database search cannot distinguish from methylation on the C-terminal K or R side chain. For these matches, we will follow the rule in the paper that cleavage is allowed for mono-methylation but inhibited by di- or tri-methylation, and assume the latter must be C-term esters. This leaves 91 true matches (i.e. matches found in Tables SII and SIII) and 24 other matches with methylation on K or R, assumed false.
This is better than without competition, though still a long way from 1%. In the UNSW paper, the methylation FDRs were for counts of distinct peptides, not PSMs, and this introduces a further complication that will be the subject of the next article.