Posted by John Cottrell (September 22, 2015)

Mass-tolerant vs Error tolerant

"A mass-tolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides" in Nature Biotechnology is from Steven Gygi’s lab at Harvard Medical School. It describes the use of a very wide precursor mass tolerance, +/- 500 Da, to identify modified peptides in a Sequest search.

How does this approach, which the authors call an open search, compare with a "conventional" multi-pass search, such as the Mascot error tolerant search? To find out, we downloaded Gygi’s HEK 293 cell data set, consisting of 24 Q-Exactive Orbitrap raw files, from Pride project PXD001468. Mascot Daemon was used to automate peak picking of the files using Mascot Distiller, merge them, and submit an error tolerant search to Mascot Server 2.5.1. The Distiller processing options can be downloaded here. The sequence database was identical to that used in the paper (GRCh37.61.pep.all) and the other search parameters were:

Enzyme                  :	Trypsin/P
Fixed modifications     :	Carbamidomethyl (C)
Variable modifications  :	Oxidation (M)
Mass values             :	Monoisotopic
Protein mass            :	Unrestricted
Peptide mass tolerance  :	± 5 ppm (# 13C = 1)
Fragment mass tolerance :	± 15 ppm
Max missed cleavages    :	2
Instrument type         :	ESI-TRAP

A separate search was used to determine the significance threshold to give a peptide FDR of 1% for the first pass search. The most abundant modifications, with more than 1000 instances each, from the open search (as listed in Supplementary Table 3) and the error tolerant search are as follows:

Mass-tolerant (open) Search
Bin	Delta	Count	Assignment
234	-0.0002	339578	(unmodified)
252	15.9944	21171	Oxidation
277	43.0059	13660	Carbamyl
236	1.0259	12741	13C
235	0.9608	11747	Deamidated
237	1.9755	7614	Should be 2.01, 13C2?
216	-17.0255	6627	Ammonia-loss, Gln->pyro-Glu
399	301.9864	5600	?
233	-0.9464	4521	artefact
287	53.9190	3326	Cation:Fe[II]
264	27.9946	3285	Formyl
232	-1.0281	3185	artefact
230	-2.0534	2599	artefact
269	31.9893	2561	Dioxidation
333	183.0367	2290	AEBS
254	16.9961	2030	Oxidation+13C?
189	-89.0305	1934	Met-loss+Acetyl
305	79.9666	1866	Phospho
318	128.0964	1588	Lys
231	-1.9276	1573	artefact
239	3.0216	1514	13C3?
238	2.9008	1272	artefact
369	249.9803	1254	?
292	57.0227	1108	Carbamidomethyl

Error tolerant Search
Modification	Site	Delta	Count	Notes
Carbamidomethyl	C	57.0214	136316	Fixed mod in search
Oxidation	M	15.9949	79590	Variable mod in search
Non-specific cleavage	-	-	16836
Carbamyl	N-term	43.0058	13056
Gln->pyro-Glu	N-term	-17.0265	8094
Deamidated	N	0.9840	7295
AEBS	Y	183.0354	4472
Dioxidation	W	31.9898	3984
Formyl	S	27.9949	3761
Ammonia-loss	N-term	-17.0265	2919	pyro-carbamidomethyl
Phospho	S	79.9663	2669
AEBS	K	183.0354	2529
Acetyl	N-term	42.0106	2510
Formyl	T	27.9949	2153
Oxidation	W	15.9949	2117
Deamidated	Q	0.9840	1848
Carbamyl	K	43.0058	1699
Glu->Gln	E	-0.9840	1514	same as amidation
Arg	N-term	156.1011	1275	ISD / non-specific cleavage
Carbamyl	T	43.0058	1224
Cation:Fe[II]	D	53.9193	1172
Iodo	Y	125.8966	1138
Cation:Fe[II]	E	53.9193	1132
Delta:H(2)C(2)	N-term	26.01565	1121
Carbamyl	S	43.0058	1091
Ammonia-loss	N	-17.0265	1030

The frequency distributions for the bins illustrated in Figure 2 of the paper are narrow Gaussians, but some of the other bins with high counts extend over a very wide mass range and are not well fitted by a Gaussian, so have been labelled artefact. An example would be bin 231, which tails from -1.98 to -1.80. Bin 237 is listed with a mass of 1.9755 and a count of 7614, but is actually a broad distribution extending from 1.8 to 2.1 with a spike at 2.01.

bin 231 histogram bin 237 histogram

This is a concern, because the mass accuracy is good – low ppm. The channels in these histograms are 0.01 Da wide, so any particular modification can only acount for one or two channels at most. There is almost a continuum of delta mass values in certain ‘busy’ regions, and it is difficult to imagine coming up with any sort of assignment for most of them. First guess might be that these are mostly false matches, but the authors argue strongly that the peptide FDR is well below 1%. Further investigation is clearly required.

Otherwise, many of the modifications appear in both lists and are the "usual suspects". Those with a question mark are not discussed or assigned in the paper, although I’m sure the authors must have puzzled over them.

For small mass values, you might hope the mass accuracy would be sufficient to give an elemental composition. Unfortunately, these are mass differences, and the counts of some elements may be negative, (e.g. deamidation is H_-1 N_-1 O). There are some nice online tools to find elemental compositions from mass values, such as ChemCalc, but I haven’t found one that can handle negative counts.

SNPs are not a reasonable assignment when the matches are scattered across a large number of different sequences, so it seems unlikely that 16.9961 is Asn->Met, even though the mass is a good fit. Highly abundant delta masses that are not in Unimod are more likely to be combinations of common modifications than truly novel moieties. For an assignment to a combination to be credible, the individual modifications need to be even more abundant. On these grounds, Oxidation + ¹³C might be a reasonable assignment for 16.998.

As mentioned, 2.01 is a more representative mass value for bin 237 than 1.9755, in which case, we could assign it as ¹³C₂ and maybe bin 239 is ¹³C₃.

This leaves question marks against two of the most abundant delta masses from the open search: 249.98 and 301.99. For such large masses, there are very many possible combinations. Any good suggestions out there? This illustrates a significant drawback of the open search – if you find an abundant modification that isn’t in Unimod, how do you figure out what it is?

Among the less abundant modifications, all those discussed in the paper are in Unimod except an unidentified 72.005 Da modification to N-terminal tryptophan, an unidentified 103.063 Da cysteine modification, and some polyalanine insertions found in ribosomal protein L14. Diphthamide (called diphthalamide in the paper) was in Unimod but with one too many hydrogens, (since corrected).

The authors describe how matches in an open search are weaker because only unmodified fragments are used for the match. For spectra that have strong b and y ions, this isn’t a huge problem. For spectra that are mostly y ions, there is a bias against modifications towards the C-terminus, because this takes out most of the potential fragment peak matches. For spectra that are mostly b ions, the bias is against modifications towards the N-terminus. Also, the search engine cannot make use of known neutral loss behaviour, such as loss of 98 from phosphate. On the other hand, modifications that are lost in their entirety on fragmentation, such as glycosylation or sulfation, so that fragments revert to their unmodified masses, should give matches that are just as strong as in a standard search. (Although sulfate is not one of the modifications identified in the open search.)

In an error tolerant search, although the search is limited to peptides with a single unsuspected modification, the matches are just as strong as if the modification had been specified as a variable modification in a standard search. Modified fragments can be matched and neutral loss information applied.

On balance, it is difficult see the open search becoming widely used for shotgun proteomics because it requires so much more time and effort for results interpretation compared with a multi-pass search. On the other hand, an important potential application of the open search is not mentioned anywhere in the paper – characterising modifications on endogenous peptides. Multi-pass searches are of limited use in this case because a protein will often be represented by a single peptide. If that peptide is modified, it is very likely to be missed. The open search may provide a more efficient alternative to the current strategy of de novo followed by an error tolerant sequence tag search.

Keywords: error tolerant, mass tolerant, modification, open search, Unimod

2 comments on “Mass-tolerant vs Error tolerant”

Paul Gershon on October 29, 2015 at 02:23 said:

Maybe UniMod can be updated to encompass experimentally-discovered open mods such as 72.005, 103.063, etc., annotated, at least temporarily, as “Unknown” with a reference.

That way, error-tolerant approach can never be a substantially worse approach.
Qingtao Lu on November 26, 2015 at 07:41 said:

Maybe it is a complementar search method different to direct search for modification.

Matrix Science

Mass-tolerant vs Error tolerant

2 comments on “Mass-tolerant vs Error tolerant”