Posted by David Creasy (August 17, 2018)

Back to Basics 2: Common mistakes

When a search on our public web fails, we are often contacted to provide some help. These are some of the common mistakes we have seen. Most of the focus here is on Peptide Mass Fingerprint (PMF) searches, which are still very popular on our public web site.

If you are new to database searching, you may find this tutorial helpful.

1. Choice of database and taxonomy

A common and very understandable mistake is to search more databases than necessary. For example, on our public web site, we often see searches against both NCBIprot and SwissProt but since all of the sequences in SwissProt are also in NCBIprot, this is obviously a waste of time. It can also result in a failure to get a significant match if the spectrum is not of sufficient quality. The Mascot score for a specific match remains constant regardless of the size of the database(s). However, the significance threshold (i.e. the score required for the match to be significant, rather than just a random match), depends on the number of entries in the database. For example:

This PMF search against SwissProt gets a score of 82 for PML_HUMAN, and the significance threshold is 70 (for p<0.05). However, if you repeat the search against SwissProt and NCBIprot, then the significance threshold goes up to 94 and there is no longer a significant match. Since the search space is larger, there are now many more random matches, some with scores above 80. Indeed, there is a better, but still not significant, match than the ‘correct’ one which has a score of 84, where the taxonomy is “synthetic construct”. Without the significance threshold, it’s possible to be convinced that this “synthetic construct” protein is the correct match.

If your sample is from a species that is well represented in SwissProt, then it generally makes sense to search SwissProt using the appropriate taxonomy and a contaminants database. So in this case, I would have chosen to search SwissProt with human taxonomy and also the contaminants database. To select two databases, use the control key when selecting the second database. The results of the search are here. The score of the PML_HUMAN protein is still 82, but the significance threshold is now 56. If you restrict the search to a specific taxonomy, then it’s important to include the contaminants database because the spectra may turn out to be from a contaminant, such as trypsin, rather than actually from your sample, and it’s normally very useful to see this rather than just get no match.

If your sample is from a species that is not well represented in the databases, then you may need to search NCBIprot. You should also check how many protein sequences there are for the species in NCBI using the NCBI taxonomy browser. For example, if your sample is from Hystricidae (Old World porcupines), you will see that there are only 181 protein sequences. If you have your own Mascot Server, you could add Hystricidae to the taxonomy list and just search those sequences. Alternatively, you could select “Other Rodentia”, but this also only has a few sequences. A better option would be to search Rodentia in the hope that the porcupines have some homologous sequences with their less prickly cousins.

2. Insufficent mass values

A very common error is to submit a PMF search with a single mass value. This could be because the person hasn’t understood that a peptide mass fingerprint works from digesting a single protein with an enzyme to produce a number of peptides of different masses that are effectively a ‘fingerprint’ for that protein. If the instrument has only produced a single mass value, something fundamental has gone wrong. It isn’t possible to get a significant protein match from a single mass value because it can occur any number of times in the database by pure chance.

There’s a similar issue with MS-MS searches where an MS-MS spectrum has a just one or two fragment peaks. Ideally, there should be at least one peak for each residue, so if there are less fragment ions than residues, you cannot expect a high score. In practice, there will often be peaks for both b series and y series ions as well as possibly some neutral losses and other fragments. With a few noise peaks as well, 100 peaks per spectrum will not be unusual.

3. Impossible or low scoring mass values

We have also had failures reported where a peptide mass fingerprint has been submitted with m/z values all greater than 5000 Daltons. Assuming the enzyme was Trypsin, there are very few tryptic peptides with such a high mass, so such a search will result in no matches to any protein. In this case, it’s most likely that the sample hasn’t been digested properly. It may be possible in such a case to get a match by specifying a very large number of missed cleavages, but it’s much better to repeat the analysis. At the other end of the scale, mass values for very short peptides contribute little to the score. It is the long peptides, which are unlikely it is to occur in multiple proteins, that provide the greatest specificity, so aim to get as many peptide masses as possible in the range 1000 to 3500 Da.

For MS-MS searches, fragment masses (m/z * charge) under 50 Daltons or above the precursor mass will not contribute to the score and indicate a problem with the spectrum or peak detection.

Keywords: contaminants, PMF, taxonomy, tutorial

Comments are closed.

Matrix Science

Back to Basics 2: Common mistakes

1. Choice of database and taxonomy

2. Insufficent mass values

3. Impossible or low scoring mass values