Mascot: The trusted reference standard for protein identification by mass spectrometry for 25 years

Peptide Mass Fingerprint search

If you have no time to read this tutorial, these are the most important do’s and don’ts:

  • You cannot search raw data; it must be converted into a peak list.
  • Search parameters are critical and should be determined by running a standard, such as a BSA digest.
  • If you are not sure which database to search, start with Swiss-Prot.
  • If you use a taxonomy filter, or search a single organism database, include a contaminants database in the search.
  • Never specify more than two variable modifications.
  • Always choose a specific enzyme (usually trypsin).
  • A protein hit is only significant (reliable) if it has an expect value below 0.05, (5% chance of being false).

Peak list

The first requirement for a Peptide Mass Fingerprint (PMF) search is a peak list; you cannot upload a raw data file. Raw data is converted into a peak list by a process called peak picking or peak detection. Often, the instrument data system takes care of this, and you can submit a Mascot search directly from the data system or save a peak list to a disk file for submission using the web browser search form.

If the instrument data system doesn’t provide it, or if you have a raw data file and no access to the data system, you’ll need to find a utility to convert it into a peak list. We recommend Mascot Distiller, which has been designed to work with raw data from any instrument.

Peak lists are text files and come in various different formats. You can also copy and paste a list of values into the query area of the search form, or even type them in. Each m/z value goes on a separate line. If you also have an intensity value for the peak, this follows the m/z value, separated by a space or a tab.

Mass values for very short peptides contribute little to the score. It is the long peptides, which are unlikely to occur in multiple proteins, that provide the greatest specificity, so aim to get as many peptide masses as possible in the range 1000 to 3500 Da. High mass accuracy is good, but sequence coverage is equally important. You will get a better score from 20 mass values at modest accuracy than 5 mass values at very high accuracy.

Search parameters

A peak list, by itself, is not sufficient. There are also a number of search parameters that must be set appropriately. Follow this link to open the search form in a new browser tab. The labels for each control on the search form are also links to help topics. Note that you can set your own defaults for the web browser search form by following the link at the bottom of the Access Mascot Server page.

The form looks much the same whether you have your own Mascot server, in-house, or whether you are connected to the free, public Mascot Server. If you are using the free, public Mascot Server, there are some restrictions, one of which is that you have to provide a name and email address so that we can email a link to your search results if the connection is broken. Whether you enter a search title is your choice. It is displayed at the top of the result report, and can be a useful way of identifying the search at a later date.

Using a standard sample

If at all possible, run a standard sample and use this to set all the search parameters. By standard sample, we mean something like a BSA digest, which will give a strong match and where you know what the answer is supposed to be. Trying to set search parameters on an unknown is much more difficult, and can lead to false positives.

Choosing a sequence database

The first choice you have to make is which database to search. The free public web site has just a few of the more popular public databases, but an in-house server may have a hundred or more. Some databases contain sequences from a single organism. Others contain entries from multiple organisms, but usually include the taxonomy for each entry, so that entries for a specific organism can be selected during a search using a taxonomy filter.

If your target organism is well characterised, such as human or mouse or yeast or arabidopsis, Swiss-Prot is the recommended choice. The entries are all high quality and well annotated. Because Swiss-Prot is non-redundant, it is relatively small, which makes it easier to get a statistically significant match.

Contaminants

If you think you know what is in the sample, you can restrict the search to an organism or family by means of the taxonomy filter, but remember that you can never rule out contaminants. When searching entries for a single organism, always include a database of common contaminants. Otherwise, you might fail to get a match, or you could end up reporting your sample is human serum albumin when it is really BSA.

In the web browser form, to select two databases, first click on your target database then hold down the control key and click on a contaminants database. If the search includes a taxonomy filter, that’s not a problem because taxonomy is not configured for the contaminants databases, so all the entries will always be searched.

Bacteria and plants

If you are interested in a bacterium or a plant, you may find that it is poorly represented in Swiss-Prot, and it would be better to try one of the comprehensive protein databases, which aim to include all known protein sequences. The two best known are NCBIprot and UniRef100. These are very large databases, and you will always want to select a limited taxonomy.

However, never choose a narrow taxonomy without looking at the counts of entries and understanding the classification. In the current Swiss-Prot, for example, there are 27,987 entries for rodentia, but 17,212 are mouse and 8,199 are rat – only 9,173 are for other rodents. So, even if your target organism is hamster, it isn’t a good idea to choose ‘other rodentia’. Better to search rodentia and hope to get a match to a homologous protein from mouse and rat.

Enzyme

You must always choose an enzyme for a PMF. The number of allowed missed cleavages should be set empirically, by running a standard and trying different values to see which gives the best score.

Fixed and variable modifications

Modifications in database searching are handled in two ways. First, there are the fixed or quantitative modifications. The most common example is the alkylation of cysteine. Since all cysteines are modified, this is effectively just a change in the mass of cysteine. It carries no penalty in terms of search speed or specificity. The most widely used alkylation agents are iodoacetamide (select modification carbamidomethyl), iodoacetic acid (carboxymethyl), and MMTS (methylthio).

In contrast, most post-translational modifications do not apply to all instances of a residue. For example, phosphorylation might affect just one serine in a protein containing many serines and threonines. These variable or non-quantitative modifications are expensive in the sense that they increase the search space. This is because the software has to permute out all the possible arrangements of modified and unmodified residues that fit to the peptide molecular mass. As more and more modifications are considered, the number of combinations and permutations increases geometrically, and we get a so-called combinatorial explosion.

It is not possible to identify post-translational modifications by PMF; this requires MS/MS, so the best advice is to use a minimum of variable modifications, or none at all. In most cases, the only variable modification you need to consider is oxidation of methionine. Try searching the data from your standard with and without this modification to see which gives the highest score.

Protein mass

Protein mass is applied as a sliding window. That is, for each database entry, Mascot looks for the highest scoring set of peptide mass matches within a contiguous stretch of sequence less than or equal to the specified protein mass. Usually, this adds little to the score, and the general advice is to leave this field blank.

Mass tolerances

Making an estimate of the mass accuracy doesn’t have to be a guessing game. The Mascot Protein View report includes graphs of mass errors. Just run a standard and look at the error graphs for the correct match. Ignore outliers, which are chance mass matches, add on a safety margin and this is your error estimate. You can also use these graphs to decide whether Da or ppm is the best choice for the tolerance unit.

Charge state

In most cases, PMF data comes from a MALDI experiment, and the mass values are MH+. Your peak list will only contain Mr values (relative molecular mass) if the peak picking software has ‘de-charged’ the measured m/z values. Possibly, because the data contained a mixture of charge states.

Automatic target-decoy search

Mascot automatically runs a target-decoy search. The decoy search is done against a database in which each protein sequence has been randomised. If you have a score close to the significance threshold and are wondering whether the match is reliable, it can help to see the best score from the randomised, decoy database. If this is similar to that from the target, or higher, this can be a useful caution.