MS/MS Results Interpretation
Other help pages describe the format and content of the various result reports. In particular, refer to Result Report Overview and Summary Reports for MS/MS. This page attempts to explain some of the underlying concepts, especially those relating to protein inference.
In Mascot, the ions score for an MS/MS match is based on the calculated probability, P, that the observed match between the experimental data and the database sequence is a random event. The reported score is -10Log(P).
During a search, if 1500 peptides fell within the mass tolerance window about the precursor mass, and the significance threshold was chosen to be 0.05, (a 1 in 20 chance of being a false positive), the corresponding score threshold should be -10Log(1/(20 x 1500)) = 45. Extensive testing with large target-decoy searches showed this to be too high, and the identity threshold displayed in reports has always had an empirical correction of -13 applied.
If the quality of an MS/MS spectrum is poor, particularly if the signal to noise ratio is low, a match to the "correct" sequence might not exceed this absolute score threshold. Even so, the best match could have a relatively high score, which is well separated from the distribution of 1500 random scores. In other words, the score is an outlier. This would indicate that the match is not a random event and, if tested using a method such as a target-decoy search, such matches can be shown to be reliable. For this reason, Mascot also attempts to characterise the distribution of random scores, and provide a second, lower threshold to highlight the presence of any outlier. The lower, relative threshold is reported as the homology threshold while the higher threshold is reported as the identity threshold.
The identity threshold is still useful because it is not always possible to estimate a homology threshold. If the instrument accuracy is very high or the database is very small, there may only be a small handful of candidate sequences, so that it is not possible to say whether a match is an outlier.
For a search of at least 1000 spectra, where an automatic decoy search was used, you can choose to process the Mascot scores through Percolator. This uses machine learning to re-rank the matches, so as to obtain an optimum false discovery rate. The revised probabilities are converted to scores for reporting purposes, together with a single score threshold to indicate significance.
The protein score in the result report from an MS/MS search is derived from the ions scores. For a search that contains a small number of queries, the protein score is the sum of the highest ions score for each distinct sequence. That is, excluding the scores of duplicate matches, which are shown in parentheses. A small correction is applied to reduce the contribution of low-scoring random matches. This correction is a function of the total number of molecular mass matches for each query. This correction is usually very small, except in no enzyme searches.
This protein score works well for small searches, and provides a logical order to the report. If multiple queries match to a single protein, but the individual ions scores are below threshold, the combined ions scores can still place the protein high in the report. However, the standard protein score is less satisfactory for searches with very large numbers of queries, such as MudPIT data sets. For each MS/MS query, Mascot retains up to 10 peptide matches. When the number of queries is comparable with the number of entries in the database, this means that there can be random, low-scoring matches for every entry. Although the average number of random matches per entry might be low, the actual number will follow a distribution, and some entries will have large numbers of low scoring matches, leading to large protein scores.
While it is obvious from a detailed study of the report that these are meaningless matches, it would be better to eliminate them entirely. So, if the ratio between the number of queries and the number of entries in the database exceeds a pre-determined threshold, the basis for calculating the protein score is changed. Only those ions scores that exceed one or both significance thresholds contribute to the score, so that low scoring, random matches have no effect. This gives a much cleaner report for a large scale search. This threshold is 0.001 by default, and can be changed on a global basis in the configuration file, mascot.dat, or changed for a single report by using the format controls at the top of the report. Note that, when calculating this threshold, if a taxonomy filter is being used, the number of entries in the database is the number remaining after the taxonomy filter.
When MS/MS spectra are searched against a sequence database, we are matching peptides, not proteins. In most cases, the matched peptides will not be unique to a single protein. Yet, we usually want to know which proteins were present in the sample. So, we are faced with the challenge of protein inference: given a set of peptide matches, which proteins do we believe were present in the sample?
The usual approach is based on the "Principle of Parsimony". We report the minimum set of proteins that account for the observed peptide matches. If we had four peptide matches, two of which occurred in protein A and two in protein B but all four were found in protein C, we would report that protein C had been identified. Proteins A and B might be listed as "sub-set" proteins. It is perfectly possible that our sample actually contained a mixture of proteins A and B, but there is no evidence for this.
The Peptide Summary and Select Summary uses a very simple algorithm. First, we take the protein with the highest protein score, and call this hit number 1. We then take all other proteins that share the same set of peptide matches or a sub-set and include these in the same hit. In the report, they are listed as same-set and sub-set proteins. With these proteins removed from the list, we now take the remaining protein with the highest score and repeat the process until all the significant peptide matches are accounted for.
This sounds simple enough, and works well for small datasets, but larger search results create difficulties:
- What if two proteins have many strong matches in common but one has an additional weak match? Should we treat one as the outright winner, and relegate the other to the status of sub-set?
- What if we have intersections? That is, the protein is not a sub-set of any other one protein, but all the matches can be found in a set of proteins, each of which has additional matches.
- In many cases, the exact sequence of the protein that was analysed is not in the database. All the peptide sequences are present, but spread across several homologous proteins, which might be splice variants or represent different combinations of SNPs.
The Protein Family Summary tries to address these difficulties by clustering proteins into families. The algorithm works as follows:
- Create a list of proteins, ordered by protein score
- Take the highest scoring protein
- Find all the family members for this protein:
Note that this grouping into families is based on significant matches. Non-significant matches are ignored.
- select all matches with a score at or above the homology threshold
- for each match, select all other the proteins that contain this match (using the score as a test to include matches that are identical matches though not identical sequences, e.g. I to L substitution or other differences that have no impact on the score)
- for each new protein, select all new matches with a score at or above the homology threshold
- loop until all related proteins and matches have been found
- Report this family as a single hit. All these proteins can be removed from the list
- For each protein in the family, make a list of the distinct peptide sequences. That is, ignore differences in score, modifications, charge, etc. Where there are duplicate matches, use the highest score
- Divide and group the proteins into same-set proteins and sub-set proteins; sub-sets include intersections
- Where there are same-set proteins, collapse into a single family member
- Move any proteins that are sub-sets or intersections to the sub-sets list
- Perform hierarchical clustering on the family members, using the score excess over threshold of the non-shared matches as the distance metric
- Loop from step 2 until no more proteins remain that contain matches with homology score or better
The goal is to present the possible protein assignments clearly, so that someone with knowledge of the biology can make an informed decision as to which proteins are present. In most cases, there will be some ambiguity about precisely which proteins are present. As mentioned earlier, the exact sequence of an analyte may not be in the database, and peptide matches may be distributed across multiple, homologous database entries. If it is essential to characterise the complete protein sequence, or to choose between splice variants, or to confirm a SNP, it is likely that additional, targeted experiments will be required.
To cluster proteins into families, we use the score of the non-shared matches as the distance between two proteins. More precisely, we use the score excess over the significance threshold, since a score below significance threshold could be random, and should not be taken as evidence for two different proteins being present. This means that matches below threshold play no part in the clustering process. Each distinct peptide sequence is represented once by the match with the highest score. Matches to the same sequence with different charge states or with different modifications are considered duplicates.
If two proteins have the same set of peptide matches, the distance between them is zero. If they have just a single shared match, the distance between them is the sum of the score excesses of all the non-shared matches in one protein, since discarding these would make the protein a sub-set of the other, based on the single shared match.
There are some subtleties to this procedure. Consider the case of two proteins which have different peptide matches to the same query with the same score. Only one of these matches can be correct, but we don’t know which. One obvious example is where the two sequences differ only in exchange of I and L. In terms of the mass spectrum, these sequences are identical. Unless the mass accuracy is high, the same is true for exchange of Q and K or F and oxidised M. Clearly, a sequence containing F at a particular position is very different, in biological terms, from one containing M at the same position. But, if the scores are the same, there is simply no evidence from the mass spectrometry data for two proteins. In terms of a distance matrix, we must treat it is as if there was no match to either peptide.
Now, consider the case where we have two proteins with different peptide matches to the same query and the scores are not the same. Assume the threshold is 40 and one has a score of 50 and the other has a score of 60. Again, only one of these matches can be correct; it is not the same as if they were independent matches to different queries. Extending the logic that matches to the same query with the same score correspond to a distance of zero, matches to the same query with different scores correspond to a distance that is the score difference. In this example, the distance would be 10. If the two matches came from different queries, and could be treated independently, the distance would be (60 – 40) + (50 – 40) = 30
To create the dendrogram, we first compute a distance matrix, which is the distance between each pair of proteins. The two proteins separated by the smallest distance are joined to create a node, with the length of the branches from the node are the score distance between the proteins. The two joined proteins are removed from list, replaced by the node, and the distances between the new node and all other remaining proteins (or nodes) computed. The process is repeated until only one node remains.
When the dendrogram (or tree) is drawn, the order is chosen to avoid any branches crossing. There is no other significance to the order of the branches, and there are many possible ways to order the branches so as to avoid crossings. In the tabular part of the report, proteins are sorted in order of decreasing score, and this will often be different from the dendrogram order.
Note that, if you select a pair of family members from a large family, it is perfectly possible that they will have no shared matches. Each family member will have shared matches with at least one other family member, or they would not have been grouped into the same family, but this doesn’t mean that there are going to be shared matches between every pair.
This Format Control allows you to specify a preferred taxonomy for the anchor protein in cases where there is a choice of indistinguishable proteins.
Imagine we are studying dormice, which are not well represented in any protein database. We choose the broader taxonomy of Rodentia so that we can get matches to homologous proteins from other rodents. But, if a hit contains same-set matches to proteins from rat, mouse and dormouse, we can ensure the dormouse entry will be selected as the anchor protein by specifying Gliridae as the Preferred Taxonomy.
Another situation where Preferred Taxonomy can come in useful is for a database like NCBI nr, where each entry represents multiple proteins. By default, it is always the first protein in the title line that is selected as anchor protein. You might search with a taxonomy filter of dog and pull out an entry for a protein that was found in both cat and dog and happened to have cat listed first. Setting a Preferred Taxonomy of dog will ensure the dog accession and description are selected for display in such cases.
In Mascot 2.4, the additional taxonomy information required for this function is saved in the result file, and the preferred taxonomy control will always be available for new searches of databases for which taxonomy is defined. If the result file comes from Mascot 2.3 or earlier, the databases that were used in the search need to be online. Otherwise the control will be hidden, because there would be no way to retrieve the required taxonomy information.
Finally, note that the default taxonomy list shipped with Mascot is limited to a small number of well characterised organisms, and this doesn’t include either cat or dog. So, for the second example, you would need to edit the file called taxonomy in the Mascot config directory to add the required entries. For example, the categories under mammals in the default file might look like this:Title:. . . . . . . . . . . . Mammalia (mammals) Include: 40674 Exclude: * Title:. . . . . . . . . . . . . . Primates Include: 9443 Exclude: * Title:. . . . . . . . . . . . . . . . Homo sapiens (human) Include: 9606 Exclude: * Title:. . . . . . . . . . . . . . . . Other primates Include: 9443 Exclude: 9606 * Title:. . . . . . . . . . . . . . Rodentia (Rodents) Include: 9989 Exclude: * Title:. . . . . . . . . . . . . . . . Mus. Include: 10088 Exclude: * Title:. . . . . . . . . . . . . . . . . . Mus musculus (house mouse) Include: 10090 Exclude: * Title:. . . . . . . . . . . . . . . . Rattus Include: 10114 Exclude: * Title:. . . . . . . . . . . . . . . . Other rodentia Include: 9989 Exclude: 10088, 10114 * Title:. . . . . . . . . . . . . . Other mammalia Include: 40674 Exclude: 9443, 9989 *
To add dog to the list of choices, enter the text shown in redTitle:. . . . . . . . . . . . . . . . Other rodentia Include: 9989 Exclude: 10088, 10114 * Title:. . . . . . . . . . . . . . Canis familiaris Include: 9615 Exclude: * Title:. . . . . . . . . . . . . . Other mammalia Include: 40674 Exclude: 9443, 9989, 9615 *
The NCBI Taxonomy Browser is invaluable for looking up TaxID codes and finding where a particular organism fits into the tree of life. It also lists the number of entries in GenBank for each taxonomy, which is a useful way to discover whether a particular taxonomy might be too narrow. Never choose a taxonomy that has less than two thousand proteins; move to a higher level so as to search a reasonable number of entries.
The Protein Family Summary is expressly designed for large search results. Because it is a paged report, that initially displays only the first ten families, it will usually succeed on a 32-bit platform and always on a 64-bit platform. If, for some reason, you need to view results using the earlier, Select Summary report, this section contains some tips.
The format controls near the top of the report can help streamline the results from a large search by eliminating most of the "junk". If the report is too large to open in the first place, these options can also be specified by adding URL switches to the report URL.
- View the report on a client with plenty of free physical RAM. Do not try to view the report in a browser running on the Mascot server
Select Summary: Ensure you are using the Select Summary. If you are using a third party client that has specified Peptide Summary,
Add this to the URL before opening the file: &REPTYPE=select
Don’t specify too many hits: Use AUTO to report only protein hits that contain significant peptide matches
Add this to the URL before opening the file: &REPORT=AUTO
MudPIT Protein Scoring: By default, large searches will switch to using more aggressive protein scoring. This removes many of the junk protein hits, which have high protein scores but no high scoring peptide matches. Do not be tempted to switch back to standard scoring.
Add this to the URL before opening the file: &_server_mudpit_switch=0.000000001
Require Bold Red: The Select Summary report does not detect intersections. Red and bold typefaces are used to highlight the most logical assignment of peptides to proteins. The first time a peptide match to a query appears in the report, it is shown in bold face. Whenever the top ranking peptide match appears, it is shown in red. Thus, a bold red match is the highest scoring match to a particular query listed under the highest scoring protein containing that match. This means that protein hits with many peptide matches that are both bold and red are the most likely assignments. Conversely, a protein that does not contain any bold red matches is an intersection of proteins listed higher in the report.
Requiring a protein hit to include at least one bold red peptide match is a good way to filter homologous proteins from a report. The down-side is that you may sometimes throw out the wrong protein! For example, imagine you are searching with a taxonomy of mammals but are mainly interested in yeti proteins. If the same strong peptide matches are found in a yeti protein and also in the human homolog, and one or more junk peptide matches prevent the two proteins collapsing into a single hit, but give the human protein a slightly higher score, that is the one that will feature in the report.
Add this to the URL before opening the file: &_requireboldred=1
Ignore Ions Score Below: You can minimise the previous problem by judicious use of the Ions score cut-off field. By setting this to a value of 1 or more, you filter out all of the matches with lower scores. When set to a value between 0 and 1, it becomes an expect value cut-off, filtering out matches with higher expect values. Removing random matches means that homologous proteins are more likely to collapse into a single hit. (Note that this control is not displayed by default. The default control is a checkbox labelled Display non-significant matches which should be left unchecked. For more information, search the Installation & setup manual for DisplayNonSignificantMatches.)
Add this to the URL before opening the file: &_ignoreionsscorebelow=0.5
Add this to the URL before opening the file: &_showpopups=FALSE