Peptide match duplicates

Warning – this is a complex issue that only applies to peptide summary reports, and only needs to be understood by advanced users. The default flag MSRES_DUPE_DEFAULT is suitable for most use cases.

Duplicate peptides arise from data

In a typical results file derived from an LC-MS-MS data set, the same peptide will appear multiple times. In a perfect system, identical peptides would rarely be seen. Ideally, peptides would be separated perfectly by chromatography, and even if that failed, peak detection software would combine similar peptides before the data were submitted to Mascot Server.

Back to reality... The standard Mascot Server reports display most 'duplicate' peptides, but the scores are shown with brackets (parentheses) to indicate that these peptides don't affect the total protein score. Mascot Parser has full flexibility to customise the treatment of such 'duplicate' peptides.

How Mascot Parser detects duplicate peptides

In peptide summary reports, each candidate protein contains one or more peptides. Some of these peptides will be duplicates of others, and these generally do not add to the confidence of the protein match, and therefore can optionally be ignored.

There are four possible situations that will cause Mascot Parser to mark a peptide as a duplicate:

Same query number. For example, if a protein contains the two peptides ABCDIK and BACIDK, and query 3 matches both of these (possibly with different scores), then the two peptides are duplicates of each other. Alternatively, the same query could match the same peptide twice but with different modifications.
Same peptide sequence. The peptide matches may or may not be from the same query. Also, they may or may not have the same start and end position in case there are repeated peptides in the protein.

If the match is a crosslinked match, it can only be marked as a duplicate of another crosslinked match whose alpha and beta sequences are the same and alpha and beta peptides are in the same protein (an intralinked match).
Same modifications. If the peptide sequences are different, then (to reduce permutations) the modifications are defined as being different.
Same start and end positions in the protein sequence. Normally caused by repeats in the sequence. If the peptide sequences are different, then the start and end positions are defined as being different. (This is necessary because for a Unigene entry, the start and end positions could be the same for different peptides. Each Unigene entry is made up from different EST fragments.)

If the match is a crosslinked match, and alpha and beta are in different proteins (an interlinked match), then their positions are always treated as different.

There are 4 rules to cover the cases above, which generates 16 possibilities. However, because of the definitions described above, only 9 are possible combinations:

Query	Sequence	Modifications	Position	`Rule_ID`
same	same	same	different	`A`
same	same	different	same	`B`
same	same	different	different	`C`
same	different	different	different	`D`
different	same	same	same	`E`
different	same	same	different	`F`
different	same	different	same	`G`
different	same	different	different	`H`
different	different	different	different	`I`

Use of the 'Rule ID' column is described below.

When the rules are applied

There are two flags that control how these rules are applied to the ms_mascotresults object:

MSRES_DUPE_REMOVE_[RULE_ID]

Each protein initially contains all possible matching peptides. If this flag is given to the constructor, duplicate peptides are then removed according to the specified rules.

A removed peptide will never be considered to be part of that protein – so, for example, it won't be included in the score. In rare cases, this will also affect how proteins are grouped together. (See Grouping proteins together). If no duplicates are ever to be removed, specify MSRES_DUPE_REMOVE_NONE.

Crosslinked matches are subject to duplicate detection. Intralinked matches (alpha and beta in same protein) are removed according to the specified MSRES_DUPE_REMOVE_[RULE_ID]. Interlinked matches (alpha and beta in different proteins) may be flagged as duplicates but cannot be removed using any of the existing rule IDs. If you can think of a use case for suppressing or removing duplicate interlinked matches, please contact us at support@matrixscience.com!

MSRES_DUPE_INCL_IN_SCORE_[RULE_ID]

When calculating the protein score, it may be desireable to include some duplicates in the score. If this flag is given to the constructor, duplicates matching the rule will be included in protein scoring. These flags apply to standard Mascot protein scoring and not to MudPIT scoring.

For the standard Mascot Server reports, no duplicates are included when calculating the protein score. (And duplicates are shown in brackets.) Any peptides removed by MSRES_DUPE_REMOVE_[RULE_ID] have already been removed before protein scoring is performed, so it is pointless to try and override, for example, MSRES_DUPE_REMOVE_D by using MSRES_DUPE_INCL_IN_SCORE_D.

When displaying or storing a list of peptides, it may be desirable to inhibit the display of some duplicate peptides. ms_protein::getPeptideDuplicate() can be used to that end. Note that peptides that have been removed due the MSRES_DUPE_REMOVE_[RULE_ID] setting need never be specifically inhibited by client code, since they will not be present anyway.

ms_peptidesummary::getProteinsWithThisPepMatch() will return a list of all proteins that contained this peptide. The list only includes peptides that were not discarded due to the MSRES_DUPE_REMOVE_[RULE_ID] setting. There is currently no other override for this function.

Rules for standard Mascot reports

For Mascot 1.9 and Mascot 2.0, the default for MSRES_DUPE_REMOVE_[RULE_ID] is MSRES_DUPE_REMOVE_A | MSRES_DUPE_REMOVE_D. This means that peptides with

same query, same sequence, same mods, different position

or

same query, different sequence, different mods, different position

are not included in any proteins.

No duplicates are ever added into the score, so MSRES_DUPE_INCL_IN_SCORE_NONE needs to be specified. MSRES_DUPE_DEFAULT is defined as MSRES_DUPE_REMOVE_A | MSRES_DUPE_REMOVE_D | MSRES_DUPE_INCL_IN_SCORE_NONE.

Compatibility with previous versions

This functionality was introduced in Mascot Parser 1.2. No changes are required to client code that was used before Mascot Parser 1.02:

If no MSRES_DUPE_REMOVE_[RULE_ID] flags are supplied, then the default MSRES_DUPE_REMOVE_A | MSRES_DUPE_REMOVE_D is assumed. (In the unlikely event that it is required that no duplicates are ever to be removed, then MSRES_DUPE_REMOVE_NONE must be specified.)
The default is to never add duplicates to the protein score, so MSRES_DUPE_INCL_IN_SCORE_NONE needs to be specified. However, since this is defined as '0', it does not need to be passed to the ms_peptidesummary constructor.

Earlier client code would typically check the following before displaying peptides or adding them to a database:

    prot->getPeptideDuplicate(i) != ms_protein::DUPE_DuplicateSameQuery

This test is still required since generally peptides with the same query, same sequence and different modifications are not shown in the report, but would be seen in the yellow popup.

Chimeric duplicates

Warning – this is a complex issue that only applies to peptide summary reports, and only needs to be understood by advanced users. It is best to always specify MSPEPSUM_REMOVE_CHIMERIC_DUPES or always use ms_mascotresfilebase::get_ms_mascotresults_params().

Mascot 2.5 added support for chimeric spectra. A chimeric spectrum contains MS/MS data from multiple precursors. Mascot divides each input spectrum with multiple precursor masses into a set of subsidiary queries, linked by the source index of the original spectrum. Each subsidiary query is matched separately, so that the number of output queries is the total number of precursor masses in the input file.

The subsidiary queries in each set can have duplicate matches, called chimeric duplicates. This is easiest to explain by example. Suppose we have a chimeric spectrum with two precursors, forming queries 1 and 2:

Query 1
Rank	Sequence	Delta
1	TVAGQDAVIVLLGTR	0.002411
2	ISMPDIDLNLTGPK	-0.916199	(dupe)
3	LLSENADLKKQVR	-0.992782	(dupe)
4	FLTGPLNLNDPDAK	-1.908081	(dupe)
5	AEAGLQDGISGPATAR	-0.883652	(dupe)

Query 2
Rank	Sequence	Delta
1	TVAGQDAVIVLLGTR	0.921411	(dupe)
2	ISMPDIDLNLTGPK	0.002801
3	LLSENADLKKQVR	-0.073782
4	FLTGPLNLNDPDAK	-0.989081
5	AEAGLQDGISGPATAR	0.035348

The two queries originate from the same chimeric spectrum and have nearly the same precursor masses. (The precursor mass tolerance is quite wide in this example.) As a result, both queries match the same peptide sequences. However, all but the rank 1 match in the first query have a much larger mass delta than in query 2. This means the rank 2-5 matches in query 1 are all chimeric duplicates. Conversely, the rank 1 match in query 2 is a duplicate.

Parser removes chimeric duplicates from consideration if you open the results file as a peptide summary (ms_peptidesummary) and use the flag MSPEPSUM_REMOVE_CHIMERIC_DUPES. In the above case, you would only see one match in query 1 and four in query 2. Ranks are renumbered accordingly (e.g. query 2 rank 2 becomes query 2 rank 1 when the chimeric duplicate is removed). Chimeric duplicate removal is done at query level before protein grouping and before considering peptide match duplicates within protein hits.

Chimeric duplicates are removed by default if you use ms_mascotresfilebase::get_ms_mascotresults_params() to construct the default flags to ms_peptidesummary. Otherwise you need to specify the flag yourself when creating the peptide summary object.

Note that the flag has no effect if the results file was created by Mascot 2.4 or earlier, or if the results file has no chimeric spectra.