Advanced reading: calculating the spectral library score threshold

Warning: The details of how the spectral library score threshold is calculated is not required reading. The API is described in section Spectral library search results. The sections below are intended for reference only.

Score excess over threshold

The expect value, library match "significance" and library score contribution to protein score all use the basic concept of score excess over threshold. This is simply score - thr: the higher the excess, the lower the expect value and vice versa. A match score exactly at the threshold has excess zero. Similarly, only library matches with excess above zero contribute to the protein score.

Score excess is used in a similar way with Mascot scores and FASTA matches. The main difference is that Mascot score threshold is specific to a query.

The library score excess is calculated differently depending on which mode is active, integrated mode or SL-only mode. However, since the model in both cases is not a true statistical model, the phrase "library score is significant" should be interpreted very loosely. It simply means the score is above threshold (score excess is positive).

Score excess, threshold and expect value in integrated mode

In integrated mode, the library score excess is derived from matching the mean and standard deviation of library scores to the mean and standard deviation of Mascot score excess. This yields a formula for converting raw library scores to values equivalent to the Mascot score excess:

    excess(s) = mascot.mean + mascot.stdev * (s - library.mean) / library.stdev

The formula simply standardises library scores, then rescales them. Here s is the raw library score. The values for mascot.mean, mascot.stdev, library.mean and library.stdev are calculated from queries where the FASTA and library matches have the same sequence and the Mascot match is significant:

Iterate over all queries. If a query contains a significant rank 1 FASTA match and a library match at rank 2 with the same sequence, add the paired score (mascot.excess, library.score) to the list. mascot.excess = mascot.score - mascot.thr is the score excess over threshold for the rank 1 match.
Calculate the mean and standard deviation of mascot.excess in this list; store in mascot.mean and mascot.stdev.
Calculate the mean and standard deviation of library.score in this list; store in library.mean and library.stdev.

The procedure can be justified if you look at the paired score distribution in large integrated library searches. Queries that match the same peptide sequence in both the protein sequence database and the spectral library show high correlation between Mascot score and library score. The shapes of the Mascot score distribution and the library score distribution are also quite similar, although the similarity breaks down in small searches or searches with few matches from one or the other search engine.

The formula is valid for all library scores in the search, not just those used in the above calculation. This is because MSPepSearch scores are on an absolute scale with a finite range, and scores between different spectra seem to be very well calibrated.

Now that we can compute the equivalent score excess for library matches, the expect value can be calculated using the same formula as for Mascot scores, namely

    E(s) = minProbability * 10 ** (-excess(s)/10)

Here minProbability is the significance threshold argument given to the ms_peptidesummary constructor. An excess of zero gives E(s) = minProbability, while high excess gives low expect value.

The library score threshold is now easy to calculate: it is the score s for which score excess is zero. Equivalently, it is the score s for which E(s) = minProbability. Note that unlike with Mascot scores, the library score threshold is calculated after expect values are already known.

The score excess is also used in calculating protein scores; see Protein scores below.

As a small numerical example, suppose the search contains ten queries with the following matches:

Query	Rank	Match	Source
1	(no matches)
2	1	CIPALDSLTPANEDQK, score = 51, thr = 13	AA
2	2	CIPALDSLTPANEDQK, score = 524	SL
3	1	SLNNQIETLLTPEGSR, score = 21, thr = 33	AA
4	(no matches)
5	1	ENNEQLR, score = 22, thr = 20	AA
5	2	ENNEQLR, score = 300	SL
6	1	NIHMWCAMR, score = 101	SL
7	1	TLNDELEIIEGMOK, score = 40, thr = 25	AA
7	2	TLNDELELIEGMOK, score = 255	SL
8	1	NSGGNNNTTDLK, score = 39, thr = 16	AA
8	2	NSGGNNNTTDLK, score = 428	SL
9	(no matches)
10	1	LYGTDDNTQEVEAVTNK, score = 61, thr = 36	AA
10	2	LYGTDDNTQEVEAVTNK, score = 384	SL

Queries 1, 4 and 9 contribute nothing, since they have no matches.

Queries 3 and 6 don't contribute, since either the FASTA or the library match is missing. In query 3, another problem is that the FASTA match is below threshold, so not significant.

Query 7 doesn't contribute, since there is no library match to the same sequence (the match is to a slightly different sequence).

From the remaining queries, the data set is the pairs of scores from query 2 (38, 524), query 5 (2, 300), query 8 (23, 428) and query 10 (25, 384). The statistics will not be meaningful with such a small sample, but for the sake of completeness, we get mascot.mean = 22, mascot.stdev = 14.9, library.mean = 409 and library.stdev = 93.3. The transformed library score excess for the matches in queries 2, 5, 6, 7, 8 and 10 are excess(524) = 40.4, excess(300) = 4.6, excess(101) = -27.2, excess(255) = -2.6, excess(428) = 25.0 and excess(384) = 18.0. The score excess is similar to the Mascot score excess, which is a feature of many real-life data sets as well.

Score excess, threshold and expect value in SL-only mode

In SL-only mode, there are no Mascot scores, so there is no reference point based on an existing model. We discovered during the development of Mascot 2.6 that library score 300 is a rough but consistent estimate of a "universal" score threshold in MS/MS searches at approximately 5% level of significance. This observation might not be valid for small spectral libraries or small searches.

Unlike in integrated mode, score excess and the effective threshold are calculated from the following expect value formula:

  E(s) = 0.05 * 10 ** (-(s - 300)/100)

Here s is the raw library score. Note especially that minProbability does not appear in the formula.

The library score threshold is the value of s for which E(s) = minProbability; this is the same criterion as in the integrated case. If minProbability is the default 0.05, then library score threshold is 300. Note that again the expect value is known before the library score threshold, which is the same as in the integrated case but different from FASTA-only searches.

The library score excess is simple: find the library score threshold thr, and let excess = score - thr. The excess is on the library score scale rather than a scale equivalent to Mascot score excess.

The score excess is used in calculating protein scores; see Protein scores below.

Protein scores

It's recommended to use MudPIT scoring with spectral library searches. The calculation is explained in ms_protein::getScore(). In SL-only mode, the contribution of a match is the raw library score excess. In integrated mode, the contribution is the Mascot-equivalent score excess (see excess(s) in Score excess, threshold and expect value in integrated mode).

Note that even if you select standard scoring, the protein score calculation always uses the uncorrected match score; there is no multiplicity correction.