Mascot: The trusted reference standard for protein identification by mass spectrometry for 25 years

Quantitation: Statistical procedures

Usually, identification and quantitation are performed at the peptide level. The Mascot result report assigns the peptide matches to protein hits, and the ratios for individual peptide matches are combined to determine ratios for the protein hits. The methods provided for calculating a protein ratio from a set of peptide ratios are median, average, or ratio of summed intensities (only available for reporter protocol and referred to in method configuration as weighted average). The standard deviation of the peptide ratios provides a measure of the uncertainty in the protein ratio.

All statistical tests are performed in log space, and we assume the peptide log-ratios are normally distributed. This means we assume that ratios have a log-normal distribution. Since we are dealing with ratios, the average is the geometric mean and the standard deviation is the geometric standard deviation, which is a factor, never less than 1. The confidence interval for n peptide ratios with geometric mean m and geometric standard deviation s is from m / s^(c/sqrt(n)) to m * s^(c/sqrt(n)), where c is the critical value for a two-sided Student’s t test with n-1 degrees of freedom. For example, assume we have n=10 peptide ratios with a geometric mean of m=0.91 and geometric standard deviation s=1.06. From a table of critical values, c=2.26 for a 95% confidence interval, giving us the interval 0.91 / 1.06^(2.26/sqrt(10)) = 0.87 to 0.91 * 1.06^(2.26/sqrt(10)) = 0.95

Ratios for peptide matches are only reported if various quality criteria are fulfilled, the most important being:

  • Peptide modification state
  • Minimum precursor charge, (default 1)
  • Strength of the peptide match, defined in terms of either a minimum score, a maximum expect value, or the score being at or above either the identity threshold or the homology threshold, (default maximum expect of 0.05)
  • Method specific criteria, such as a minimum number of fragment ion pairs for multiplex

A ratio for a protein hit is only reported if the minimum number of peptide matches is achieved, (default 2). If the ratios for the peptide matches are not consistent with a sample from a log-normal distribution, the SD(geo) value is displayed in italics, and should be treated with caution.

Testing for normality

Testing for outliers and reporting a standard deviation for the protein ratio relies on the peptide ratios being consistent with a sample from a log-normal distribution. If the peptide ratios do not appear to be from a log-normal distribution, this may indicate that the values are meaningless, and something went systematically wrong with the the analysis. On the other hand, it may indicate something interesting, like the peptides have been mis-assigned and actually come from two proteins with very different ratios, so that the distribution is bimodal.

Shapiro-Wilk W test

In the Shapiro-Wilk W test, the null hypothesis is that the sample is taken from a normal distribution. This hypothesis is rejected if the critical value P for the test statistic W is less than 0.05. The routine used is valid for sample sizes between 3 and 2000.

References:

  1. Royston, J. P., An Extension of Shapiro and Wilk’s W Test for Normality to Large Samples, Applied Statistics 31 115-124 (1982)
  2. Royston, P., Remark AS R94: A Remark on Algorithm AS 181: The W-test for Normality, Applied Statistics 44 547-551 (1995)

Outlier removal

The available methods for testing and removing outliers are none, auto, dixons, grubbs, and rosners. Choosing auto means that Dixon’s method will be used if the number of values is between 4 and 25, while Rosner’s method will be used if the number of values is greater than 25. If the ratios for the peptide matches are not consistent with a sample from a log-normal distribution, the SD(geo) value is displayed in italics, and outlier removal is skipped.

Any statistician will advise of the dangers of blindly removing outliers. The general advice is to analyse the data both with and without the outlier(s) and see if the conclusions are qualitatively different.

Dixon’s method

Dixon’s r11 test, also referred to as N9, is used to detect and remove a single outlier at a time from either the upper or lower extreme of the range. Critical values for a significance level 0.05 are used, as tabulated by Verma and Quiroz-Ruiz. The test is applicable to between 4 and 100 values. Each time a value is removed, the test is repeated.

References:

  1. Dixon, W. J., Processing Data for Outliers, Biometrics 9 74-89 (1953)
  2. Verma, S. P. and Quiroz-Ruiz, A., Critical values for six Dixon tests for outliers in normal samples up to sizes 100, and applications in science and engineering, Revista Mexicana de Ciencias Geológicas, 23 133-161 (2006)

Grubbs’ method

Grubbs’ method is used to detect and remove a single outlier at a time from either the upper or lower extreme of the range. Critical values for a significance level 0.05 are used, as tabulated by Verma and Quiroz-Ruiz (Table A1 for discordancy test N1). The test is applicable to between 3 and 100 values. Each time a value is removed, the test is repeated.

References:

  1. Grubbs, F. E., Procedures for Detecting Outlying Observations in Samples, Technometrics 11 1-21 (1969)
  2. Verma, S. P. and Quiroz-Ruiz, A., Critical values for 22 discordancy test variants for outliers in normal samples up to sizes 100, and applications in science and engineering, Revista Mexicana de Ciencias Geológicas, 23 302-319 (2006)

Rosner’s method

Rosner’s method will detect and remove multiple outliers in a single pass. Critical values for a significance level 0.05 are used. The test will remove up to 10 outliers from a sample of at least 25 values.

References:

  1. Rosner, B., Percentage Points for a Generalized ESD Many-Outlier Procedure, Technometrics 25 165-172 (1983)

Protein ratio calculation

The three methods of deriving a protein hit ratio from a set of peptide ratios are median, average, and ratio of summed intensities.

  • Median: The median peptide ratio is selected to represent the protein ratio. If there are an even number of peptide ratios, the geometric mean of the median pair is used
  • Average: The protein ratio is the geometric average of the peptide ratios
  • Summed intensities: For each component, the intensity values are summed over the set of peptides and the protein ratio(s) calculated from the summed values. This will be the best measure if the accuracy is limited by counting statistics. This is only available for reporter protocol and, when selected, SD(geo) and p-values for the protein ratios are not available.

Significant changes

A protein ratio is reported in bold face if it is significantly different from unity. The comparison test is Student’s t statistic in log space:

equation

If this inequality is true, then there is no significant difference at the stated confidence level. (N is the number of peptide ratios, s is the standard deviation and x the mean of the peptide ratios, both numbers calculated in log space. The true value of the ratio, µ, is 0 in log space. t is students t for N-1 degrees of freedom and a two-sided confidence level of 95%.)

When ratio type is median, there is an extra correction factor. The standard error is log(s)/sqrt(n) in log space when ratio type is mean, and C * log(s)/sqrt(n) when ratio type is median. The correction factor C is pi/2 when sample size > 100 and between 1 and 1.25 for smaller sample sizes.

Data normalisation strongly influences which protein ratios are shown in bold. If a large number of values are bold, this is likely to indicate that the comparison to unity has no meaning. If the ratios for the peptide matches are not consistent with a sample from a log-normal distribution, the SD(geo) value is displayed in italics, and the significance test is omitted.

References:

  1. Hojo, T. and K. Pearson (1931): Distribution of the Median, Quartiles and Interquartile Distance in Samples From a Normal Population. Biometrika 23(3/4):315-363.

Protein ratio p-values

The null hypothesis is that the protein ratio is 1.0. The code calculates the t statistic

t = (log(m) - mu)/(log(s)/sqrt(n))

where m is the observed geometric mean of peptide ratios, mu = log(1.0) = 0.0 is the population mean from the null hypothesis, s is the geometric standard deviation and n is the sample size. The p-value is then

p = 1 - (F(t; n-1) - F(-t; n-1))

where F(t; d.f.) is the cumulative distribution function of Student’s t distribution with d.f. degrees of freedom. This is the standard two-tailed p-value. Parser uses a reasonably accurate algorithm for the cumulative distribution function.

You can sometimes approximate the p-value from a table of critical values. For example, if t = 2.3 and d.f. = 9, the upper probability is a bit over 0.975, say 0.98, and because the t distribution is symmetric, the lower probability is around 1 – 0.98 = 0.02. So p = 1 - (0.98 – 0.02) = 0.04, approximately. (Exact value is 0.04699939 to 8 decimal places.)

For the median, the p-value formula is the same but the test statistic has the extra correction factor C, described earlier:

t = (log(m) - mu)/(C * log(s)/sqrt(n))