Posted by John Cottrell (August 2, 2013)

Current challenges in quantitative proteomics

"Current challenges in software solutions for mass spectrometry-based quantitative proteomics" is a recent paper in Amino Acids by a group of expert authors that describes ten areas of particular difficulty in data processing for quantitation. Full text is available online at Springer Link.

I would argue that Mascot Distiller meets almost all of these challenges. Obviously, I have to declare an interest in making Distiller more widely used, so I’ll try to justify this claim in terms that are reasonably objective and verifiable. If you are sceptical (as a good scientist should be) please try Distiller on a free 30 day evaluation, as described under Evaluation on the download page, and decide for yourself.

Like any complex software, there is a learning curve. To help you judge the quality of the results without an excessive investment of time, we invite you to send a non-confidential raw file (or two if its a label-free experiment) together with a description of the type of quantitation and anything else we might need to know about the sample. We’ll process the file and send back the Distiller project file, allowing you to study the results and see all the settings that were used. Just contact support@matrixscience.com for upload instructions.

Challenge 1: software usability

Usability is a subjective quality; some people prefer a mouse, others a command line. The authors take the very reasonable approach of listing what they feel are the important usability features for two different types of user: an end-user and a bioinformatics developer. The items on the end-user list are:

Ease of installation. Is the tool at hand easy to install, or does it require expert knowledge? For example, can you use an installer, or does it require manual compiling from the source code?: Distiller is installed using a standard Windows setup wizard; no compilation is required
Presence of documentation or tutorials, which help in perceiving the software as ‘easy to use’.: Context sensitive HTML help is included and a webcast tutorial is available
Presence of a graphical user interface.: Present
Presence of interactive feedback during data processing, to allow for adjustments and ad hoc decision making.: There are plenty of progress bars, but I guess the authors mean more than this. Just about everything in Distiller is configurable, but you can’t make adjustments on the fly. There are three main stages of processing: peak picking, database search, and quantitation. At each stage, you choose your settings, process a file, then inspect the results. If something isn’t right, you have to change the relevant settings and start over.
Presence of interactive feedback during the quantification process to allow for manual validation of the quantification results or visual assessment of what went wrong in case of no results.: Distiller uses three rules to decide whether to accept or reject each peptide measurement. There are also outlier detection routines to reject statistical outliers. Users can over-ride these decisions based on inspection of the XICs and the goodness of fit between the experimental and calculated isotope distributions.
Presence of a mailing list, for update notifications, discussion about problems and direct help from the software developers.: Technical support is available by email or phone
Storage and sharing of user data and results.: The entire work-space is saved to a project file. If not registered, Distiller operates as a free project viewer, so you can share results with colleagues who don’t have a Distiller licence.

The suggested requirements of a bioinformatics developer:

Flexibility, i.e., how well does the software follow current standards and/or does it handle multiple vendor formats?: Supports mzML, mzXML, and all mainstream native raw file formats
Modularity, i.e., can the software be easily integrated into existing pipelines or workflow management tools: There is an API to all the peak picking functions and quantitation can be performed by executing Distiller on the command line. The quantitation results including intermediate data (e.g. XIC traces) can be exported as an XML file with a published and annotated schema
Portability, i.e., can the software run on different hardware platforms?: No, Distiller is only available for Windows. All MS data systems are Windows-based and some of the vendor supplied libraries used to access the raw data are only available for Windows
Documentation.: In addition to the end-user help, the API is fully documented
Distribution terms: freeware, shareware or commercial? Open source or closed source? Web based?: Commercial, closed source, not web-based
Scaling and parallel processing, i.e., are multithreading, multiprocessing or grid-based processing possible?: Processing is multithreaded and Distiller will try to use all the resources on a multiprocessor PC, if permitted, but it is not compatible with grid or cloud processing
Batch processing, i.e., is it possible to run large batches of files in a single instance and without manual intervention?: All stages from peak picking to saving the fully processed project can be automated using Mascot Daemon

I think we can claim a score of 6.5 out of 7 from the end-user perspective. The only thing missing is the ability to make adjustments on the fly. Maybe 5 out of 7 for the bioinformatics developer, because Distiller is only available for Windows and I’m guessing that open source freeware would be the authors’ first choice for item 5.

Challenge 2: data reduction

For MS, this is described in terms of reducing the data to a form suitable for feature detection. Distiller works directly from the original, raw data files, so this doesn’t apply.

MS/MS quantitation, such as iTRAQ and TMT, is reported in the Mascot search result report, because all the required information can be contained in a conventional MS/MS peak list. Distiller is a suitable choice for creating the peak list, and fits single peaks in the reporter ion region, rather than averagine isotope distributions. The authors list five aspects to MS/MS data reduction:

Pre-processing to centroid peaks, filter out noise, deconvolute multiply charged ions to the m/z of the corresponding 1+ charge state, and deal with isotope clusters.: Distiller performs all of the above. Noise is discarded because it doesn’t fit to the shape of an isotope distribution. Charge deconvolution is optional, and can be important for instruments that generate highly charged precursors.
Detection and clustering of multiple redundant spectra of the same peptide (Beer et al. 2004; Tabb et al. 2005). From the point of view of quantification, clustering algorithms may be useful for the detection of weaker peptides.: If the acquisition method allows redundant spectra to be acquired, they can be summed according to tolerances on the precursor m/z and elution time. In the context of quantitation, it could be useful to inhibit summing if spectral similarity fell below some threshold, but usually safer to turn off summing altogether. If there are two spectra for the same peptide, and peptide ratios are combined using a weghted average, it makes no difference whether the spectra are summed or not. If there are two spectra which might be for the same peptide, but one is noisy and gets a non-significant match, better to exclude it from quantitation.
Detection of spectra of multiple co-eluting peptides (Bern et al. 2010; Houel et al. 2010) which can seriously harm identification and quantification.: Co-eluting peptides are a recognised problem for reporter ion quantitation, and limit the dynamic range. If identification suffers, this is actually a good thing, as it helps exclude such spectra. It would be possible for the output of an MS/MS spectrum to be suppressed when there were multiple precursors in the survey scan and the data was destined for isobaric quantitation. This is on the wish list for Distiller.
Elimination of low-quality spectra (Flikka et al. 2006; Junqueira et al. 2008).: This is most efficiently handled by the search engine. Rule-based pre-processing is redundant.
Reassignment of precursor charge and m/z (Mayampurath et al. 2008; Shinkawa et al. 2009).: An important reason to use Distiller, as described under the next challenge.

Challenge 3: feature detection

Feature detection is one of the key strengths of Distiller, which calculates a theoretical isotope distribution and fits it to the experimental data. The authors recognise that this is the optimum approach, but then comment "We are not aware of any current tools that double-check the isotopic pattern after the peptide assignment." In fact, this is exactly what Distiller does. The initial peak picking calculates distributions based on averagine then, during quantitation, the actual elemental composition of the matched peptide is used to get a more accurate profile. The shape is further modified by predicting the effect of under-enrichment, since no label is ever 100% pure. This is extremely important for ¹⁵N metabolic labeling, where even 1% under-enrichment causes significant tailing because of the large number of heavy atoms in a typical peptide.

I suggest Distiller addresses all of the issues described under feature detection:

Deisotoping (and abundance measurements): Distiller calculates and fits a distribution based on actual elemental composition
Isobaric interference from isotopic clusters: When isotope distributions overlap, Distiller uses deconvolution in the intensity domain to determine the areas of the individual components. This is especially for important for ¹⁸O.
Isobaric interference from co-eluting peptides: Distiller supports all of the approaches described in the paper.
Satellite peaks from partial isotope enrichment: Distiller models under-enrichment, as mentioned above. It can even model under-enrichment in more than one atom, for the brave souls doing ¹⁵N + ¹³C metabolic
Satellite peaks from proline conversion: Correction for Arg-Pro conversion in SILAC can be specified in the quantitation method in a very general way.
Detector saturation: One of the Distiller quality thresholds is the correlation co-efficient between the calculated and experimental isotope distributions. This should catch cases where detector saturation causes the distribution to become seriously distorted, but I’ve never had an opportunity to test this on real, saturated data. If anyone has some data they’d be willing to share, please contact support@matrixscience.com

Challenge 4: noise rejection

The authors identify three types of noise.

Random noise in the mass domain is not an issue for peak picking by fitting a calculated isotope distribution. Smoothing is simply not required.

Chemical noise in the mass domain is strongly discriminated against by the peak picking. For low resolution data, some may get through, giving rise to an elevated background in the time domain. This is largely eliminated by the approach used in Distiller to calculating a ratio. For each ratio, the pairs of component intensities from the scans in the XIC peak are fitted by a straight line using method of least squares with errors in both co-ordinates. The gradient of the fitted line is the best estimate of the ratio and any constant background becomes the intercept. The standard error for the fit is a good measure of the reliability of the ratio, and is used as one of the quality thresholds.

Nonprotein contaminants behave like chemical noise and the remedies are the same

Challenge 5: retention time alignment

It is probably true to say that, for all software, time alignment is only as good as the chromatography. In a label-free experiment, if the XIC peak for a peptide is different in shape from run to run, the measurement is bound to be unreliable. There are also severe difficulties with strongly up or down regulated peptides, such that one or more XICs are missing or at noise level.

Distiller also supports time-alignment for stable isotope experiments. The original deuterium ICAT is little used today, but an increasing number of groups are using deuterium in dimethylation or SILAC labels. If the deuterium causes a significant elution time shift, the ratio within any one scan is distorted, and it is necessary to ‘time align’ the light and heavy signals.

Challenge 6: peptide identification

There is no requirement for MS/MS identification of a peptide across all components. For example, in 3 component SILAC, only one of the light, medium or heavy states needs to be in the search results. For label-free, it is only necessary to identify the peptide in one of the runs.

The authors speculate on identification based on possible combinations of database search, library search and de novo. Its hard to imagine a situation where one would wish to quantify a peptide of unknown sequence, so I’m not sure how de novo helps. In most cases, peak picking and quantitation take very much longer than the database search, so any speed advantage that library search might offer seems unimportant in this context.

Challenge 7: normalization of peptide abundances

Distiller checks most of the boxes for normalization, which can be global or based on specified proteins or peptides, but not some of the more exotic approaches: linear regression normalization, local regression normalization, quantile normalization.

Challenge 8: protein inference

For a large data set, the default protein inference method is family grouping. This produces a very rigorous minimal list but recognises that there will always be some degree of ambiguity for proteins with shared peptides. If you wish, you can choose to quantify only peptides that are unique to one protein family member, but this may discard a large proportion of the measurements. Better to let everything through then inspect proteins where the variance of the peptide ratios is higher than expected, which might indicate matches for differentially regulated isoforms have been incorrectly grouped.

Challenge 9: protein quantification

You can choose from all the common methods of calculating a protein ratio from a set of peptide ratios: average, median, and weighted average. The Top N approach is also implemented in the form of the average protocol. This is particularly well suited to absolute quantitation within a mixture where the amount of one or more proteins is known.

Challenge 10: statistical significance analysis and data mining

Distiller includes many of the basic statistical methods referred to in this section: Shapiro–Wilk test for normality, students t test for significant fold change, and non-parametric outlier detection. It does not extend to more complex treatments, such as ANOVA. Distiller is a tool for processing a single data set. This may span multiple files, as in label-free or shotgun fractions, but Distiller doesn’t seek to consolidate and report data across multiple experiments, such as technical and biological replicates.

Analysis of complex data sets is best handled by a separate application with a more appropriate user interface. The quantitation results from Distiller can be exported as CSV and XML for processing in the statistical packages cited in the paper, such as R or MATLAB.

Keywords: Mascot Distiller, peak picking, quantitation, statistics

2 comments on “Current challenges in quantitative proteomics”

Corbin Kembel on March 17, 2014 at 18:05 said:

When viewing a Peptide Summary Report, I select a protein hit and view the Mascot Search Results. When viewing the Mascot Search Results, the identified peptides of the protein hit are displayed. However, some of these peptides are shown more than once. Is this a method of quantitation? If a peptide is shown 4 times on this report can this be extrapolated to determine a relative concentration?

Thank you.
- John Cottrell on March 19, 2014 at 20:17 said:
  
  Getting multiple matches to the same peptide is a function of the acquisition method. Maybe you are not using an exclusion list or the exclusion list settings are not quite right? While there will be some correlation between abundance and multiplicity of matches, spectral counting quantitation methods normally look at the number of distinct peptides matched per protein rather than the count of matches.

Matrix Science