Matrix Science header

Spectral libraries
[Mascot results file module]

Spectral library search results

Mascot 2.6 and later can search spectral libraries in parallel with a FASTA database. This page deals with opening spectral library search results using the standard Parser interface. If you need to read spectral library files (in MSP format), see Classes for reading and writing spectral libraries.

By default, Parser opens spectral library results in backwards-compatible mode, where spectral library matches are invisible. Enabling spectral library support in code using Parser is more involved than just setting a constructor flag. Please study the list of methods under Spectral library API summary. Many of the methods listed either take a new argument in Parser 2.6 to differentiate between FASTA and library matches, or they interpret existing arguments differently depending on mode.

For example, ms_peptidesummary::getPeptideThreshold() takes an optional rank parameter. You should always specify it, because Mascot and library matches have different score thresholds. If you don't give a rank to the method, it will choose the score threshold based on the rank 1 match.

Detecting and opening spectral library results

To detect whether a results file contains spectral library matches, the easiest test is ms_mascotresfile::anyPeptideSummaryMatches(). If the test returns true for SEC_LIBRARYPEPTIDES, the results file has a non-empty section containing spectral library matches and the file can be opened as a spectral library search.

Three modes are available:

  1. FASTA-only mode (no flag): The backwards-compatible default is to hide spectral library matches.
    If the search contains only spectral library matches, then opening the file in this mode means there are no matches available at all.
  2. SL-only mode (MSPEPSUM_SL_ONLY): Show only spectral library matches.
    If the search contains only FASTA matches, then opening the file in this mode means there are no matches available at all.
  3. Integrated mode (MSPEPSUM_SL_INTEGRATED): The integrated library mode is similar to the integrated error tolerant mode (see Integrated error tolerant search). A query can contain a mixture of up to 20 FASTA and library matches, and protein accessions can come from both the FASTA file and the spectral library (or its reference database -- see Protein inference in spectral library searches).
    If the search contains only FASTA matches or only spectral library matches, the integrated mode is equivalent to the FASTA-only or SL-only mode, respectively.

The helper function ms_mascotresfile::get_ms_mascotresults_params() does not set MSPEPSUM_SL_ONLY or MSPEPSUM_SL_INTEGRATED automatically. This means code written for Parser 2.5 and earlier always opens spectral library search results in FASTA-only mode.

Parser 2.6 and Mascot 2.6 do not support the following search types in combination with spectral libraries:

(*) Reporter quantitation may work; it depends on spectral library matches having the correct set of modifications (see How modifications are encoded). Other types of quantitation will not work with library matches.

Opening the file in SL-only mode

To open the file in SL-only mode, pass the flag MSPEPSUM_SL_ONLY to the matrix_science::ms_peptidesummary constructor.

The default value for minProbability, 0.05, corresponds to a score threshold of 300 (see Library scores and thresholds). To convert a raw score to a value suitable as minProbability, see ms_peptidesummary::getMinProbabilityForSLScore().

Opening the file in integrated mode

To open the file in integrated mode, pass the flag MSPEPSUM_SL_INTEGRATED to the matrix_science::ms_peptidesummary constructor. Observe the following points:

Library scores and thresholds

In Mascot 2.6, the spectral library search tool is the NIST MSPepSearch. Library scores range from 0 to 1000, with 0 meaning the observed spectrum and the library match are entirely dissimilar, and 1000 meaning they are identical. Library scores are obviously on a very different scale from Mascot scores, which have no particular upper limit (although a score of a couple hundred is extremely unusual). FASTA matches and library matches must have different score thresholds.

If a query contains both FASTA and library matches, and you open the file in integrated mode, there are three thresholds of interest:

The library score threshold is global -- it applies to all queries in the search -- whereas Mascot thresholds are specific to a query. However, in order to retain a compatible API, functions that return the identity threshold may return either the Mascot identity or the library score threshold, depending on function arguments. Functions that return the homology threshold will always return zero for spectral library matches.

For example, the following functions take a new argument that determines which threshold is of interest:

The type of the score returned by ms_peptide::getIonsScore() depends on the type of the match. Use ms_peptide::getIsFromLibrary() to find out.

The expect value of a spectral library match is a function of its score and threshold, as usual.

For details on how the library score threshold is derived, see Advanced reading: calculating the spectral library score threshold.

Protein inference in spectral library searches

Protein inference in SL-only or integrated mode uses the same rules as in FASTA-only mode and previous versions of Parser. The main differences are the source of protein data for spectral library matches, and the treatment of "ties" between otherwise equivalent accessions, e.g., when choosing a representative family member or sameset protein.

During a library search, a library match can be mapped to a FASTA accession, library accession or both. FASTA accession in this context means an accession from a FASTA database that is part of the same search. Library accession means an accession from the reference database of the spectral library, or an accession from the MSP file if the sequence was not found in the reference database; see the Mascot help for more information on reference databases.

The search can be opened in integrated or SL-only mode. In SL-only mode, spectral library matches appear only under library accessions. It is as if the FASTA database was not searched at all, so protein inference is the same as in a single database search.

In integrated mode, a spectral library match can be mapped to a FASTA accession or a library accession or both. There are therefore potentially many more sameset and subset proteins. The situation is more even complicated if the spectral library has a reference database that is not part of the search.

Database number and type

Here are some of the possible database, library and reference database combinations:

A. One library (getNumberOfDatabases() == 1)
  1. NIST_S.cerevesiae_IonTrap (SL)
  2. SwissProt (SLREF of 1)
B. One library, one FASTA; FASTA is not reference db (getNumberOfDatabases() == 2)
  1. NIST_S.cerevisiae_IonTrap (SL)
  2. UniProt_Yeast (AA)
  3. SwissProt (SLREF of 1)
C. Two libraries with same reference db (getNumberOfDatabases() == 2)
  1. NIST_S.cerevisiae_IonTrap (SL)
  2. PRIDE_Contaminants (SL)
  3. SwissProt (SLREF of 1 and 2)
D. Two libraries, one FASTA that is a reference db (getNumberOfDatabases() == 3)
  1. NIST_S.cerevisiae_IonTrap (SL)
  2. PRIDE_S.cerevisiae (SL)
  3. SwissProt (AA, also ref of 1)
  4. UniProt_Yeast (SLREF of 2)

The string in parentheses (AA, SL, SLREF) is the database type. The numbers refer to the database number. This is normally in the interval 1..ms_searchparams::getNumberOfDatabases(), except that reference databases have index numbers after this range. It is possible for a FASTA database to be both part of the search and a spectral library's reference database.

Parser assigns reference accessions the database number of the relevant spectral library, so most of the time you do not need to worry about the difference. For example, in case C of two libraries with the same reference database, an accession from SwissProt (e.g., KPYK1_YEAST) can appear under either database number, "1::KPYK1_YEAST" and "2::KPYK2_YEAST". Matches from the first library searched appear under "1::KPYK1_YEAST", and matches from the second under "2::KPYK2_YEAST". (Protein inference may of course make one a subset of the other.) Database numbers above ms_searchparams::getNumberOfDatabases() do not appear in ms_protein::getDB().

You can access the database type and the mapping between spectral library index and reference database index with ms_mascotresfile::getDatabaseType(), ms_mascotresfile::getReferenceDatabaseNumberOfSL() and ms_mascotresfile::getSLDatabaseNumbersOfReference().

FASTA accessions are preferred over reference and library accessions

Parser prioritises FASTA accessions over reference accessions and reference accessions over MSP accessions in protein inference. For example, if a FASTA accession and a reference accession have the same set of significant sequences, Parser will choose the FASTA accession as the representative protein and relegate the reference accession to sameset status. If a FASTA accession is the superset protein of a library accession (of either kind), the library accession is removed entirely. This prioritisation is necessary to simplify the output of protein inference and remove redundancies.

Reference database name matters in external lookups

The mass and description of reference accessions are not saved in the results file. This means ms_mascotresults::getProteinDescription() returns the empty string for reference accessions. The description of an MSP accession may or may not be available, depending on what the MSP file contained at the time of the search.

When you need to look up a protein attribute like mass, description, pI or taxonomy information from an external source, you need to know what the correct source database is. If the protein comes from a FASTA file (database type AA or NA), the source database is the FASTA file. But if the protein comes from a spectral library (database type SL), you need to target the query at the reference database. You can find the reference database number with ms_mascotresfile::getReferenceDatabaseNumberOfSL() and the reference database name with ms_searchparams::getDB().

How modifications are encoded

Modifications are encoded differently between FASTA matches and spectral library matches. In a FASTA-only search, you can specify both fixed and variable modifications, as well as have query-level modifications and modifications as part of a quantitation method. In an SL-only or integrated library search, none of these modifications are used for spectral library matches.

Spectral library matches can contain their own modifications, which you can access with ms_mascotresults::getLibraryModString(). The modifications originate from the spectral library entry. The results file contains a list of all possible library modifications in the spectral libraries searched; see ms_searchparams::getLibraryModName(). It is possible for fixed and variable modifications to have the same name as a library modification, and vice versa. Whether they are the same modification depends entirely on how the spectral library was created.

Other attributes specific to spectral libraries

The results file contains a back reference for each spectral library match that allows you to look up the corresponding library entry from the spectral library file. See ms_peptidesummary::getLibraryEntryId() for more details.

Spectral library API summary

For more details on spectral library specific behaviour, see the following classes and functions:

ms_searchparams

ms_mascotresfile

ms_mascotresults

ms_peptidesummary

ms_peptide

ms_libraryoptions

ms_spectral_lib_file

ms_spectral_lib_entry

ms_spectral_lib_peak


Copyright © 2022 Matrix Science Ltd.  All Rights Reserved. Generated on Thu Mar 31 2022 01:12:30