Posted by John Cottrell (December 13, 2016)

Protein inference for spectral library searches

The major new feature of Mascot Server 2.6, now running on this web site, is that searches of spectral libraries have been fully integrated with ‘conventional’ Mascot searches of Fasta files.

The search engine for spectral library searches is MSPepSearch from Steve Stein and colleagues at NIST. We didn’t have any revolutionary ideas for improving spectral library scoring so, rather than re-invent the wheel, we decided to adopt a tried and tested program. The attractions of MSPepSearch are that it is fast, has a comprehensive command line interface, and can search very large libraries.

When you submit a search through the Mascot search form, you can choose any combination of amino acid Fasta files, nucleic acid Fasta files, and spectral library files. On completion of the search, the results can be viewed and interrogated using the unique features of the protein family summary.

Protein inference for library search results presents a challenge. At risk of stating the obvious, the entries in a library are peptides, not proteins, which means that protein level information is only present as annotations. Such annotations are optional, and may be missing entirely, as in the case of most PRIDE libraries. Even if present, the reliability is unknown and annotations rarely extend to more than a single accession per library entry, which means that protein inference will be inaccurate for shared peptides.

We believe that accurate protein inference is just as important as accurate peptide identification, so decided to require a reference Fasta database to be specified for each library file when it is added to the system. The default is SwissProt, with an appropriate taxonomy filter, but any online Fasta can be chosen. This allows Mascot to map most of the library peptides to accessions in the reference database. This mapping is done at the sequence level, with no constraints from enzyme specificity. If a library entry has a novel sequence, not found in the reference database, the accession in the library annotations is used. If there is no accession, the peptide sequence is treated as the accession, so that duplicate matches to the same peptide can be grouped, if nothing else.

This example shows the results of a search of a CPTAC file against a single spectral library, NIST_S.cerevesiae_IonTrap. The reference database was SwissProt with a taxonomy filter of Saccharomyces cerevesiae. The accessions in the library annotations also come from SwissProt, but are formatted differently. If you expand family 3, you’ll see that enolase 1 and 2 have many shared peptides. Click on one of them, e.g. query 261, to load a Peptide View report. Towards the bottom, you’ll see the original library annotations. These give the protein accession as sp|P00925|ENO2_YEAST, which is correct, but only tells half the story. To make an informed decision about whether the results include ENO1_YEAST, ENO2_YEAST, or both, you really need to see which matches are shared and which are unique, and this cannot be inferred from a single accession per entry.

Incidentally, expand the ‘Search parameters’ and ‘Modification statistics’ sections at the top of this report. Even though the search didn’t specify any modifications, you’ll see significant matches to a good number of modified peptides – one of the advantages that libraries have over Fasta files.

If we select a Fasta file and a library file in the same search, the accessions in the Fasta also come into play. As far as possible, library matches are mapped to accessions in the Fasta file(s) being searched, in addition to the reference file selected when the library was configured. Where there are same set proteins, the Fasta being searched will take precedence. This example shows a search of NIST_S.cerevesiae_IonTrap plus NCBIprot with a taxonomy filter of Saccharomyces cerevesiae. It is the same peak list as the earlier example and, if you compare the result reports side by side, you can see from the descriptions that the most abundant proteins are much the same, even though the report is now using accessions from NCBIprot. If you switch to the Report builder tab, you won’t see any SwissProt IDs in the list, even though this is the reference database for the library. This is because SwissProt is a sub-set of NCBIprot, so there is never a case where a SwissProt accession has to be used because there is no equivalent in NCBIprot.

Expand family 5 (EF3A_YEAST) in the first report and compare with family 6 in the second (EDV08578.1). These are both Elongation factor 3A and there are very few differences at the PSM level. In some cases, the library gets the stronger match and, in others, the Fasta. Very few matches are unique to one database apart from non-specific and modified peptides that were not considered in the Fasta search. In this particular family, queries 4115 and 4550 are non-specific and two queries get matches to peptides modified by Gln->pyro-Glu.

The sharp eyed will note that the expect values for the library matches are slightly different between these two searches, even though the scores are the same. This is because the expect values are calculated differently, as explained on the help page. We’ll go into this in a bit more detail in a future blog article. If you want to see the Fasta matches alone, or the library matches, there is a format control for this – Report mode.

As with any new development, it is unlikely we have got everything right, first time. If you spot any bugs or have comments or suggestions, please let us know. Either below or via email to support@matrixscience.com.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

HTML tags are not allowed.