Posted by Ville Koskinen (August 12, 2017)

Exporting spectral library search results

Mascot 2.6 integrates spectral library searching. Today we’ll describe how these searches can be exported. Please ensure you’ve installed the Mascot 2.6.1 patch, as support for exporting library search data was not complete in the initial Mascot 2.6.0 release.

Library searches can be either library-only or integrated searches. Integrated means the search is against both a spectral library and a FASTA database. The default is to export both FASTA and library matches as an integrated report, but you also have the option of exporting only the FASTA matches or only the library matches.

Mascot can export spectral library searches in four file formats: CSV, XML, mzIdentML and mzTab. Additionally, you can download the results file (.dat) or just the peak lists (MGF). You can try the export options right now by viewing the example library-only search against NIST_S.cerevesiae_IonTrap or the example integrated search against NIST_S.cerevesiae_IonTrap and SwissProt and clicking on the Export button.

CSV and XML

CSV and XML are our custom file formats, described on the export help page. The file structure is almost exactly the same for FASTA search results and library search results. In fact, the only new addition is the peptide “source” column or field. This is AA or NA if the peptide match comes from a protein sequence database or nucleic acid database, or SL if it comes from a spectral library. You may also see XA if the peptide match was found in both AA and NA.

There are only a couple more details to be aware of: library modifications and the library score threshold.

Library modifications – which are part of the library annotations – are not listed under variable modifications at the top of the file, because they are not part of the search parameters or the search space. The MSPepSearch algorithm ignores the peptide sequence and modifications when it compares spectra. What you end up with in the Mascot results file is simply what appears in the annotation string of the matched library entry.

The modification name is given in the usual column or field under the peptide match, while the letter used in the varmods position string is Y. For example, a library match might have these fields in the XML export:

<pep_res_before>K</pep_res_before>
<pep_seq>YRPNCPIILVTR</pep_seq>
<pep_res_after>C</pep_res_after>
<pep_var_mod>Carbamidomethyl</pep_var_mod>
<pep_var_mod_pos>0.0000Y0000000.0</pep_var_mod_pos>

Lastly, library scores are on a very different scale to Mascot scores. In a library-only export, there is no significance threshold; instead, there’s a library score threshold. The XML element sigthreshold now has an upper limit of 1000 instead of 1.0. In an integrated export, the significance threshold has the usual range between 0 and 1.

mzTab

mzTab is a table-based format, similar to CSV but with more structure and embedded metadata. Have a look at a previous blog article on mzTab for more background information. mzTab is developed by the Proteomics Standards Initiative.

Like CSV, the mzTab output is not all that different for library searches. The mzTab 1.0 standard requires that all modifications be listed in the metadata section as either fixed or variable. Although library modifications are neither, Mascot conforms to the standard and lists library modifications as “variable”. You’ll need to be aware of this in downstream analysis. Note that the search parameters in the metadata section do not include library modifications, so it is still easy to determine what was searched for in the FASTA database.

There are no changes in the protein table. In the peptide match table (PSH/PSM), there is one new column for MSPepSearch score. If the match comes from FASTA, the value in this column is null. If it comes from the library, the value is the library score, and correspondingly the Mascot score column is null. Because Mascot performs post-processing of library matches (e.g. protein inference, expect value calculation), we decided it’s best to keep the search engine column as Mascot in both cases. The best way to determine what is a library match is to look at the MSPepSearch score column.

mzIdentML

mzIdentML is an XML-based format; see the previous blog article on mzIdentML for a general description of the file structure. Like mzTab, it is developed by the Proteomics Standards Initiative.

The changes in mzIdentML are somewhat larger than other formats, as it has a lot more structure. However, mzIdentML 1.1. has no formal support for encoding spectral library searches, much less support for an integrated FASTA and library search. It does support encoding results from multiple search engines, but there are not many applications that can read such files. We’ve tried to make as few changes as possible to stay compatible with existing applications but still allow integrated search results to be exported.

First of all, MSPepSearch and Mascot are listed as separate entities in the analysis software section, with Mascot following MSPepSearch. This means Mascot is treated as a postprocessing step for library matches in addition to producing its own peptide matches. In the search database section, the spectral library or libraries are tagged using the correct file format with new CV (controlled vocabulary) terms. If the search is integrated, there is another CV term to signal this.

Protein data is unchanged. Peptide data for FASTA matches has exactly the form same as before. For library matches, the main differences are that they have a CV term for MSPepSearch score, and that the value encoded as Mascot identity threshold is actually the library threshold. One way to discriminate between FASTA and library matches is to see which type of score it has. Arguably there should be a new CV term for MSPepSearch threshold, but as it and the expect value are calculated by Mascot, it made sense to keep the existing terms. Finally, library modifications are treated in the same way as mzTab: they are encoded as variable modifications.

For more on exporting search results, see the February 2017 tips and tricks article and the earlier mzIdentML and mzTab blog posts. As always, we welcome feedback on data format compatibility with third-party software.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

HTML tags are not allowed.