Posted by Ville Koskinen (August 15, 2015)

PSI file formats, part 3: repositories

We’ve talked about mzIdentML validity only in terms of file structure. Proteomics repositories, such as PRIDE or ProteoRed, of course require files to be valid in that sense, but they impose additional requirements. If you need to upload your search results to a repository, it is worth looking at this more extended idea of validity. For simplicity, I’ll only consider complete submissions to PRIDE, which consist of the search results in mzIdentML format, the peak lists (often MGF or mzML) and the raw data files.

MIAPE compliance

First, the additional requirement on mzIdentML is MIAPE compliance. In particular, when it comes to Mascot results, we need to look at MIAPE-MSI. The default export options are almost enough to comply with the guidelines. You only need to ensure that under optional protein hit information, both Description and Length in residues are ticked. The latter causes not only protein sequence length to be included in the mzIdentML file, but also protein sequence coverage. It is also highly recommended to run your searches as target-decoy searches, because the false discovery rate (FDR) of the results will be included in the exported file. Otherwise you may need to provide some other form of statistical analysis on false positive rates. The full MIAPE guidelines require additional information that Mascot cannot export in the mzIdentML file. These are details like the contact details and affiliation of the researcher(s) performing the experiment, and sample and data processing protocol. In the case of PRIDE submission, you can fill these in using the ProteomeXchange submission tool.

mzIdentMLValidator supports validating the file against MIAPE-MSI as long as you choose MIAPE-compliant validation from the drop-down menu. Validation is based on a special CV mapping file, as mentioned above. The mapping is updated in each version of the validator. Version 1.3.3 is now a bit old (a new version is apparently in beta testing) and does not seem to take the CV version into account either, so you may get false positive errors.

Original peak lists

Second, there is an important point regarding the original peak list file: Always store a copy of it in the same place as the mzIdentML file! The mzIdentML file contains references to spectra in the peak list file, and if the original is lost, it might not be possible to recreate it exactly, even if you still have the raw files. In the case of MGF, although it is possible to generate an MGF file from the Mascot search results, such an MGF file is unlikely to be identical to the original MGF file. This is because Mascot sorts input spectra by Mr and may split spectra with ambiguous charge states into separate queries. The MGF file will yield the same search results, but the spectra are in a different order from the original MGF file. If you try to use it with the previously exported mzIdentML file, the links between peptide matches and spectra will be correct only by accident. In a situation like this, the best thing to do is rerun the search using the exported MGF file and delete the old mzIdentML file.

Only Unimod modifications

A final point about PRIDE is that we have had some reports from users who are unable to submit search results in mzIdentML format, because the search includes modifications that are not in the public Unimod database. These are modifications that have been added to the local Mascot Server or defined in a quantitation method, such as Label:13C(6)+Succinyl. They are encoded in mzIdentML with the Unimod accession specified as "unknown", which PRIDE does not accept. In the short term, you may have to rerun your search without the offending modifications, if that’s possible, or petition EBI to allow submissions of this kind. In the long term, it would be good to add a mechanism to mzIdentML that allows specifying combinations of modifications or define nonstandard modifications within the file, as there are a number of situations where such flexibility is needed (for example, when developing new analysis methods). Such change proposals need to be carried out through the standard PSI processes – but this is again one of the benefits of an open standard.

And that is about all I can say on the topic of mzIdentML. In the next and last part, we’ll do a short tour of the other PSI file format for search results, mzTab.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

HTML tags are not allowed.