Posted by Ville Koskinen (May 14, 2015)

PSI file formats, part 1: mzIdentML

Mascot search results are saved in a .dat format, which contains all protein and peptide identifications output by Mascot, as well as auxiliary information (search parameters, mass definitions for modifications, etc.). The .dat file is almost never the final step in data analysis; for the next step, you need to import the search results in some other piece of software. Although some software can read the .dat file, for example by using Mascot Parser, Mascot also allows exporting the .dat file in a number of different formats. Two of the formats in Mascot 2.5, mzIdentML and mzTab, are standardised file formats developed by the Proteomics Standards Initiative (PSI).

In part 1 of this series of articles, we’ll take a look at the mzIdentML format and what kinds of tools and software can process it. Part 2 discusses mzIdentML file validation, as well as issues that can arise when importing mzIdentML or submitting to proteomics repositories. Finally, part 3 gives an overview of the new mzTab format and how it differs from mzIdentML.

Both mzIdentML and mzTab are file formats for protein and peptide identifications (mzTab also supports quantitation data). PSI has also developed a format for raw data and peak lists, called mzML, and a companion format to mzIdentML for quantitation, called mzQuantML, but these formats are outside the scope of our present little series.

Structure of mzIdentML files

The mzIdentML 1.0 specification was published in August 2009. The current recommended version, 1.1, was published in August 2011. Version 1.0 has largely fallen out of use in favour of 1.1, as the latter fixes a number of small problems and inconsistencies. The next version, 1.2, is being developed and contains further small improvements to encoding protein groupings and combining search results from multiple search engines.

The mzIdentML format is XML-based, meaning the files are XML files but with additional structure. XML describes a tree-like or hierarchical data structure, where data is encoded in nodes and nodes have parent-child or sibling relationships with other nodes. The mzIdentML document structure is defined in an XML Schema Definition (XSD), which is a formal specification for the allowed node types and node contents as well as for the hierarchical relationships. Additionally, mzIdentML has an associated metadata layer called the PSI-MS Controlled Vocabulary (CV). The CV defines the allowed names, types and range of values of node attributes. For example, the search engine that originated the data can be encoded with the CV term MS:1001207 Mascot (no explicit value specified), while the score of a peptide match could be MS:1001171 Mascot:score with value 32.1, say.

mzIdentML, and XML in general, is intended for machine processing, although it is possible to manually read and edit the files with a decent text editor. A good introduction to the file structure as well as some relatively simple example files are available on the PSI website.

Most of the time you need not worry about the XSD or the CV, or their precise relationship to the data stored in the mzIdentML file. However, if the file does not conform to the schema (it is not valid) or contains other errors, the validator tools report the errors in terms of the XSD or the CV, at which point it is useful to know what the terms mean. This is a topic for part 2.

Search engine support for mzIdentML

Establishing new file formats is often a chicken and egg problem: Software vendors are reluctant to add support for reading the format until software exists for writing the files, and until users start asking for the support. Users won’t start asking until there is software support for both reading and writing the files, and writing support might not be added if the format is complicated (a lot of work) and there’s no software yet that reads the files!

Even as of April 2015, not many search engines have native support for exporting mzIdentML 1.1 files. As far as I can tell, there are only five: Mascot, MS-GF+, Myrimatch (since 2.1), Crux and PEAKS. In addition to Myrimatch, the Tabb lab software applications TagRecon and Pepitome also export mzIdentML. But there are also a number of programs for converting files to mzIdentML: ProCon handles Sequest, Proteome Discoverer and ProteinScape output; ProteoWizard’s idconvert does pepXML (and is bundled with recent versions of the Trans-Proteomic Pipeline); and the mzidLibrary contains utilities for converting OMSSA and X!Tandem results to mzIdentML. So, although native support is not as widespread as you might expect, it is possible to export or convert output from most search engines to mzIdentML.

Exporting mzIdentML from Mascot

mzIdentML 1.0 export support was added in Mascot 2.3. Mascot 2.4.1 and newer (including Mascot 2.5) export mzIdentML 1.1, so if you have Mascot 2.4, it’s a good reason to update to the latest patch release. To export search results, as described in the help page, simply click the Export button (in Protein Family Summary) or choose Export Search Results in the format controls (in Peptide Summary), and then choose mzIdentML as the export format. The checkboxes for optional protein and peptide information control how much additional information about protein hits and input queries should be included. It is normally fine to leave the controls at their default settings, unless you require a specific item of information in the exported file. We’ll return to this topic in part 2.

What can you do with mzIdentML files?

Once you have exported search results as mzIdentML, what can you do with it? If you are a software developer, you’re in luck: there are at least jMzIdentML (Java, mzIdentML 1.1), mzID (R, 1.0 and 1.1), and OpenMS (C++, 1.1), all of which seem to provide a comprehensive support for all the features of the format. The mzidLibrary package for Java probably contains an API for reading and writing mzIdentML files as well, although it is not obvious from their documentation how to access the API.

If you are not a software developer and want to actually analyse the mzIdentML file contents, until a couple of years ago there was not much software support. The situation seems to be improving slowly but steadily. The following is a fairly comprehensive list of programs able to process mzIdentML files.

For general data processing, visualisation and interpretation, there are Scaffold (by Proteome Software), PeptideShaker and Mascot Insight. Scaffold 4 can both read and write mzIdentML files and perform a number of processing tasks on them. PeptideShaker accepts mzIdentML input files as long as the corresponding MGF file is available. Mascot Insight can import any mzIdentML file, whether from Mascot or some other source, after which it behaves like any other data set; you can view and navigate the identified proteins and peptides, generate reports (protein–protein interactions, Venn diagrams, any number of plots), etc. Insight can also import the Scaffold-generated combination of mzIdentML and sqml files for quantitative data sets, as mzIdentML on its own does not support quantitation.

For simple visualisation, ProteoIDViewer is a nifty little application. It offers a read-only view to the mzIdentML file, including both protein and peptide centric views and optionally displaying fragment spectra (if the corresponding MGF or mzML file is loaded separately). If you tick the Matched Fragment Ions box in the Mascot export options, the display will contain the labelled, matched peaks. The application also has a tab for basic peptide–spectrum match statistics and, in decoy search results, FDR calculation. However, if you export a Mascot decoy search and load it in ProteoIDViewer, you will not see any decoy matches. This is because Mascot exports only matches to the target database, while ProteoIDViewer expects the file to contain both target and decoy matches.

More specialised mzIdentML tools include IDPicker 3 (peptide filtering and protein inference), PAnalyzer (protein inference) and BiblioSpec (create spectral libraries from mzIdentML and MGF files, part of Skyline). The mzidLibrary mentioned above reads mzIdentML and contains functions for protein inference and peptide score thresholding among other things. The library is used at least by ProteoAnnotator.

And that seems to be about it, at least as far as I have been able to find out. There is also mzidValidator for validating mzIdentML files, but we’ll talk about this tool more in part 2. If you know of more mzIdentML-reading software, please leave a link in the comments!

5 comments on “PSI file formats, part 1: mzIdentML

  1. Alejandro Aguilar on said:

    Hi.

    I have a problem.

    I need mzIdentML v1.1 but, the mascot 2.3 export to v1.0
    Are there any conversion to v1.1 from v1.0?

    thanks

    • John Cottrell on said:

      mzIdentML 1.1 was released well after Mascot Server 2.3. We added support for mzIdentML 1.1 in Mascot Server 2.4.1, released in 2012. I don’t know of any converter.

  2. Witold on said:

    Dear John,

    I have mascot .dat files and need to convert them to mzIdentML files. Is there a stand alone command tool you could recommend?

    Thank you

    • John Cottrell on said:

      You can run the export script at the command line – http://www.matrixscience.com/help/export_help.html#COMMAND

  3. Juan A. Vizcaino on said:

    For data visualisation, PRIDE Inspector works really very well and it is actively maintained.

    PRIDE Inspector is open source (free to use) and can be downloaded from: https://github.com/PRIDE-Toolsuite/pride-inspector

    The most recent PRIDE Inspector publication in MCP (describing the support for mzIdentML, but also mzTab) is: http://www.ncbi.nlm.nih.gov/pubmed/26545397. You can read all the details there and test the tool.

    The original publication is here: http://www.ncbi.nlm.nih.gov/pubmed/22318026.

Leave a Reply to Alejandro Aguilar Cancel reply

Your email address will not be published. Required fields are marked *

*

HTML tags are not allowed.