Posted by Ville Koskinen (October 21, 2015)

PSI file formats, part 4: mzTab

mzTab is a relatively new file format for reporting protein and peptide search results. Its specification (1.0) was published in June 2014. Like mzIdentML, it is developed by the Proteomics Standards Initiative (PSI) and shares some of the same controlled vocabulary (CV). However, that’s where the similarities end. The biggest differences are that mzTab is table-based text, not XML, and the file can contain both identification and quantitation data.

Structure of mzTab files

mzTab files are “plain text” and can be opened in a text editor or a spreadsheet. The file starts with a metadata section, followed by four optional sections: one table for each of proteins, peptide-spectrum matches, peptides and small molecules. There are two main types, Identification and Quantification, and two modes, Summary and Complete. Type and mode determine which table columns are required and which are optional; the details can be found the specification document. Regardless of type and mode, any or all of the tables can be present or absent depending on what kind of search results are being exported.

PSI has made available several example files, of which SILAC_SQ.mzTab is the simplest and nearly self-explanatory. This is a SILAC experiment summary reporting only the final list of proteins and their relative abundances. The metadata (MTD) section is a list of key-value pairs separated by tabs, which contains the usual search parameters, input file description and so on and also specifies the options and assumptions used in generating the mzTab file. The protein table structure is equally simple: columns are separated by tabs, the first row is the header and specifies the column names and types, and the remaining rows contain the search results. Column data types are either fixed in the standard – e.g. taxid is a positive integer – or specified in the metadata section – e.g. protein_search_engine_score[1] is Mascot score.

The other dimension of structure is experiment structure. mzTab uses the following concepts, translated into Mascot terms:

  • run: The Mascot results file, or strictly speaking the set of peak lists submitted for searching.
  • sample: In a search with quantitation, the components of the quantitation method are treated as samples (e.g. 114, 115, 116 and 117 in 4-plex iTRAQ). When quantitation is not in use, there is a single unnamed sample.
  • assay: Realisation or measurement of a particular sample in a particular run, so a run can contain multiple assays. If there is only one results file, as is usually the case, sample and assay are the same thing. When quantitation is not in use, there is a single unnamed assay.
  • study variable: Quantiation variable of interest, for example protein or peptide ratio. This is a function of assay abundances.

mzTab also supports combining results from multiple search engines. Certain protein and peptide columns are indexed with one or more of the above. One example is protein_search_engine_score[1], which refers to the score of the protein given by search engine 1.

Exporting mzTab in Mascot

Mascot 2.5 is the first and so far the only search engine to offer native mzTab exporting. If you wish to try it out on your local Mascot installation, ensure you have installed the latest service pack (2.5.01), as it fixes a few issues caused by last-minute changes to the specification after Mascot 2.5 was released.

Mascot 2.5 supports both Identification and Quantification mzTab files, and the choice is naturally made based on the search type: searches whose quantitation protocol is Reporter or Multiplex are exported as Quantification results, while all other searches are Identification results. The exported files are always Complete, never Summary. The main difference between the two is that Complete files have to contain protein and peptide abundance columns if type is Quantification. For Identification files, the difference is minimal.

Identification files contain three sections: metadata, protein table and PSM table. The contents is pretty much the same as in the Mascot CSV export format. Quantification files contain an additional table for peptides, with the (simplistic) assumption that PSM = peptide. The reason is technical. Only the peptides table can report peptide quantitation ratios and abundances, which are not allowed in the PSM table, but peptide ratios in the Reporter and Multiplex protocols are based directly on the spectrum peak intensities. Since there is no “peptide inference” step in Mascot and PSMs are implicitly treated as peptides elsewhere in the system, this approach seemed reasonable. Feedback from mzTab users would be welcome.

Like all standards, mzTab 1.0 leaves some corner cases unspecified either for lack of consensus or lack of real-life examples. Here’s a short list of the design decisions we’ve made:

  • Full set of search parameters: We export the full set of search parameters and the full set of export options in the metadata section. This means that if you have the original peak lists, you can reproduce both the search results and the exported mzTab file.
  • Support for multiple databases: We export a custom column for the databases of the “ambiguity members” of a protein hit, which are the sameset proteins. The anchor or representative protein hit already has both accession and database columns.
  • Support for non-standard quantitation methods: The mzTab specification recommends the use of CV terms to name the quantitation method used. This only works if the method is standard enough to have a CV term. Instead, we export as close a description of the quantitation method and its components as is feasible using custom metadata entries.
  • Protein and peptide ratio formulas: We export custom metadata entries that describe how the protein and peptide ratios were calculated from the assays. The former is the name of the protein ratio statistic (e.g. median) while the latter is the actual formula of the ratio as specified in the quantitation method. It’s hard to see why this was excluded from mzTab 1.0.
  • Error tolerant and unknown modifications: If you export a Mascot results file from before version 2.2, or if you’re exporting an error tolerant search that used an old Unimod file, it’s possible some of the modifications reported in the results file no longer exist in Unimod. mzTab does not allow exporting free-text modification names in cases like this, so we can only export modification delta.

All of the above is implemented using the extensibility mechanisms provided in mzTab 1.0, which allow adding custom metadata entries and specifying new protein and peptide table columns.

Other software support

Software library support for mzTab is reasonable and can get you started in application development. jmzTab is the reference implementation and includes a validator. For R, the Bioconductor package MSnbase can read and write mzTab files, and there is also experimental support in the OpenMS 2.0 C++ library. The validator in jmzTab is still fairly basic and mainly examines the file and table structure and whether the required columns and metadata elements are included. It’s unlikely to ever be as complicated as mzIdentML validation, which is a good thing.

There are a couple of options for converting data to mzTab. jmzTab bundles the tools mzTabGUI and mzTabCLI, which allow converting Pride XML and mzIdentML files to mzTab. PRIDE Converter 2 can convert a range of formats to a skeletal mzTab file, and mzqLibrary can convert quantitation results in mzQuantML format to mzTab. The PSI website also mentions mzTab support in MaxQuant, but as the MaxQuant wiki makes no mention of it, I’m not certain if it’s export only.

What can you do with mzTab files? At the moment, unfortunately not much. As far as I can tell, the only application that both reads and writes mzTab, other than MSnbase, is MZmine 2, but it only uses the small molecules table and only for storing peak lists. If this is a repeat of the mzIdentML story, we can expect more interesting applications to appear perhaps 4-5 years after version 1.0 of the format, so expect the first mzTab “killer app” to appear around 2018. In the meantime, it’s perfectly possible to browse mzTab files in a spreadsheet (like Microsoft Excel or LibreOffice Calc) and enjoy it as an improved CSV format.

One comment on “PSI file formats, part 4: mzTab

  1. Juan Antonio Vizcaino on said:

    For data visualisation, PRIDE Inspector works really very well with mzTab files.

    PRIDE Inspector can be downloaded from:
    https://github.com/PRIDE-Toolsuite/pride-inspector

    The most recent PRIDE Inspector publication in MCP (describing the support for mzTab, but also mzIdentML) is: http://www.ncbi.nlm.nih.gov/pubmed/26545397. You can read all the details there and try the tool.

    The original publication is here: http://www.ncbi.nlm.nih.gov/pubmed/22318026.

    PRIDE plans to support “complete” submissions in mzTab format in the coming months.

Leave a Reply to Juan Antonio Vizcaino Cancel reply

Your email address will not be published. Required fields are marked *

*

HTML tags are not allowed.