Posted by Ville Koskinen (October 14, 2019)

Is your database search reproducible?

A lot has been written about the reproducibility of shotgun proteomics workflows, as well as controlling measurement variability between instruments and laboratories. An equally important factor is data analysis transparency and reproducibility, which is gaining increasing visibility. This boils down to two essentials: recording all software parameters used, and ensuring the software produces the same output given the same input and parameters. A number of Mascot features together enable complete reproducibility.

Recording input data and metadata

Mascot automatically records the complete set of metadata describing the search conditions in the results file (Fxxxx.dat). This is a standard feature available in all versions of Mascot. The metadata consist of:

All user-specified (form-level) search parameters
All Distiller processing options, if submitted from Mascot Distiller
Relevant mascot.dat options
Filepaths to the databases or spectral libraries that were searched
Exact version of Mascot that produced the results
Enzyme definition
Instrument definition and fragmentation rules
Definitions of all fixed and variable modifications selected as parameters
Details of modifications found during an error tolerant search
Quantitation method definition, if any
Complete peak lists and embedded (query-level) parameters

You can view the embedded modification and quantitation method configuration by clicking on the links under Search Parameters in Protein Family Summary. When you export the search results as CSV, XML, mzIdentML or mzTab, the output file contains as much of the metadata as supported by the corresponding format. At minimum, this means the search parameters, export parameters and Mascot version, but may also include details about the enzyme, fragmentation rules and the quantitation method.

Although the results file stores the relevant portions of the configuration files, repeating the search from the results report doesn’t reuse the embedded configuration. Instead, the search form chooses the enzyme, instrument, modifications and quantitation method by name. If any of them has been changed, the repeat search can produce different results. The results files are just text files, so if this happens, you can always open the files in a decent text editor and compare the embedded configuration.

The results file stores the peak lists, but they are not necessarily identical to the input peak lists. For example, the PrecursorCutOut, CentroidWidth and CentroidWidthCount options (discussed in an earlier article) may have filtered or merged peaks. These options are currently not stored in the results file, and they probably should be. The important point is, the embedded peak lists are the ones that were used in the search to obtain the reported peptide matches. If you repeat the search, you can expect to get identical results.

Finally, Distiller processing options are only saved in the results file if the correct option is enabled. Open a project and go to Tools → Preferences → Peak list format → MGF parameters, and check “Processing options in header”. You can set this as the default behaviour for new projects by closing all projects, then checking the same option in Tools → Default Preferences → Peak list format.

Sequence databases

The results file does not store protein sequences. Doing so would greatly inflate its size and create rather nasty storage issues. In any case, it would be infeasible to store the whole FASTA file in every results file. So how can you ensure the peak lists will be matched against the same protein sequences in a future search?

The easiest way is to use Database Manager. For example, if your species of interest is well characterised in SwissProt, enable SwissProt as a predefined definition. Now, create a copy of the activated definition, and choose to also copy the database files. Give it a name like SwissProt_201910 (date-based) or SwissProt_Drosophila_study_123 (project-based). Run all your searches against this copy. The results file metadata contains the full filepath to this version, and any repeat search will default to using the correct database name. Naturally, it’s worth considering your backup strategy if you need to archive the search data or sequence databases long term.

Software stability

The remaining questions about data analysis reproducibility have to do with software versions and dependencies. What if you update your Mascot installation, or move it to a different PC or operating system? Will you still get the same peptide matches? The short answer is, yes.

Mascot produces the same search results on all platforms. There are no functional differences between the Windows and Linux versions, or between standalone and cluster modes. Moving the software from, say, Windows 7 to Windows 10 makes no difference, because Mascot has minimal third-party dependencies. In fact, the only dependencies are standard system libraries (like the C standard library). You can update all other software on the PC and it will make no difference to Mascot search results.

We also maintain a high level of compatibility between Mascot versions. The primary goal is, updating to a new version and repeating a search should never have a negative impact on the results. It may have a positive impact. If avoiding an impact is not possible due to a new feature being added, we always try to add a way to override or disable the new behaviour. If you spot a case where this doesn’t hold, please report it as a bug.

Finally, when you licence Mascot in-house, the licence is perpetual and does not expire. You can stick to the current or any previous version of Mascot, which can be important in research projects that span many years.

Keywords: configuration files, database manager, MGF, reproducibility

Comments are closed.

Matrix Science

Is your database search reproducible?

Recording input data and metadata

Sequence databases

Software stability