Mascot Parser supports two types of cache file, which speed up access to very large results files.

resfile cache: A cache for the results file itself (ms_mascotresfile_dat), i.e. the .dat file.
pepsum cache: A cache for the protein and peptide data after protein inference (ms_mascotresults/ms_peptidesummary).

Resfile cache (MSR format)

Mascot Server 3.0 introduces a new file format, Mascot Search Results (MSR). The MSR file is an SQLite database, which provides fast random access to the 'raw' search results. Unlike dat28 files (described below), no resfile cache is required for MSR.

Parser supports saving protein inference results from an MSR file in a pepsum cache.

Resfile cache (dat28 format)

Mascot Server 2.8 and earlier saved results in a MIME-format file with .dat extension, now called the dat28 format. For new applications, we recommend using MSR files, as they are much more efficient for random access.

For dat28 files, Mascot Parser 2.3 and later support the use of cache files to speed up access. The resfile cache is an index to different sections and raw lines of the results file itself, and only needs to be rebuilt if the file changes.

Parser supports saving protein inference results from a dat28 file in a pepsum cache.

Pepsum cache (both formats)

The pepsum cache contains processed and grouped protein hits and peptide data. It is created and used by ms_peptidesummary. Without a pepsum cache, protein grouping needs to be done anew every time an ms_peptidesummary object is created. With caching, the grouped data can be stored in the cache file, and subsequent file access can skip protein grouping entirely.

For MSR files, the pepsum cache has file extension .sdb, which is an SQLite database.

For dat28 files, the pepsum cache has file extension .cdb, which is a read-only key-value database.

Specifying cache file directory

In a standard Mascot Server setup, cache files are located in a directory specified by the CacheDirectory value in the options section of mascot.dat. If CacheDirectory specifies a relative directory, then it is relative to the current working directory of the calling program. The function ms_mascotoptions::getCacheDirectory() can be used to retrieve the cache directory from mascot.dat. In all Mascot Server scripts, the return value from this function is passed to the ms_mascotresfilebase::createResfile.

Cache files for a specific results file are put into their own directory under the directory specified by CacheDirectory. The name of the directory is constructed by calling getMD5Sum() and passing a string comprising the filename (or filenames if combining multiple results files; see Combining multiple results files), file size and last modified date. The function ms_mascotresfile::getCacheDirectory() can be used to retrieve the directory used for cache files for a specific results file.

A number of special 'tokens' can be used to split the potentially large number of files/directories into more convenient subdirectories. The 'tokens' are generated using the strftime function using the last modified date of .dat file. Whilst the strftime function has a large number of options, the ones most likely to be useful are:

%d - The day of the month as a decimal number (range 01 to 31)
%m - The month as a decimal number (range 01 to 12)
%y - The year as a decimal number without a century (range 00 to 99)
%Y - The year as a decimal number including the century

Other tokens accepted by strftime are documented here: http://www.cplusplus.com/reference/clibrary/ctime/strftime/

The default value is "../data/cache/%Y/%m" which specifies a new directory for each month. For example, for a file dated the 3rd February 2020, the files would be in a directory ../data/cache/2020/02.

Mascot Parser will try to create a cache directory if it doesn't exist. If necessary, it will create the whole directory tree.

Filenames for the pepsum cache files

The base name for pepsum cache for dat28 format is

    F01234.dat.[a-z0-9]*.cdb

The base name for pepsum cache for MSR format is

    F01234.msr.[a-z0-9]*.sdb

The segment [a-z0-9] will be replaced by exactly 26 numbers or lower-case letters created by calling getMD5Sum() on a string created using all parameters passed to the ms_peptidesummary constructor, including the UniGene index file path.

If, for example, two reports are created for the same results file, one with a probability threshold of 0.05 and another with a threshold of 0.001, then separate cache files with unique filenames will be created, making it fast to switch between the two reports. Separate cache files are necessary, because changing any of the constructor parameters may change protein and peptide scoring, grouping, cut-off thresholds, etc.

The following flags do not affect the contents of the cache file and therefore are not included in the MD5Sum for the name:

The function ms_peptidesummary::getCacheFileName() can be called to retrieve the full or relative path to the cache file.

Using the resfile cache (dat28 format only)

The default constructor for ms_mascotresfile_dat has the flag RESFILE_NOFLAG, which means that when a results file is opened, no cache will be used. Simply specify RESFILE_USE_CACHE to use a cache. This flag is a "no-op" for MSR files.

Most errors relating to the cache files are 'soft'. For example, if two applications both try and create a cache file at the same time, then the first will succeed in creating the cache file, and the second one will carry on without using a cache. If an application crashes in the middle of creating a file, then the next application to try and use the cache will re-create it. If an application fails to create the cache for any reason, it will carry on without the cache.

A resfile cache may be recreated under some conditions, even when it already exists:

If the size or last modified date of a results file has changed since the cache was created, then the cache will be rebuilt.
An internal version number for the cache is saved within the file. If Matrix Science need to change the format of the cache file, then this will cause the cache to be rebuilt.

The resfile cache has a maximum size of 4GB. In general, the ms_mascotresfile_dat cache file is about 10% of the size of the results file.

The static function ms_mascotresfile_dat::willCreateCache() can be called before creating an ms_mascotresfile_dat object to determine if a cache file will be created. This is useful in checking whether creating the object will take a long time.

The function ms_mascotresfile::getCacheFileName() can be called to retrieve the full or relative path to the cache file.

Using the pepsum cache (MSR and dat28)

Specify MSPEPSUM_USE_CACHE when creating the ms_peptidesummary object to enable caching.

If the cache file doesn't exist, then it will be created. If it exists, then the ms_peptidesummary constructor will return without extra delay. Calling getHit() and getPeptide() will load data from the cache file on demand. This means that after the cache file has been created once, you will need much less memory to access data in the file, as the whole file does not need to be read into memory.

What is cached?

Different versions of Parser cache more or less data in the pepsum cache. The following list is an example of what is stored:

Protein grouping and family links, most common fields (except description)
ms_mascotresults::isNA()
ms_mascotresults::getProteinScoreForHistogram()
ms_mascotresults::getAvePeptideIdentityThreshold()
ms_mascotresults::getIonsScoreHistogram()
For results using a UniGene index, the date, time and size of the UniGene file are cached.
For a decoy search, calls to the following methods when using the default minProbability value:
The unassigned list. It may still be slow to sort the unassigned list by intensity, as the intensity values are not cached.

Assuming MSRES_DECOY is not specified, no actual decoy matches will be saved. The same is true in reverse: when using MSRES_DECOY, protein hits from the standard search are not cached.

If a UniGene index is in use and the UniGene file changes, the cache will be rebuilt.

Performance issues when creating the pepsum cache (dat28 format only)

If you use the cache for an ms_mascotresfile_dat object, and then create a new ms_peptidesummary object, caching will typically take 1.5 to 3 times longer than without an ms_mascotresfile_dat cache.

The static function matrix_science::ms_peptidesummary::willCreateCache() can be called before creating an ms_peptidesummary object to determine if there will be a delay while a cache file is created.

Performance issues with ms_protein objects

Once a cache file has been created, it should take less than a second to create a new ms_peptidesummary object. Calling matrix_science::ms_mascotresults::getHit() should also be very fast as this just loads data from the cache file. However, only basic information for each matrix_science::ms_protein object is loaded at this time. The first call for any particular ms_protein object to a function that takes a pepNumber argument (for example: matrix_science::ms_protein::getPeptideIonsScore() ) will be slow because this will cause a reload of data from multiple parts of the results file. Subsequent calls to any function for that ms_protein object will be fast, until matrix_science::ms_mascotresults::freeHit() is called.

Some applications just need a list of query/rank values for each protein so these values are cached separately for each top level protein and family member protein. Therefore, the first call to matrix_science::ms_protein::getPeptideQuery() or matrix_science::ms_protein::getPeptideP() will be reasonbly fast compared will calls to the other functions that take a pepNumber.

Performance issues when using non-default arguments

Calls to any function that takes a OneInXprobRnd argument will be slow if 1) caching is in use and 2) the argument is not the same as when the cache file was created. Typically, this argument is 1 / results.getProbabilityThreshold(). The value is compared to what is saved in the cache file (within a certain precision), and if the argument differs, this may trigger an expensive loop over all queries.

The affected functions are:

Other potentially slow function calls in cache mode

Parser is backwards compatible with cache files from a number of previous versions. New functionality often necessitates adding more data to new cache files to improve performance. If this data is not present in the current cache file (e.g. the cache file is from one or two versions before the current version), it is read directly from the results file or calculated on the fly.

The enum ms_peptidesummary::BUGFIX_NUM maintains a list of new functionality and performance improvements where this may be the case. If you use cache files created by a previous version of Parser, it is good practice to call ms_peptidesummary::isDataCached() with the relevant bug number to discover whether the data needed by the method you use is indeed present in the cache file. If it isn't, you can avoid the function call, prepare a progress feedback screen or recreate the cache files.

How a script should decide whether to use a cache

There are two lines in the options section of mascot.dat which specify what each script should do:

    ResfileCache master_results.pl,master_results_2.pl,peptide_view.pl...
    ResultsCache master_results.pl,master_results_2.pl,peptide_view.pl...

Each script should see if it is listed in ResfileCache, and if so, it should specify RESFILE_USE_CACHE when creating the ms_mascotresfilebase object. The value can be retrieved from the Options section of mascot.dat by calling ms_mascotoptions::getResfileCache(). The procedure is correct for both MSR and dat28 files; ms_mascotresfile_msr simply ignores RESFILE_USE_CACHE if it's set.

Each script should see if it is listed in ResultsCache, and if so, it should specify MSPEPSUM_USE_CACHE when creating the ms_mascotresults object. The value can be retrieved from the Options section of mascot.dat by calling ms_mascotoptions::getResultsCache().

Using the pepsum cache for protein_view.pl

To quickly load details for a single protein (e.g. for protein_view.pl), applications prior to Mascot Parser 2.3 would generally use the singleHit parameter. However, it is faster to use a cache file if it exists.

For ms_proteinsummary, no peptide summary cache is available, so continue to follow the instructions as in Getting a single hit from the protein summary.

For ms_peptidesummary, the fastest method of loading details for a single protein is to use the cache if it already exists. For protein_view.pl, it is likely that a cache has already been created by master_results.pl or master_results_2.pl. So, as long as the exact same flags and parameters for the ms_peptidesummary constructor are used, access will be fast.

Protocol:

Check to see if protein_view.pl (or whatever application/script) is to use a cache file as described above.
Create the new ms_peptidesummary without the singleHit parameter and using the same flags and parameters as in master_results.pl or master_results_2.pl to make sure an existing cache file is used. Make sure that MSPEPSUM_USE_CACHE is specified.
Use getProtein() to get the single protein hit.