Matrix Science header

Caching Mascot Results
[Mascot results file module]

Mascot Parser 2.3 and later support the use of cache files to speed up access to huge results files.

Two types of cache file are supported, and typically both will be created when processing a single results file:

The first one is an index to different sections and raw lines of the results file itself, and only needs to be rebuilt if the file changes.

The second contains processed and grouped protein hits and peptide data. Without an ms_peptidesummary cache, protein grouping needs to be done anew every time an ms_peptidesummary object is created. With caching, the grouped data can be stored in the cache file, and subsequent file access can skip protein grouping entirely.

Specifying cache file directory

In a standard Mascot Server setup, cache files are located in a directory specified by the CacheDirectory value in the options section of mascot.dat. If CacheDirectory specifies a relative directory, then it is relative to the current working directory of the calling program. The function ms_mascotoptions::getCacheDirectory() can be used to retrieve the cache directory from mascot.dat. In all Mascot Server scripts, the return value from this function is passed to the ms_mascotresfile constructor.

Cache files for a specific results file are put into their own directory under the directory specified by CacheDirectory. The name of the directory is constructed by calling getMD5Sum() and passing a string comprising the .dat filename (or filenames if combining multiple results files; see Combining multiple .dat files), file size and last modified date. The function ms_mascotresfile::getCacheDirectory() can be used to retrieve the directory used for cache files for a specific results file.

A number of special 'tokens' can be used to split the potentially large number of files/directories into more convenient subdirectories. The 'tokens' are generated using the strftime function using the last modified date of .dat file. Whilst the strftime function has a large number of options, the ones most likely to be useful are:

Other tokens accepted by strftime are documented here: http://www.cplusplus.com/reference/clibrary/ctime/strftime/

The default value is "../data/cache/%Y/%m" which specifies a new directory for each month. For example, for a file dated the 3rd February 2010, the files would be in a directory ../data/cache/2010/02.

Mascot Parser will try to create a cache directory if it doesn't exist. If necessary, it will create the whole directory tree.

Using the ms_mascotresfile cache files

The default constructor for ms_mascotresfile has the flag RESFILE_NOFLAG, which means that when a results file is opened, no cache will be used. Simply specify RESFILE_USE_CACHE to use a cache.

Most errors relating to the cache files are 'soft'. For example, if two applications both try and create a cache file at the same time, then the first will succeed in creating the cache file, and the second one will carry on without using a cache. If an application crashes in the middle of creating a file, then the next application to try and use the cache will re-create it. If an application fails to create the cache for any reason, it will carry on without the cache.

A cache file may be recreated under some conditions, even when it already exists:

The cache files have a maximum size of 4Gb. In general, the ms_mascotresfile cache file is about 10% of the size of the results file.

The static function ms_mascotresfile::willCreateCache() can be called before creating an ms_mascotresfile object to determine if a cache file will be created. This is useful in checking whether creating the object will take a long time.

The function ms_mascotresfile::getCacheFileName() can be called to retrieve the full or relative path to the cache file.

Using the ms_peptidesummary cache

Specify MSPEPSUM_USE_CACHE when creating the ms_peptidesummary object to enable caching.

If the cache file doesn't exist, then it will be created. If it exists, then the ms_peptidesummary constructor will return without extra delay. Calling getHit() and getPeptide() will load data from the cache file on demand. This means that after the cache file has been created once, you will need much less memory to access data in the file, as the whole file does not need to be read into memory.

What is cached?

The following values and data is stored in the cache file:

Assuming MSRES_DECOY is not specified, no actual decoy matches will be saved. The same is true in reverse: when using MSRES_DECOY, protein hits from the standard search are not cached.

If a UniGene index is in use and the UniGene file changes, the cache will be rebuilt.

Performance issues when creating the cache

If you use the cache for an ms_mascotresfile object, and then create a new ms_peptidesummary object, caching will typically take 1.5 to 3 times longer than without an ms_mascotresfile cache.

The static function matrix_science::ms_peptidesummary::willCreateCache() can be called before creating an ms_peptidesummary object to determine if there will be a delay while a cache file is created.

Performance issues with ms_protein objects

Once a cache file has been created, it should take less than a second to create a new ms_peptidesummary object. Calling matrix_science::ms_mascotresults::getHit() should also be very fast as this just loads data from the cache file. However, only basic information for each matrix_science::ms_protein object is loaded at this time. The first call for any particular ms_protein object to a function that takes a pepNumber argument (for example: matrix_science::ms_protein::getPeptideIonsScore() ) will be slow because this will cause a reload of data from multiple parts of the results file. Subsequent calls to any function for that ms_protein object will be fast, until matrix_science::ms_mascotresults::freeHit() is called.

Some applications just need a list of query/rank values for each protein so these values are cached separately for each top level protein and family member protein. Therefore, the first call to matrix_science::ms_protein::getPeptideQuery() or matrix_science::ms_protein::getPeptideP() will be reasonbly fast compared will calls to the other functions that take a pepNumber.

Performance issues when using non-default arguments

Calls to any function that takes a OneInXprobRnd argument will be slow if 1) caching is in use and 2) the argument is not the same as when the cache file was created. Typically, this argument is 1 / results.getProbabilityThreshold(). The value is compared to what is saved in the cache file (within a certain precision), and if the argument differs, this may trigger an expensive loop over all queries.

The affected functions are:

Other potentially slow function calls in cache mode

Parser is backwards compatible with cache files from a number of previous versions. New functionality often necessitates adding more data to new cache files to improve performance. If this data is not present in the current cache file (e.g. the cache file is from one or two versions before the current version), it is read directly from the results file or calculated on the fly.

The enum ms_peptidesummary::BUGFIX_NUM maintains a list of new functionality and performance improvements where this may be the case. If you use cache files created by a previous version of Parser, it is good practice to call ms_peptidesummary::isDataCached() with the relevant bug number to discover whether the data needed by the method you use is indeed present in the cache file. If it isn't, you can avoid the function call, prepare a progress feedback screen or recreate the cache files.

Filenames for the ms_mascotresults/ms_peptidesummary cache files

The cache filename format is

    F01234.dat.[a-z0-9]*.cdb 

where [a-z0-9] will be replaced by exactly 26 numbers or lower-case letters created by calling getMD5Sum() on a string created using all parameters passed to the ms_peptidesummary constructor, including the UniGene index file path.

If, for example, two reports are created for the same results file, one with a probability threshold of 0.05 and another with a threshold of 0.001, then separate cache files with unique filenames will be created, making it fast to switch between the two reports. Separate cache files are necessary, because changing any of the constructor parameters may change protein and peptide scoring, grouping, cut-off thresholds, etc.

The following flags do not affect the contents of the cache file and therefore are not included in the MD5Sum for the name:

The function ms_peptidesummary::getCacheFileName() can be called to retrieve the full or relative path to the cache file.

How a script should decide whether to use a cache

There are two lines in the options section of mascot.dat which specify what each script should do:

    ResfileCache master_results.pl,master_results_2.pl,peptide_view.pl...
    ResultsCache master_results.pl,master_results_2.pl,peptide_view.pl...

Each script should see if it is listed in ResfileCache, and if so, it should specify RESFILE_USE_CACHE when creating the ms_mascotresfile object. The value can be retrieved from the Options section of mascot.dat by calling ms_mascotoptions::getResfileCache().

Each script should see if it is listed in ResultsCache, and if so, it should specify MSPEPSUM_USE_CACHE when creating the ms_mascotresults object. The value can be retrieved from the Options section of mascot.dat by calling ms_mascotoptions::getResultsCache().

Using the ms_peptidesummary cache file for protein_view.pl

To quickly load details for a single protein (e.g. for protein_view.pl), applications prior to Mascot Parser 2.3 would generally use the singleHit parameter. However, it is faster to use a cache file if it exists.

For ms_proteinsummary, no peptide summary cache is available, so continue to follow the instructions as in Getting a single hit from the protein summary.

For ms_peptidesummary, the fastest method of loading details for a single protein is to use the cache if it already exists. For protein_view.pl, it is likely that a cache has already been created by master_results.pl or master_results_2.pl. So, as long as the exact same flags and parameters for the ms_peptidesummary constructor are used, access will be fast.

Protocol:


Copyright © 2022 Matrix Science Ltd.  All Rights Reserved. Generated on Thu Mar 31 2022 01:12:30