Mascot Parser 2.3 and later support the use of cache files to speed up access to huge results files.
Two types of cache file are supported, and typically both will be created when processing a single results file:
ms_mascotresfile), i.e. the .dat file.
The first one is an index to different sections and raw lines of the results file itself, and only needs to be rebuilt if the file changes.
The second contains processed and grouped protein hits and peptide data. Without an
ms_peptidesummary cache, protein grouping needs to be done anew every time an
ms_peptidesummary object is created. With caching, the grouped data can be stored in the cache file, and subsequent file access can skip protein grouping entirely.
In a standard Mascot Server setup, cache files are located in a directory specified by the CacheDirectory value in the options section of
mascot.dat. If CacheDirectory specifies a relative directory, then it is relative to the current working directory of the calling program. The function ms_mascotoptions::getCacheDirectory() can be used to retrieve the cache directory from
mascot.dat. In all Mascot Server scripts, the return value from this function is passed to the ms_mascotresfile constructor.
Cache files for a specific results file are put into their own directory under the directory specified by
CacheDirectory. The name of the directory is constructed by calling getMD5Sum() and passing a string comprising the .dat filename (or filenames if combining multiple results files; see Combining multiple .dat files), file size and last modified date. The function ms_mascotresfile::getCacheDirectory() can be used to retrieve the directory used for cache files for a specific results file.
A number of special 'tokens' can be used to split the potentially large number of files/directories into more convenient subdirectories. The 'tokens' are generated using the
strftime function using the last modified date of .dat file. Whilst the
strftime function has a large number of options, the ones most likely to be useful are:
Other tokens accepted by strftime are documented here: http://www.cplusplus.com/reference/clibrary/ctime/strftime/
The default value is "../data/cache/%Y/%m" which specifies a new directory for each month. For example, for a file dated the 3rd February 2010, the files would be in a directory
Mascot Parser will try to create a cache directory if it doesn't exist. If necessary, it will create the whole directory tree.
Most errors relating to the cache files are 'soft'. For example, if two applications both try and create a cache file at the same time, then the first will succeed in creating the cache file, and the second one will carry on without using a cache. If an application crashes in the middle of creating a file, then the next application to try and use the cache will re-create it. If an application fails to create the cache for any reason, it will carry on without the cache.
A cache file may be recreated under some conditions, even when it already exists:
The cache files have a maximum size of 4Gb. In general, the
ms_mascotresfile cache file is about 10% of the size of the results file.
The static function ms_mascotresfile::willCreateCache() can be called before creating an
ms_mascotresfile object to determine if a cache file will be created. This is useful in checking whether creating the object will take a long time.
The function ms_mascotresfile::getCacheFileName() can be called to retrieve the full or relative path to the cache file.
If the cache file doesn't exist, then it will be created. If it exists, then the ms_peptidesummary constructor will return without extra delay. Calling getHit() and getPeptide() will load data from the cache file on demand. This means that after the cache file has been created once, you will need much less memory to access data in the file, as the whole file does not need to be read into memory.
The following values and data is stored in the cache file:
Assuming MSRES_DECOY is not specified, no actual decoy matches will be saved. The same is true in reverse: when using
MSRES_DECOY, protein hits from the standard search are not cached.
If a UniGene index is in use and the UniGene file changes, the cache will be rebuilt.
If you use the cache for an
ms_mascotresfile object, and then create a new ms_peptidesummary object, caching will typically take 1.5 to 3 times longer than without an
The static function matrix_science::ms_peptidesummary::willCreateCache() can be called before creating an
ms_peptidesummary object to determine if there will be a delay while a cache file is created.
Once a cache file has been created, it should take less than a second to create a new ms_peptidesummary object. Calling matrix_science::ms_mascotresults::getHit() should also be very fast as this just loads data from the cache file. However, only basic information for each matrix_science::ms_protein object is loaded at this time. The first call for any particular ms_protein object to a function that takes a pepNumber argument (for example: matrix_science::ms_protein::getPeptideIonsScore() ) will be slow because this will cause a reload of data from multiple parts of the results file. Subsequent calls to any function for that ms_protein object will be fast, until matrix_science::ms_mascotresults::freeHit() is called.
Some applications just need a list of query/rank values for each protein so these values are cached separately for each top level protein and family member protein. Therefore, the first call to matrix_science::ms_protein::getPeptideQuery() or matrix_science::ms_protein::getPeptideP() will be reasonbly fast compared will calls to the other functions that take a pepNumber.
Calls to any function that takes a
OneInXprobRnd argument will be slow if 1) caching is in use and 2) the argument is not the same as when the cache file was created. Typically, this argument is
1 / results.getProbabilityThreshold(). The value is compared to what is saved in the cache file (within a certain precision), and if the argument differs, this may trigger an expensive loop over all queries.
The affected functions are:
Parser is backwards compatible with cache files from a number of previous versions. New functionality often necessitates adding more data to new cache files to improve performance. If this data is not present in the current cache file (e.g. the cache file is from one or two versions before the current version), it is read directly from the results file or calculated on the fly.
The enum ms_peptidesummary::BUGFIX_NUM maintains a list of new functionality and performance improvements where this may be the case. If you use cache files created by a previous version of Parser, it is good practice to call ms_peptidesummary::isDataCached() with the relevant bug number to discover whether the data needed by the method you use is indeed present in the cache file. If it isn't, you can avoid the function call, prepare a progress feedback screen or recreate the cache files.
The cache filename format is
[a-z0-9] will be replaced by exactly 26 numbers or lower-case letters created by calling getMD5Sum() on a string created using all parameters passed to the
ms_peptidesummary constructor, including the UniGene index file path.
If, for example, two reports are created for the same results file, one with a probability threshold of 0.05 and another with a threshold of 0.001, then separate cache files with unique filenames will be created, making it fast to switch between the two reports. Separate cache files are necessary, because changing any of the constructor parameters may change protein and peptide scoring, grouping, cut-off thresholds, etc.
The following flags do not affect the contents of the cache file and therefore are not included in the MD5Sum for the name:
The function ms_peptidesummary::getCacheFileName() can be called to retrieve the full or relative path to the cache file.
There are two lines in the options section of mascot.dat which specify what each script should do:
ResfileCache master_results.pl,master_results_2.pl,peptide_view.pl... ResultsCache master_results.pl,master_results_2.pl,peptide_view.pl...
Each script should see if it is listed in
ResfileCache, and if so, it should specify RESFILE_USE_CACHE when creating the
ms_mascotresfile object. The value can be retrieved from the Options section of mascot.dat by calling ms_mascotoptions::getResfileCache().
Each script should see if it is listed in
ResultsCache, and if so, it should specify MSPEPSUM_USE_CACHE when creating the
ms_mascotresults object. The value can be retrieved from the Options section of mascot.dat by calling ms_mascotoptions::getResultsCache().
To quickly load details for a single protein (e.g. for protein_view.pl), applications prior to Mascot Parser 2.3 would generally use the singleHit parameter. However, it is faster to use a cache file if it exists.
ms_proteinsummary, no peptide summary cache is available, so continue to follow the instructions as in Getting a single hit from the protein summary.
ms_peptidesummary, the fastest method of loading details for a single protein is to use the cache if it already exists. For
protein_view.pl, it is likely that a cache has already been created by
master_results_2.pl. So, as long as the exact same flags and parameters for the
ms_peptidesummary constructor are used, access will be fast.
protein_view.pl(or whatever application/script) is to use a cache file as described above.
singleHitparameter and using the same flags and parameters as in
master_results_2.plto make sure an existing cache file is used. Make sure that MSPEPSUM_USE_CACHE is specified.
|Copyright © 2016 Matrix Science Ltd. All Rights Reserved. Generated on Fri Jun 2 2017 01:44:51|