Mascot: The trusted reference standard for protein identification by mass spectrometry for 25 years

Posted by Richard Jacob (September 14, 2016)

Getting the most out of your Mascot Server hardware

A Mascot Server search consists of a number of separate stages. Once the input file has been uploaded to the server, Mascot starts by sorting the peak list by peptide molecular mass. Unless the peak list is very small, it also divides the peak list into chunks limited in size by either the number of bytes or the number of queries. Next, comes the search itself, when the experimental masses in the individual spectra are compared with masses calculated from the database sequences. After all chunks have been processed, the results are consolidated and saved to the results file, (the F*.dat file). For decoy searches, if requested, Mascot then runs Percolator, a semi-supervised machine learning algorithm, to improve the discrimination between correct and incorrect spectrum identifications. Finally, Mascot creates some cache files to improve performance when viewing large search results.

The main part of the search, when the mass values are calculated and scored, is embarrassingly parallel. This means that processing can be split into a number of independent tasks that run separately from each other with little or no communication between them. Modern multi-core CPU’s, whether in a single machine or distributed across a cluster of computers, allow easy scaling of this part of the search. The result is that doubling the number of cores used in a search approximately halves the time taken. The other stages – the sorting and splittting of the peak list, writing the result file and Percolator post-processing – are currently singly threaded processes. These parts are not improved by using more cores and limit the improvements that can be made by further parallelization. This is known as Amdahl’s Law. In particular, the length of time taken to sort the peak list is a function of its size and it can take quite a large portion of the total time for a search of a very large peak list against a small database.

Performance bottle neck

For the main part of the search, the performance bottle neck is normally the CPU. We have found that the PassMark PerformanceTest CPUBenchmark test is a reasonable indicator of Mascot performance

We license Mascot Server by the ‘CPU’, where each CPU is good for 4 physical cores. If your processors have hyper-threading, we don’t count these additional logical cores as they only add about a 10% improvement to performance. In Windows, from the System Control Panel, you should be able to find the model and speed of the CPU that your Mascot Server is using. In Linux, you can use the following command line:

cat /proc/cpuinfo

When determining the CPUBenchmark for your CPU, remember to divide the performance by the number of cores as Mascot Server is core based licensing rather than socket based licensing. With newer CPU’s, that have more than 8 cores, we sometimes find that the performance per core may go down as the number of cores increases. This is because the CPU clock speed is limited by the amount of heat the CPU can dissipate and it is not able to run all the cores at a high frequency at the same time.

A benchmark performance score of 2000 or more per core is very high performance, ideal for Mascot Server. Slower CPU’s can still be practical choices because of the convenience of installing a multi CPU Mascot Server license on a single computer or the lower cost.

RAM

A Mascot Server needs a minimum of 16GB of RAM on a standalone computer or the head node of a cluster. Less than 16GB may mean Mascot is not able to bring very large databases on line. For larger Mascot Server licenses, with 3 or more CPU’s, we recommend an additional 1GB of RAM per a core. If the Mascot Server has insufficient RAM the computer will start exchanging data between RAM and the disk drives, or swapping, which increases the search time dramatically.

SSD drives

A Solid State drive (SSD) is much faster than a traditional magnetic drive, so enables a computer to start up very quickly. The computer will also ‘feel’ fast when doing many tasks. However, such drives are expensive and limited in size. As mentioned above, the performance bottle neck for the main part of the search is the CPU, so putting the sequence database files on an SSD may have limited benefit. The areas where an SSD can boost performance are the initial sorting and splitting of the input file and in the writing of the result file at the end of the search, so putting the operating system, the Mascot program files, and the data directory on an SSD while leaving the sequence files on a magnetic drive may represent a good compromise.

Keywords: ,

Leave a Reply

Your email address will not be published. Required fields are marked *

*

HTML tags are not allowed.