Mascot: The trusted reference standard for protein identification by mass spectrometry for 25 years

Posted by John Cottrell (January 20, 2014)

How long should a search take? – Hardware and configuration

This question comes up quite frequently. Sometimes in the form of "How large a licence do I need?" and sometimes as "My searches seem slower since I re-installed Mascot / updated to the new version / moved to new hardware." Resisting the temptation to reply "How long is a piece of string?", let’s review some of the factors that influence search time. This article will address hardware and configuration aspects and a future article will cover data and search parameters.

When things are working as they should, for a given processor architecture, search speed varies as the product of processor clock speed and the number of physical cores used for searching. That is, double the number of cores or double the clock speed and the time taken for a given search should drop by approximately a factor of two. It isn’t strictly linear because other constraints, such as memory bandwidth or disk access, can become significant. Also, there are activities at the beginning and end of a search that are not threaded, so that very short searches don’t scale well because a large proportion of the time is spent on reading the peak list, allocating memory, and writing out the result file.

When it comes to choosing a processor, there are far too many models for us to benchmark. It isn’t just a matter of comparing clock speed and number of cores; processor design and the type and amount of on-chip cache are also important. Fortunately, we’ve observed that the PassMark CPU benchmark is a pretty good guide as to the performance you can expect for Mascot Searches. As well as the outright performance league tables, its important to look at the price to performance lists, because you often pay a premium price, which may not be justified, for the top-end models. The other important listing is for single thread performance, because Mascot Server licence cost depends on the number of cores you want to use for searches. Even if a particular processor with 16 cores was faster and cheaper than one with 4 cores, you have to factor in that you would need a 4 cpu Mascot licence to utilise all 16 cores, compared with a 1 cpu licence to saturate the 4 core processor.

We started with the caveat "When things are working as they should". How can you tell whether this is the case?

First thing to check is database configuration. In Mascot Server 2.4, running on a single computer, the number of threads for each database should be set to -1 for ‘Auto’. Searches will use a thread for each logical core, up to the limit set by the licence. For example, a 2 cpu licence will use 8 threads for each search, or 16 if the cores are hyperthreaded. Unless you deliberately want to limit processor usage for a particular database, make sure this setting hasn’t been changed.

There isn’t space here to go into cluster configuration or how the number of threads was configured in earlier versions of Mascot, but its always worth looking at Database Status to verify that the number of threads for each database is set correctly. For further information, refer to the Installation & Setup manual (linked from your local Mascot home page) and the Processors section of the PC hardware help page.

Searches only achieve full speed once the database being searched has been read into memory. In general, we advise that databases are memory mapped but not memory locked because most systems don’t have enough RAM to hold entire databases in memory while leaving enough free for searches and other processes. If you have huge amounts of RAM, once the database is read into memory, it will stay there whether it is locked or not, because there is no reason for the operating system to swap it out. Hence, the main benefit of memory locking when you have lots of RAM is that the database is read into memory when Mascot Monitor starts up, rather than the first time the database is searched. So, unless you have the database locked, when running a benchmark, run the search multiple times and ignore the timing for the first run, because this may be slower due to having to read the database into memory.

With the database in memory, unless there are other processor intensive applications running on the computer, you should see the bulk of the processor time being used for the Mascot search (or searches). These show up in Windows Task Manager or Linux top as nph-mascot.exe. If the combined searches are getting less than 80% of the processor time, this may indicate a problem, such as insufficient RAM. If there isn’t enough RAM for all the active processes, the operating system has to swap pages out to disk, which is very slow compared with reading and writing RAM. Processor usage can drop to very low levels and, in a quiet room, you may even hear the overworked disk ‘thrashing’. For serious searching, a 64-bit operating system and plenty of RAM are just as important as the processor speed and number of cores.

Other factors that can impact performance include slow disks or faulty RAID arrays, using remote storage when the network isn’t up to it, and anti-virus software trying to perform real-time scanning of every changed file. Large solid state disks are becoming more affordable, and can be much faster than traditional hard drives. If you have to be selective about which files go on the SSD, then the files to leave on conventional storage are older result files, which are less likely to be viewed again, and sequence databases, provided you have plenty of RAM. Running Mascot in a virtual machine introduces additional complexity, and can have a negative effect on performance, as discussed in an earlier blog article.

Finally, if comparing speed across different versions of Mascot, remember that new features sometimes speed things up or slow things down. For example, in early versions of Mascot, the unknown residue X was assigned a nominal mass, which meant that peptide sequences containing X were essentially not matchable. One of the changes in version 2.0 was that, if the peptide sequence had a single X, the code would try all 20 residues at that position to see which gave the best match. This clearly has the potential to slow things down according to the abundance of X in the database. An example of a change that speeded things up was the introduction of exclusive modifications in Mascot 2.2. Prior to this, isotopic labels such as SILAC had to be specified as variable modifications.

Keywords: ,

Leave a Reply

Your email address will not be published. Required fields are marked *

*

HTML tags are not allowed.