Mascot: The trusted reference standard for protein identification by mass spectrometry for 25 years

Posted by John Cottrell (February 1, 2013)

Don’t get stuck in a queue

Mascot Server doesn’t queue searches. This isn’t because we can’t be bothered to write the code. Its because we don’t believe queuing would improve performance or usability.

The times taken for individual searches cover a very wide range. A PMF or a small MS/MS search with few variable modifications and tryptic specificity might take seconds. A large MS/MS search of a large database with semi-specific or non-specific enzyme and many variable modifications might take days or weeks on the same hardware. Someone planning a large, multi-user system might naturally think that it would be a good idea to segregate searches into queues, according to predicted search time, so that short searches could be given a higher priority or large ones only allowed to run at nights or weekends, when the system was idle.

Could this be done? Well, the major factors that influence MS/MS search time are:

  1. Number of spectra
  2. Size of database in residues or bases after taxonomy filter
  3. Precursor m/z tolerance
  4. Enzyme specificity and missed cleavages
  5. Number and type of variable modifications

If we are happy with an order of magnitude approximation, we can assume search time is proportional to each of the first three. Enzyme specificity determines the number of candidate peptides for each spectrum, and this is easily estimated. The last factor, variable modifications, is more tricky. A modification like Gln->pyro-glu has little effect, creating one additional peptide for every 20, to a first approximation. A modification like Phospho (ST) has a more dramatic effect, particularly for large peptides, which have a high probability of containing multiple modifiable residues. The killer is the combinatorial explosion we see when several variable modifications are specified. Again, this is not difficult to estimate. Our predicted search times would not be very accurate, but "good enough for Government work".

So, maybe all searches lasting more than 2 hours go into the long queue, only running when the system is idle. My important search, predicted to last 3 hours, goes into the queue behind ten searches from the fool down the hallway, each predicted to last 2 days, and maybe I’ll see my results in 10 weeks. Obviously, I’m not happy about this, so pester the system administrator to jump the queue or create finer grained queues or use other considerations, like seniority in the lab.

Hard to imagine a system that would please everyone. Which is fine, because there really isn’t any good reason to use queues.

By default, all Mascot searches run in parallel and get an equal slice of processor time. For example, imagine 3 searches which, if running by themselves, would take 10 seconds, 10 minutes, and 10 hours. If these searches were submitted so as to start running at exactly the same time, there would be 3 running to begin with, each getting 1/3 of the processor time. The short search would complete after 30s. Then there are 2 searches running, and the medium search will take a further 19m 40s, completing after 20m 10s. The long job now has the system to itself, and completes after a total of 10h 10m 10s.

The net result is that all the searches took longer than if they had the system to themselves, but they came out in the right order and the search times were not increased out of all proportion. The short search was hit hardest, but this is a very artificial example. On a real system, there would be a stream of new searches of various sizes arriving at near random intervals. If the average number of searches running was 3 then all searches would take roughly 3 times as long as if they had the system to themselves.

This seems like a reasonably equitable arrangement, and doesn’t require complex planning or administration. If you find you need to tweak it, there are a number of mechanisms for this purpose.

Go to the Mascot database status page, click on a database, and you get a list of the searches running against that database. Click on a search, and you find controls to change the priority of the search, pause it, or kill it. If you have long searches that are not urgent, they can be given a low priority such that they only get significant processor time when the system would be otherwise idle.

If Mascot security is enabled, the default priority, which is also the highest priority, can be set differently for different groups. Users can still decrease the priority for their non-urgent jobs, but they cannot increase it above their limit. Mascot security also allows you to set limits at group level on maximum search time, the number of concurrent searches per user, and all of the factors that influence how long searches might take (except precursor tolerance).

You can also use Mascot Daemon to run low priority jobs at quiet times (Start at). Search priority can be specified in the Daemon task editor. Remember that searches within a Daemon task are run serially while Daemon tasks run in parallel, so it is sometimes worth splitting a large set of files across multiple tasks.

The maximum number of simultaneous searches on a Mascot Server is a global setting in mascot.dat: MaxConcurrentSearches. On installation, this is set to a default of 10, and most systems never hit this limit. But, if you have a large and busy system, with plenty of free memory, and people sometimes have searches refused, you might consider increasing it.

Keywords: , ,

3 comments on “Don’t get stuck in a queue

  1. doufeia on said:

    Is it true that GPU computing is much much faster than CPU computing search? Please correct me if I am wrong. Latest GeForce GTX Titan has 4.5 Tflops power, while Intel Core i7-3770k has only 100Gflops (0.1Tflops) power. Also the video cards can work in 4x mode.

  2. dtrudg on said:

    Unfortunately the lack of any kind of queue causes severe headaches when a Mascot Server is shared between a large number of potential users who often have large multi-file datasets, and/or it is used as a search tool within a pipeline of other tools. Priorities only go so far – people are always unlikely to downgrade their own searches, and I don’t think Mascot can automatically do it based on number of searches that person has submitted etc.

    Relatively frequently someone working with huge fractionated datasets will want to search all of the files for PTMs etc. If submitted all at once the searches would swamp the server, drastically slowing down searches from other users. Therefore, the number of concurrent searches per user and total searches must be limited. The user then has to either manually feed the server a new file each time one finishes via the web interface, or use Mascot Daemon (not possible from Mac or Linux).

    If building Mascot into a pipeline, that pipeline must queue searches for submission to Mascot, so that it does not fall foul of the Mascot concurrent search limits.

    With a heavily shared Mascot install there is always a queue somewhere – whether it is a manual queue to submit your 100 files via the web interface, a queue of those files in Daemon, or a queue in a pipeline that sends searches to Mascot. The most natural place to implement a queue is on the Mascot Server, where it can be administered easily and benefits all situations. Users could submit as many files as they want through the web interface or by other means, turn off their PC and go home. In the morning, or next week, it’s all done.

    • John Cottrell on said:

      Daemon was developed for exactly the situation you describe, where someone wants to queue up a number of searches on Friday night, then come back on Monday morning to pick up the results. You say “not possible from Mac or Linux”, but is this really such a limitation? The reason we have resisted porting Daemon to other OS is that all MS data systems are Windows based and hence most data import filters are only available for Windows. A version of Daemon that is only good for peak lists hardly seems worth the effort. In an organisation where Windows PCs are scarce or unpopular, Daemon can always be run on the instrument acquisition PC or in a VM under OS X or Linux.

      Similar considerations apply to moving Daemon functionality to the Mascot Server. We could develop a web-browser interface that allowed a user to upload a set of peak lists and associate them with a set of search parameters. But, even if the Mascot Server was Windows based, I don’t think it would be practical to move the data import filters to the server, if only because it means uploading large numbers of raw files, which are at least an order of magnitude larger than peak lists. This, together with an increasing interest in quantitation, means that client-side automation seems the more useful arrangement.

Leave a Reply

Your email address will not be published. Required fields are marked *


HTML tags are not allowed.