Mascot: The trusted reference standard for protein identification by mass spectrometry for 25 years

Hardware virtualisation

Overview

One of the most significant trends in IT in recent years has been the shift towards virtual machines (VMs). Hardware virtualisation can offer a host of advantages such as consolidation, elastic provisioning, high availability, disaster recovery, multi-tenant isolation and legacy OS support, at the expense of some performance impact and overall system complexity.

At an elementary level, Mascot Server, Mascot Distiller and Mascot Daemon are trivial to install and run in a virtual machine. They are installed like any other ‘shrink-wrap’ software, and there is nothing in the software that would prevent running it in a virtualised environment. However, getting optimal performance requires careful resource allocation, and this forms the bulk of the complexity.

Hardware virtualisation is distinct from software containers, also known as OS-level virtualisation. Mascot Server and Mascot Distiller do not currently support software containers like Docker. The rest of this article discusses hardware virtualisation.

Mascot Server

The starting point for VM specification is PC specification for Mascot Server. The basic requirements are the same: one or two fast processors, at least 16GB of RAM (preferably 64GB) and plenty of disk space (at least 1TB) located on fast disks. It’s beneficial to put the operating system and Mascot program files on a solid state drive if one is available. It’s best to avoid processors with a very large number of cores, as power management tends to throttle the clock frequency progressively as the workload increases. Beyond that, the three categories – processor, memory, disk – need special consideration when virtualised.

Processor mapping and vCPUs

A database search is typically CPU-bound, meaning the process spends most of its time calculating, as opposed to loading and saving data. A Mascot licence is priced according to the number of processor cores available for database searching. Unless the licence is for many CPUs, the cost of the licence is likely to be greater than the cost of the hardware. This makes it important to choose fast hardware and make maximum use of it.

Mascot licensing is in units of 1 CPU = 4 cores. When the processor is virtualised, the hypervisor typically maps physical cores or hardware threads into vCPUs, so 1 licensed CPU = 4 vCPUs. For licensing purposes, it does not matter how the cores are distributed among virtualised processors or how many virtualised processors are mapped to the VM. A 2-CPU licence is good for 8 vCPUs, which can be configured as 1x 8-core virtual processor, 2x 4-core, 3x 3-core (leaving one spare vCPU), and so on.

The host hardware should obviously have at least as many cores as the Mascot licence. Additionally, it is very important to pay attention to the vCPU mapping, which is hypervisor specific. Intel processors support Hyper-Threading, which allows the host system to run up to two hardware threads on a single physical core. Recent AMD processors have Simultaneous Multithreading, which is equivalent. The default with many hypervisors is to map the physical core as two “independent” vCPUs: one vCPU for the physical core and one vCPU for the hyperthread. On bare metal hardware, hyperthreading can give up to a 12% performance increase. On virtualised hardware, however, it can be detrimental to Mascot performance.

For example, suppose the host machine has two 4-core processors, each with hyperthreading enabled, so the host has 16 hardware threads. These are typically mapped into vCPUs in linear order:

  Processor 1, core 1, HT 0 = vCPU 0
  Processor 1, core 1, HT 1 = vCPU 1
  Processor 1, core 2, HT 0 = vCPU 2
  Processor 1, core 2, HT 1 = vCPU 3
  Processor 1, core 3, HT 0 = vCPU 4
  ...
  Processor 2, core 4, HT 0 = vCPU 14
  Processor 2, core 4, HT 1 = vCPU 15

Suppose you have a 1-CPU licence. In the first VM configuration, the first four vCPUs (0-3) are mapped to the VM. In the second VM, only the physical cores are mapped (vCPUs 0, 2, 4, 6). The first VM is equivalent to a 2-core processor with hyperthreading enabled, while the second is equivalent to a 4-core processor without hyperthreading. Thus, database searches in the second VM could be 1.5-2 times faster, even though their configuration appears to be the same. For this reason, it is beneficial to ‘pin’ the Mascot Server VM to specific vCPUs that map to physical cores.

Alternatively, when hyperthreading is enabled on the host, map twice the number of vCPUs to the VM as you have licensed cores (vCPUs 0-7 in the above example). The operating system in the VM should then have a chance to schedule processing time appropriately among the physical cores, leaving only a small performance impact. It is generally useful to have more cores than the licence, as the extra cores are used for operating system overhead and generating search reports.

Processor sharing

Virtualisation is an additional layer of software that can cause some reduction in efficiency, but let’s assume that this effect is negligible for the moment. A much more important factor is whether there are other VMs running simultaneously on the same host machine. These may be competing with Mascot for processor time unless isolated appropriately.

For example, if you have a 3-CPU (12 core) Mascot licence and a physical server with 16 cores, then ensuring that the other VMs never require more virtual CPUs than the 4 ‘spare’ unlicensed cores should be perfectly fine.

An extreme example of poor performance is a 6-CPU licence, which is configured in cluster mode with 3 VMs, each configured as a 2-CPU (8 core) search node, but the host hardware only has two 4-core processors. Everything would appear to run OK, but the speed of the system would almost certainly be worse than a 2-CPU licence running in a single VM.

Random access memory (RAM)

Search speed depends heavily on the sequence database files being held in memory during the search. This makes it important that the VM is given access to as much physical RAM as practical. Commonly, the default VM configuration assigns a limited amount of memory to each VM on the assumption that it will be one of many. We recommend at least 16GB of RAM for Mascot Server, preferably 64GB. If you’re setting up a cluster of VMs, allocate at least 16GB per cluster node.

Some hypervisors, such as Microsoft Hyper-V, support memory ballooning. This means you set a minimum, initial and maximum amount of RAM, and the hypervisor grows or shrinks the RAM allocated to the VM according to usage and staying within the minimum and maximum. Memory ballooning can have a severe, detrimental impact on Mascot Server performance, so we recommend always allocating a fixed amount of RAM that is not shared with any other VM.

Disk size and speed

Storage arrangements can also impact Mascot performance. If several VMs have their virtual disks on a single physical drive, this creates a potentially significant bottleneck when two or more VMs are doing I/O at the same time. For example, VMware hypervisors measure I/O contention in IOPS (number of I/O operations per second), and there is usually a hard limit to how many IOPS a single hardware device can handle.

You also need to plan for enough future storage capacity for sequence databases (some of which grow in size each month) and your search results and related cache files, which are all under the Mascot ‘data’ directory. A suitable backup regime is highly recommended.

It is beneficial to allocate an independent virtual disk for the Mascot Server VM, and store it physically separate from the other virtualised disks. Keep the OS and Mascot program files on shared storage, and the bulk files on the independent disk.

We have also found that storing VM disks on a RAID50 or RAID10 array provides decent performance for Mascot Server even if several other VMs use the same RAID array. In this case, the RAID controller can be used for creating a separate RAID volume within the array (confusingly also called a virtual disk), which is ideal for storing the independent virtual volume for Mascot.

VM snapshotting

A word of caution: Don’t be tempted to use VM snapshot functionality as a data backup mechanism. Firstly, if the disk on which the snapshots are stored dies, you lose everything. Secondly, using snapshot volumes can incur a performance penalty even during normal system operation. The very large sequence database, taxonomy and ‘data’ directories should be kept on snapshot-independent volumes, as otherwise snapshot operations will end up being incredibly slow and you’re likely to lose search results when switching back to an earlier snapshot. Thirdly, the use of snapshots in a production environment should generally be limited to very short term rollbacks for various important reasons beyond Mascot.

Mascot Distiller

Processor mapping and vCPUs

Mascot Distiller is licensed by instance and will use all available cores for data processing. The optimal number of vCPUs can only be found by experimentation; see the example in Choosing hardware for Mascot Distiller. Start with 8-10 vCPUs. As you add more vCPUs, there is a point of diminishing returns, where the process turns from CPU-bound into disk-bound. This point is specific to the host system, disk speed, intereference from other VMs, etc.

If the Distiller VM has more than 64 vCPUs, you will need to configure Windows (within the VM) to split the vCPUs into separate processor groups. The details can be found in Choosing hardware for Mascot Distiller. The advice regarding hyperthreading and vCPU mapping is otherwise the same as with Mascot Server.

Processor sharing

Mascot Server and Mascot Distiller can be installed on the same VM or different VMs. If they are installed on the same VM, the two programs will obviously compete for CPU time. If you typically run peak picking and the database search sequentially (for example, one project at a time), then the two programs will not overlap in time and there is no issue installing them on the same VM.

If Mascot Server and Mascot Distiller are installed in different VMs on the same host, then the advice about processor sharing is the same as with Mascot Server. For example, if your Mascot Server licence is for 3-CPU (12 core) and the host machine has 16 cores, then the Distiller VM should be configured with 4 cores.

Random-access memory (RAM)

We recommend at least 32GB of RAM for Mascot Distiller. Aim for 64GB if you process very large replicate (label-free quantitation) datasets. The advice regarding Mascot Server and memory ballooning applies.

Disk size and speed

You will need a decent amount of storage, probably 2-3TB to start with. There should be at least enough local storage for about twice the size of a project’s raw files. Distiller can open raw files from network storage, but this usually has a performance impact. The advice is otherwise the same as disk allocation for Mascot Server.

Mascot Daemon

Mascot Daemon does not have special hardware requirements. It can be installed in any virtualised Windows environment. If you are using the Daemon Toolbox with Mascot Distiller, then Distiller and Daemon need to be installed in the same virtual machine, as Daemon will call the Distiller executable to batch process raw files.