Posted by Ville Koskinen (June 17, 2021)

NCBI nr tips

NCBI nr is the comprehensive database of non-identical protein sequences compiled by the National Center for Biotechnology Information. The database is available in Mascot Server as the NCBIprot predefined definition. If possible, nr should be used as the database of last resort due to its huge size. The January 2021 version contains 338 million protein sequences and the FASTA file size was 158GB – and the database has grown since then. There will usually be a better choice of database, such as a Uniprot complete proteome for your organism of interest. On the other hand, there are times when nr is the only option. We’ve collected tips and solutions to common problems below.

Disk performance and size

Mascot Server has no built-in limit on database size. The main issues are hardware limitations.

First, check what type of hard disks your Mascot Server has. NCBIprot is so large that compressing the database tends to be limited by disk throughput and random access speed. The best possible setup is a RAID10 array, where both reading and writing data is essentially parallelised between disks in the array. The cheaper option is using a data centre grade NVMe or SSD disk. Last come consumer-grade SSDs and traditional HDDs. We’ve heard some reports that bringing online a recent nr can take well over two weeks on a traditional HDD.

The second important factor is disk space. As a rule of thumb, you will need enough space to store three copies of the sequence database. Using January 2021 numbers, the gzipped FASTA file is about 90GB, the uncompressed FASTA file 158 GB and the index files created by Mascot about 170 GB, or 418 GB total. On top of that, there are taxonomy files (about 20 GB at the time of writing), and if you’re updating from an earlier version of NCBIprot, you also need space to store the previous compressed files and previous FASTA while the new one is being compressed. At minimum, we recommend at least 800 GB of disk space reserved for NCBIprot.

A third factor is virtualisation. Mascot can be installed in a virtual machine, but there are several considerations that affect virtualised disk performance: the virtualisation technology, underlying host hardware, how the virtual disk is configured, whether several VMs share the same host disk and whether snapshots are in use. If possible, it’s best to dedicate a host disk for NCBIprot, and make sure it’s configured as a snapshot-independent (persistent) volume.

Downloading

The NCBIprot predefined definition shipped with Mascot uses an FTP URL to download the FASTA file. FTP (literally file transfer protocol) is simple but not very robust, and the protocol has no built-in error correction. The larger the file being transferred or the longer it takes, the higher the chance of something going wrong. We’ve heard from a number of customers around the world about download errors with the FASTA file. Sometimes these are caused by unexpected termination of FTP server connection, sometimes by network glitches or variable download speed.

If you are experiencing FTP connection errors or the download never succeeds through Database Manager, the workaround is to use rsync. Support for rsync is briefly mentioned in the NCBI download FAQ and deserves to be better known. rsync is a well-established tool in the Unix/Linux world, and it handles error correction, resuming a download and even transferring just the changed bytes in big binary files. We’ve also tried file transfers using HTTPS. The error detection is better than FTP, but error correction is not, so we recommend giving rsync a try.

Downloading with rsync on Linux

If you have a Linux PC, it may already have rsync installed, or it can be trivially installed from the distribution repository.

Steps to update taxonomy files:

  1. Update SwissProt through Database Manager. This will download and extract the latest taxdump.tar.gz, which is also used by NCBIprot.
  2. Download prot.av2taxid.gz from https://s3.amazonaws.com/matrixsciencemisc (can be done with a web browser).
  3. Decompress prot.av2taxid.gz and copy it to the mascot/taxonomy directory.

Steps to update the FASTA file:

  1. Enable NCBIprot through Database Manager if not already enabled.
  2. Cancel any download tasks.
  3. Open a shell and change directory to NCBIprot/incoming.
  4. Download the file:
    rsync -av --progress --partial --partial-dir=rsync.tmp rsync://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz .
    The command can be cancelled and resumed if needed.
  5. Decompress the downloaded file: gunzip nr.gz
  6. Rename the file nr to NCBIprot_20210611.xyzzy, where the date is today’s date.
  7. Move the .xyzzy file to NCBIprot/current.
  8. Rename the .xyzzy file to .fasta.

The last step is important. If you simply copy a .fasta file to the current directory, Mascot might start compressing the database before the file copying has finished.

Downloading with rsync on Windows

Unfortunately, there does not seem to be an official version of rsync for Windows. Here are some alternatives:

  • Use a Linux PC to run rsync, then copy nr.gz to the Windows PC.
  • Install Cygwin. In the package manager, install gzip and rsync. You can now follow the same steps as on Linux.
  • Use a commercial program like Acrosync to transfer nr.gz. Acrosync supports the rsync protocol.
  • Use a free software program like DeltaCopy or Grsync.

Bringing the database online

Once the FASTA file has been downloaded, bringing the database online is an automated process. How long this takes depends on the disk performance as discussed above. Keep in mind that if you stop the Mascot Monitor service or reboot the computer, the process will start from the beginning.

In Mascot Server 2.7 and earlier, you may get an error that the test search has timed out. The solution is to increase the value of MonitorTestTimeout in mascot.dat Options section. If the value is 2400, double it to 4800 and retry database compression. The timeout issue will be removed in the next version of Mascot.

Database search

The key to searching NCBIprot is choosing a narrow taxonomy filter. Searching the whole database will not only take a long time, but you’ll also lose sensitivity. Mascot ships with default taxonomy filters in the config/taxonomy file. Instructions for adding new filters are in chapter 9 of the Installation & Setup manual. You’ll need to recompress the database after making changes to the taxonomy configuration.

Keywords: , , ,

One comment on “NCBI nr tips

  1. Ville Koskinen on said:

    Since July 2021, NCBIprot has over 409 million sequences. Unfortunately, Mascot fails to bring NCBIprot online if it has more than approximately 370 million sequences. This is caused by a 4GB limitation on the taxonomy index .t00 file created during database compression. The bug will be fixed in a future version of Mascot.

    The workaround is to divide the FASTA file in two parts. Instructions are here:

    https://www.matrixscience.com/help/seq_db_setup_nr.html

Leave a Reply to Ville Koskinen Cancel reply

Your email address will not be published. Required fields are marked *

*

HTML tags are not allowed.