Mascot: The trusted reference standard for protein identification by mass spectrometry for 25 years

Posted by Ville Koskinen (October 31, 2025)

Replacements for NCBI nr

NCBI nr is the comprehensive database of non-identical protein sequences compiled by the National Center for Biotechnology Information. The database configuration has long been shipped with Mascot Server as the NCBIprot predefined definition. However, it seems the FASTA file has quietly been retired; the last version appears to be the February 2024 file.

The last-modified time of nr.gz on the NCBI FTP site is 2024-02-07:

README.txt              2024-01-25 15:59  365   
nr.gz                   2024-02-07 10:05  186G  
nr.gz.md5               2024-02-07 11:22   40   

At the time time of writing (October 2025), it seems there have been no updates for 18 months. The README.txt file links to a blog post at NCBI Insights, which suggests the FASTA file has been discontinued. There don’t seem to be any other official annoucements about it. The NCBI documentation has not been updated and still refers to nr.gz as the active FASTA file, as can be seen in the BLAST Help manual (section Sequence files under the “/db/FASTA/” subdirectory).

How to compile the full BLAST nr database

BLAST itself has not been discontinued; only the prebuilt protein FASTA file is finished. It’s still possible to download and compile the BLAST nr database on your local computer. The announcements points to the BLAST command line instructions. You’ll need to use update_blastdb.pl to download the full nr database. Then you can use blastdbcmd.exe to extract (a subset of) protein sequences in FASTA format.

I downloaded the blast+ package 2.16.0 from: https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/. You will also need to install Perl. On Linux, Perl is usually available as part of the operating system. On Windows, Strawberry Perl is a good choice.

Once you have Perl installed, install the blast+ package. It contains the script update_blastdb.pl for initiating the download. On Windows, for example, the script is extracted into the installation directory C:\Program Files\NCBI\blast-2.16.0+\bin:

  perl "C:\Program Files\NCBI\blast-2.16.0+\bin\update_blastdb.pl" --source ncbi --decompress nr

The command will download the chunked database, like:

  Downloading https://ftp.ncbi.nlm.nih.gov/blast/db/nr.000.tar.gz...
  Downloading https://ftp.ncbi.nlm.nih.gov/blast/db/nr.001.tar.gz...

Once all the chunks have been downloaded, extract the FASTA formatted database with:

  perl "C:\Program Files\NCBI\blast-2.16.0+\bin\blastdbcmd.exe" -entry all -db nr -out nr.fasta

Now, how long will it take? According to the metadata file, in October 2025, the 128 gzip-compressed chunks amount to 306GB. If you have a 1Gbps Internet connection, downloading could theoretically be finished in an hour, but it depends on how fast NCBI’s servers are. With a 100Mbps connection, it will take overnight or longer. In our experience, the chunk downloads sometimes fail so you may need to run the command several times.

In October 2025, the chunks decompress to 590GB. That’s the size of the BLAST database. You will also need disk space for the formatted FASTA file extracted from BLAST, which is today around 450GB. In total:

  • 306GB for the gzipped chunks
  • Plus 590GB after extracting the files
  • Plus 450GB for nr.fasta

So, better have at least 1.5TB of free space.

Back in April 2022, we noted that NCBI nr continues to double in size roughly every two years. The trend is still the same, so if you are reading this article in 2027, make sure you have at least 3TB of free disk space!

Using NCBI nr in Mascot Server

Although nr now contains close to a billion sequences (as predicted in April 2022), you can still use it with Mascot Server, which may be the only database search engine that can handle this gigantic database.

Go to your local Mascot Server home page, Configuration Editor and Database Manager.

  1. Choose to create a new database.
  2. Select the option “Use predefined definition template”, and select NCBIprot as the template.
  3. Give it a descriptive name like NCBI_nr_from_BLAST.
  4. Click Next and you have a choice where to put the sequence files. Make sure you choose a directory or drive that has sufficient free space. Twice the size of the FASTA file is sufficient.
  5. Click Create.

At this point, the database definition is waiting for the FASTA file. Don’t use the web browser’s file upload functionality! Web browsers have not been designed to upload a 450GB file over the network. Instead, copy nr.fasta directly to the database’s ‘current’ directory on the Mascot Server hard drive. Then, rename nr.fasta to NCBI_nr_from_BLAST_20251030.fasta, using today’s date. The database definition should automatically refresh and display the detected FASTA file. Click Enable to start database compression.

Compressing, or actually creating the index files, will take some hours or even a day or two, depending on your disk speed and processor speed.

Alternatives?

We’ve always advised that NCBI nr is a database of last resort due to its sheer size, and that’s more true with every doubling of the database. If you have a local copy of BLAST, one useful alternative is to extract a taxonomy subset of protein sequences. The BLAST+ manual says it’s possible but, as there don’t seem to be tutorials available, perhaps we should write one.

Another option is to use a different predefined definition. Mascot Server ships with configuration for a couple other big databases:

  • UniProtKB/TrEBML (Trembl) contains unreviewed protein sequences associated with computationally generated annotation and large-scale functional characterization.
  • UniRef100 (UniRef) contains non-identical clustered sets of sequences from UniProtKB (including isoforms).

TrEMBL, in particular, is a good alternative. The upcoming 2026_02 release of UniProt will see a reorganisation, announced in October 2025, which increases the number of reference proteomes and reduces the number of unannotated or poorly annotated sequences. In other words, the database will be much higher quality, while providing comprehensive coverage across the Tree of Life. We recommend using TrEMBL in favour of NCBI nr. If you are truly studying unknown or poorly characterised bacterial species (e.g. metaproteomics), then nr is still king.

Keywords: , , ,

Leave a Reply

Your email address will not be published. Required fields are marked *

*

HTML tags are not allowed.