Sequence database setup: NCBI nr

IMPORTANT – NCBI have dropped gi numbers
In late August 2016, NCBI removed gi numbers from the title lines of the nr Fasta file. This breaks the existing definition, which was called NCBInr, so we have created a new definition for accession.version identifiers, called NCBIprot.
If you are part way through a major project or have a workflow that absolutely requires the continued use of gi numbers as identifiers, you will need to freeze nr at or before the 21 August 2016 release. That is, you must disable any type of automatic updating. If you need to refer to the old configuration information, see the archived help page.
In all other cases, you should use the new NCBIprot configuration. In Mascot 2.4 and later, NCBIprot is a predefined database, which can be enabled in Database Manager.
If you already had NCBInr enabled, either select it from the top level Databases page and choose Delete or, if you wish to keep it so that you can load Protein View reports for old search results, ensure that automatic updates are disabled.

Overview

The nr database is compiled by the NCBI (National Center for Biotechnology Information) as a protein database for Blast searches. It contains non-identical sequences from GenBank CDS translations, PDB, Swiss-Prot, PIR, and PRF.

The strengths of nr are that it is comprehensive and frequently updated. The downside is that it is a huge database. As of July 2021, the 190 GB Fasta file contained 409 million entries. A 64-bit version of Mascot on a 64-bit PC is essential. In most cases, there are better choices of database, such as a subset of GenBank for the organism of interest or a Uniprot complete proteome.

NCBIprot configuration in Database Manager

IMPORTANT – Mascot (all versions) cannot handle NCBIprot larger than approximately 370 million sequences
Since July 2021, NCBIprot contains at least 409 million sequences. Mascot Server will fail to bring NCBIprot online if it contains over 370 million sequences. This is caused by a 4GB limitation on the taxonomy index .t00 file created during database compression. The bug will be fixed in a future version of Mascot.
The workaround is to split the FASTA file in two parts.
  1. Download divide_fasta_file.zip and extract using unzip or 7zip.
  2. Download nr.gz either using a web browser or rsync (see rsync example in NCBI nr tips blog article).
  3. Extract nr.gz using gzip or 7zip and rename to nr.fasta.
  4. Open a command prompt or shell and change directory to where nr.fasta is located. Run the Perl script:
    perl divide_fasta_file.pl nr.fasta
  5. The script creates two files, nr_1.fasta and nr_2.fasta.
  6. If you have Mascot 2.6 or later, you can use Perl bundled with Mascot to run the script. Perl path is C:\inetpub\mascot\perl64\bin\perl (Windows) or /usr/local/mascot/perl64/bin/perl (Linux).
Next, update taxonomy files and then set up the two halves of the database:
  1. Configuration Editor; Configuration Options; add NCBIprot_1 and NCBIprot_2 to the list of databases next to IgnoreDupeAccessions and Apply.
  2. Configuration Editor; Database Manager. Update SwissProt. This will download and extract taxdump.tar.gz.
  3. Download prot.av2taxid.gz from https://s3.amazonaws.com/matrixsciencemisc. Decompress prot.av2taxid.gz and copy it to the mascot/taxonomy directory.
  4. Configuration Editor; Database Manager. Create databases NCBIprot_1 and NCBIprot_2 using NCBIprot as template.
  5. Move nr_1.fasta to NCBIprot_1/current directory and rename to NCBIprot_1_20210801.fasta.
  6. Move nr_2.fasta to NCBIprot_2/current directory and rename to NCBIprot_2_20210801.fasta.
  7. Activate NCBIprot_1. Wait for Mascot to finish.
  8. Activate NCBIprot_2. Wait for Mascot to finish.

In Mascot 2.4 and later, NCBIprot is a predefined database, meaning up-to-date configuration information is downloaded automatically by Mascot Database Manager. To enable nr with the new configuration, all you need to do is:

  1. Configuration Editor; Configuration Options; add NCBIprot to the list of databases next to IgnoreDupeAccessions and Apply.
  2. Configuration Editor; Database Manager; choose Enable predefined definition then select NCBIprot.

Download

ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz for the current release.

NCBIprot manual configuration in Mascot 2.3

The rest of this page is only relevant for Mascot 2.3 or if you choose to edit mascot.dat rather than use Database Manager. (It isn’t possible to configure taxonomy for nr in Mascot 2.2 and earlier.)

Taxonomy

The following taxonomy files are required:

ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz (the downloaded file must unpacked using tar as well as decompressed.)
prot.av2taxid.gz can be downloaded from http://s3.amazonaws.com/matrixsciencemisc

Note that the taxonomy files go into the taxonomy directory, not into the sequence database directory.

Add the following taxonomy definition to mascot.dat, changing the taxonomy block number so as to be consecutive with the existing blocks. Make a backup copy of mascot.dat, then use a text editor to make these changes. Note that the file must be saved as plain text and ensure the filename is not changed to mascot.dat.txt or something.

#
Taxonomy_17
Identifier NCBIprot (nr post gi numbers)
Enabled 1
FromRefFile 0
ErrorLevel 0
DescriptionLineSep 1
AccFromSpeciesLine "^>*\([^> ,]*\)"
SpeciesFiles ACC2TAXID:prot.av2taxid, NCBI:names.dmp
NodesFiles NCBI:nodes.dmp, NCBI:merged.dmp
DefaultRule ACC2TAXID, CHOP: "^>*\([^> ,]*\)"
end
#

Parse Rules

A typical Fasta title line is:

>WP_011638038.1 LysR family transcriptional regulator [Shewanella frigidimarina]

Suitable parse rules are:

Accession from Fasta title: ">\([^ ]*\)"
Description from Fasta title: ">[^ ]* \(.*\)"

If an entry in nr represents multiple source database entries, the Fasta title lines are concatenated together with CTRL+A as the delimiter.

Miscellaneous

It is essential for NCBIprot (or whatever name you use for the database) to be listed on the IgnoreDupeAccessions line in the Options section of mascot.dat.

Configuration example for Mascot 2.3

For this example, nr.gz was downloaded to a folder named C:\Inetpub\MASCOT\sequence\NCBIprot\current. The file was decompressed using gzip, and renamed to NCBIprot_20160901.fasta.

Taxonomy files were downloaded to the taxonomy directory, as described above.

Mascot database maintenance utility

There is no downloadable full text file for nr, but full text for individual entries can be retrieved across the web from the NCBI Entrez server. The syntax for the Path field is:

/entrez/eutils/efetch.fcgi?rettype=gp&retmode=text&db=protein&tool=mascot&email=support@matrixscience.com&id=#ACCESSION#

If you don’t require full text in a Mascot Protein View report, simply leave the Host, Port, and Path fields blank and choose
— no full text report —
in the drop down list.

Always test a new definition before applying the changes to mascot.dat