Posted by John Cottrell (January 14, 2016)

Disruption ahead for NCBI databases

NCBI has announced that it will drop ‘gi number’ unique identifiers on June 15, 2016. Details are given in section 1.4 of the GenBank release notes. This will create difficulties for users of many bioinformatic tools, not just Mascot. Particularly in the context of major projects, where analyses are performed over an extended time period and results consolidated by protein identifier.

Although the change affects all NCBI Fasta files, the primary concern for most people will be the comprehensive protein database, nr. One possibility is to change the configuration of NCBInr in Mascot to use new identifiers. This keeps things simple in some respects, but it means that you can no longer load a protein view report for a search performed before the change. This is because complete protein sequences are not saved in the result files; to do so would make the files enormous. Instead, the sequence is retrieved from the Fasta when the report is loaded, and this will fail because the old gi number identifier cannot be found in the new Fasta file. The fix is to re-run the search, which is not ideal, but how often do you really need to get a protein view report off an old result file?

Issues with taxonomy

A less obvious but more serious problem concerns the way in which taxonomy is determined for entries in NCBI databases. NCBI maintains two index files, gi_taxid_prot.dmp for proteins and gi_taxid_nucl.dmp for nucleic acids, which are used by Mascot to translate gi numbers to taxonomy IDs. When gi numbers disappear, the remaining identifiers will be in Fasta Sequence ID format. This comprises a two or three letter tag indicating the source database followed by between one and three data fields, delimited by pipes (vertical bars). Here are examples for each of the source databases in the current NCBInr:

sp|Q93TM8.1|PEBB_NOSP7
tpg|DAA00042.1|
dbj|BAB70916.1|
pir||T49736
ref|WP_003131952.1|
tpd|FAA00039.1|
prf||1103184A
pdb|3SIB|A
tpe|CAD29859.1|
emb|CAD71090.1|
gb|AAT54006.1|

Mascot, and most other software, needs to parse out a unique identifier for every entry using some simple rule. Looking at the variation in the identifiers above, you will see that you cannot take only the second field, because this is empty for pir and prf, or the third field, which is often empty. You can’t even take the rightmost non-empty field, because this is not unique, being just a single letter for pdb entries. The only practical option is to take the complete string.

We hope that NCBI will create new indexes that allow a taxonomy ID to be looked up using a Fasta Sequence ID, although our enquiries so far have only been met with ‘we’ll get back to you’. Even if such index files are created and made available simultaneously with the new Fasta files, it is likely that the format will change in a way that cannot be handled without code changes. If so, it won’t be possible to use the new indexes with Mascot Server until we can release a patch. The alternative is to parse taxonomy descriptions from the Fasta title line and use these to look up taxonomy IDs. This is how taxonomy for NCBI was handled before the index files were available, but it is less reliable and much slower than using an index file.

Anticipated changes in Mascot

NCBInr is a pre-defined database in Mascot Database Manager. Our plan is to keep NCBInr as the name for the old format database, configured for gi numbers and gi_taxid_prot.dmp. The only action you will need to take is to turn off scheduled updates for NCBInr as the June deadline approaches. (If you forget to do this, compression of the new Fasta will fail, and the old Fasta will continue to be the active database.) Keeping the old format database online minimises disruption and means that protein view reports will continue to be available for old searches.

NCBInr was never a good name, given that nr is non-identical, not non-redundant, so we will use a new name for the new format files, possibly NCBIni or NCBIprot. The new definition will be made public as soon as we have files from NCBI to test against. All you will need to do in Database Manager is to enable it.

Alternative databases

However, this might be a good time to review whether you really need to have the complete NCBInr on your Mascot Server. The size of the NCBInr Fasta at the end of December 2015 was a brutal 48 GB. This has been causing problems for some time now. Updating the database takes a long time and requires a great deal of free disk space. It is no longer possible to use NCBInr on a 32-bit system because the memory requirements to create the taxonomy index exceeds the available address space on a 32-bit system.

If you have a truly unknown sample, which could contain anything, maybe SwissProt is actually a better choice for a quick ‘survey’ search? For in-depth searching, where the target proteins come from one taxonomy, a subset of GenBank entries limited to that taxonomy can be downloaded from Entrez as a Fasta. A template to configure such a file is included in Database Manager (NCBI_AA_template) and the general procedure is described on the relevant help page.

If you decide you still need a comprehensive database, consider switching to Uniref100. This offers very similar coverage, containing 26.0 billion residues compared with 28.7 billion in NCBInr, (both numbers as of December 2015).

If you currently have the NCBI EST files on your server, consider switching to the equivalent files from EMBL. The NCBI files are divided into three: human, mouse, and ‘others’, containing all other species. This last file is some 42 GB, creating the same problems as NCBInr. The EMBL files offer the same coverage and are more evenly divided into 10 files.

The only bright side to all of this is that the change is scheduled for after ASMS, and not just before, when everyone is frantically trying to complete work for presentations.

3 comments on “Disruption ahead for NCBI databases

  1. John Cottrell on said:

    Update: Release notes for GenBank 212 state that the dropping of gi numbers has slipped slightly to September 15 2016

  2. John Cottrell on said:

    Update: New Fasta title line syntax described in NCBI news item: https://www.ncbi.nlm.nih.gov/news/03-02-2016-phase-out-of-GI-numbers/

  3. SamGG on said:

    Bad news, but may be a chance to evaluate alternatives.
    If I remember correctly, there is no taxonomy associated to UniRef, because the representative sequence is picked up among the identical sequences. So a lot of GB without any specificity IMHO.
    Library build by user is interesting, but one has to rebuild it from time to time in order to keep up to date.
    While NCBInr is clearly redundant, automated updates and taxonomy combination offered by Mascot made NCBInr very useful, especially for exhaustive searches. I admit that freezing NCBInr is the best option.
    Since a few years, UniProt is used in our lab. While UniProt is less exhaustive (unless computing all variants), it answers most of our needs, is frequently updated and allow combining taxonomies.

Leave a Reply to SamGG Cancel reply

Your email address will not be published. Required fields are marked *

*

HTML tags are not allowed.