Posted by Ville Koskinen (August 31, 2022)

Downloading UniProt proteomes via new API

The UniProt website received a snazzy facelift in June 2022. Both the browser interface and the REST API were updated. The previous version, now termed legacy website, remains available until the 2022_04 release under a new URL (https://legacy.uniprot.org/) – so there is limited time to compare and admire the improvements! The new API is an almost seamless transition for Mascot Server. Unfortunately, uniprot.org dropped an HTTP header that Mascot relies on, so we have prepared a patch release to address the change.

Simple URL change

Uniprot.org provides two types of data download. Whole databases are provided as files, like https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref100/uniref100.fasta.gz, and these URLs have not changed. You can also download protein sequences and whole proteomes via the REST API, whose URL used to start with www.uniprot.org/uniprot/?query=.... The query string contains filters like taxonomy ID. The new URL is rest.uniprot.org/uniprotkb/stream?query=... and the parameter names have changed slightly.

We’ve updated the documentation for setting up UniProt proteomes, and we’ve also updated the predefined definitions (databases_1.xml). If you have set up a custom UniProt proteome, changing to the new URL is simple. Edit the FASTA URL and make the following changes:

  • New hostname and path: rest.uniprot.org/uniprotkb/stream?
  • Rename parameter compress=no to compressed=false
  • Rename parameter include=yes to includeIsoform=true
  • The query= parameter syntax is unchanged.

For example, the URL for the human proteome used to be:

  https://www.uniprot.org/uniprot/?query=proteome:UP000005640&format=fasta&compress=no&include=yes

Applying the above changes turns it into the new URL:

  https://rest.uniprot.org/uniprotkb/stream?query=proteome:UP000005640&format=fasta&compressed=false&includeIsoform=true

You can easily re-create the URL in the new format by using the UniProt web interface.

Missing Last-Modified header

Database Manager uses standard HTTP headers when communicating with a remote server. If certain HTTP headers are available, it tries hard to do basic integrity checking, as well as avoid downloading if there is no new version available. Database Manager also has to deal with proxy servers and resuming interrupted downloads, which have their own subset of HTTP headers.

The available HTTP headers can vary a great deal between servers and services. For example, when you’re downloading a file, the server typically knows the file size and last-modified time, so it will send these as the Content-Length and Last-Modified headers. When you’re downloading data from a stream that is generated on the fly, the server might or might not know the total amount of data to send, so it might not emit a Content-Length header.

The UniProt REST API is a stream, where the contents is generated (most likely) from a relational database based on the query parameters. The legacy REST API provided a Last-Modified header, which contained the UniProt release date and whose stated purpose was: “will avoid that you download data more than once per release”. Database Manager used the header exactly for this.

The new API no longer has a Last-Modified header. The removal is unfortunate, because Database Manager does not behave correctly when the header is absent. Setting up a UniProt proteome works fine and Database Manager succeeds downloading protein sequences using the new API. However, when you click to update the database, Database Manager inspects the HTTP headers, finds that Last-Modified is absent, and decides there is nothing new to download. So, you can never actually update the database through Database Manager. The bug affects the last three versions of Mascot, but we’ve prepared a workaround and a patch release (see below).

The UniProt website update was tweeted, and the new API announced in a subsequent tweet, but there doesn’t seem to be a website news item for either one, or any document that lists the differences between the old and the new. We’ve discussed the header removal with UniProt support, and it seems unlikely the header will be added back.

Mascot Server 2.8

If you have Mascot Server 2.8.0 or 2.8.1, please download and install patch 2.8.2. The patch fixes the HTTP header issue. It also fixes a bug, introduced in patch 2.8.1, where setting up a database using a template (such as UniProt_proteome_template) can cause scheduled database updates to stop working.

Mascot Server 2.7 and 2.6

Enabling a predefined definition of a UniProt proteome, for example UP5640_H_sapiens, works fine in Mascot Server 2.7 and 2.6. Database Manager downloads the FASTA file and Mascot brings the database online. However, trying to update the database through Database Manager does nothing. Database Manager queries the UniProt HTTPS server, which now returns no Last-Modified header, and determines that no new version needs to be downloaded.

The workaround is:

  1. Deactivate the database
  2. Delete the files in its ‘current’ directory
  3. Click to update the database in Database Manager
  4. Activate the database

Adding a UniProt proteome as a custom database or using a template works fine for the initial download. Please follow the above workaround for subsequent database updates.

Mascot Server 2.5 and 2.4

UniProt is only available via HTTPS. Mascot Server 2.5 doesn’t support downloading data from an HTTPS URL with query parameters, and Mascot Server 2.4 doesn’t support HTTPS at all, so please download the FASTA file manually as documented in our help.

Mascot Server 2.3 and earlier

Versions 2.3 and earlier don’t have Database Manager, so these versions are unaffected.

Keywords: , , ,

Leave a Reply

Your email address will not be published. Required fields are marked *

*

HTML tags are not allowed.