Sequence database setup: UniProt proteomes


A UniProt complete proteome consists of the set of proteins thought to be expressed by an organism whose genome has been completely sequenced. A reference proteome is the complete proteome of a representative, well-studied model organism or an organism of interest for biomedical research.

UniProtKB is a collaboration between the European Bioinformatics Institute, the Swiss Institute of Bioinformatics and the Protein Information Resource.

First, you need to discover the Proteome ID for your proteome of interest. For example, go to and search for rice by name or by taxonomy ID. The Proteome ID for Oryza sativa subsp. japonica is UP000059680

In Database Manager, create a new custom definition, as follows:

  1. Fasta or New database; Create New
  2. Use pre-defined template; UniProt_proteome_template
  3. Create
  4. Download from remote URL; Next
  5. Set up download URL
  6. Paste the following into the FASTA file URL field, where the proteome ID is for your proteome of interest
  7. Save; Start downloading
  8. Activate

The complete configuration for the rice proteome in Database Manager would look similar to this (except URL, which is outdated format)

Mascot database manager

Once configured, You can enable automatic updating by clicking on the database name then choosing Edit schedule.


  • Locate the proteome for your organism of interest by searching by name or by taxonomy ID at
  • Click on the Proteome ID link
  • Click on the Download button and choose All protein entries, Fasta (Canonical and isoform), compressed


Taxonomy is not required for a single organism database

Parse Rules

When a single entry is expanded into entries for multiple isoforms, they share the same ID, so AC must be used as the unique identifier

>sp|Q67W82-2|4CL4_ORYSJ Isoform 2 of Probable 4-coumarate--CoA ligase 4 OS=Oryza sativa subsp. japonica GN=4CL4

AC from Fasta title: ">..|\([^|]*\)"
Description from Fasta title: ">[^ ]* \(.*\)"

Configuration (Mascot 2.3 and earlier)

A Fasta file containing canonical and isoform sequence for the rice proteome was downloaded to /usr/local/mascot/sequence/rice_proteome/current, and renamed to rice_proteome_20120414.fasta.

Mascot database maintenance utility

Full text for individual entries can be retrieved across the web from Uniprot:

Port: 80
Path: /uniprot/#ACCESSION#.txt
Parse rule: RULE_23 "\(.*\)"

Always test a new definition before applying the changes to mascot.dat