Mascot: The trusted reference standard for protein identification by mass spectrometry for 25 years

Sequence databases

Information on relevant sequence databases can be found by following the links below. Additionally, the first issue every year of Nucleic Acids Research contains status reports from the curators of the major databases.


dbEST is the division of GenBank that contains "single-pass" cDNA sequences, or Expressed Sequence Tags, from a number of organisms.


Entries from the DNA Databank of Japan (DDBJ) are wholly incorporated into GenBank.


The EMBL Nucleotide Sequence Database is a comprehensive database of DNA and RNA sequences collected from the scientific literature and patent applications and directly submitted from researchers and sequencing groups. Data collection is done in collaboration with GenBank (USA) and the DNA Databank of Japan (DDBJ).


The Ensembl project produces genome databases for vertebrates and other eukaryotic species. Ensembl is a joint project between EMBL – EBI and the Wellcome Trust Sanger Institute.


GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. The complete release notes for the current version of GenBank are available by FTP. A new release is made every two months. GenBank is part of the International Nucleotide Sequence Database Collaboration, which is comprised of the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and GenBank at NCBI. These three organizations exchange data on a daily basis.


IPI – International Protein Index – provided a top level guide to the main databases that described the proteomes of higher eukaryotic organisms. The databases are no longer updated, and the last releases were on the 27th September 2011. The suggested replacements are UniProt proteomes.


MSDB has not been updated since 2006 and should be considered obsolete.


NCBI maintains composite, non-identical protein and nucleic acid databases for their search tools BLAST and Entrez. The entries in the protein database, nr , have been compiled from GenBank CDS translations, PIR, SWISS-PROT, PRF, and PDB. NCBI has made strong efforts to cross-reference the sequences in these databases in order to avoid duplication.


OWL has not been updated since May 1999, and should be considered obsolete.


The Brookhaven Protein Data Bank (PDB) is a database of three-dimensional structures. This means that entries are invariably well characterised, with reliable sequence data which can also be found in the other databases. Entries which are unique to PDB tend to be variant proteins, with distorted structures, which were used to refine a structural determination.


The PIR (Protein Information Resource) database was initiated at the NBRF in the early 1960′s by the late Margaret O. Dayhoff as a collection of sequences for the study of evolutionary relationships among proteins. The database is now an international collaboration of three data centers: the NBRF, the Munich Information Center for Protein Sequences (MIPS), and the Japan International Protein Information Database (JIPID). The three centers cooperate to produce and distribute a single database of `wild-type’ protein sequences.


The Protein Research Foundation of Japan database contains protein sequences abstracted from scientific publications.

UniProt (Swiss-Prot & Trembl)

The UniProt Protein knowledgebase consists of two sections. Swiss-Prot, which is manually annotated and reviewed, and TrEMBL, which is automatically annotated and is not reviewed. UniProt is a collaboration between the European Bioinformatics Institute (EBI), the SIB Swiss Institute of Bioinformatics and the Protein Information Resource (PIR)- EBI).

UniProt also curates a comprehensive collection of proteomes for species with completely sequenced genomes.