Help > Sequence databases

Sequence databases

Mascot Server ships with predefined definitions for several common sequence databases. Databases enabled in the free, public Mascot service are listed in Mascot search overview. If you have an in-house licence, you can enable a number of other predefined definitions as well as add custom FASTA files as searchable databases.

This page collects information on relevant sequence databases that can be used with Mascot. Additionally, the first issue every year of Nucleic Acids Research contains status reports from the curators of the major databases.

dbEST

dbEST is the division of GenBank that contains "single-pass" cDNA sequences, or Expressed Sequence Tags, from a number of organisms.

DDBJ

Entries from the DNA Databank of Japan (DDBJ) are wholly incorporated into GenBank.

EMBL

The EMBL Nucleotide Sequence Database is a comprehensive database of DNA and RNA sequences collected from the scientific literature and patent applications and directly submitted from researchers and sequencing groups. Data collection is done in collaboration with GenBank (USA) and the DNA Databank of Japan (DDBJ).

Ensembl

The Ensembl project produces genome databases for vertebrates and other eukaryotic species. Ensembl is a joint project between EMBL – EBI and the Wellcome Trust Sanger Institute.

GenBank

GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. The complete release notes for the current version of GenBank are available by FTP. A new release is made every two months. GenBank is part of the International Nucleotide Sequence Database Collaboration, which is comprised of the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and GenBank at NCBI. These three organizations exchange data on a daily basis.

IPI

IPI – International Protein Index – provided a top level guide to the main databases that described the proteomes of higher eukaryotic organisms. The last update was in 2011. The databases are no longer available and should be considered obsolete.

MSDB

MSDB has not been updated since 2006 and should be considered obsolete.

NCBI nr

NCBI maintains composite, non-identical protein and nucleic acid databases for their search tools BLAST and Entrez. The entries in the protein database, nr, have been compiled from GenBank CDS translations, PIR, SWISS-PROT, PRF, and PDB. NCBI has made strong efforts to cross-reference the sequences in these databases in order to avoid duplication.

OWL

OWL has not been updated since May 1999, and should be considered obsolete.

PDB

The Brookhaven Protein Data Bank (PDB) is a database of three-dimensional structures. This means that entries are invariably well characterised, with reliable sequence data which can also be found in the other databases. Entries which are unique to PDB tend to be variant proteins, with distorted structures, which were used to refine a structural determination.

PIR

The PIR (Protein Information Resource) database was initiated at the NBRF in the early 1960′s by the late Margaret O. Dayhoff as a collection of sequences for the study of evolutionary relationships among proteins. The database is now an international collaboration of three data centers: the NBRF, the Munich Information Center for Protein Sequences (MIPS), and the Japan International Protein Information Database (JIPID). The three centers cooperate to produce and distribute a single database of `wild-type’ protein sequences.

PRF

The Protein Research Foundation of Japan database contains protein sequences abstracted from scientific publications.

UniProt (Swiss-Prot & Trembl)

The UniProt Protein knowledgebase consists of two sections. Swiss-Prot, which is manually annotated and reviewed, and TrEMBL, which is automatically annotated and is not reviewed. UniProt is a collaboration between the European Bioinformatics Institute (EBI), the SIB Swiss Institute of Bioinformatics and the Protein Information Resource (PIR)- EBI).

UniProt also curates a comprehensive collection of proteomes for species with completely sequenced genomes.

Matrix Science