Matrix Science
Home What's New Mascot Help Products Support Training Contact  
   
  Help > Sequence Database Setup > TrEMBL  
 
 

Sequence Database Setup: TrEMBL

TrEMBL Release 2011_06
A change in the format of speclist.txt has broken taxonomy assignment for TrEMBL release 2011_06 onwards. This issue will be corrected in the next release of Mascot Server (2.4). For Mascot 2.3 and earlier, we have posted a modified version of speclist.txt and will keep this updated with each new TrEMBL release for the forseeable future.

If you use the database update script (db_update.pl) to perform automatic updates of TrEMBL, change the URL for downloading speclist.txt in the relevant definition block to http://www.matrixscience.com/downloads/speclist.txt

If you have discovered this problem after updating to release 2011_06, the procedure to correct it is as follows:

  1. Windows: stop the Mascot service, Unix: kill ms-monitor.exe
  2. Delete the *.stats file in the database current directory
  3. Download the modified speclist.txt to the taxonomy directory
  4. Windows: start the Mascot service, Unix: execute ms-monitor.exe

Overview

TrEMBL is a computer-annotated supplement of SWISS-PROT that contains all the translations of EMBL nucleotide sequence entries not yet integrated in SWISS-PROT.

TrEMBL is developed by the SWISS-PROT groups at SIB and EBI.

Download

Expasy: ftp://ftp.expasy.org/databases/uniprot/knowledgebase
EBI: ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase

The EBI site mirrors the Expasy site. The relevant files are:

  • Version info: reldate.txt
  • TrEMBL Fasta file: uniprot_trembl.fasta.gz
  • TrEMBL Dat file: uniprot_trembl.dat.gz

To download TrEMBL updates automatically, the relevant definition block in db_update.pl is Trembl_complete_from_EBI.

Taxonomy

Taxonomy is identical to that for SwissProt, and is predefined in mascot.dat. Even if you have the Trembl Dat file, choose "SwissProt FASTA". Verify that the taxonomy definition in mascot.dat is up to date:
# TAXONOMY FOR SwissProt or Trembl from the fasta file
Taxonomy_3
Identifier SwissProt FASTA
Enabled 1 # 0 to disable it
FromRefFile 0
DescriptionLineSep 0 # ctrl a - hex code '1'. For multiple descriptions per entry
SpeciesFiles NCBI:names.dmp, SWISSPROT:speclist.txt
NodesFiles NCBI:nodes.dmp, NCBI:merged.dmp
DefaultRule SWISSPROT, CHOP: ">[^_]*_\([^ ]*\) " # Anything after _ before space
end
#

Note that mascot.dat must be saved as plain text, so be careful if using a word processor, and ensure the filename is not changed to mascot.dat.txt or something.

The following taxonomy files are required:

ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
http://www.matrixscience.com/downloads/speclist.txt

Note that the taxonomy files go into the taxonomy directory, not into the sequence database directory. Also, some files need to be unpacked (using tar) as well as uncompressed.

Parse Rules

A typical Trembl Fasta title line is:

>tr|A0AQI4|A0AQI4_9ARCH Putative ammonia monooxygenase (Fragment) OS=uncultured archaeon GN=amoA PE=4 SV=1

You can use either the ID (A0AQI4_9ARCH) or the AC (A0AQI4) as the identifier.

ID from Fasta title: ">..|[^|]*|\([^ ]*\)"
AC from Fasta title: ">..|\([^|]*\)"
Description from Fasta title: ">[^ ]* \(.*\)"

The corresponding lines in the Dat file are:

ID   A0AQI4_9ARCH              Unreviewed;         206 AA.
AC   A0AQI4;

ID from Ref file: "^ID   \([^ ]*\)"
AC from Ref file: "^AC   \([-A-Z0-9_]*\)"

Configuration

For this example, the database files were downloaded to C:\Inetpub\MASCOT\sequence\Trembl\current, decompressed using gzip, and renamed to Trembl_39.0.dat and Trembl_39.0.fasta.

When updating an active database, it is important to rename the Fasta file last, because Mascot will begin database exchange as soon as it sees a new Fasta file that matches the wildcard path for the database.

Mascot database maintenance utility

If you decide not to have the reference file locally, full text for individual entries can be retrieved across the web from Uniprot or an SRS server. For Uniprot, the required entries are:

Host: www.uniprot.org
Port: 80
Path: /uniprot/#ACCESSION#.txt
Parse rule: RULE_23 "\(.*\)"

Where #ACCESSION# represents either the AC or ID. For an SRS server, the syntax for the Path field is:

Retrieve by ID: /srsbin/cgi-bin/wgetz?-e+[UNIPROT-id:#ACCESSION#]+-vn+2
Retrieve by AC: /srsbin/cgi-bin/wgetz?-e+[UNIPROT-acc:#ACCESSION#]+-vn+2

This screen shot illustrates a configuration in which the identifier is AC, there is no local Dat file, and full text is retrieved from Uniprot:

Mascot database maintenance utility

If you don't require full text in a Mascot Protein View report, simply leave the Host, Port, and Path fields blank and choose
--- no full text report ---
in the drop down list.

Always test a new definition before applying the changes to mascot.dat.

 
 
Copyright © 2011 Matrix Science Ltd. All Rights Reserved.