Sequence Database Setup: EMBL EST
| EMBL EST Format Change |
| EMBL have dropped the semi-colon at the end of the accession string
in the Fasta title line. You will need to update your accession from Fasta
parse rule and also the parse rule in the taxonomy definition block to match the examples shown below.
This change first appeared in the Fasta files for release 109 at the beginning of October 2011. |
|
Overview
The EST Fasta files from EMBL
contain "single-pass" cDNA sequences, or Expressed Sequence Tags. The sequences are divided
into 10 divisions:
- ENV:Environmental Samples
- FUN:Fungi
- HUM:Human
- INV:Invertebrates
- MAM:Other Mammals
- MUS:Mus musculus
- PLN:Plants
- PRO:Prokaryotes
- ROD:Rodents
- VRT:Other Vertebrates
Download
Individual Fasta files can be downloaded from the
EBI FTP server.
On this help page, the rodents file is used as an example. To work with other divisions, simply substitute the
three letter code. For example, the compressed Fasta file for rodents is em_rel_est_rod.gz,
while the one for fungi is em_rel_est_fun.gz.
Note that versions of wget up to 1.10.x have problems with files larger than 2 GB on 32 bit platforms. The
current stable release, 1.11.4, works correctly.
Windows binaries can be downloaded from SourceForge.
Taxonomy
Taxonomy for EMBL EST files requires Mascot 2.3 or later. For earlier versions of Mascot, configure without
taxonomy. The following taxonomy files are required:
ftp://ftp.ebi.ac.uk/pub/databases/embl/misc/acc_to_taxid.mapping.txt.gz
ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
Note that the taxonomy files go into the taxonomy directory, not into the sequence database
directory. Also, some files need to be unpacked (using tar) as well as uncompressed.
The Taxonomy definition block in
mascot.dat should be as follows:
# TAXONOMY FOR EMBL EST
Taxonomy_13
Identifier EMBL EST Fasta
Enabled 1 # 0 to disable it
FromRefFile 0
ErrorLevel 0
SpeciesFiles ACC2TAXID:acc_to_taxid.mapping.txt, NCBI:names.dmp
NodesFiles NCBI:nodes.dmp, NCBI:merged.dmp
DefaultRule ACC2TAXID, CHOP: ">EM_EST:\([A-Z0-9]*\)"
GencodeFiles NCBI:gencode.dmp
MitochondrialTranslation 0
end
Unigene
The NCBI UniGene
indexes are created by automatically partitioning GenBank sequences into non-redundant sets of
gene-oriented clusters. If UniGene indexes are available locally, results from Mascot searches of
EST databases can be grouped and reported by gene family, rather than by raw EST accession numbers.
To enable UniGene indexes, uncomment the following line, near the top of the
db_update.pl script:
# $local_unigene_directory = "$MASCOT/unigene";
This will cause the required UniGene indexes to be downloaded when the EST databases are next updated.
You will also need to uncomment and possibly modify the relevant lines in the UniGene block of mascot.dat. For example:
arabidopsis /usr/local/mascot/sequence/unigene/arabidopsis/current/At.data
Plants_EST arabidopsis barley maize rice wheat
A control to map database accessions to UniGene families will then be added to the format controls
in MS/MS Summary reports for searches of the enabled databases.
Parse Rules
A typical Fasta title line is:
>EM_EST:AA012645
AA012645.1 RPU0101AC Rat myometrium, differential display Rattus ...
Suitable parse rules are:
Accession from Fasta title (all databases except ENV) : ">EM_EST:\([A-Z0-9]*\)"
Accession from Fasta title (ENV) : ">EM_ENV:\([A-Z0-9]*\)"
Description from Fasta title (all databases) : ">[^ ]* \(.*\)"
Configuration
For this example, em_rel_est_rod.gz was downloaded to a folder named
C:\Inetpub\Mascot\sequence\Rodents_EST\current.
The file was decompressed using gzip,
and renamed to Rodents_EST_109.fasta (because it was from EMBL release 109).
Full text for individual entries can be retrieved across the web
from the EBI at www.ebi.ac.uk. The syntax for the Path field is:
/cgi-bin/emblfetch?id=#ACCESSION#
If you don't require full text in a Mascot Protein View report, simply leave the Host, Port, and Path fields blank
and choose
--- no full text report ---
in the drop down list.
Always test a new definition before applying the changes to mascot.dat.
|