Posted by Ville Koskinen (April 13, 2018)

Keeping genome databases up to date

Database Manager is a great tool for keeping your sequence databases up to date in Mascot. If the database is available as a ready-made FASTA file, all you need to do is enable it as a predefined definition, or set up a definition to download the file from a known URL (see the help for more details). Updating the database is as simple as clicking on a button or scheduling a regular update job.

Sometimes you need an extra processing step between downloading and adding the file to Mascot. For example, if you want to search a typical genome, we recommend splitting it into overlapping segments. Once split, the generated FASTA file can be copied or uploaded to Mascot as usual. When you want to update the database, just repeat these manual steps.

It’s possible to automate the procedure using two not so well-known features of Database Manager: “file URLs” (since Mascot 2.5) and triggering a database update from the command line (since Mascot 2.4). The obvious benefit is, the update can now be scheduled as a recurring event. But this also allows you to “push” sequences to Mascot on demand without having to click a button in the user interface.

To illustrate, we’ll use the Helicobacter pylori HPAG1 genome from the European Nucleotide Archive. The accession number of the complete genome is CP000241, and the EBI provide a simple URL for fetching the sequence. Assume Mascot is installed on a Windows system in C:\inetpub\mascot, and that you have a suitable working space in C:\work. (The steps below work on Linux as well – just change the filepaths.)

Setting up the environment

First, set up your environment and ensure the manual steps work:

  1. Download splitter.pl.gz. Extract it in C:\inetpub\mascot\bin.
  2. Download the FASTA file and save it in C:\work\hpylori.
  3. Use splitter.pl to confirm it creates output files where you expect:

    C> C:
    C> cd \inetpub\mascot\bin
    C> ..\perl64\bin\perl splitter.pl \work\hpylori\CP000241.fasta
    

    You should see the new file C:\work\hpylori\split_CP000241.fasta.

Next, set up the target database definition:

  1. In Database Manager, choose Create New in the left-hand menu.
  2. Type in a name; we’ll call it hpylori_split.
  3. Choose to create from the template simple_NA_template.
  4. When asked where to get the FASTA file from, choose the option “copy from Mascot server hard disk”. Use the “URL” C:\work\hpylori\split_CP000241.fasta, and leave the filename pattern at its default setting.
  5. Click on “Start downloading”.

If all is working correctly, you should see Database Manager pick up split_CP000241.fasta.

Triggering an update in Database Manager

Now it’s time to automate the download and update steps! It’s very easy to ask Database Manager to update a database: simply call the script dbman_add_task.pl with one argument, the database name. The script adds a task to Database Manager’s internal job queue, exactly as if you had clicked the “Get new files” or “Update” button.

You can use any scripting or programming language that is capable of running external programs. Ensure you run the script from the Mascot bin directory:

  cd \inetpub\mascot\bin
  ..\perl64\bin\perl dbman_add_task.pl <name of database>

Example script

I’ve written a short example script in Perl: update_hpylori_split.pl. The script will work with Mascot 2.6 on a Windows system, but you can easily change the filepaths to work on a Linux system. Run the script thus:

C> cd \inetpub\mascot\bin
C> ..\perl64\bin\perl update_hpylori_split.pl

The relevant lines are highlighted below:

  # Use absolute paths for directories.
  my $work_dir = "C:/work/hpylori";
  my $mascot_bin_dir = "C:/inetpub/mascot/bin";

  my $db_name = "hpylori_split";
  my $accession = "CP000241";
  my $filename = "$accession.fasta";

  # Step 1: download the file.
  chdir $work_dir;
  download_from_EBI($accession, $filename);

  # Step 2: split it.
  chdir $mascot_bin_dir;
  system('../perl64/bin/perl', 'splitter.pl', join('/', $work_dir, $filename));
  print "$accession has been split\n";

  # Step 3: schedule an update task.
  chdir $mascot_bin_dir;
  system('../perl64/bin/perl', 'dbman_add_task.pl', $db_name);
  print "Update task scheduled for $db_name\n";

Step 1 (download_from_EBI) downloads a file using the Perl module LWP::UserAgent. The step for acquiring the FASTA file isn’t important in this example and could be just about procedure.

In step 2, the script calls splitter.pl using Mascot’s Perl. This is the same as the manual step.

Step 3 asks Database Manager to check for new files for hpylori_split. When copying data from file URLs, Database Manager first checks whether the size and last-modified time of the source file differ from the last time. If they don’t, this fact is logged and the task ends. This is the same behaviour as with FTP and HTTP sources. You can use this to your advantage to avoid unnecessary database updates.

An alternative to step 3 is dbman_download.pl hpylori_split. This script performs the heavy lifting and prints progress messages to standard output. Normally you should make use of Database Manager’s job queue, as above, to avoid potential conflicts between two update tasks running at the same time. But there may be cases where it’s important to block script execution until the database update has truly finished.

There are a couple final details to consider. The process calling dbman_add_task.pl must have permission to write to Mascot’s config\db_manager directory, as this is where the task queue is kept. And if you’re using Mascot 2.5, you should replace the Perl path (..\perl64\bin\perl) with perl.

Ideas and extensions

The above technique isn’t limited to genomes. Here are some ideas and extensions:

  • Add your script to the Cron section of mascot.dat; it’s an easy way to schedule the job. See chapter 6 of the Installation and Setup manual.
  • Suppose your Mascot server is not connected to the Internet. Write a script that copies new FASTA files to a local directory, then call Database Manager to update the corresponding database from it. File URLs support wildcards, in which case Database Manager will choose the newest file matching the wildcard.
  • Write a program to generate your own FASTA file from, say, a constantly updating MySQL database. Once done, call Database Manager to fetch the new file.
  • Write a program that randomises a spectral library to create a decoy library. Once done, call Database Manager to fetch the new file.

If you have any questions, just e-mail us at support@matrixscience.com.

Keywords: , , ,

Leave a Reply

Your email address will not be published. Required fields are marked *

*

HTML tags are not allowed.