Matrix Science header

Automated repeating of searches
[Mascot results file module]

A common requirement is to repeat all searches with, for example, a different or updated database. The standard Mascot Server reports provide a button "Search Selected" or "Re-search all" to repeat a search, but it is tedious to do this manually for more than a few searches. Mascot Daemon also allows searches to be repeated as a follow on from a search using the original peak lists.

Repeating searches in a batch can easily be achieved using Mascot Parser. It is assumed you have access to a Mascot Server installation.

Overview

To repeat a search, nph-mascot.exe has to be run with command parameter 4 and the search data needs to be taken from a MIME format input file. The format of the input file is described in the Mascot Installation and Setup manual, chapter 8.

MIME Format File

In its simplest form, the MIME format file has the following structure:

    ----12345
    Content-Disposition: form-data; name="QUE"

    MASS=Monoisotopic
    CLE=Trypsin
    ...

    1234.012
    1567.086
    ----12345--

where MASS, CLE, etc. are the search parameters, and the two numbers are simply two masses for a peptide mass fingerprint search. Note the standard MIME format header and terminating line.

You can access the search parameters easily in Mascot Parser by using the params() method of the ms_mascotresfile object. Another way is to iterate through the keys using enumerateSectionKeys(), which avoids having to type all parameter names directly.

The following example prints all search parameters except DB, which is changed to another database. INTERMEDIATE and RULES are also skipped for reasons explained below.

C++
    ms_mascotresfile resfile = ms_mascotresfile(filename);

    if (!resfile.isValid()) {
        /* Error handling... */
    }

    int i = 1;
    std::string key = resfile.enumerateSectionKeys(ms_mascotresfile::SEC_PARAMETERS, i);

    while (!key.empty()) {
        std::string val = resfile.getSectionValueStr(
            ms_mascotresfile::SEC_PARAMETERS, key.c_str()
        );

        if (!val.empty() && key != "INTERMEDIATE" && key != "RULES" && key != "DB")
            std::cout << key << "=" << val << std::endl;

        key = resfile.enumerateSectionKeys(ms_mascotresfile::SEC_PARAMETERS, ++i);
    }

    std::cout << "DB=My_database" << std::endl;

Perl
    my $resfile = msparser::ms_mascotresfile->new($filename);

    if (not $resfile->isValid()) {
        # Error handling... 
    }

    my $sec_params = $msparser::ms_mascotresfile::SEC_PARAMETERS;
    my $i = 1;
    my $key = $resfile->enumerateSectionKeys($sec_params, $i);

    while ($key ne '') {
        my $val = $resfile->getSectionValueStr($sec_params, $key);

        if ($val ne '' and $key ne "INTERMEDIATE" and $key ne "RULES" and $key ne "DB") {
            print $key, "=", $val, "\n";
        }

        $key = $resfile->enumerateSectionKeys($sec_params, ++$i);
    }

    print "DB=My_database\n";

Java
    ms_mascotresfile resfile = new ms_mascotresfile(filename);

    if (!resfile.isValid()) {
        /* Error handling... */
    }

    int i = 1;
    String key = resfile.enumerateSectionKeys(ms_mascotresfile.SEC_PARAMETERS, i);

    while (key != "") {
        String val = resfile.getSectionValueStr(ms_mascotresfile.SEC_PARAMETERS, key);

        if (val != "" && key != "INTERMEDIATE" && key != "RULES" && key != "DB")
            System.out.println(key + "=" + val);

        key = resfile.enumerateSectionKeys(ms_mascotresfile.SEC_PARAMETERS, ++i);
    }

    System.out.println("DB=My_database");

Python
    resfile = msparser.ms_mascotresfile(filename)

    if not resfile.isValid() :
        # Error handling... 

    sec_params = msparser.ms_mascotresfile.SEC_PARAMETERS
    i = 1
    key = resfile.enumerateSectionKeys(sec_params, i)

    while len(key) > 0 :
        val = resfile.getSectionValueStr(sec_params, key)

        if len(val) > 0 and key != "INTERMEDIATE" and key != "RULES" and key != "DB" :
            print("%s=%s" % (key, val))

        i += 1
        key = resfile.enumerateSectionKeys(sec_params, i)

    print("DB=My_database")

C#
    ms_mascotresfile resfile = new ms_mascotresfile(filename);

    if (!resfile.isValid()) {
        /* Error handling... */
    }

    int i = 1;
    string key = resfile.enumerateSectionKeys(ms_mascotresfile.section.SEC_PARAMETERS, i);

    while (key != "") {
        String val = resfile.getSectionValueStr(ms_mascotresfile.section.SEC_PARAMETERS, key);

        if (val != "" && key != "INTERMEDIATE" && key != "RULES" && key != "DB")
		    Console.WriteLine("{0} = {1}",key,val);			

        key = resfile.enumerateSectionKeys(ms_mascotresfile.section.SEC_PARAMETERS, ++i);
    }
    Console.WriteLine("DB=My_database");

For MS-MS data, the complete set of ions peaks could be megabytes or even gigabytes of data, and it may make no sense to copy the data into the repeat search file. nph-mascot.exe supports a "query" statement for this purpose, which is returned by getRepeatSearchString():

C++
    for (int q=1; q != resfile.getNumQueries(); q++)
        std::cout << resfile.getRepeatSearchString(q) << std::endl;

Perl
    for my $q (1 .. $resfile->getNumQueries()) {
        print $resfile->getRepeatSearchString($q), "\n";
    }

Java
    for (int q = 1; q != resfile.getNumQueries(); q++)
        System.out.println(resfile.getRepeatSearchString(q));

Python
    for q in range(1, 1 + resfile.getNumQueries()) :
        print(resfile.getRepeatSearchString(q))

C#
    for (int q = 1; q != resfile.getNumQueries(); q++)
        Console.WriteLine(resfile.getRepeatSearchString(q));

When nph-mascot.exe reads in a query statement, it loads the original input data from the Mascot results (.dat) file. To this end, you also need to add the INTERMEDIATE parameter that points to the original results file:

C++
    std::cout << "INTERMEDIATE=" << filename << std::endl;

Perl
    print "INTERMEDIATE=", $filename, "\n";

Java
    System.out.println("INTERMEDIATE=" + filename);

Python
    print("INTERMEDIATE=%s" % filename)

C#
    Console.WriteLine("INTERMEDIATE={0}", filename);

Running the repeat search

In the example code, the search is run by calling nph-mascot.exe with two parameters, and the repeat search data is piped into the process' standard input:

    nph-mascot.exe 4 -commandline > tmp.txt

The 4 indicates that this is a repeat search. The -commandline parameter is used to prevent progress reports and HTML being written to standard out.

For a successful search, the output (e.g. tmp.txt) will be of the form:

    SUCCESS
    ../data/20031007/F001547.dat

where the data file name is output following the text "SUCCESS".

If an error occurs, the output will be of the form:

    FATAL_ERROR: M00027
    Sorry, the database (SwissProt) is not currently available for searching [M00027]

Comparing the results

The example code uses a simple approach to compare the old and new results. For searches with PMF data, the top protein hit from both searches are compared and if their scores differ by more than 10, then the difference is reported.

For searches with MS-MS data, peptide matches are compared, and a score difference of more than 10 will be reported.

It is more than likely that these comparison rules will need to be changed for optimum results.

Remember that if the searches were performed with a much older version of Mascot Server, then the scores may have changed a little because minor changes have been made to Mascot to optimize the scoring. However, a score difference of greater than 10 is likely to indicate a new protein sequence in the database.

Example Code

Example code is provided in various programming languages:

The sample program takes a single input file, repeats the search and compares results as above. To run the program on a whole directory of files, use 'find' under Unix, or a FOR loop in a batch/cmd file under Windows. For example, to repeat all searches from the year 2002 under Unix:

    # find ../data/2002???? -name \*.dat | xargs -n 1 repeat_search.pl

Remember that new results will go in 'todays' directory, so be sure not to include that directory with 'find' or the repeated searches will be repeated again and again and again...


Copyright © 2022 Matrix Science Ltd.  All Rights Reserved. Generated on Thu Mar 31 2022 01:12:30