Posted by David Creasy (April 15, 2015)

Integrating Mascot into a proteomics pipeline (Part 2)

Parsing the results from a Mascot search

This is the second part of this blog entry

When a Mascot search is run, the results are stored in a MIME format text file, normally in the mascot/data/YYYYMMDD directory. If the search was a peptide mass fingerprint or an MS-MS search of just a few spectra, then the file will contain protein and peptide matches. However, for most MS-MS searches, protein inference information will not be saved in the file. We strongly recommend that you use Mascot Parser to read and process the results files. Mascot Parser can be used free of charge in most cases, and can be accessed from programs written in Perl, Java, Python or C++. We use Mascot Parser extensively in Mascot Server, Mascot Distiller and Mascot Insight, so you can be confident that it is well tested and that there will be ongoing development effort for it. There is extensive documentation available, either in a Windows Help file, or via html pages that are included with the download package. The latest help for the current version is also available here.

The example below shows how easy it is to open a results file and obtain a list of proteins. The code is written in Perl, but the function calls will be the same whatever language is used.

For a perl program, we must ‘use’ the msparser module:

use strict;

use msparser;

When we call the script from the command line, we will pass the name of the results file as a parameter:

if (!defined($ARGV[0])) { die "Must specify results filename as parameter"; } 

Next, we need to create a ‘ms_mascotresfile’ object.

my $resfile = new msparser::ms_mascotresfile($ARGV[0]);

Mascot Parser can use cache files to speed up repeated access of the results file. To do this, simply specify the appropriate flags and a cache file directory in the constructor, but for simplicity, I’ve not done that for this example.

Next, we need to check that there wasn’t a problem with the file:

if (!$resfile->isValid) {
    print STDERR "Cannot process file '$ARGV[0]' : ", $resfile->getLastErrorString(), "\n";

We are going to process PMF searches and MS-MS searches with almost identical code! To do this, we will create an ‘ms_mascotresults’ object, but using either the ‘ms_peptidesummary’ or the ‘ms_proteinsummary’ class as appropriate. Once we have created the results object, the remaining code can normally be almost identical. There are a number of values that need to be passed to these constructors to specify, for example, significance thresholds and the protein inference algorithm. The easiest way to get default values for these parameters is to call the ‘ms_mascotresfile::get_ms_mascotresults_params‘ function. This function uses an ‘ms_mascotoptions’ object which would normally be populated from the Options section of your mascot.dat file. In this example, I’ll just use the default parameters:

my $results;
my $options = new msparser::ms_mascotoptions;
my ($scriptName, 
    ) = $resfile->get_ms_mascotresults_params($options);

We now use the appropriate constructor:

if ($usePeptideSummary) {
  $results = new msparser::ms_peptidesummary($resfile, $flags, $minProbability, $maxHitsToReport, "", 
                                             $ignoreIonsScoreBelow, $minPepLenInPepSummary, "", $flags2);
} else {
  $results = new msparser::ms_proteinsummary($resfile, $flags, $minProbability, $maxHitsToReport);

It’s now best to check again for any errors:

if (!$resfile->isValid) {
    print STDERR "Cannot process file '$ARGV[0]' : ", $resfile->getLastErrorString(), "\n";

We are now going to iterate through all the proteins. In this example, the protein matches are just printed to the console, but it is obviously trivial to write this data to a sql datbase, or provide it as input for another piece of software. You may also want to extract a list of matching peptides, but I’ve not added code for this to keep the example simple:

my $hit  = 1;
my $prot = $results->getHit($hit);

while (defined($prot)) {
    my $accession   = $prot->getAccession();
    my $description = $results->getProteinDescription($accession);
    my $score       = int($prot->getScore());
    my $mass        = $results->getProteinMass($accession);
    print $hit . ":" . $accession . " " . $score . " " . $description . "\n";
    $prot = $results->getHit($hit);

This code will work with all types of searches, even Error Tolerant searches and cases where Percolator has been used.

Leave a Reply

Your email address will not be published. Required fields are marked *


HTML tags are not allowed.