Matrix Science header

Garbage collection problems (advanced reading)
[Getting started with Mascot Parser]

Perl
    #!/usr/local/bin/perl
    use strict;
    use msparser;

    sub get_params {
        my $resfile = msparser::ms_mascotresfile->new($_[0]);
        return $resfile->params;                 # PROBLEM HERE
    }

    my $params = get_params($ARGV[0]);
    print $params->getNumberOfDatabases, "\n";   # CRASH HERE

Java
    import matrix_science.msparser.*;

    public class example {
        static {
            try { 
                System.loadLibrary("msparserj");
            } catch (UnsatisfiedLinkError e) { 
                System.exit(0); 
            }
        }

        private static get_params(String filename) {
            ms_mascotresfile resfile = new ms_mascotresfile(filename);
            return resfile.params();    // PROBLEM HERE
        }

        public static void main(String argv[]) {
            ms_searchparams params = get_params(argv[0]);
            System.gc();                // See below why these are needed
            System.runFinalization();   // to trigger the crash.
            System.out.println(params.getNumberOfDatabases()); // CRASH HERE
        }
    }

Python
    #!/usr/bin/python
    import msparser
    import sys

    def get_params(filename):
        resfile = msparser.ms_mascotresfile(filename)
        return resfile.params()           # PROBLEM HERE

    params = get_params(sys.argv[1])
    print(params.getNumberOfDatabases())   # CRASH HERE

C#
    using System;
    using matrix_science.msparser;
    class GarbageCollectionExample
    {
        private static ms_peptidesummary loadPeptideSummary(string filename)
        {
            ms_mascotresfile resfile = new ms_mascotresfile(filename);
            ms_datfile datfile = new ms_datfile("../config/mascot.dat");    

            ms_mascotoptions opts = new ms_mascotoptions();

            uint flags, flags2, minpeplen;
            int maxhits;
            double minprob, iisb;
            bool usePepsum;
            resfile.get_ms_mascotresults_params(opts, out flags, out minprob, 
                out maxhits, out iisb, out minpeplen, out usePepsum, out flags2);

            return new ms_peptidesummary(resfile, flags, minprob, 
                maxhits, "", iisb, (int)minpeplen, "", flags2); // PROBLEM HERE            
        }
    
        public static void Main(string[] argv)
        {
            ms_peptidesummary pepsum = loadPeptideSummary(argv[0]);
            for (int i = 1; i <= pepsum.getNumberOfHits(); i++)
            {
                ms_protein hit = pepsum.getHit(i);
                for (int e = 1; e <= hit.getNumPeptides(); e++)
                {
                    int q = hit.getPeptideQuery(e), p = hit.getPeptideP(e);
                    ms_peptide peptide = pepsum.getPeptide(q, p);
                    Console.WriteLine(peptide.getPeptideStr());   // CRASH HERE
                }
            }
        }
        
    }

Why do the example programs crash?

The programs crash because the resfile object is deallocated from memory too early.

The resfile object is lexically scoped only to the get_params() function (or the loadPeptideSummary function in the C# example). When get_params() or loadPeptideSummary ends, resfile becomes unreachable and is ready for garbage collection. In Perl and Python, this happens at the end of the function; in Java and C#, this happens at an arbitrary point during execution, but you can force it at any time by calling System.gc(); System.runFinalization(); in Java (as we do in the example program to illustrate the problem).

By itself, this is not a problem at all. The problem becomes clear when you look at the declaration of matrix_science::ms_mascotresfile::params(). Mascot Parser uses SWIG (Simplified Wrapper and Interface Generator, http://swig.org/) to generate the mappings between C++ and the target language. The params() method returns a C++ reference to an ms_searchparams object, not a copy, which the SWIG layer helpfully wraps into a native class object, thus masking the real nature of the return value.

So the resfile object is freed from memory at the end of scope, and all its internal data deallocated, which also means the ms_searchparams object. But we still have an object pointing to the internal ms_searchparams object! It now points to some arbitrary chunk of memory, which most certainly is not executable code. If you then try to call its methods, you will crash the program with a segmentation fault.

Why is this not handled in SWIG? Because the SWIG layer cannot detect that the ms_searchparams object has an internal reference to the parent ms_mascotresfile object. This is hidden inside the C++ implementation. A general solution to this problem would need to track all pointer and reference assignments in the C++ code, and then increase and decrease the reference count of each wrapped object in the runtime environment of whichever programming language you are using. It is very difficult to implement correctly.

Note that the problem in Java and C# is even more subtle than in Perl and Python. If you remove the System.gc() and System.runFinalization() lines from the example program, the program may run just fine for hundreds of times, and then one time, if the system is running out of memory for some reason, or any other similar issue, the JVM or .NET CLR garbage collector might be run just before params.getNumberOfDatabases() -- and bang! Your program, which has worked hundreds of times before, crashes for no reason. This is obviously even harder to detect and debug in a large program.

Luckily there is a safe fix: follow Two rules of thumb when writing programs using Mascot Parser.

Copyright © 2016 Matrix Science Ltd.  All Rights Reserved. Generated on Fri Jun 2 2017 01:44:51