Matrix Science header

Using the toolkit from C++
[Getting started with Mascot Parser]

Memory management conventions

Variables are passed by value in C++. When dealing with ms_mascotresfile objects that can be as large as the data file on disk (even gigabytes in size), it is obviously good practice to pass a reference or a pointer to such an object instead of calling its copy constructor.

Similarly, if one has a ms_mascotresfile object, and one wants to view a portion of it, say a protein hit entry with ms_mascotresults::getHit(), it would be possible to return a copy of the subset of the data file to the caller. But for performance reasons, it is often more desirable to simply return a reference or a pointer to an object internal to the c ms_mascotresfile object.

This creates a subtle problem with hidden parent-child relationships between objects that is not obvious from the API. If object P returns a reference or a pointer to object C, and P is deallocated, is it still safe to use C? Usually, in Mascot Parser, the answer is no. The reason is that C often has an internal reference to the parent object P, and deallocating P means the reference points to some arbitrary chunk of memory. Calling methods in C, if they happen to reference P, will then cause a segmentation fault.

Two rules of thumb

The rules of thumb elaborated here serve as general guidelines: follow them and you will not introduce subtle bugs that may lead to crashes. In some cases it is possible to break the rule and survive, but this usually requires knowledge of Parser internals. (In other words, don't do it.)

There are two rules of thumb. In both of them, it is assumed both A and B are in the matrix_science namespace, i.e. classes defined in Mascot Parser.

  1. Methods returning a reference or a pointer:

        A* B::func();
        A& B::func();
    

    You must keep both the returned object and b in memory as long as you use either. Parser owns the returned object, which means you must not deallocate it.

  2. Methods or constructors taking a reference or a pointer:

        B::B(A *a);
        B::B(A &a);
    
        B::func(A *a);
        B::func(A &a);
    

    Ownership does not change: if you owned a previously, you still do. However, you must keep both objects in memory as long as you use either.

Beware of implicit deallocation when you allocate things on the stack. The following code will segfault when function print_db_name() is called:

    using namespace matrix_science;

    ms_searchparams& get_parameters(string filename) {
        ms_mascotresfile resfile(filename);

        if (!resfile.isValid()) 
            throw new std::runtime_error(resfile.getLastErrorString());

        return resfile.params(); // PROBLEM HERE
    }

    void print_db_name(string filename) {
        ms_searchparams& params = get_parameters(filename);
        std::cout << params->getDB() << std::endl; // CRASH HERE
    }

The resfile object gets deallocated at the end of get_parameters(), as it is allocated on the stack. This means the object referenced by params in print_db_name() will also be deallocated (because Parser owns the ms_searchparams object) and havoc ensues.

Examples of the rules of thumb

    ms_mascotresfile resfile(filename);
    ms_searchparams& params = resfile.params();
    ms_inputquery q1(resfile, 1);

Rule of thumb 1: resfile must be kept in memory for as long as you use params.

Rule of thumb 2: resfile must be kept in memory for as long as you use q1.

    ms_mascotresfile resfile(filename);
    ms_peptidesummary pepsum(resfile);
    ms_protein* hit = pepsum.getHit(1);

Rule of thumb 1: pepsum must be kept in memory for as long as you use hit. Do not deallocate hit; it belongs to Parser.

Rule of thumb 2: resfile must be kept in memory for as long as you use pepsum.

    ms_datfile datfile();
    const ms_databases* dbs = datfile.getDatabases();

Rule of thumb 1: datfile must be kept in memory for as long as you use dbs. Do not deallocate dbs; it is owned by Parser.

    ms_quant_configfile qf();
    bool success = resfile.getQuantitation(&qf);

Rule of thumb 2: resfile must be kept in memory for as long as you use qf.

Using Apache Xerces and Parser in the same application

Parser uses the XML parser library Apache Xerces internally to read and write XML configuration files. Xerces symbols are in the xercesc namespace, which is an alias of the version-specific namespace. For example, Xerces 3.1 namespace is xerces_3_1.

If your application also links against Xerces, special care may need to be taken depending on the platform (Windows or Linux) and whether you're linking against the dynamic or static version of Parser. If your application does not use Xerces, you can ignore the rest of this section.

Origin of the problem

Xerces mandates that the global constructor xercesc::XMLPlatformUtils::Initialize() is called before any other Xerces method calls, and the global destructor xercesc::XMLPlatformUtils::Terminate() called after the last Xerces method call. Normally these calls are done at the start and end of main() or in a global application constructor and destructor class.

Parser calls both Initialize() and Terminate() internally. The Initialize() function takes as parameters pointers to an error handler and a memory manager, which are assigned to variables shared by Xerces methods inside the library.

If your application links against both Parser and Xerces, it is possible (see below) that the application ends up with only one copy of Xerces. The linker may silently choose one or the other copy of Xerces depending on linking order and how the static or shared Parser was constructed. This means that if you also call xercesc::XMLPlatformUtils::Initialize() (with or without your own error handler), it can corrupt the internal Xerces pointers. You will either receive no error messages or the application will crash while using Xerces or Parser methods.

Xerces and statically linked Parser

The statically linked Parser calls Initialize() and Terminate() as needed. The calls are balanced exactly, so that when a Parser class or method uses Xerces, Initialize() is called right before and Terminate() right after the Xerces calls.

When you link statically against both Parser and Xerces, the linker on both Windows and Linux will include only one copy of Xerces in the final application. This is fine as long as you are very careful about when and how you use Xerces. Only use Xerces strictly before or strictly after any calls to Parser functions, and never interleave the two.

Consider the following singlethreaded example:

  1. Program starts.
  2. Call xercesc::XMLPlatformUtils::Initialize() within your application.
  3. Do some XML processing with Xerces.
  4. Call xercesc::XMLPlatformUtils::Terminate() within your application.
  5. Create ms_datfile and read options from mascot.dat.
  6. Create ms_mascotresfile.
  7. Create ms_peptidesummary and iterate over hits.
  8. Delete or free the Parser objects.
  9. Call xercesc::XMLPlatformUtils::Initialize() within your application.
  10. Do some XML processing with Xerces.
  11. Call xercesc::XMLPlatformUtils::Terminate() within your application.
  12. Program ends.

This sequence is safe, because Xerces is deinitialised before steps 5-7 and there are no explicit calls to Xerces between steps 5-7. Parser may call xercesc::XMLPlatformUtils::Initialize() and xercesc::XMLPlatformUtils::Terminate() some number of times in steps 5-7. After step 8, Parser will have deinitialised Xerces; the calls are balanced exactly. This means step 9 is safe, and as long as there are no Parser calls in steps 2-4 and 9-11, the program will work as expected.

Note that there are additional restrictions concerning Xerces in multithreaded applications; see Multithreading (advanced reading).

Xerces and dynamically linked Parser on Windows

The Windows DLL version of Parser does not suffer from the problem. The shared library contains its own, separate copy of Xerces, which is initialised and deinitialised in DllMain(), right before and after your application starts. You can link against Xerces and the two libraries will not conflict.

Xerces and dynamically linked Parser on Linux

The Linux shared library version of Parser may export Xerces symbols depending on the compiler and linker. The library initialises Xerces in a library constructor function before main() is run, and deinitialises in a library destructor function after main(). Although Parser does not export the internal Xerces variables, the implementation within Parser refers to them internally. It is possible that the final application contains only the Parser implementation or only the other implementation of Xerces. In this case, using Xerces methods either explicitly in your application or implicitly through Parser will almost certainly cause a crash, whether you call xercesc::XMLPlatformUtils::Initialize() or not.

The only way to avoid the problem is to move your copy of Xerces in a different namespace (e.g. my_xercesc). To do this, you need to either change the default namespace of Xerces and recompile it, or create a statically linked wrapper library. This may not be an option of you are linking against Xerces installed as a system library.

Another way to avoid the problem is to use a different version of Xerces. Parser 2.5 uses Xerces 3.1. Xerces namespaces are versioned, so you could use Xerces 3.0 or 3.2 or any other version and the linker should keep the two implementations separate. Future versions of Parser may use a different version of Xerces, though.


Copyright © 2022 Matrix Science Ltd.  All Rights Reserved. Generated on Thu Mar 31 2022 01:12:30