PROOF in FairRoot

PROOF in FairRoot

The necessary changes to the FairRoot allowing usage of PROOF have been added on 08.03.2012.
In order to run FairRunAna on the PROOF cluster, one should edit the macro file, and change:

FairRunAna* fRun = new FairRunAna(); to FairRunAna* fRun = new FairRunAna("proof");

The example macros that show the usage of PROOF are located in macro/global/.

Create several input files with:


karabowi@lxi047:pandaroot_trunk:~/pandaroot_trunk/trunk/macro/global$ date; root -l -q 'sim_BARREL_1000.C(0)' &> p0.dat; date ;root -l -q 'sim_BARREL_1000.C(1)' &> p1.dat; date; root -l -q 'sim_BARREL_1000.C(2)' &> p2.dat; date; root -l -q 'sim_BARREL_1000.C(3)' &> p3.dat; date; root -l -q 'sim_BARREL_1000.C(4)' &> p4.dat; date; root -l -q 'sim_BARREL_1000.C(5)' &> p5.dat; date; root -l -q 'sim_BARREL_1000.C(6)' &> p6.dat; date; root -l -q 'sim_BARREL_1000.C(7)' &> p7.dat; date; root -l -q 'sim_BARREL_1000.C(8)' &> p8.dat; date; root -l -q 'sim_BARREL_1000.C(9)' &> p9.dat; date;

Analyze the files in PROOF:

karabowi@lxi047:pandaroot_trunk:~/pandaroot_trunk/trunk/macro/global$ root -l -q 'tracks_BARREL_1000.C("proof",10);

You may also analyze the files locally:

karabowi@lxi047:pandaroot_trunk:~/pandaroot_trunk/trunk/macro/global$ root -l -q 'tracks_BARREL_1000.C("local",10);

Running on PROOF
The most important difference when running on PROOF is the dialog window that will appear after PROOF starts working. // I would like to have a picture here //.

It shows the progress of the analysis in a graphic form.

When you look at the screen printouts, you will notice that it differs from the one you are used to.

The major differences include:

  • the welcome message of the PROOF:
    FairRunAna::RunOnProof(0,10000): running FairAnaSelector on proof server: "workers=3" with PAR file name = "$VMCWORKDIR/gconfig/libFairRoot.par".
    +++++++ T P R O O F +++++++++++++++++++++++++++++++++
    creating TProof* proof = TProof::Open("workers=3");
    +++ Starting PROOF-Lite with 3 workers +++
    Opening connections to workers: OK (3 workers)
    Setting up worker servers: OK (3 workers)
    PROOF set to parallel mode (3 workers)
    +++++++ C R E A T E D +++++++++++++++++++++++++++++++

    The messages comes from the FairRunAna::RunOnProof() and informs where and with what options the PROOF cluster
    started.
  • the list of the enabled PROOF ARchives:
    drwxr-x--- 3 karabowi had1 22 31. Okt 13:34 libFairRoot
    lrwxrwxrwx 1 karabowi had1 60 9. Mär 11:57 libFairRoot.par -> /misc/karabowi/pandaroot_trunk/trunk/gconfig/libFairRoot.par

    In short, from this libFairRoot.par the individual workers learn that after being created they should execute the gconfig/rootlogon.C which loads the necessary libraries.
  • some lines informing the user on the number of entries, input files, status of the analysis...
    FairRunAna::RunOnProof(): The chain seems to have 10000 entries.
    FairRunAna::RunOnProof(): There are 10 files in the chain.
    FairRunAna::RunOnProof(): Starting inChain->Process("FairAnaSelector","",10000,0)

    Info in <:setqueryrunning>: starting query: 1
    Info in <:setrunning>: nwrks: 3
    Error in <:tgspeedo::build>: speedo.gif not found
    -I- FairAnaSelector::Begin()
    Looking up for exact location of files: OK (10 files)
    Looking up for exact location of files: OK (10 files)
    Info in <:tpacketizeradaptive>: Setting max number of workers per node to 3
    Validating files: OK (10 files)
    Info in <:initstats>: fraction of remote files 1.000000
    -I- FairAnaSelector::Terminate(): fOutput->ls() still sending)
    OBJ: TSelectorList TSelectorList Special TList used in the TSelector : 0
    OBJ: TList MissingFiles Doubly linked list : 0
    OBJ: TStatus PROOF_Status : 0 at: 0x3c2bd30
    OBJ: TOutputListSelectorDataMap PROOF_TOutputListSelectorDataMap_object Converter from output list to TSelector data members : 0 at: 0x2e53450
    -I- FairAnaSelector::Terminate(): -------------
    Lite-0: all output objects have been merged
    FairRunAna::RunOnProof(): inChain->Process DONE

  • there is no normal std::output the user got used to, like initialization messages, the couts from your tasks. Since the analysis was executed on several external workers, the printouts from them on what they actually did is not printed on the screen. These printouts from the workers go to special log files, which are located in:

    $HOME/.proof/pandaroot_installation_directory_trunk-macro-yourmacrodir/session-$hostname-$jobID-$pJobID/worker-0.workerNumber.log

    To make it clearer, here is the location of my log files:

    karabowi@lxi047:pandaroot_trunk:~/pandaroot_trunk/trunk/macro/global$ ls ~/.proof/pandaroot_trunk-trunk-macro-global/session-lxi047-1331290627-14341/worker-0.*.log -l
    -rw-r----- 1 karabowi had1 439090 9. Mar 11:57 /misc/karabowi/.proof/pandaroot_trunk-trunk-macro-global/session-lxi047-1331290627-14341/worker-0.0.log
    -rw-r----- 1 karabowi had1 375985 9. Mar 11:57 /misc/karabowi/.proof/pandaroot_trunk-trunk-macro-global/session-lxi047-1331290627-14341/worker-0.1.log
    -rw-r----- 1 karabowi had1 440103 9. Mar 11:57 /misc/karabowi/.proof/pandaroot_trunk-trunk-macro-global/session-lxi047-1331290627-14341/worker-0.2.log

    The log files include all the printouts from the workers.

Moreover there are more output files than you expect:
karabowi@lxi047:pandaroot_trunk:~/pandaroot_trunk/trunk/macro/global$ ls -ltrh tracks_*.root
-rw-r----- 1 karabowi had1 378 9. Mar 12:37 tracks_22Part_n10000.root
-rw-r--r-- 1 karabowi had1 1009K 9. Mar 12:37 tracks_22Part_n10000_worker_0.0.root
-rw-r--r-- 1 karabowi had1 1013K 9. Mar 12:37 tracks_22Part_n10000_worker_0.2.root
-rw-r--r-- 1 karabowi had1 1,3M 9. Mar 12:37 tracks_22Part_n10000_worker_0.1.root

The individual workers create the output files on their own, and the trees/files from different workers are not merged. There is also the file one would expected created, but it should
be changed, so that this empty tracks_22Part_n10000.root is not created at all.

Depending on the number of input files, number of workers and the time spent for analysis of one event, the PROOF Packetizer, responsible for sending events to workers, will behave differently, but in general:

  • if there are enough trees each worker will get a tree to analyze. Any worker that finishes analyzing his tree will get another tree to analyze if there are any left. This the scenario quoted above. Looking at the output files' sizes it is easy to deduce that the worker 1 was the quickest in analyzing the trees, and therefore the got 4 of the input trees, while workers 0 and 2 got only 3 trees.
  • if there are less trees than workers or if one event analysis takes a long time, event in any tree will be distributed among different workers.

Whatever the case, you will find out that there is no correspondance in event order between the input files and the output files. One can however still match the events between the input and output using the fRunId and fMCEntryNo stored in the EventHeader for each event.

Restrictions

There are few restrictions when using PROOF:

  • tasks - in principal any task can be attached to FairRunAna and analyzed using PROOF, but most will probably crash nowadays, as the members in many tasks are not properly initialized in constructors. When running locally these uninitialized members do not cause any warnings/errors/segmentation violations, but when running on PROOF the master session will crash to prevent possible errors with uninitialized variables.
  • parameter file - to analyze trees in FairRoot you need to specify a root file with the parameters necessary for reconstruction and stored in the simulation phase. In order to reconstruct a chain of trees with several input trees the matching parameters have to be stored in ONE file. The implication is that you cannot concurrently create several input files, while doing so would spoil the parameter file. We are planning to fix this "feature" as soon as possible.

There are now two constructors of the FairRunAna:

  • FairRunAna(); - it does not take any argument, with the old functionalities, runninig on one CPU
  • FairRunAna(const char* type, const char* proofName=""); - where:
    • "type" can be either "proof" to run on PROOF cluster or "local" to run locally, just like in FairRunAna();
    • "proofName" is the name of your PROOF cluster. One can also specify some options to PROOF here, for example:
      • proofName = "" - the default scenario. The PROOF cluster is your machine with as many workers as the CPUs the machine has
      • proofName = "workers=3" - to specify the number of workers to use
      • proofName = "pod://" - to run on the PROOF cluster created with PoD - not tested since a long time...

The changes to the SVN include:

  • 3 new files:
    • gconfig/libFairRoot.par - PROOF ARchive file that tells PROOF which libraries to load
    • base/FairAnaSelector.{h.cxx} - the general class deriving from TSelector that manages the reconstruction with PROOF.
  • updates to files:
    • pandaroot/trunk/gconfig/basiclibs.C (modified) (1 diff)
    • fairbase/release/base/CMakeLists.txt (modified) (1 diff)
    • fairbase/release/base/FairLinkDef.h (modified) (1 diff)
    • fairbase/release/base/FairRootManager.cxx (modified) (74 diffs)
    • fairbase/release/base/FairRootManager.h (modified) (6 diffs)
    • fairbase/release/base/FairRun.cxx (modified) (3 diffs)
    • fairbase/release/base/FairRun.h (modified) (5 diffs)
    • fairbase/release/base/FairRunAna.cxx (modified) (40 diffs)
    • fairbase/release/base/FairRunAna.h (modified) (8 diffs)
    • fairbase/release/parbase/FairParAsciiFileIo.cxx (modified) (1 diff)
    • fairbase/release/parbase/FairParAsciiFileIo.h (modified) (2 diffs)
    • fairbase/release/parbase/FairParIo.cxx (modified) (1 diff)
    • fairbase/release/parbase/FairParIo.h (modified) (2 diffs)
    • fairbase/release/parbase/FairParRootFileIo.cxx (modified) (1 diff)

To developers.

Following are several notes to the developers that will have to make their classes usable in PROOF:

  • 1. Member initialization matters! The most common error I encountered was caused by class members not being initialized in constructors. Do initialize them and many causes of the errors will disappear.
  • 2. Some members are a bit more complicated or cannot simply be initialized to "0" or "NULL". A good way to get rid of the crash in these cases is to use "//!" which tells the streamer not to stream the value. However be careful here. For not streamed class members the member will not be copied from master to worker node. Why does it matter? The tasks are created on the master already in the macro. Often some values are changed in these tasks with dedicated Set functions. The PROOF clusters start separate processes, also on different machines, when you specify so. In order to execute exactly the task with exactly the parameters you want to have, the tasks are streamed to the nodes using the TList* fInput of the TSelector. Now assume you set some member value with Set function in the macro or in the constructor, but you switch the streaming off for that value, then the value will not be set on the worker... Remember:
    Int_t fVerbosity; // verbosity of the task will be streamed and the value passed to the worker nodes
    Int_t fVerbosity; //! verbosity of the task will not be streamed and the value will not be passed to the worker nodes
  • 3. ReInit(). I have found some serious mistakes in ReInit(). When one analyzes the chain consisting of several trees ReInit() function of the tasks is called BETWEEN the different trees, when fRunId changes from one event to another. Because of already fixed bug in FairRunAna and the fact that the option of analyzing a chain of trees was rarely used the ReInit() is ofter not properly debugged. And so I have seen following issues:
    • a/ the ReInit() returns kERROR - it means that for the next trees this task will be omitted.
    • b/ the ReInit() calls function responsible for creating detectors - in that case in the second tree the task saw twice as many detectors, in the third analyzed tree it saw 3 sets of the detectors. The task was a Hit Producer, created hits in a loop over the detectors, and therefore created normal amount of hits for the first tree, twice the normal amount for the second tree and so on.
    • c/ sometimes something is screwed up in the ReInit() in such a way, that the last entry of the first tree is treated as input to the tasks. In that case in all events in 2nd, 3rd and so on will have the same input and often also the output...

    When checking the task in PROOF, because of that I always have more files than workers, so that they have to call ReInit(). I also analyze the whole set locally and compare: the size of output files, the number of objects created by the task I debug, the shape of the distributions. In case a/ you will have less objects than expected, in case b/ more, and in case c/ you will see peaks in the distribution.

  • 4. Verbosity. For me increasing it is the fastest way of finding problems.
  • 5. DEBUGGING
    The debugging when running on PROOF is more difficult than when running locally, because you can experience errors/crashes on the master (printed on the screen) and on the slaves (in that case you have to observe the individual log files). Because of that, I advise you not to add all the tasks at once, fix them one by one. You will easily spot the errors or crash on the master. However, when the analyze crashes somewhere on the workers, you will see following printout:

    Error in TPacketizerAdaptive::SplitPerHost: The input list contains no elements
    Info in TPacketizerAdaptive::InitStats: fraction of remote files 1.000000
    Info in TProofLite::MarkBad:
    +++ Message from master at lxi047.gsi.de : marking lxi047:-1 (0.2) as bad
    +++ Reason: undefined message in TProof::CollectInputFrom(...)

    +++ Message from master at lxi047.gsi.de : marking lxi047:-1 (0.2) as bad
    +++ Reason: undefined message in TProof::CollectInputFrom(...)

    +++ Most likely your code crashed
    +++ Please check the session logs for error messages either using
    +++ the 'Show logs' button or executing
    +++
    +++ root [] TProof::Mgr("lxi047.gsi.de")->GetSessionLogs()->Display("*")

    You may try the suggested method, but it will print out the last 20 lines of each worker. Usually with segmentation violation your error will occur earlier, so that you will get many lines that does not any information necessary for debugging.

    For me the quickest way was to access them by:

    karabowi@lxi047:pandaroot_trunk:~/pandaroot_trunk/trunk/macro/global$ less ~/.proof/pandaroot_trunk-trunk-macro-global/last-lite-session/worker-0.2.log

    Note, that the running directory (pandaroot_trunk/trunk/macro/global) is transformed inside PROOF with and "/" are changed into "-" (pandaroot_trunk-trunk-macro-global).
    Remember also to take the log from the worker that crashed (in this case worker 0.2).

In case of not being able to fix your problem, please contact me at r.karabowicz@gsi.de

Known issues

The parameter file - to analyze trees in FairRoot you need to specify a root file with the parameters necessary for reconstruction and stored in the simulation phase. In order to reconstruct a chain of trees with several input trees the matching parameters have to be stored in ONE file. The implication is that you cannot concurrently create several input files, while doing so would spoil the parameter file. We are planning to fix this "feature" as soon as possible.

The PROOF ARchive file produces error on the master. In your macro you load the rootlogon.C macro and execute it to make FairRoot classes known to ROOT's cint. When you run analysis on the PROOF, the default libFairRoot.par PROOF ARchive also loads in and executed the same macro rootlogon.C. Annoying but harmless error is printed on the screen:

+++++++ T P R O O F +++++++++++++++++++++++++++++++++
creating TProof* proof = TProof::Open("");
+++ Starting PROOF-Lite with 8 workers +++
Opening connections to workers: OK (8 workers)
Setting up worker servers: OK (8 workers)
PROOF set to parallel mode (8 workers)
+++++++ C R E A T E D +++++++++++++++++++++++++++++++
EXECUTING libFairRoot.par/SETUP.C without includes
Function SETUP_c7827247() busy. loaded after "/misc/karabowi/pandaroot_trunk/trunk/gconfig/rootlogon.C"
Error: G__unloadfile() Can not unload "/misc/karabowi/pandaroot_trunk/trunk/gconfig/rootlogon.C", file busy /tmp/SETUP_c7827247.C:20:
Note: File "/misc/karabowi/pandaroot_trunk/trunk/gconfig/rootlogon.C" already loaded
*** Interpreter error recovered ***
DONT MIND THE ERRORS HERE, ITS EXECUTED.

Meanwhile I have added the line "DONT MIND THE ERRORS HERE, ITS EXECUTED", and am currently working on proper loading/unloading of the file.