Overview

We need a sensor that checks the configuration of MPI sites and verifies that MPI jobs can be run successfully. Eventually this could be integrated into the core SAM framework and used via the FCR tool.

Technical details

For the moment, a separate sensor is needed because MPI jobs can only be submitted using the MPICH job type. If that restriction is lifted, the test functionality could be added to any other sensor.

First step was to use the standalone SAM framework (http://wiki.egee-see.org/index.php/SEE-GRID_standalone_SAM). I created a simple test script that just tests MPI and manually set job type and node number in the JDL used to submit the standalone SAM. Some results can be seen at http://www.cs.tcd.ie/Stephen.Childs/mpi-tests

Next thing was to integrate the test script into the SAM framework. The easiest thing turned out to be to modify the JDL for the existing CE sensor (/opt/lcg/same/client/sensors/CE/testjob.jdl) to add the NodeNumber and MPICH parameters, and then create a new test for MPI (CE-sft-job-mpi -- a better name should probably be chosen) within the testjob sensor. See bottom of the page for the modified files.

Problems were encountered with job submission using this approach. It turns out that the SAM framework generates a JDL like this:

Executable = "/bin/sh";
Arguments = "-c 'tar xzf testjob.tgz ; export SAME_WORK=`pwd`/work ; bin/same-exec -c same.conf -
-nodetest testjob grid10.lal.in2p3.fr -- CE-sft-1175070028 lxn1183.cern.ch 2>&1'";

and the complicated quotes in the Arguments field interact badly with whatever wrapping is done in the MPI PBS jobwrapper causing errors like this:

Job output:

tar: You must specify one of the `-Acdtrux' options
Try `tar --help' for more information.
/work ; bin/same-exec -c same.conf --nodetest testjob gridgate.cs.tcd.ie -- CE-sft-1175005602 lxn1183.cern.ch 2>&1

The solution is easy: instead of invoking a shell with arguments, create a shell script containing the contents of the arguments and submit that instead. I did this manually and succeeded in submitting a test job to the LAL CE (results here).

The CE sensor's JDL has now been modified to submit a script testjob.sh containing the arguments needed to run the test.

Test specification

Do we need separate test jobs for each flavour of MPI to test the entire process:

  1. Match against site for flavour (on UI)
  2. Check that the environment variables are set up correctly for EGEE MPI and mpi-start (on WN)
    • Check that advertised flavours correspond with installed flavours
    • Check that advertised version numbers correspond with installed versions
  3. Check that the submitted C code compiles correctly (should we also submit equivalent Fortran code?)
  4. Check that the binary executes correctly (compare hostnames returned to machinefile)

Alternatively, the test could have a prepare stage that matches against a list of flavours and generates a test script that only tests advertised flavours.

mpi: MpiSamSensor (last edited 2011-07-12 14:41:39 by localhost)