This page describes the preferred configuration for EGEE sites who wish to support MPI. These guidelines are implemented in the Quattor Working Group templates, and in a YAIM module (described on the page YaimConfig).

The main components that need configuring are the information system and job environment variables.

RPMs

It is up to you which versions of MPI you want to support. RPMs are available at this location: http://quattorsrv.lal.in2p3.fr/packages/mpi/ (or you can install your own preferred version - just make sure you advertise this correctly).

We use the mpi-start package from the int.eu.grid project to hide some of the details of MPI setup from the user. It should be installed on worker nodes. More details on mpi-start can be found in [http://indico.cern.ch/materialDisplay.py?contribId=s3t4&sessionId=s3&materialId=slides&confId=a063547 ] this presentation. You can get the latest version of mpi-start here.

N.B. we hope to provide a meta-rpm soon which will pull in MPI rpms and mpi-start.

Distributing MPI binaries for user jobs

The MPI binaries that users want to run will need to be accessible on every node involved in an MPI computation (it is a parallel job after all). There are three main approaches:

Shared home/other shared area

By far the best option is to provide user homes hosted on a shared filesystem. This could either be a network filesystem (e.g. NFS) or a cluster filesystem (e.g. GPFS or Lustre). Then the MPI binary you compile up on the first MPI node will automatically be available on all nodes. This is the normal mode of operation for MPI, and what MPI users will probably expect.

A secondary advantage is that if you share the home directory with the CE as well, you can use the pbs jobmanager (rather than lcgpbs). This is quicker and works better with the RB and WMS.

Passwordless ssh between WNs

If you configure host-based authentication between worker nodes, then mpi-start can automatically replicate your binary to nodes involved in the computation. However, other files (e.g. data) will not be replicated, so this would have to be done manually (and would be slow for large data sets). Also it could open up the potential for users to subvert the normal resource management mechanisms by directly executing commands on nodes not allocated to them.

Use mpiexec to distribute files

This option is for sites with neither a shared filesystem nor passwordless ssh between WNs. If you have an mpiexec that can spawn the remote jobs using the LRMS native interface, you can use it to distribute the files. See this page (http://www.osc.edu/~pw/mpiexec/index.php#Cute_mpiexec_hacks) for the basic idea. We hope to implement support for this option in mpi-start in the future. (Note: activation of this feature depends on the SSH and shared home FS variables being set correctly.)

mpi-mt

This option for mpi-start is not heavily tested, but might work. Quoting Sven Stork:

1. Use a MPI implementations that can start without the need of ssh.
2. Install a small tool called mpi-mt on every worker node in a specific 
   place.(https://savannah.fzk.de/~autobuild/module-mpi-mt-build.html)
3. If mpi-start will use the selected MPI implementation to start this program 
   on every worker node in the job. mpi-mt is able perform very basic  
   operations (copy file, delete file, execute shell command) operations.

By this way the question if we can live without ssh/scp reduces to the 
question if the MPI implementation can live without ssh or not. Depending on 
your scheduler you have several different options for the MPI 
implementations.

To tell mpi-start the location of this utility you can set the variable 
I2G_MPI_MT. The default will be /opt/i2g/bin/i2g-%{I2G_MPI_TYPE}-mpi-mt. If 
mpi-start finds the executeable it automatically switching over to the MPI 
based approach.

Information system

Sites may install different implementations (or flavours) of MPI. It is expected that users will converge on OpenMPI in the future, but for the moment, a variety of libraries and tools are in use. It is important therefore that users can use the information system to locate sites with the software they require. MPI flavours will be advertised as GlueHostApplicationSoftwareRunTimeEnvironment variables.

MPI-start support

GlueHostApplicationSoftwareRunTimeEnvironment: MPI-START

MPI flavour(s)

<MPI flavour>

This is the most basic variable and one should be advertised for each MPI flavour that has been installed and tested. Currently supported flavours are MPICH, MPICH2, LAM and OPENMPI.

Example:

MPI version(s)

<MPI flavour>-<MPI version>

This should be published to allow users with special requirements to locate specific versions of MPI software.

Examples

MPI compiler(s) -- optional

<MPI flavour>-<MPI version>-<Compiler>

If <Compiler> is not published, then gcc suite is assumed.

Interconnects

MPI-<interconnect>

Interconnects: Ethernet, Infiniband, SCI, Myrinet

Examples

Sites have to publish MPI-Ethernet (?)

Shared homes

If a site has a shared filesystem for home directories it should publish the variable MPI_SHARED_HOME.

Environment variables

These environment variables should be set for jobs executing on a worker node in an MPI site. This is normally done by adding a script to /etc/profile.d. The environment variables should be a straight mapping from the environment variables.

All prefixed with MPI_:

Mandatory

Examples:

mpiexec

Some sites use OSC mpiexec as it uses the scheduler interface directly to execute multi-node jobs, and so usage is accounted correctly. Sites which have this installed should set the following environment variable to the top directory in their mpiexec installation: e.g.

MPI_MPICH_MPIEXEC=/opt/mpiexec-0.80

Support for this format was added to i2g-mpi-start as of version 0.0.46.

Optional

In the case of the optional variables, we need to decided on a method for translating the version numbers (which contain decimal points) into a format compatible with environment variables.

MPI_INTERCONNECT=<interconnect>

Shared area

MPI_SHARED_HOME

Optional

MPI_SHARED_AREA=<path to shared area>

mpirun

All MPI jobs coming through an EGEE resource broker will be wrapped in a call to mpirun (even if they are scripts rather than MPI binaries. For this reason, it is essential that a job arriving at the site should find an mpirun in its path before any MPI setup has been done. We have written a small "dummy" mpirun that can be installed on WNs in a location high up in the user's path. It will simply execute the script passed to it. YAIM and Quattor MPI configurations will create this script automatically or it is attached here: mpirun for manual use.

mpi: SiteConfig (last edited 2011-07-12 14:41:40 by localhost)