This page describes the preferred configuration for EGEE sites who wish to support MPI. These guidelines are implemented in the Quattor Working Group templates, and in a YAIM module (described on the page YaimConfig).
The main components that need configuring are the information system and job environment variables.
Contents
RPMs
It is up to you which versions of MPI you want to support. RPMs are available at this location: http://quattorsrv.lal.in2p3.fr/packages/mpi/ (or you can install your own preferred version - just make sure you advertise this correctly).
We use the mpi-start package from the int.eu.grid project to hide some of the details of MPI setup from the user. It should be installed on worker nodes. More details on mpi-start can be found in [http://indico.cern.ch/materialDisplay.py?contribId=s3t4&sessionId=s3&materialId=slides&confId=a063547 ] this presentation. You can get the latest version of mpi-start here.
N.B. we hope to provide a meta-rpm soon which will pull in MPI rpms and mpi-start.
Distributing MPI binaries for user jobs
The MPI binaries that users want to run will need to be accessible on every node involved in an MPI computation (it is a parallel job after all). There are three main approaches:
Shared home/other shared area
By far the best option is to provide user homes hosted on a shared filesystem. This could either be a network filesystem (e.g. NFS) or a cluster filesystem (e.g. GPFS or Lustre). Then the MPI binary you compile up on the first MPI node will automatically be available on all nodes. This is the normal mode of operation for MPI, and what MPI users will probably expect.
A secondary advantage is that if you share the home directory with the CE as well, you can use the pbs jobmanager (rather than lcgpbs). This is quicker and works better with the RB and WMS.
Passwordless ssh between WNs
If you configure host-based authentication between worker nodes, then mpi-start can automatically replicate your binary to nodes involved in the computation. However, other files (e.g. data) will not be replicated, so this would have to be done manually (and would be slow for large data sets). Also it could open up the potential for users to subvert the normal resource management mechanisms by directly executing commands on nodes not allocated to them.
Use mpiexec to distribute files
This option is for sites with neither a shared filesystem nor passwordless ssh between WNs. If you have an mpiexec that can spawn the remote jobs using the LRMS native interface, you can use it to distribute the files. See this page (http://www.osc.edu/~pw/mpiexec/index.php#Cute_mpiexec_hacks) for the basic idea. We hope to implement support for this option in mpi-start in the future. (Note: activation of this feature depends on the SSH and shared home FS variables being set correctly.)
mpi-mt
This option for mpi-start is not heavily tested, but might work. Quoting Sven Stork:
1. Use a MPI implementations that can start without the need of ssh.
2. Install a small tool called mpi-mt on every worker node in a specific
place.(https://savannah.fzk.de/~autobuild/module-mpi-mt-build.html)
3. If mpi-start will use the selected MPI implementation to start this program
on every worker node in the job. mpi-mt is able perform very basic
operations (copy file, delete file, execute shell command) operations.
By this way the question if we can live without ssh/scp reduces to the
question if the MPI implementation can live without ssh or not. Depending on
your scheduler you have several different options for the MPI
implementations.
To tell mpi-start the location of this utility you can set the variable
I2G_MPI_MT. The default will be /opt/i2g/bin/i2g-%{I2G_MPI_TYPE}-mpi-mt. If
mpi-start finds the executeable it automatically switching over to the MPI
based approach.
Information system
Sites may install different implementations (or flavours) of MPI. It is expected that users will converge on OpenMPI in the future, but for the moment, a variety of libraries and tools are in use. It is important therefore that users can use the information system to locate sites with the software they require. MPI flavours will be advertised as GlueHostApplicationSoftwareRunTimeEnvironment variables.
MPI-start support
GlueHostApplicationSoftwareRunTimeEnvironment: MPI-START
MPI flavour(s)
<MPI flavour>
This is the most basic variable and one should be advertised for each MPI flavour that has been installed and tested. Currently supported flavours are MPICH, MPICH2, LAM and OPENMPI.
Example:
MPI version(s)
<MPI flavour>-<MPI version>
This should be published to allow users with special requirements to locate specific versions of MPI software.
Examples
GlueHostApplicationSoftwareRunTimeEnvironment: OPENMPI-1.0.2
GlueHostApplicationSoftwareRunTimeEnvironment: MPICH-1.2.7
GlueHostApplicationSoftwareRunTimeEnvironment: MPICH-G2-1.2.7
GlueHostApplicationSoftwareRunTimeEnvironment: OPENMPI-1.0.2-ICC
MPI compiler(s) -- optional
<MPI flavour>-<MPI version>-<Compiler>
If <Compiler> is not published, then gcc suite is assumed.
Interconnects
MPI-<interconnect>
Interconnects: Ethernet, Infiniband, SCI, Myrinet
Examples
GlueHostApplicationSoftwareRunTimeEnvironment: MPI-Infiniband
Sites have to publish MPI-Ethernet (?)
Shared homes
If a site has a shared filesystem for home directories it should publish the variable MPI_SHARED_HOME.
GlueHostApplicationSoftwareRunTimeEnvironment: MPI_SHARED_HOME
Environment variables
These environment variables should be set for jobs executing on a worker node in an MPI site. This is normally done by adding a script to /etc/profile.d. The environment variables should be a straight mapping from the environment variables.
All prefixed with MPI_:
Mandatory
MPI_<flavour>_VERSION
MPI_<flavour>_PATH
Examples:
- MPI_MPICH_VERSION=1.2.6
- MPI_MPICH_PATH=/opt/mpich-1.2.6
mpiexec
Some sites use OSC mpiexec as it uses the scheduler interface directly to execute multi-node jobs, and so usage is accounted correctly. Sites which have this installed should set the following environment variable to the top directory in their mpiexec installation: e.g.
MPI_MPICH_MPIEXEC=/opt/mpiexec-0.80
Support for this format was added to i2g-mpi-start as of version 0.0.46.
Optional
MPI_<flavour>_COMPILER
MPI_<flavour>_<version>_PATH
MPI_<flavour>_<version>_<compiler>_PATH
- MPI_OPENMPI_COMPILER=/opt/openmpi-1.0.2 (?)
In the case of the optional variables, we need to decided on a method for translating the version numbers (which contain decimal points) into a format compatible with environment variables.
MPI_INTERCONNECT=<interconnect>
Shared area
MPI_SHARED_HOME
Optional
MPI_SHARED_AREA=<path to shared area>
- This should be a job-specific area: how should it be set up?
- Who takes care of getting all the data, executable, etc. to the shared area?
mpirun
All MPI jobs coming through an EGEE resource broker will be wrapped in a call to mpirun (even if they are scripts rather than MPI binaries. For this reason, it is essential that a job arriving at the site should find an mpirun in its path before any MPI setup has been done. We have written a small "dummy" mpirun that can be installed on WNs in a location high up in the user's path. It will simply execute the script passed to it. YAIM and Quattor MPI configurations will create this script automatically or it is attached here: mpirun for manual use.
TCG working