Some experience in configuring gLite MPI cluster via YAIM
Contents
Installation of packages
In order to install needed packages via yum an additional repository has to be added on CE and WNs in /etc/yum.repos.d/. File name can be e.g. glite-MPI_utils.repo and it can be downloaded from here. Its content is the following:
[glite-MPI_utils] name=gLite 3.1 MPI utils baseurl=http://linuxsoft.cern.ch/EGEE/gLite/R3.1/glite-MPI_utils/sl4/$basearch/ enabled=1
Remove all java-related packages in order to avoid unresolvable dependencies problem:
[CE,WNs]$ yum remove java jdk xml-commons-jaxp
Update installed packages:
[CE,WNs]$ yum update
There are several options mentioned here to install JAVA. The second one which is called “Option 2: Installing SUN's RPM of JDK" assumes the installation of native jdk packages from SUN's web-site. Installation of xml-commons-jaxp-1.3-apis package causes the dependency problem whereas xml-commons-jaxp-1.2-apis doesn't. So do:
[CE,WNs]$ yum install xml-commons-jaxp-1.2-apis [CE,WNs]$ yum locallinstall jdk-1_5_0_14-linux-i586.rpm [CE,WNs]$ yum install java-1.5.0-sun-compat-1.5.0.14
More details about different combinations of java, xml-commons-jaxp and java-1.5.0-sun-compat packages are described in GGUS ticket #49604.
Taking into account that issue and bug #50854 it was a necessary to specify exact version of some packages for CE and WN correspondingly like below:
[CE]$ yum install lcg-CE glite-BDII glite-TORQUE_server glite-TORQUE_utils glite-MPI_utils torque-2.1.9-4cri.slc4 maui-client-3.2.6p19_20.snap.1182974819-4.slc4 maui-server-3.2.6p19_20.snap.1182974819-4.slc4 maui-3.2.6p19_20.snap.1182974819-4.slc4 torque-server-2.1.9-4cri.slc4 torque-client-2.1.9-4cri.slc4
[WNs]$ yum install glite-WN glite-TORQUE_client glite-MPI_utils glite-TORQUE_utils torque-2.1.9-4cri.slc4 torque-client-2.1.9-4cri.slc4 maui-client-3.2.6p19_20.snap.1182974819-4.slc4 torque-mom-2.1.9-4cri.slc4 maui-3.2.6p19_20.snap.1182974819-4.slc4
Additional MPI flavors can also be installed. E.g. in order to enable OpenMPI support on cluster an openmpi package needs to be compiled with torque|pbs support (--with-tm option). To enable fortran support "--enable-mpi-f77" and "--enable-mpi-f90" options need to be specified:
$ wget http://www.open-mpi.org/software/ompi/v1.3/downloads/openmpi-1.3.2-1.src.rpm -P /usr/src/redhat/SRPMS/ $ yum install gcc-c++ gcc make gcc4-gfortran $ yum install torque-devel-2.1.9-4cri.slc4 $ rpmbuild --rebuild /usr/src/redhat/SRPMS/openmpi-1.3.2-1.src.rpm --define 'configure_options --with-tm=/usr --enable-mpi-f77 --enable-mpi-f90' --define 'install_in_opt 1' --define 'build_all_in_one_rpm 1' --target i386
Install compiled package on one of WN and check if tm components are there:
$ rpm -Uvh /usr/src/redhat/RPMS/i386/openmpi-1.3.2-1.i386.rpm
$ /opt/openmpi/1.3.2/bin/ompi_info|grep " tm"
MCA ras: tm (MCA v2.0, API v2.0, Component v1.3.2)
MCA plm: tm (MCA v2.0, API v2.0, Component v1.3.2)Compiled package needs to be installed on all WNs of cluster (as well as gcc4-gfortran if needed).
There is a corresponding feature request.
i2g-mpi-start package available in glite-MPI_utils repo is pretty old (v0.0.52-1). The latest production version is 0.0.59-1 and it is available here. There is also corresponding enhancement request.
Links to repositories of i2g packages for different architectures can be found here.
Configuring CE and WNs
As mentioned here there are several approaches to distribute MPI binaries for user jobs and thus there are different gLite MPI cluster setups:
1) shared /home or other directory between WNs,
2) passwordless ssh between WNs,
3) files distribution via mpiexec
There is a possibility to use i2g-mpi-start package (or simply mpi-start) for files distribution. Its logic is the following.
The distribution method used depends on the plugin that is found most suitable after iterating over all *.filedist plugins. Technically, every plugin returns some integer. After iterating over all plugins, the plugin which returned the lowest integer (highest suitability) is used.
Suitability depends on :
- available MPI
and
- other environment settings (most frequently availability of a shared file system).
mpiexec-based distribution logic is
(the mpiexec plugin finds out that MPICH or MPICH2 is used AND PBS is the scheduling file system) AND (no better file distribution is found).
The ssh-based distribution logic is
(the env. variable MPI_SSH_HOST_BASED_AUTH is set OR Open MPI is found) AND (no better file distribution is found).
For the moment there is no optimal GLOBAL strategy when more than one methods are found suitable. For example, currently the mpiexec based distribution has a higher priority than the SSH-based distribution if both plugins are found to work. That is why in order to get ssh file distribution to work the mpiexec plugin needs to be switched off (remove|renamed) - see "passwordless ssh between WNs (host-based authentication)" part.
Because of bug #48875 all MPI-related variables have to be defined in site-info.def file but not in siteinfo/services/glite-mpi* ones.
Edit configuration files (site-info.def, users.conf, groups.conf, wn-list.conf) taking into account info from Yaim Guide, MPI YaimConfig, MpiTools pages and according to your grid site configuration.
shared /home or other directory between WNs
The values for relevant variables of site-info.def file below are for cluster with “pbs" jobmanager, shared among CE and WNs /home directory and without host-based authentication:
JOB_MANAGER=pbs CE_BATCH_SYS=pbs
MPI_MPICH_ENABLE="yes" MPI_MPICH_PATH="/opt/mpich-1.2.7p1/" MPI_MPICH_VERSION="1.2.7p1" MPI_MPICH_MPIEXEC="/opt/mpiexec-0.82/bin/mpiexec" MPI_MPICH2_ENABLE="no" MPI_OPENMPI_ENABLE="yes" MPI_OPENMPI_PATH="/usr/lib/openmpi/1.2.5-gcc" MPI_OPENMPI_VERSION="1.2.5" MPI_OPENMPI_MPIEXEC="/usr/lib/openmpi/1.2.5-gcc/bin/mpiexec" MPI_LAM_ENABLE="no" MPI_SHARED_HOME="yes" MPI_SSH_HOST_BASED_AUTH="no" MPI_SUBMIT_FILTER="yes"
Copy CE's hostcert.pem and hostkey.pem as well as /etc/grid-security/certificates into /etc/grid-security/.
It might be useful to verify if config files are correct:
[CE]$ /opt/glite/yaim/bin/yaim -v -s /root/siteinfo/site-info.def -n MPI_CE -n lcg-CE -n TORQUE_server -n TORQUE_utils -n BDII_site
[WNs]$ /opt/glite/yaim/bin/yaim -v -s /root/siteinfo/site-info.def -n MPI_WN -n glite-WN -n TORQUE_client
If verification ended without any errors then configuration can be done:
[CE]$ /opt/glite/yaim/bin/yaim -c -s /root/siteinfo/site-info.def -n MPI_CE -n lcg-CE -n TORQUE_server -n TORQUE_utils -n BDII_site
[WNs]$ /opt/glite/yaim/bin/yaim -c -s /root/siteinfo/site-info.def -n MPI_WN -n glite-WN -n TORQUE_client
passwordless ssh between WNs (host-based authentication)
The values for relevant variables of site-info.def file below are for cluster with “lcgpbs” jobmanager and enabled host-based authentication:
JOB_MANAGER=lcgpbs CE_BATCH_SYS=torque
MPI_MPICH_ENABLE="yes" MPI_MPICH_PATH="/opt/mpich-1.2.7p1/" MPI_MPICH_VERSION="1.2.7p1" MPI_MPICH_MPIEXEC="/opt/mpiexec-0.82/bin/mpiexec" MPI_MPICH2_ENABLE="no" MPI_OPENMPI_ENABLE="yes" MPI_OPENMPI_PATH="/usr/lib/openmpi/1.2.5-gcc" MPI_OPENMPI_VERSION="1.2.5" MPI_OPENMPI_MPIEXEC="/usr/lib/openmpi/1.2.5-gcc/bin/mpiexec" MPI_LAM_ENABLE="no" MPI_SHARED_HOME="no" MPI_SSH_HOST_BASED_AUTH="yes" MPI_SUBMIT_FILTER="yes"
Copy CE's hostcert.pem and hostkey.pem as well as /etc/grid-security/certificates into /etc/grid-security/.
It might be useful to verify if config files are correct:
[CE]$ /opt/glite/yaim/bin/yaim -v -s /root/siteinfo/site-info.def -n MPI_CE -n lcg-CE -n TORQUE_server -n TORQUE_utils -n BDII_site
[WNs]$ /opt/glite/yaim/bin/yaim -v -s /root/siteinfo/site-info.def -n MPI_WN -n glite-WN -n TORQUE_client
If verification ended without any errors then configuration can be done:
[CE]$ /opt/glite/yaim/bin/yaim -c -s /root/siteinfo/site-info.def -n MPI_CE -n lcg-CE -n TORQUE_server -n TORQUE_utils -n BDII_site
[WNs]$ /opt/glite/yaim/bin/yaim -c -s /root/siteinfo/site-info.def -n MPI_WN -n glite-WN -n TORQUE_client
As mentioned in bug #50524 yaim doesn't configure host-based authentication properly. In order to enable it in addition to yaim's modifications one needs to do the following.
Edit /etc/ssh/sshd_config file on WNs like below:
HostbasedAuthentication yes IgnoreUserKnownHosts yes IgnoreRhosts yes
and add all relevant FQDNs into /etc/ssh/shosts.equiv (helper script /opt/edg/sbin/edg-pbs-shostsequiv and config file /opt/edg/etc/edg-pbs-shostsequiv.conf can be used as well) in format one FQDN per line:
node1.domain node2.domain node3.domain
Restart sshd:
[root@WNs]$ /etc/init.d/sshd restart Stopping sshd: [ OK ] Starting sshd: [ OK ]
Spread /etc/ssh/sshd_config and /etc/ssh/shosts.equiv files over all WNs and restart sshd service on each of them.
Make sure that yaim created /etc/ssh/ssh_known_hosts file with rsa-keys of all relevant hosts. If these file is incomplete then helper script /opt/edg/sbin/edg-pbs-knownhosts together with configuration file /opt/edg/etc/edg-pbs-knownhosts.conf can be used.
Check if pool account user can perform passwordless ssh to another host and back:
[root@WN01 ~]# su - dteam001 [dteam001@WN01 ~]$ ssh <WN02_hostname> [dteam001@WN02 ~]$ ssh <WN01_hostname> [dteam001@WN01 ~]$
It might be a problem to do passwordless ssh from CE to WNs whereas it works from WNs to CE and from one WN to another WN. The problem is because /etc/ssh/ssh_config file wasn't modified by yaim on CE during configuration procedure but /etc/ssh/sshd_config was. So in order to enable passwordless ssh from CE to WNs copy /etc/ssh/ssh_config from WN to CE or just add the following lines to that file on CE:
RhostsAuthentication yes RhostsRSAAuthentication yes EnableSSHKeysign yes HostbasedAuthentication yes
and restart sshd services.
[root@CE]$ /etc/init.d/sshd restart Stopping sshd: [ OK ] Starting sshd: [ OK ]
In order to force i2g-mpi-start tools to use files distribution over WNs via ssh (i.e. use /opt/i2g/etc/ssh.filedst script) the file /opt/i2g/etc/mpiexec.filedst file needs to be removed|renamed on all WNs. It's because of mpi-start plugins logic what is described below.
files distribution via mpiexec
This file distribution method I have not tested yet. But it looks like in order to enable it two configuration variables: MPI_SHARED_HOME and MPI_SSH_HOST_BASED_AUTH - need to be set to "no" in site-info.def file and thus mpiexec.filedist will be used as most suitable for such configuration mpi-start plugin.
Testing
For testing one can use information at the following links:
[1] http://wiki.egee-see.org/index.php/Testing_MPI_support
[2] http://www.grid.ie/mpi/wiki/YaimConfig
[3] http://www.grid.ie/mpi/wiki/JobSubmission
[4] https://twiki.cern.ch/twiki/bin/view/EGEE/MpiTools
[5] http://egee-uig.web.cern.ch/egee-uig/production_pages/MPIJobs.html
As written at 4] in “Submission of MPI Jobs” part in order to invoke MPI-START a wrapper script that sets the environment variables for user's job is needed. This script is generic and should not need to have significant modifications made to it.
# Pull in the arguments.
MY_EXECUTABLE=`pwd`/$1
MPI_FLAVOR=$2
# Convert flavor to lowercase for passing to mpi-start.
MPI_FLAVOR_LOWER=`echo $MPI_FLAVOR | tr '[:upper:]' '[:lower:]'`
# Pull out the correct paths for the requested flavor.
eval MPI_PATH=`printenv MPI_${MPI_FLAVOR}_PATH`
# Ensure the prefix is correctly set. Don't rely on the defaults.
eval I2G_${MPI_FLAVOR}_PREFIX=$MPI_PATH
export I2G_${MPI_FLAVOR}_PREFIX
# Touch the executable. It exist must for the shared file system check.
# If it does not, then mpi-start may try to distribute the executable
# when it shouldn't.
touch $MY_EXECUTABLE
# Setup for mpi-start.
export I2G_MPI_APPLICATION=$MY_EXECUTABLE
export I2G_MPI_APPLICATION_ARGS=
export I2G_MPI_TYPE=$MPI_FLAVOR_LOWER
export I2G_MPI_PRE_RUN_HOOK=mpi-hooks.sh
export I2G_MPI_POST_RUN_HOOK=mpi-hooks.sh
#export MPI_START_SHARED_FS=1
# If these are set then you will get more debugging information.
#export I2G_MPI_START_VERBOSE=1
#export I2G_MPI_START_DEBUG=1
#export I2G_MPI_START_TRACE=1
echo "Start: $I2G_MPI_START"
# Invoke mpi-start.
$I2G_MPI_START“JobType” attribute in jdl-file needs to be set to “Normal” in order to run MPI jobs via MPI-START scripts. "CPUNumber" needs to corresponds to the number of desired nodes. "Executable" attribute has to point to wrapper script (mpi-start-wrapper.sh in this case). "Arguments" are MPI binary and the MPI flavour that it uses. MPI-START allows user defined extensions via hooks (check the MPI-START Hook CookBook for examples). Here is an example JDL for the submission of the "Hello, world" application using 3 processes:
JobType = "Normal";
CPUNumber = 3;
Executable = "mpi-start-wrapper.sh";
Arguments = "mpi-test MPICH";
StdOutput = "mpi-test.out";
StdError = "mpi-test.err";
InputSandbox = {"mpi-start-wrapper.sh","mpi-hooks.sh","mpi-test.c"};
OutputSandbox = {"mpi-test.err","mpi-test.out"};mpi-hooks.sh and mpi-test.c files can be taken from 5] or found below.
mpi-hooks.sh:
#
# This function will be called before the MPI executable is started.
# You can, for example, compile the executable itself.
#
pre_run_hook () {
# Compile the program.
echo "Compiling ${I2G_MPI_APPLICATION}"
echo "OPTS=${MPI_MPICC_OPTS}"
echo "PROG=${I2G_MPI_APPLICATION}.c"
# Actually compile the program.
cmd="mpicc ${MPI_MPICC_OPTS} -o ${I2G_MPI_APPLICATION} ${I2G_MPI_APPLICATION}.c"
echo $cmd
$cmd
if [ ! $? -eq 0 ]; then
echo "Error compiling program. Exiting..."
exit 1
fi
# Everything's OK.
echo "Successfully compiled ${I2G_MPI_APPLICATION}"
return 0
}
#
# This function will be called before the MPI executable is finished.
# A typical case for this is to upload the results to a storage element.
#
post_run_hook () {
echo "Executing post hook."
echo "Finished the post hook."
return 0
}mpi-test.sh has to be executable as well as mpi-start-wrapper.sh.
mpi-test.c:
/* hello.c
*
* Simple "Hello World" program in MPI.
*
*/
#include "mpi.h"
#include <stdio.h>
int main(int argc, char *argv[]) {
int numprocs; /* Number of processors */
int procnum; /* Processor number */
int namelen;
char processor_name[MPI_MAX_PROCESSOR_NAME];
double startwtime = 0.0, endwtime;
/* Initialize MPI */
MPI_Init(&argc, &argv);
/* Find this processor number */
MPI_Comm_rank(MPI_COMM_WORLD, &procnum);
/* Find the number of processors */
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Get_processor_name(processor_name,&namelen);
/* printf("Process %d on %s\n", procnum, processor_name); */
printf ("Hello world! from processor %d (name=%s ) out of %d\n", procnum, processor_name, numprocs);
startwtime = MPI_Wtime();
endwtime = MPI_Wtime();
printf("wall clock time = %f\n",
endwtime-startwtime);
/* Shut down MPI */
MPI_Finalize();
return 0;
}For extra debugging info the following variables needs to be set in mpi-start-wrapper.sh file:
export I2G_MPI_START_VERBOSE=1 export I2G_MPI_START_DEBUG=1 export I2G_MPI_START_TRACE=1
If mpi-start scripts fails to detect shared directory properly (it may happen if gLite services are running on virtual machines and share directory in irregular way e.g. it mounted directly from host and inside VM it looks like local fs) one can set environment variable MPI_START_SHARED_FS to "1" in wrapper script to skip such detection.
To run MPI job on gLite MPI cluster with enabled host-based authentication a “--lrms pbs” option has to be specified during glite-wms-job-submit command invocation whereas for normal (i.e. non-parallel) one it can be skipped:
[UI]$ glite-wms-job-submit --lrms pbs -r vps123.jinr.ru:2119/jobmanager-lcgpbs-edu -a mpi-test-mpich.jdl
The job may end with status like below
************************************************************* BOOKKEEPING INFORMATION: Status info for the Job : https://vps103.jinr.ru:9000/ZU_mX3ip6hlRO5y_OTmKHA Current Status: Done (Exit Code !=0) Exit code: 127 Status Reason: Warning: job exit code != 0 Destination: vps117.jinr.ru:2119/jobmanager-pbs-edu Submitted: Wed Apr 1 13:18:00 2009 MSD *************************************************************
and output files contain something like
test-mpi.out:
Modified mpirun: Executing command: test-mpi.sh test-mpi
test-mpi.err:
/opt/glite/bin/mpirun: line 40: test-mpi.sh: command not found
It happens because /opt/glite/bin/mpirun assumes that '.' is in the $PATH but it's not always true.
One of possibilities is to try to change manually at the end of the file /opt/glite/bin/mpirun the following line:
$@
to
`pwd`/$@
There is a correspondent bug #52560.
If openmpi job ends with message in mpi-test.err like below
mpi-hooks.sh: line 16: mpicc: command not found
then make sure there is mpicc in $MPI_OPENMPI_PATH/bin/. If OpenMPI packages are installed from sl4-base repository then openmpi-devel-1.2.5-5.el4 needs to be installed apart from openmpi-1.2.5-5.el4 in order to allow user job to be compiled on WNs.
To check if source will be compiled successfully on WN one of the possibilities is to run under pool user account the following command on WN:
[WN]$ /usr/lib/openmpi/1.2.5-gcc/bin/mpicc -o mpi-test mpi-test.c
If openmpi job ends with
libibverbs: Fatal: couldn't read uverbs ABI version. -------------------------------------------------------------------------- [0,1,0]: OpenIB on host vps126.jinr.ru was unable to find any HCAs. Another transport will be used instead, although this may result in lower performance. -------------------------------------------------------------------------- libibverbs: Fatal: couldn't read uverbs ABI version. -------------------------------------------------------------------------- [0,1,1]: OpenIB on host vps126.jinr.ru was unable to find any HCAs. Another transport will be used instead, although this may result in lower performance. -------------------------------------------------------------------------- libibverbs: Fatal: couldn't read uverbs ABI version. -------------------------------------------------------------------------- [0,1,2]: OpenIB on host vps126.jinr.ru was unable to find any HCAs. Another transport will be used instead, although this may result in lower performance. --------------------------------------------------------------------------
in mpi-test.err then try to install on WNs openmpi packages compiled for your OS, basearch, etc without infiniband support.
TCG working