1. Introduction

Over the past decades High-Performance Computing has grown from infancy to adulthood. The compute power has increased drastically by the use of faster processors. Also, the cost for floating point operations per second (FLOPs) has dropped tremendously. Initially super computers contained larger amounts of memory and faster processors than normal desktop computers. With the increasing speed of network connections, i.e. the interconnects between computers, a new type of super computer saw the light, the Beowulf cluster. Such a cluster consists of bunch of regular PCs connected via a fast network. Although the total number of FLOPs and the total size of the memory of a cluster is as large as, or even bigger than, that of a traditional super computer, the way to write programs for a cluster is quite different than that for a super: the memory and the compute power are distributed over several PCs. Therefore, a mechanism for communication between separate PCs was devised.

The most commonly used communication mechanism today is the Message Passing Interface, or MPI. A fundamental drawback of MPI is that the source code of a program needs to be modified to make use of the parallelism. Furthermore, making programs run in parallel efficiently demands a thorough understanding of the algorithms by the programmer. Still, MPI has gained ground over the past years and even found its way back to the large shared memory systems with multiple CPUs, where it also could be used because of its generality: for MPI it doesn't matter whether the processors with their respective memory chunks are physically located in one machine or are distributed over several machines. Nowadays, it is very common, even for laptop computers, to have more than one CPU, or core. This means that even desktop applications are being modified to make use of multiple CPUs.

Many participating sites in EGEE are offering (part of) their compute power to the High-Energy Physics (HEP) community. These super computers are, almost without exception, all Beowulf clusters. Many people from the HEP community are using serial programs, since a lot of the HEP simulations consist of large amounts of serial runs with slightly different input parameters for each run.

Although EGEE was initially focused on the HEP community, other communities were added later, being Chemistry, Biology, Medical Imaging, etc. In these communities the use of parallel programs is quite common and therefore the need for MPI on the EGEE Grid increased.

1.1. Summary of goals and achievements of the previous MPI WG

The purpose of the previous MPI WG, chaired by Stephen Childs, was to investigate and improve the support for parallel jobs within the EGEE middleware, with particular reference to the widely-used Message Passing Interface (MPI) standard. While their primary focus was on MPI, their findings should also be relevant to other methods for submitting parallel jobs.

The previous MPI WG recommended that sites supporting MPI should be configured to publish the particular flavors and versions of MPI they support in their information system. These sites should also make environment variables available to jobs to allow MPI libraries and tools to be located. Furthermore, the MPI-start package produced by the int.eu.grid project should be installed to hide some of the complexity of the MPI setup from users.

Besides, the previous MPI WG recommended minor modifications to the middleware to allow multiple CPUs to be requested for normal jobs. The MPICH job type should be deprecated as it hard-codes assumptions about site setup, and does not meet the needs of either site admins or users. Additionally, the previous MPI WG recommended that users should submit MPI jobs using wrapper scripts that set up their desired environment correctly. Templates were to be provided which users could customize as needed.

The MPI WG made a setup procedure for site admins to setup their site for use with MPI. Via this procedure several MPI flavors are installed and the site information system is updated to ventilate the MPI flavors available to the users. The MPI-start package is also part of this procedure. Minor modifications have been made to the middleware: nowadays, multiple CPUs can be requested in normal jobs. The MPICH job type is still available, although still deprecated.

1.2. Scope

Still, MPI enabled cluster are scarce. The current MPI WG would like to know why so few sites have MPI enabled. To investigate why so few sites have installed MPI, the WG group has setup two questionnaires: one for (potential) Grid users and one for system administrators.

Like the previous WG, the current WG will recommend a method for deploying MPI support that will work for both users and site administrators. The additional element added in this working group is: how to get the allocated cores to be all on the same physical machine, or packed into as few physical machines as possible. It should be understood to what extent this is a preference and what extent it's a strict requirement.

1.3. Document layout

The document contains three main parts. In the next (second) section details from the questionnaire are worked out which give a good view of the reasons why MPI is not being used thoroughly or being installed widely. The third section contains the recommended modifications for all services which are passed by jobs on the Grid. The last part contains recommendations based upon experiences with the installation of the current MPI implementation. It also contains recommendations based upon e-mails from users and site admins received by the MPI WG during the passed few months and based upon discussions within the MPI WG itself.

2. Survey (questionnaire/poll) - on usage/installation

The user survey has been distributed via e-mail. Additionally, Kostas Koumantaros and Marios Chatziangelou have setup a web form version of this survey from which a large portion of the results have been obtained. For the system administrator survey, a single website at http://examine.vu.nl setup by Pieter van Beek, has been used. The questions contained in both questionnaires can be found in the appendix of this document.

2.1. Users

The results are based on 56 filled-in user questionnaires. The 56 people were 17 physicists, 9 chemists, 6 biologists, 4 academic medical researchers, 3 astronomists, 3 geologists and 14 others, ranging from bioinformaticians to seismologists.

On the question whether their research involves computational problems that lend themselves to parallel processing only 4 people answered "No". This is in line with the expectation, since one might expect that people who are interested in parallel computing are more eager to take the time to fill in the survey than those who are not. Of the 52 people for whom parallel processing is helpful, or even essential, 14 people use the batch system to obtain parallelism (large amounts of independent jobs). The applications of 28 people rely on MPI (versions 1 and 2), 5 on OpenMP and a few on a mix of MPI with OpenMP.

On the question whether MPI should be used within a cluster, or between clusters 13 out of the 28 people indicated to be interested in MPI between clusters, while for 15 MPI within clusters suffices.

The Grid is only being used by roughly half the people for parallel jobs. The other half uses the facilities at their home institution. On the question why these people are not using the Grid for their parallel tasks, the large majority answer that they can't port their application because using the Grid is too complicated or that their needs are too special for the Grid.

At the end of the survey, the user was asked for suggestions. This selection contains the following useful suggestions:

2.2. System administrators

The results are based on 86 filled-in administrator surveys. A first thing that becomes clear taking the results into account is that the 19 system administrators that weren't asked by users to install MPI didn't do so. Another 14 were asked to, but didn't install it. Their remarks indicate that they are waiting, either for more requests or a more generic installation method (via the default repositories). Others remark that they are waiting for the outcome of the MPI WG before installing MPI.

From the 53 system administrators that installed MPI, 26 found the installation easy or straight forward. The other 27 found it difficult or extremely hard. 21 system administrators did make own modifications to the installation or wrapper scripts, while 32 system administrators didn't.

Regarding the special MPI SAM test, only 6 system administrators out of the 53 are running it.

Only 6 system administrators indicate that their MPI implementation is not being used, despite their efforts to install it.

Like the users, also the system administrators were asked for suggestions. The most useful ones are itemized below:

3. Implications on the middleware to enable the running of MPI jobs on multiple cores within a node

Running MPI jobs on multiple cores requires additional information to be forwarded to the Local Resource Manager of the CE in order to customize the execution environment and the resources allocations accordingly. Some changes in current behavior to include this new information propagation is required and would affect some gLite middleware components.

3.1. JDL

A new attribute, say SMPGranularity, should be introduced in order to allow users to specify how the cores can be distributed for the allocation. The SMPGranularity value determines the number of cores any host involved in the allocation has to dedicate to the application. The SMPGranularity value should be also inserted in the requirements expression within the submitted JDL, to match the required number of cores of the eligible resource while the WMS processes the matchmaking: other.GlueHostArchitectureSMPSize >= SMPGranularity;

For multi-threaded applications, instead of letting users figure out how to restrain their programs to single cores, it is more convenient if these users would have the ability to reserve whole nodes. In this case it would use-full having a boolean attribute in the JDL:

WholeNodes = True;

In this case it would be required to propagate the WholeNodes as well as GlueHostArchitectureSMPSize values, relevant to the selected resource, to the LRM in order to properly configure the environment.

Finally by using both SMPGranularity and WholeNodes the user would be able to claim complete nodes giving hints on the minimum number of cores the eligible resource supplies with.

Since setting

WholeNodes = True; SMPGranularity = 8;

would lead to other.GlueHostArchitectureSMPSize >= SMPGranularity added in the requirements expression of the JDL (requesting nodes with a given number of cores) as well as propagating WholeNodes and GlueHostArchitectureSMPSize to properly configure the environment at the LRMS side.

3.2. WMS

For an effective exploitation of the SMPGranularity value to give allocation hints in term of cores concentration some information should be propagated to the Local Resource Manager. As an example, in LSF each MPI queue by default is configured with the span resource empty (span[]) to allow jobs to be scheduled on machines that have an empty processor based on normal LSF load considerations. If users wish to specify a specific spanning resource option to suit their own parallel batch job they may use -R "span[ptile=<value>]" on the bsub command. Where <value> is the number of processors to use on each host.

The configuration file of the WM could be modified in order to accept a construct describing what should be propagated and how, that is, the name of the variable as seen in the environment at the LRMS side.

PropagateToLRMS= {
  [ name = "smpgranularity"; value = jdl.SMPGranularity ],
  [ name = "wholenodes"; value = jdl.WholeNodes ],
  [ name = "hostsmpsize"; value = ce.GlueHostArchitectureSMPSize ]
};

The value attribute of each entry should be evaluated in a classad context which map the pseudo attributes jdl. and ce. to the classad representing the JDL and the selected CE to submit the job at respectively.

The mechanism required to propagate this information to the LRM for the actual allocation depends on the kind of the computing element considered.

3.2.1. lcgCE

LCG CE represents the native computing resource access service with Globus Gatekeeper.

The WMS sends the job wrapper to the CE using the Globus Gram component which uses the RSL (Resources Specification Language) language to describe the application with some information for its execution (arguments, environment and so on). Since RSL does not allow additional attributes to be included in the description, SMPGranularity information could be pushed in the RSL within the arguments attribute as a semicolon separated list of pairs “name:value”, like:

arguments = "SMPGranularity:<value>;<Attribute>:<value>;...."

This can be achieved by simply modify the JobAdapter compontent of the gLite WMS.

With reference to LSF as underlain resource manager at the CE side, the file lcglsf.pm is responsible for the actual job allocation and should be modified in order to read the forwarded SMPGranularity value to invoke the relevant #BSUB command (-R "span[ptile=<SMPGranularity>]").

Here follow the tentative diffs concerning the modification required to the lcglsf.in with respect to revision v1.6.1:

@@ -511,7 +511,40 @@
 #
 # We don't currently impliment starting job threads on secondary (LSB_HOSTS) hosts.
 #
-#    $lsf_job_script->print("#BSUB -n " . $description->count() . "\n");
+
+     if (defined $description->arguments()) { 
+
+       $arg=$description->arguments();
+       $arg =~ s/\'//g;
+       foreach (split(/,/, $arg)) {
+         @l=split(/=/, $_);
+         $v{$l[0]}=$l[1];
+       } 
+
+       if (defined $v{wholenodes} && $v{wholenodes}) {
+         
+         if (defined $v{smpgranularity}) {
+
+           $lsf_job_script->print("#BSUB -n " . $v{smpgranularity} . "\n");
+           $lsf_job_script->print("#BSUB -R \"span[ptile=" . $v{smpgranularity} . "]\"\n");
+         }
+         elsif (defined $v{hostsmpsize})  {
+
+           $lsf_job_script->print("#BSUB -n " . $v{hostsmpsize} . "\n");
+           $lsf_job_script->print("#BSUB -R \"span[ptile=" . $v{hostsmpsize} . "]\"\n");
+        }
+       }
+       else {
+         $lsf_job_script->print("#BSUB -n " . $description->count() . "\n");
+       
+         if (defined $v{smpgranularity}) {
+             $lsf_job_script->print("#BSUB -R \"span[ptile=" . $v{smpgranularity} . "]\"\n");
+         }
+       }
+    } 
+    else {
+      $lsf_job_script->print("#BSUB -n " . $description->count() . "\n");
+    }
 
     chomp(my $my_hostname = `hostname -f`);
     mkdir '.lcgjm', 0700;

Here follow the tentative diffs concerning the modification required to lcgpbs.in with respect to revision v1.4:

@@ -555,18 +555,46 @@
 
     $pbs_job_script->print("#PBS -W stagein=".$gpg_file."@".$my_hostname.":".$cache_export_dir."/".$gpg_file."\n");
 
+    if (defined $description->arguments()) { 
+      $arg=$description->arguments();
+      $arg =~ s/\'//g;
+      foreach (split(/,/, $arg)) {
+        @l=split(/=/, $_);
+        $v{$l[0]}=$l[1];
+      } 
+      if (defined $v{smpgranularity}) {
+        $cpu_per_node = $v{smpgranularity};
+      }
+       elseif (defined $v{wholenodes} && $v{wholenodes} && defined $v{hostsmpsize}) {
+         $cpu_per_node = $v{hostsmpsize};
+       }
+    }
+
+
     if(defined $description->host_count() && $description->host_count() != 0)
     {
+      if($cpu_per_node != 0)
+      {
+        $pbs_job_script->print("#PBS -l nodes=" .
+                               POSIX::ceil($description->host_count() /
+                                          $cpu_per_node) . ":ppn=" . $cpu_per_node .
+                              "\n");
+      }
+      else 
+      {
+
        $pbs_job_script->print("#PBS -l nodes=" .
                               $description->host_count().
                               "\n");
+      }
     }
-    elsif($cluster && $cpu_per_node != 0)
+    elsif(!$cluster && $cpu_per_node != 0)
     {
        $pbs_job_script->print("#PBS -l nodes=" .
-                              POSIX::ceil($description->count /
-                                          $cpu_per_node).
-                               "\n");
+                               (defined $v{wholenodes} && $v{wholenodes} ? 
+                                 1 :(POSIX::ceil($description->count / $cpu_per_node))
+                               ) . ":ppn=" . $cpu_per_node .
+                               "\n");
     }
     else
     {

3.2.2. ice / CREAM

The CREAM (Computing Resource Execution And Management) Service is a simple, lightweight service for job management operation at the CE level.

CREAM accepts job submission requests, which are described with the same JDL language used to describe the jobs submitted to the WMS, and other job management requests (e.g. job cancellation, job monitoring, etc). CREAM can be used by the WMS and ICE (Interface to CREAM Environment) is the WMS service dealing when interacting with CREAM based CEs.

The interface with the underlying Local Resource Manager (LRMS) is implemented via BLAH, which natively supports handling of generic attributes forwarded vithin CERequiremsnts. As a consequence in order to support SMPGranularity value propagation and exploitation at LRMs side, BLAH should be distribuited with customized submit scripts: <jobmanager>_local_submit_attributes.sh.

Here follows the tentative lsf_local_submit_attributes.sh:

if ( [[ ! -z "${wholenodes}" ]] && ${wholenodes} ) ; then
  if [[ ! -z "${smpgranularity}" ]] ; then
    echo "#BSUB -n $smpgranularity"
    echo "#BSUB -R \"span[ptile=$smpgranularity]\""
  else
    [[ ! -z "${hostsmpsize}" ]] && echo "#BSUB -n $hostsmpsize" && echo "#BSUB -R \"span[ptile=$hostsmpsize]\""
  fi
else
  [[ ! -z "${smpgranularity}" ]] && echo "#BSUB -R \"span[ptile=$smpgranularity]\""
fi

Here follows the tentative pbs_local_submit_attributes.sh:

defined() {
 [[ ! -z "${1}" ]]
}
ceil() {
 echo $1 | awk '{printf("%d\n",$0+=$0<0?0:1)}'
}

if ( defined $wholenodes && ${wholenodes} ) ; then
  if ( defined $smpgranularity ) ; then
    echo "#PBS -l nodes=1:ppn=$smpgranularity"
  elif ( defined $hostsmpsize ) ; then
    echo "#PBS -l nodes=1:ppn=$hostsmpsize"
  fi
elif (defined $bls_opt_mpinodes && [ $bls_opt_mpinodes -gt 0 ] ) ; then
  if ( defined $smpgranularity ) ; then
    echo -n "#PBS -l nodes="
    echo -n ceil `echo "scale=2 ; $bls_opt_mpinodes / $smpgranularity" | bc`
    echo ":ppn=$smpgranularity"
  else
    echo "#PBS -l nodes=$bls_opt_mpinodes"
  fi
fi

4. Recommendations

In this chapter a number of short term recommendations will be given that should improve the current parallel job support in EGEE, without extensive software development effort. Most recommendations are basically extensions of the recommendations of the previous MPI WG http://www.grid.ie/mpi/wiki/FrontPage?action=AttachFile&do=get&target=EGEE-II-MPI-WG-TEC.doc.

4.1. General

4.1.1. Jobtypes

The previous report recommends that the MPICH jobtype should be deprecated and that the Normal jobtype should support requesting multiple cores. In the gLite WMS update 3.1.12-0 of 25 February 2009 these changes have been incorporated.

This means that the jobtype Normal can now be used to submit parallel jobs. Special requirements can now be added by hand in the JDL file by the user.

A new problem that appears is how to know that a site supports parallel jobs. For MPI the existing recommendation is that the available implementations should be published in the information system, using GlueHostApplicationSoftwareRunTimeEnvironment. Furthermore a new keyword, e.g. “Parallel”, could be published in the information system.

4.1.2. Information system

The supported and installed MPI implementations have to be published in the information system of the CE. For each supported MPI implementation a corresponding variable in GlueHostApplicationSoftwareRunTimeEnvironment should be published in the information system as <FLAVOUR>-<VERSION>. Also a variable with just <FLAVOUR> should be published for those who don't care about the version.

TODO: make sure we go after all the weird tags currently being published. A task for SA1 maybe in combination with the SAM tests. --DennisVanDok 08.05.2009

Note, however, that non-exact version requirements are hard to implement in the job description.

4.1.3. Environment variables

For each implementation of MPI that is installed on the system environment variables have to be set on the worker nodes. These environment variables point to the root of the MPI library installation, give the library version and optionally the compiler used to build the library.

4.1.4. MPI packages

The use of MPI-start from the int.eu.grid projecthttp://www.hlrs.de/organization/av/amt/research/mpi-start/ has been recommended as a way to start up MPI jobs. At the moment it is unclear to the working group if this package is still maintained.

Currently MPI-start and some other packages are available in the gLite repository. The dependencies are broken, however. The metapackage glite-MPI_utils depends on mpiexec, which in turn depends on an older version of the torque queue manager. Furthermore it was found that MPI-start also depends on mpiexec to be installed. It is used by MPI-start to start up MPICH-1 jobs. Finally, the only available MPI implementation in the repository is the old MPICH-1.

The repository could be greatly improved by fixing these issues mentioned. It should also provide more MPI implentations. The following changes should be made:

  1. A recompiled versions of mpiexec is necessary. If possible the requirement on a specific torque version should be dropped, and changed into a “greater than” version requirement.
  2. Precompiled packages of the most common MPI implementations should be available in the repository. The MPI implementations required are MPICH-1, MPICH-2, OpenMPI and possibly LAM. The latter implementation has been succeeded by OpenMPI, but some users may still require it.
  3. If possible the packages should be compiled with support for Torque, since Torque is the default batch system deployed with gLite. Some installation notes:
    1. All packages should be installed into /opt/mpi/[flavour]-[version]
    2. mpiexec is only available for Torque. It can start up MPICH-1 and MPICH-2 programs.
    3. MPICH-1 and MPICH-2 should be compiled supporting communication through shared memory and TCP/IP.
    4. OpenMPI supports many batch systems and can be compiled with support for Torque directly.
    5. LAM can also be compiled with support for Torque. It does not support the startup routines of other batch systems however.
  4. For the other batch systems a repository may be made as well.
  5. The source rpms of the MPI implementations have to be provided as well, including instructions on how to modify the configuration of the MPI implementations. This allows the site administrators to quickly create custom packages for their local environment. Supporting for example different compilers, special network interconnects, different batch systems, etc.

4.1.5. SAM tests

In principle the previous paragraphs were just a rehearsal of the recommendations of the previous working group. Still the question was raised why MPI has not been used and supported more widely.

One of the reasons, as indicated by the results of the survey, is the lack of standard packages, an issue that is resolved by one of the recommendations above. Another issue is that currently there is no operational SAM test for sites that support MPI. Because the quality of MPI installation on sites supporting MPI is not tested this may lead to misconfigured sites. Some things are quite easy to check, however:

  1. Is all the relevant information published in the information system? That is the MPI flavour and version information.
  2. If MPI flavours are published, are the corresponding environment variables available as well?
  3. Do these environment variables point to the correct location? That is, are the expected binaries available at the location pointed to?
  4. Does the published shared home directory really exist?

Of course it is also possible to submit real MPI jobs in the SAM framework. It may be difficult, however, to test for all flavours of MPI. Furthermore, the throughput time for parallel jobs may be too low for effective testing.

4.1.6. Further recommendations

4.1.6.1. Shared file system

A shared file system between WNs is recommended to make program files and data available to all the nodes participating in a parallel job. Furthermore transfering the output back to the user is also much easier when the data is not spread out over a number of nodes.

For programs that do not require a shared file system the use of MPI-start to distribute the program and data should be investigated.

4.1.6.2. Password-less ssh between WNs

For parallel jobs a way to start up tasks on remote WNs is needed. If this cannot be achieved via the batch system password-less ssh is used by most MPI implementations as a way to start up remote tasks. It is therefore currently necessary to allow password-less ssh between WNs in order to support parallel jobs.

For Torque a special pam module exists that can limit the user logins to the nodes where a user has a job running.

4.1.6.3. Scheduling parallel jobs

A new requirement is to allow for other methods of running work in parallel, like e.g. using pthreads or OpenMP. For this to work a job has to be scheduled with all the cores on a single node. Currently a site administrator is free to set the scheduling as he or she sees fit. In principle it is possible to change the scheduling in such a way that all cores for a parallel job are submitted to as few nodes as possible. This has the following advantages:

  1. When the number of cores requested is less than the number of cores in the nodes, shared memory parallelisation can be used as well.
  2. Faster shared memory communication can be used which is beneficial for some jobs.

The main disadvantage is that the scheduling time increases, especially on smaller sites, because it takes longer before enough free cores have become available, when all of these have to be on the same nodes. Furthermore some applications may scale better when using cores on different nodes, because they may also have a memory bandwidth or disk bandwidth bottleneck.

4.1.6.4. CPU time limits

A final problem is that most of the current EGEE sites have CPU time limits which are similar to the wall time clock limits. When parallel jobs are correctly accounted for, the CPU time they use is relative to the number of cores they use. This means that these jobs will rapidly run into the CPU time limit of the queue. For parallel jobs the wall time clock limit is the only useful limit to enforce. The solution for this problem is not easy, however. Some approaches:

  1. Do not set the CPU time limit. Some applications may have requirements on the CPU time limit, however, and these will not use a site that does not publish a value here.
  2. The limit can be set very high. This may also lead to a problem for jobs that have a requirement on this limit, because the relation with the wall time clock time limit has disappeared.
  3. Do not enforce the CPU time limit. This is batch system dependent. For Torque this can be done, but the disadvantage is that CPU time is not accounted for anymore either.

A clear solution has not yet been found.

4.1.6.5. YAIM

Once a repository with standard MPI packages is available, creating a standard installation using YAIM is not very difficult. The instructions on the MPI wiki should be tested and updated if necessary.

TODO: a patch for certification in SA3 is expected; to be followed up by DennisVanDok

4.1.6.6. On-demand MPI support

On-demand MPI support means to provide facilities for users to be able to build their MPI environment at first, and then executing the related MPI application e.g. using special MPI packages. This concept comes from Cloud Computing paradigm. Since there are lots of MPI flavours and implementations, this feature expands the support of various MPI packages. In this scenario, users have to submit the MPI package as well as their MPI application. Users have to provide some scripts for compilation of MPI package and also preparing the environment for launching MPI start-up processes(e.g. ssh-password-less).

4.2. Site admins

4.3. Users

4.4. (Central) MPI Support

Need to establish structural support for MPI beyond the lifespan of the WG. --DennisVanDok 08.05.2009

4.5. Developers

4.6. MPI job accounting

By suggestion of Alvaro Simon Garcia we should add a recommendation regarding accounting of MPI jobs:

Citing his request:

Hi Jeroen

We have reviewed the document and it is probably necessary to add one more point about accounting. A mpi accounting portal was developed here at cesga (to show number of cpus used by a job, mpi efficiency, etc) for int.eu.grid project and this work could be useful for a future EGEE portal. About apel mpi support I think is already supported (at least for torque and sge batch systems) and not many changes are needed.

Cheers Alvaro

5. Appendix

5.1. Questionnaire (potential) Grid users

1. What is your research area?

a) Physics

b) Astronomy

c) Meteorology

d) Chemistry

e) Biology

f) Academic Medical Research

g) Other

2. Does your research involve computational problems that lend themselves to parallel processing?

a) Yes

b) No

3. If your application is parallel on which model does it depend?

a) Parallel via batch system (large amount of independent tasks)

b) OpenMP

c) MPI-1

d) MPI-2

e) Other

4. If c) or d) on 3., are using (or interested to use) MPI:

a) within a worker node?

b) within a site?

c) over multiple sites?

5. Can you give an estimate how intense the processing is, in terms of total CPU hours and the period over which these were used?

6. Where have you found resources to run your programs in the past?

a) dedicated cluster at home institute

b) desktop cluster (e.g. using Condor)

c) federation of scientific community

d) commercial provider

e) BOINC or similar

f) other (please specify)

7. Please estimate the size of previously used resource provider (total CPUS)

8. Are you using the Grid to do multi-core computations, using MPI, OpenMP or similar techniques?

a) Yes

b) No

9. If yes on 8., which Grid?

a) EGEE

b) OSG

c) Naregi

d) Nordugrid

e) DEISA

f) other (please specify)

10. If no on 8., what is the main reason for not using it?

a) there aren't enough resources available for me on the Grid

b) my computational needs are too special for the Grid

c) I can't port my application because it uses legacy code

d) I can't port my application because it's too complicated to use the Grid

e) I've never heard of Grid computing

f) I tried but I gave up

11. If the answer on 10. was a), b), or d), please also answer the following question:

I. did you already ask a system administrator?

a) Yes

b) No

II. Did you ask the Grid sites for support?

a) Yes

b) No

III. Did you seek help in the EGEE community?

a) Yes

b) No

IV. What were the answers you got?

12. If yes on 8., did you have to convince the system operators to install it?

a) Yes

b) No

13. If yes on 8., was it easy to find documentation for running in parallel on the Grid?

a) Yes

b) No

14. If yes on 8., how difficult would you rate doing parallel computations on the Grid, compared to

I. doing parallel computations elsewhere, on a scale from 1 to 10?

II. doing serial processing on the grid, on a scale from 1 to 10?

15. Do you have any suggestions for the MPI Working Group to improve the usability of parallel mechanisms on the Grid?

5.2. Questionnaire System Administrators

1. For which research communities does your site provide services (more than one answer possible)?

a) Physics

b) Astronomy

c) Meteorology

d) Chemistry

e) Biology

f) Academic Medical Research

g) Other (please specify)

2. What is the total number of CPU's in your cluster(s)?

3. What kind of Local Resource Manager are you using?

a) Torque

b) LSF

c) Condor

d) SGE

e) Other (please specify)

4. How many admins run the site?

5. Is your hardware tuned to parallel processing, through low-latency networking and/or special processors?

a) Yes

b) No

6. What type of network is used for inter-node communication?

a) Gigabit

b) Infiniband

c) Myrinet

d) Other (please specify)

7. Are you offering MPI/OpenMP functionality to your users?

a) Yes

b) No

8. If yes on 7., What jort of MPI implementations are you supporting?

a) MPICH-1

b) MPICH-2

c) OpenMPI

d) LAM-MPI

e) MPICH-G2

f) MPIg

g) Others

9. If yes on 7., what fraction of the workload consists of parallel jobs (estimate percentage)?

10. If yes on 7., do you employ special scheduling strategies to accomodate parallel jobs?

a) Yes

b) No

11. If yes on 10., does your site mix parallel and serial work on the same resources?

a) Yes

b) No

12. If yes on 7., what is your opinion on the installation/configuration process? Was it:

a) Easy

b) Straight Forward

c) Difficult

d) Extremely hard

13. If yes on 7., did you use the MPI installation procedure provided by EGEE?

a) Yes

b) No

c) Partly (as guideline)

14. If yes on 7., did you make some modifications, such as wrappers or scripts to make MPI run on your cluster(s)?

a) Yes

b) No

15. If yes on 7., is your current MPI implementation being used?

a) Yes

b) No

16. If yes on 7., are you running (the) special MPI SAM tests?

a) Yes

b) No

17. Did (someone from) the user community ask for MPI or OpenMP support?

a) Yes

b) No

18. Is your site currently participating in cross-site MPI scheduling?

a) Yes

b) No

19. Are there any suggestions you would like to share with the MPI Working Group?

mpi: WorkingGroup/Recommend2009 (last edited 2011-07-12 14:41:40 by localhost)