[Contents] [Prev] [Next] [End]


Chapter 13. Using LSF MultiCluster


What is LSF MultiCluster?

Within a company or organization, each division, department, or site may have a separately managed LSF cluster. Many organizations have realized that it is desirable to allow their multitude of clusters to cooperate to reap the benefits of global load sharing:

LSF MultiCluster enables a large organization to form multiple cooperating clusters of computers so that load sharing happens not only within the clusters but also among them. It enables load sharing across large numbers of hosts, allows resource ownership and autonomy to be enforced, non-shared user accounts and file systems to be supported, and communication limitations among the clusters to be taken into consideration in job scheduling.

Getting Remote Cluster Information

The commands lshosts, lsload, and lsmon can accept a cluster name to allow you to view the remote cluster. A list of clusters and associated information can be viewed with the lsclusters command.

% lsclusters 
CLUSTER_NAME   STATUS   MASTER_HOST               ADMIN    HOSTS  SERVERS 
clus1           ok       hostC                    user1       3        3 
clus2           ok       hostA                    user1       3        3 
% lshosts
HOST_NAME      type    model cpuf ncpus maxmem maxswp server RESOURCES
hostA         NTX86  PENT200 10.0     -      -      -    Yes (NT)
hostF         HPPA     HP735 14.0     1    58M    94M    Yes (hpux cserver)
hostB         SUN41 SPARCSLC  3.0     1    15M    29M    Yes (sparc bsd)
hostD         HPPA     HP735 14.0     1   463M   812M    Yes (hpux cserver)
hostE          SGI     R10K  16.0    16   896M   1692M   Yes (irix cserver  )
hostC        SUNSOL SunSparc 12.0     1    56M    75M    Yes (solaris cserver)
% lshosts clus1
HOST_NAME      type    model cpuf ncpus maxmem maxswp server RESOURCES
hostD         HPPA     HP735 14.0     1   463M   812M    Yes (hpux cserver)
hostE           SGI    R10K  16.0    16   896M   1692M   Yes (irix cserver  )
hostC        SUNSOL SunSparc 12.0     1    56M    75M    Yes (solaris cserver)
% lshosts clus2
HOST_NAME      type    model cpuf ncpus maxmem maxswp server RESOURCES
hostA         NTX86  PENT200 10.0     -      -      -    Yes (NT)
hostF         HPPA     HP735 14.0     1    58M    94M    Yes (hpux cserver)
hostB         SUN41 SPARCSLC  3.0     1    15M    29M    Yes (sparc bsd)
% lsload clus1 clus2
HOST_NAME       status  r15s   r1m  r15m   ut    pg  ls    it   tmp   swp   mem
hostD               ok   0.2   0.3   0.4  19%   6.0   6     3  146M  319M  252M
hostC               ok   0.1   0.0   0.1   1%   0.0   3    43   63M   44M   27M
hostA               ok   0.3   0.3   0.4  35%   0.0   3     1   40M   42M   13M
hostB             busy  *1.3   1.1   0.7  68% *57.5   2     4   18M   20M    8M
hostE            lockU   1.2   2.2   2.6  30%   5.2  35     0   10M  693M  399M
hostF           unavail

Running Batch Jobs across Clusters

A queue may be configured to send LSF Batch jobs to a queue in a remote cluster (see 'LSF Batch Configuration' in the LSF Administrator's Guide). When you submit a job to that local queue it will automatically get sent to the remote cluster:

The bclusters command displays a list of local queues together with their relationship with queues in remote clusters.

% bclusters
LOCAL_QUEUE     JOB_FLOW   REMOTE     CLUSTER    STATUS
testmc          send       testmc      clus2      ok
testmc          recv         -         clus2      ok

The meanings of displayed fields are:

LOCAL_QUEUE
The name of the local queue that either receive jobs from queues in remote clusters, or forward jobs to queues in remote clusters.
JOB_FLOW
The value can be either send or recv. If the value is send, then this line describes a job flow from the local queue to a queue in a remote cluster. If the value is recv, then this line describes a job flow from a remote cluster to the local queue.
REMOTE
Queue name of a remote cluster that the local queue can send jobs to. This field is always '-' if JOB_FLOW field is recv.
CLUSTER
Remote cluster name.
STATUS
Connection status between the local queue and remote queue. If JOB_FLOW field is send, then the possible values for STATUS field are 'ok', 'reject', and 'disc', otherwise the possible status are 'ok' and 'disc'. When status is 'ok', it indicates that both queues agree on the job flow. When status is 'disc', it means communications between the local and remote cluster has not been established yet. This may either be because no jobs need to be forwarded to the remote cluster yet, or the mbatchd's of the two clusters have not been able to get in touch with each other. The STATUS is reject if send is the job flow and the queue in the remote cluster is not configured to receive jobs from the local queue.

In the above example, local queue testmc can forward jobs in the local cluster to testmc queue of remote cluster clus2 and vice versa.

If there is no queue in your cluster that is configured for remote clusters, you will see the following:

% bclusters
No local queue sending/receiving jobs from remote clusters

Use the -m option with a cluster name to the bqueues command to display the queues in the remote cluster.

% bqueues -m clus2
QUEUE_NAME     PRIO      STATUS      MAX  JL/U JL/P JL/H NJOBS  PEND  RUN  SUSP
fair          3300    Open:Active      5    -    -    -     1     1     0     0
interactive   1055    Open:Active      -    -    -    -     1     0     1     0
testmc          55    Open:Active      -    -    -    -     5     2     2     1
priority        43    Open:Active      -    -    -    -     0     0     0     0

Submit your job with the bsub command to the queue that is sending jobs to the remote cluster.

% bsub -q testmc -J mcjob myjob
Job <101> is submitted to queue <testmc>.

The bjobs command will display the cluster name in the FROM_HOST and EXEC_HOST fields. The format of these fields is 'host@cluster' indicating which cluster the job originated from or was forwarded to. To query the jobs running in another cluster, use the -m option and specify a cluster name.

% bjobs
JOBID USER     STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
101   user7    RUN   testmc     hostC       hostA@clus2   mcjob    Oct 19 19:41
% bjobs -m clus2
JOBID USER     STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
522   user7    RUN  testmc      hostC@clus2  hostA        mcjob    Oct 19 23:09

Note that the submission time shown from the remote cluster is the time when the job was forwarded to that cluster.

To view the hosts of another cluster you can use a cluster name in place of a host name as the argument to the bhosts command.

% bhosts clus2
HOST_NAME          STATUS    JL/U  MAX  NJOBS  RUN  SSUSP USUSP  RSV
hostA              ok          -    10     1     1     0     0     0
hostB              ok          -    10     2     1     0     0     1
hostF              unavail     -     3     1     1     0     0     0

Run bhist command to see the history of your job, including information about job forwarding to another cluster.

% bhist -l 101

Job Id <101>, Job Name <mcjob>, User <user7>, Project <default>, Command 
                     <myjob>
Sat Oct 19 19:41:14: Submitted from host <hostC> to Queue <testmc>,CWD <$HOME>
Sat Oct 19 21:18:40: Parameters are modified to:Project <test>,Queue <testmc>,
                     Job Name <mcjob>;
Mon Oct 19 23:09:26: Forwarded job to cluster clus2;
Mon Oct 19 23:09:26: Dispatched to <hostA>;
Mon Oct 19 23:09:40: Running with execution home </home/user7>, Execution CWD <
                     /home/user7>, Execution Pid <4873>;
Mon Oct 20 07:02:53: Done successfully. The CPU time used is 12981.4 seconds;

Summary of time in seconds spent in various states by Tue Oct 20 07:02:53 1996
  PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL
  5846      0      28399       0         0       0      34245

Running Interactive Jobs on Remote Clusters

The lsrun command allows you to specify a cluster name instead of a host name. When a cluster name is specified, a host is selected from the cluster. For example:

% lsrun -m clus2 -R type==any hostname
hostA

The -m option to the lslogin command can be used to specify a cluster name. This allows you to login to the best host in a remote cluster.

% lslogin -v -m clus2
<<Remote login to hostF >>

The multicluster environment can be configured so that one cluster accepts interactive jobs from the other cluster, but not vice versa. See 'Running Interactive Jobs on Remote Clusters' in the LSF Administrator's Guide. If the remote cluster will not accept jobs from your cluster, you will get an error:

% lsrun -m clus2 -R type==any hostname
ls_placeofhosts: Not enough host(s) currently eligible

User-Level Account Mapping between Clusters

By default, LSF assumes a uniform user name space within a cluster and between clusters. It is not uncommon for an organization to fail to satisfy this assumption. Support for non-uniform user name spaces between clusters is provided for the execution of batch jobs. The .lsfhosts file used to support account mapping can be used to specifying cluster names in place of host names.

For example, you have accounts on two clusters, clus1 and clus2. In clus1, your user name is 'user1' and in clus2 your user name is 'ruser_1'. To run your jobs in either cluster under the appropriate user name, you should setup your .lsfhosts file as follows:

On machines in cluster clus1:

% cat ~user1/.lsfhosts
clus2 ruser_1

On machines in cluster clus2:

% cat ~ruser_1/.lsfhosts
clus1 user1

For another example, you have the account 'user1' on cluster clus1 and you want to use the 'lsfguest' account when sending jobs to be run on cluster clus2. The .lsfhosts files should be setup as follows:

On machines in cluster clus1:

% cat ~user1/.lsfhosts
clus2 lsfguest send

On machines in cluster clus2:

% cat ~lsfguest/.lsfhosts
clus1 user1 recv

The other features of the .lsfhosts file also work in the multicluster environment. See 'User Controlled Account Mapping' for further details. Also see 'Account Mapping Between Clusters' in the LSF Administrator's Guide.


[Contents] [Prev] [Next] [End]

doc@platform.com

Copyright © 1994-1997 Platform Computing Corporation.
All rights reserved.