[Contents] [Prev] [Next] [End]


Chapter 3. Configuring LSF Cluster


This chapter describes how to set up a simple LSF cluster. This simple cluster should allow your users to use LSF and run jobs with a simple default configuration. You will need to read the remaining chapters only if you want to make use of the full features and options in the LSF products.

The topics covered in this chapter are:

Initial Configuration

The initial configuration uses default settings for most parameters. Some of these parameters are mandatory. They tell the LSF daemons about the structure of your cluster by defining hosts, host types, and resources in your cluster.

The lsfsetup command is used to perform the initial configuration.

Step 1.
Log in as root (or yourself if you are planning to create your own cluster) and change directory to the distribution directory. Run the ./lsfsetup command and choose option 5, 'Configure LSF Cluster', and confirm or enter the location of your lsf.conf file.
Step 2.
Choose option 4, 'View/Add/Delete Currently Defined Host Types'. Modify the list of host types to include the types you have in your cluster. Host types can be any alphanumeric string up to 29 characters long. When you are finished enter 'q', 'Done Modifying Host Types'.
Step 3.
Choose option 2, 'View/Add/Delete/Modify Currently Defined Host Models'. Use the menu to add host model specifications and CPU factors for each different processor in your cluster. For more information on CPU factors, see 'Tuning CPU Factors'. When you are finished enter 'q', 'Done Modifying Host Models'.
Step 4.
Choose option 1, 'View/Add/Delete/Modify Currently Configured Hosts'. Choose option 2, 'Add Hosts To LSF Configuration'. The lsfsetup command asks for the host name and then runs the vi editor on the LSF configuration file where the host thresholds are configured. Edit the configuration line for the host and set the host type, model, load thresholds, and resources as desired. For the initial cluster configuration, you can use the same thresholds as the examples in the configuration file. These parameters can be changed later without interrupting LSF service.

You can add more than one host at this time. Enter the host information for all hosts in the cluster. The master LIM and mbatchd daemons run on the first available host in this list, so you should list reliable batch server hosts first. For more information see 'Fault Tolerance'.

If your cluster includes client hosts, enter the host names in this step and set the SERVER field for these hosts to 0 (zero).

When you are done adding hosts enter 'q', 'Done Modifying Host Configuration'.

Step 5.
Choose option 5, 'Configuration File Error Checking'. This option runs a check of the LSF configuration, and reports any problems found. If there are any problems, refer to 'Error Log Messages'. Next enter 'q', 'Quit' twice.
Step 6.
Start the LSF daemons by running LSF_SERVERDIR/lsf_daemons start and use ps to make sure that res, lim and sbatchd have started.

CAUTION!
LSF daemons start must be run as root. If you are creating a private cluster do not attempt to use lsf_daemons to start your daemons. Start them manually.

Note
If you choose the option to start LSF daemons for all machines using lsadmin and badmin commands, then run 'lsadmin limstartup', 'lsadmin resstartup', and 'badmin hstartup' respectively instead of running Step 6 manually.

lsfsetup creates a default LSF Batch configuration, including a set of batch queues. You do not need to change any LSF Batch files to use the default configuration. After you have the system running, you may want to reconfigure LSF Batch. See 'Managing LSF Batch' for a discussion of how to do this.

Setting Up LSF Client Hosts

You need to read this section only if you are configuring some hosts as client-only hosts.

LSF client hosts have transparent access to the resources on LSF server hosts, but do not run the LSF daemons. This means that you can run all LSF commands and tools from a client host, but your submitted jobs will only run on server hosts.

On client hosts the lsf.conf file must contain the names of some server hosts. LSF applications run on a client host contact the servers named in the lsf.conf file to find the master host.

Client hosts need access to the LSF configuration directory to read the lsf.task files. This can be a private read-only copy of the files; you do not need access to a shared copy on a file server host. If you install local copies of the lsf.task files, you must remember to update them when the files are changed on the server hosts. See 'The lsf.task and lsf.task.cluster Files' for a detailed description of these files.

Client hosts must have access to the LSF user commands in the LSF_BINDIR directory and to the nios program in the LSF_SERVERDIR directory. These can be installed directly on the client host or mounted from a file server.

The client hosts must be configured as described in Step 3 of the chapter 'Initial Configuration'.

Step 1.
Make a copy of the lsf.conf file from an LSF server host. Edit the copy and add a line like:
LSF_SERVER_HOSTS="hostA hostD hostB"

Replace the example host names with the names of some LSF server hosts listed in your lsf.cluster.cluster file. You should choose reliable hosts for the LSF_SERVER_HOSTS list; if all the hosts on this list are unavailable, the client host cannot use the LSF cluster.

Step 2.
On each client host, copy the modified lsf.conf file to LSF_ENVDIR.

Checking the LSF Configuration

Before you can start any LSF daemons, you should make sure that your cluster configuration is correct. The lsfsetup program includes an option to check the LSF configuration. The default LSF Batch configuration should work as it is installed following the steps described in 'Installation'.

You should have done the configuration checking during Step 5. If you want to confirm the configuration checking again, you can run lsadmin and badmin commands.

Log into the first host listed in lsf.cluster.cluster, as the LSF administrator to check LIM configuration:

% lsadmin ckconfig -v
Checking configuration files ...
LSF 3.0, Dec 10, 1996
Copyright 1992-1996 Platform Computing Corporation
Reading configuration from /etc/lsf.conf
Dec 21 21:15:51 13412 /usr/local/lsf/etc/lim -C
Dec 21 21:15:52 13412 initLicense: Trying to get license for LIM from source 
</usr/local/lsf/conf/license.dat>
Dec 21 21:15:52 13412 main: Got 1 licenses
Dec 21 21:15:52 13412 main: Configuration checked. No fatal errors found.
---------------------------------------------------------
No errors found.

The messages shown above are the normal output from lsadmin ckconfig -v. Other messages may indicate problems with the LSF configuration. See 'LSF Base Configuration Reference' and 'Troubleshooting and Error Messages' if any problem is found.

To check the LSF Batch configuration files, LIM must be running on the master host. If the LIM is not running, log in as root and start LSF_SERVERDIR/lim. Wait a minute and then run the lsid program to make sure LIM is available. Then run badmin ckconfig -v:

% badmin ckconfig -v
Checking configuration files ...
Dec 21 21:22:14 13545 mbatchd: LSF_ENVDIR not defined; assuming /etc
Dec 21 21:22:15 13545 minit: Trying to call LIM to get cluster name ...
Dec 21 21:22:17 13545 readHostFile: 3 hosts have been specified in file 
</usr/local/lsf/conf/lsbatch/test_cluster/configdir/lsb.hosts>; only these 
hosts will be used by lsbatch
Dec 21 21:22:17 13545 Checking Done
---------------------------------------------------------
No fatal errors found.

The above messages are normal; other messages may indicate problems with the LSF Batch configuration. See 'LSF Batch Configuration Reference' and 'Troubleshooting and Error Messages' if any problem is found.

Testing the LSF Cluster

After you started the LSF daemons in your cluster, you should run some simple tests. Wait a minute or two for all the LIMs to get in touch with each other, to elect a master, and to exchange some setup information.

The testing should be performed as a non-root user. This user's PATH must include the LSF user binaries (LSF_BINDIR as defined in /etc/lsf.conf).

Testing consists of running a number of LSF commands and making sure that correct results are reported for all hosts in the cluster. This section shows suggested tests and examples of correct output. The output you see on your system will reflect your local configuration.

The following steps may be performed from any host in the cluster.

Testing LIM

Step 1.
Check cluster name and master host name:
% lsid
LSF 3.0, Dec 10, 1996
Copyright 1992-1996 Platform Computing Corporation

My cluster name is test_cluster
My master name is hostA

The master name may vary but is usually the first host configured in the Hosts section of the lsf.cluster.cluster file.

If the LIM is not available on the local host, lsid displays the following message:

lsid: ls_getmastername failed: LIM is down; try later

If the LIM is not running, try running lsid a few more times. If the LIM still does not respond, see 'Troubleshooting and Error Messages'.

The error message:

lsid: ls_getmastername failed: Cannot locate master LIM now, try later

means that local LIM is running, but the master LIM has not contacted the local LIM yet. Check the LIM on the first host listed in lsf.cluster.cluster. If it is running, wait for 30 seconds and try lsid again. Otherwise, another LIM will take over after one or two minutes.

Step 2.
The lsinfo command displays cluster-wide configuration information.
% lsinfo
RESOURCE_NAME   TYPE   ORDER  DESCRIPTION
r15s          Numeric   Inc   15-second CPU run queue length
r1m           Numeric   Inc   1-minute CPU run queue length (alias: cpu)
r15m          Numeric   Inc   15-minute CPU run queue length
ut            Numeric   Inc   1-minute CPU utilization (0.0 to 1.0)
pg            Numeric   Inc   Paging rate (pages/second)
ls            Numeric   Inc   Number of login sessions (alias: login)
it            Numeric   Dec   Idle time (minutes) (alias: idle)
tmp           Numeric   Dec   Disk space in /tmp (Mbytes)
mem           Numeric   Dec   Available memory (Mbytes)
ncpus         Numeric   Dec   Number of CPUs
maxmem        Numeric   Dec   Maximum memory (Mbytes)
maxtmp        Numeric   Dec   Maximum /tmp space (Mbytes)
cpuf          Numeric   Dec   CPU factor
type           String   N/A   Host type
model          String   N/A   Host model
status         String   N/A   Host status
server        Boolean   N/A   LSF server host
cserver       Boolean   N/A   Compute Server
solaris       Boolean   N/A   Sun Solaris operating system
fserver       Boolean   N/A   File Server
NT            Boolean   N/A   Windows NT operating system

TYPE_NAME
hppa
SUNSOL
alpha
sgi
NTX86
rs6000

MODEL_NAME            CPU_FACTOR
HP735                   4.0
ORIGIN2K                8.0
DEC3000                 5.0
PENT200                 3.0

The resource names, host types, and host models should be those configured in LSF_CONFDIR/lsf.shared.

Step 3.
The lshosts command displays configuration information about your hosts:
Step 4.
Check the current load levels:

See 'Troubleshooting and Error Messages' if any of these tests do not display the expected results. If all these tests succeed, the LIMs on all hosts are running correctly.

Testing RES

Step 1.
The lsgrun command runs a UNIX command on a group of hosts:
% lsgrun -v -m "hostA hostD hostB" hostname
<<Executing hostname on hostA>>
hostA
<<Executing hostname on hostD>>
hostD
<<Executing hostname on hostB>>
hostB

If remote execution fails on any host, check the RES error log on that host.

Testing LSF Batch

Testing consists of running a number of LSF commands and making sure that correct results are reported for all hosts in the cluster.

Step 1.
The bhosts command lists the batch server hosts in the cluster:
% bhosts
HOST_NAME          STATUS    JL/U  MAX  NJOBS  RUN  SSUSP USUSP  RSV
hostD              ok          -    10     1     1     0     0     0
hostA              ok          -    10     4     2     2     0     0
hostC              unavail     -     3     1     1     0     0     0

The STATUS column shows the status of sbatchd on that host. If the STATUS column contains unavail, that host is not available. Either the sbatchd on that host has not started or it has started but has not yet contacted the mbatchd. If hosts are still listed as unavailable after roughly three minutes, check the error logs on those hosts. See 'Troubleshooting and Error Messages'.

See the bhosts(1) manual page for explanations of the other columns.

Step 2.
Submit a job to the default queue:
% bsub sleep 60
Job <1> is submitted to default queue <normal>

If the job you submitted was the first ever, it should have job ID 1. Otherwise, the number varies.

Step 3.
Check available queues and their configuration parameters:
% bqueues
QUEUE_NAME     PRIO      STATUS      MAX  JL/U JL/P JL/H NJOBS  PEND  RUN  SUSP
interactive    400    Open:Active      -    -    -    -     1     1     0     0
fairshare      300    Open:Active      -    -    -    -     2     0     2     0
owners          43    Open:Active      -    -    -    -     0     0     0     0
priority        43    Open:Active      -    -    -    -    29    29     0     0
night           40   Open:Inactive     -    -    -    -     1     1     0     0
short           35    Open:Active      -    -    -    -     0     0     0     0
normal          30    Open:Active      -    -    -    -     0     0     0     0
idle            20    Open:Active      -    -    -    -     0     0     0     0

See the bqueues(1) manual page for an explanation of the output.

Step 4.
Check job status:

Configuring LSF MultiCluster

You do not need to read this section if you are not using the LSF MultiCluster product.

LSF MultiCluster unites multiple LSF clusters so that they can share resources transparently, while at the same time, still maintain resource ownership and autonomy of individual clusters.

LSF MultiCluster extends the functionality of a single cluster. Configuration involves a few more steps. First you setup a single cluster as described above, then you need to do some additional steps specific to LSF MultiCluster. See 'Managing LSF MultiCluster' for more details.

Configuring LSF JobScheduler

You do not need to read this section if you are not using the LSF JobScheduler product.

LSF JobScheduler provides reliable production job scheduling according to user specified calendars and events. It runs user defined jobs automatically at the right time, under the right conditions, and on the right machines.

The configuration of LSF JobScheduler is almost the same as that of the LSF Batch cluster, except that you may have to define system-level calendars for your cluster and you might need to add additional events to monitor your site. Some of the configuration options for LSF Batch may not be as useful for LSF JobScheduler, such as NQS inter-operation and fairshare. You can ignore those features you do not need and use the features you think are useful for your LSF JobScheduler cluster. More details about the concept of LSF JobScheduler are described in the LSF JobScheduler User's Guide.

See 'Managing LSF JobScheduler' for details about the additional configuration options for LSF JobScheduler.

Providing LSF to Users

When you have finished installing and testing LSF cluster, you can let users try it out. LSF users must add LSF_BINDIR to their PATH environment variables to run the LSF utilities.

If users wish to use lstcsh as their login shell, most systems require that you modify the /etc/shells file. Add a line containing the full path name of lstcsh. Users may then use chsh(1) to choose lstcsh as their login shell. If your site uses NIS for password information, you must change /etc/shells on the NIS master host and update the NIS database. Otherwise, you must change /etc/shells on all hosts.

Users also need access to the on-line manual pages, which were installed in LSF_MANDIR (as defined in lsf.conf) by the lsfsetup installation procedure. For most versions of UNIX, users should add the directory LSF_MANDIR to their MANPATH environment variable. If your system has a man command that does not understand MANPATH, you should either install the manual pages in the /usr/man directory or get one of the freely available man programs.

Using xlsadmin

You can use the xlsadmin GUI to do most of the cluster configuration and management work that has been described in this chapter. Details about xlsadmin can be found in 'Managing LSF Cluster Using xlsadmin'.


[Contents] [Prev] [Next] [End]

doc@platform.com

Copyright © 1994-1997 Platform Computing Corporation.
All rights reserved.