[Contents] [Prev] [Next] [End]


Chapter 5. Managing LSF Base


This chapter describes the operation, maintenance and tuning of LSF Base cluster. Since LSF Base is essential to all LSF components, the correct operation of LSF Base is essential to other LSF components. This chapter should be read by all LSF cluster administrators.

Managing Error Logs

Error Logs contain important information about daemon operations. When you see any abnormal behavior related to any of the LSF daemons, you should check the relevant error logs to find out the cause of the problem.

LSF log files grow over time. These files should occasionally be cleared, either by hand or using automatic scripts run by cron(1). If you are using LSF JobScheduler, you can define a calendar-driven job to do the cleanup regularly.

LSF Daemon Error Log

All LSF log files are reopened each time a message is logged, so if you rename or remove a log file of an LSF daemon, the daemons will automatically create a new log file.

The LSF daemons log messages when they detect problems or unusual situations. The daemons can be configured to put these messages into files, or to send them to the system error logs using the syslog facility.

If LSF_LOGDIR is defined in the /etc/lsf.conf file, LSF daemons try to store their messages in files in that directory. Note that LSF_LOGDIR must be writable by root. The error log file names for the LSF Base system daemons, LIM and RES, are lim.log.hostname and res.log.hostname.

The error log file names for LSF Batch daemons are sbatchd.log.hostname, mbatchd.log.hostname, and pim.log.hostname. The LSF JobScheduler also has a eeventd.log.hostname file, in addition to all the log files of LSF Batch.

If LSF_LOGDIR is defined but the daemons cannot write to files there, the error log files are created in /tmp.

If LSF_LOGDIR is not defined, then errors are logged to syslog using the LOG_DAEMON facility. syslog messages are highly configurable, and the default configuration varies widely from system to system. Start by looking for the file /etc/syslog.conf, and read the manual pages for syslog and/or syslogd.

LSF daemons log error messages in different levels so that you can choose to log all messages or only log messages that are critical enough. This is controlled by parameter LSF_LOG_MASK in the lsf.conf file. Possible values for this parameter can be any log priority symbol that is defined in syslog.h. The default value for LSF_LOG_MASK is LOG_WARNING.

If the error log is managed by syslog, it is probably already being automatically cleared.

If LSF daemons cannot find the lsf.conf file when they start, they will not find the definition of LSF_LOGDIR. In this case, error messages go to syslog. If you can not find any error messages in the log files, they are likely in the syslog.

See 'Troubleshooting and Error Messages' for discussion of the more common problems and error log messages.

FLEXlm Log

The FLEXlm license server daemons log messages about the state of the license servers, and when licenses are checked in or out. This log helps to resolve problems with the license servers and to track license use.

The FLEXlm log is configured by the lsflicsetup command as described in 'Installing a New Permanent License'. This log file grows over time. You can remove or rename the existing FLEXlm log file at any time. The script lsf_license used to run the FLEXlm daemons creates a new log file when necessary.

Note
If you already have FLEXlm server running for other products and LSF licenses are added to the existing license file, then the log messages for FLEXlm should go to the same place as you previously set up for other products.

Controlling LIM and RES Daemons

The LSF cluster administrator can monitor the status of the hosts in a cluster, start and stop the LSF daemons, and reconfigure the cluster. Many operations are done using the lsadmin command, which performs administrative operations on LSF Base daemons, LIM and RES.

Checking Host Status

The lshosts and lsload commands report the current status and load levels of hosts in an LSF cluster. The lsmon and xlsmon commands provide a running display of the same information. The LSF administrator can find unavailable or overloaded hosts with these tools.

% lsload
HOST_NAME  status  r15s   r1m  r15m   ut    pg   ls   it   tmp   swp   mem
hostD          ok   1.3   1.2   0.9  92%   0.0    2   20    5M  148M   88M
hostB         -ok   0.1   0.3   0.7  0%    0.0    1   67   45M   25M   34M
hostA        busy   8.0  *7.0   4.9  84%   4.6    6   17    1M   81M   27M

When the status of a host is proceeded by a '-', it means RES is not running on that host. In the above example, RES on hostB is down.

Restarting LIM and RES

LIM and RES can be restarted to upgrade software or clear persistent errors. Jobs running on the host are not affected by restarting the daemons. The LIM and RES daemons are restarted using the lsadmin command:

% lsadmin
lsadmin>limrestart hostD

Checking configuration files ...
No errors found.

Restart LIM on <hostD> ...... done
lsadmin>resrestart hostD
Restart RES on <hostD> ...... done
lsadmin>quit

Note
You must login as the LSF cluster administrator to run the lsadmin command.

The lsadmin command can be applied to all available hosts by using the host name 'all'; for example, lsadmin limrestart all. If a daemon is not responding to network connections lsadmin displays an error message with the host name. In this case you must kill and restart the daemon by hand.

Remote Startup of LIM and RES

LSF administrators can start up any, or all, LSF daemons, on any, or all, LSF hosts, from any host in the LSF cluster. For this to work, file /etc/lsf.sudoers has to be setup properly to allow you to start up daemons as root and you should be able to run rsh across LSF hosts without having to enter a password. See 'The lsf.sudoers File' for configuration details of lsf.sudoers.

The 'limstartup' and 'resstartup' options in lsadmin allow for the startup of the LIM and RES daemons respectively. Specifying a host name allows for starting up a daemon on particular host. For example,

% lsadmin limstartup hostA
Starting up LIM on <hostA> ...... done
% lsadmin resstartup hostA
Starting up RES on <hostA> ...... done

The lsadmin command can be used to start up all available hosts by using the host name 'all'; for example, 'lsadmin limstartup all'. All LSF daemons, including LIM, RES, and sbatchd, can be started on all LSF hosts using the command lsfstartup.

Shutting down LIM and RES

All LSF daemons can be shut down at any time. If the LIM daemon on the current master host is shut down, another host automatically takes over as master. If the RES daemon is shut down while remote interactive tasks are running on the host, the running tasks continue but no new tasks are accepted. To shutdown LIM and RES, use the lsadmin command:

% lsadmin
lsadmin>limshutdown hostD
Shut down LIM on <hostD> ...... done
lsadmin>resshutdown hostD
Shut down RES on <hostD> ...... done
lsadmin>quit

You can run lsadmin reconfig while the LSF system is in use; users may be unable to submit new jobs for a short time, but all current remote executions are unaffected.

Locking and Unlocking Hosts

A LIM can be locked to temporarily prevent any further jobs from being sent to the host. The lock can be set to last either for a specified period of time, or until the host is explicitly unlocked. Only the local host can be locked and unlocked.

% lsadmin limlock
Host is locked
% lsload
HOST_NAME  status  r15s   r1m  r15m   ut    pg   ls   it   tmp   swp   mem
hostD          ok   1.3   1.2   0.9  92%   0.0    2   20    5M  148M   28M
hostA        busy   8.0  *7.0   4.9  84%   0.6    0   17   *1M   31M    7M
hostC       lockU   0.8   1.0   1.1  73%   1.2    3    0    4M   44M   12M
% lsadmin limunlock
Host is unlocked

Only root and the LSF administrator can lock and unlock hosts.

Managing LSF Configuration

Overview of LSF Configuration Files

LSF configuration consists of several levels:

lsf.conf File

This is the generic LSF environment configuration file. This file defines general installation parameters so that all LSF executables can find the necessary information. This file is typically installed in the directory all LSF server binaries are installed and a symbolic link is made from a convenient directory as defined by the environment variable LSF_ENVDIR, or the default directory /etc. This file is created by the lsfsetup during installation. Note that many of the parameters in this file are machine specific. Detailed contents of this file are described in 'The lsf.conf File'.

LIM Configuration Files

LIM is the kernel of your cluster that provides the single system image to all applications. LIM reads the LIM configuration files and determines your cluster and the cluster master host.

LIM files include lsf.shared, and lsf.cluster.cluster, where cluster is the name of your LSF cluster. These files define the host members, general host attributes, and resource definitions for your cluster. The individual functions of each of the files are described below.

lsf.shared defines the available resource names, host types, host models, cluster names and external load indices that can be used by all clusters. This file is shared by all clusters.

lsf.cluster.cluster file is a per cluster configuration file. It contains two types of configuration information: cluster definition information and LIM policy information. Cluster definition information impacts all LSF applications, while LIM policy information impacts applications that rely on LIM's policy for job placement.

The cluster definition information defines cluster administrators, all the hosts that make up the cluster, attributes of each individual host such as host type, host model, resources using the names defined in lsf.shared.

LIM policy information defines the load sharing and job placement policies provided by LIM. More details about LIM policies are described in 'Tuning LIM Load Thresholds'.

LIM configuration files are stored in directory LSF_CONFDIR as defined in the lsf.conf file. Details of LIM configuration files are described in 'The lsf.shared File'.

lsf.task File

lsf.task is a system-wide task to default resource requirement string mapping file. This file defines mappings between task names and their default resource requirements. LSF maintains a task list for each user in the system. The lsf.task file is useful for the cluster administrator to set task-to-resource requirement mapping at the system level. Individual users can customize their own list by using the lsrtasks command (See lsrtasks(1) man page for details of this command).

When you run a job with a LSF command such as bsub or lsrun, the command consults your task list to find out the default resource requirement string of the job if they are not already specified explicitly. If a match is not found in your task list, the system will assume a default, which typically means run the job on a host that has the same host type as the local host.

There is also a per cluster file lsf.task.cluster that applies to the cluster only and overrides the system-wide definition. Individual users can have their own files to override the system-wide and cluster-wide files by using the lsrtasks command.

lsf.task and lsf.task.cluster files are installed in directory LSF_CONFDIR as defined in the lsf.conf file.

LSF Batch Configuration Files

These files define LSF Batch specific configuration such as queues, batch server hosts and batch user controls. These files are only read by mbatchd. The LSF Batch configuration relies on LIM configuration. LSF Batch daemons get the cluster configuration information from the LIM via the LSF API.

LSF Batch configuration files are stored in directory LSB_CONFDIR/cluster, where LSB_CONFDIR is defined in lsf.conf, and cluster is the name of your cluster. Details of LSF Batch configuration files are described in 'Managing LSF Batch'.

Configuration File Formats

All configuration files except lsf.conf use a section-based format. Each file contains a number of sections. Each section starts with a line beginning with the reserved word Begin followed by a section name, and ends with a line beginning with the reserved word End followed by the same section name. Begin, End, section names and keywords are all case insensitive.

Sections can either be vertical or horizontal. A horizontal section contains a number of lines, each having the format: keyword = value, where value is one or more strings. For example:

Begin exampleSection
key1 = string1
key2 = string2 string3
key3 = string4
End exampleSection
Begin exampleSection
key1 = STRING1
key2 = STRING2 STRING3
End exampleSection

In many cases you can define more than one object of the same type by giving more than one horizontal section with the same section name.

A vertical section has a line of keywords as the first line. The lines following the first line are values assigned to the corresponding keywords. Values that contain more than one string must be bracketed with '(' and ')'. The above examples can also be expressed in one vertical section:

Begin exampleSection
key1     key2               key3
string1  (string2 string3)  string4
STRING1  (STRING2 STRING3)  -
End exampleSection

Each line in a vertical section is equivalent to a horizontal section with the same section name.

Some keys in certain sections are optional. For a horizontal section, an optional key does not appear in the section if its value is not defined. For a vertical section, an optional keyword must appear in the keyword line if any line in the section defines a value for that keyword. To specify the default value use '-' or '()' in the corresponding column, as shown for key3 in the example above.

Each line may have multiple columns, separated by either spaces or TAB characters. Lines can be extended by a '\' (back slash) at the end of a line. A '#' (pound sign) indicates the beginning of a comment; characters up to the end of the line are not interpreted. Blank lines are ignored.

Example Configuration Files

Below are some examples of LIM configuration and LSF Batch configuration files. Detailed explanations of the variables are described in 'LSF Base Configuration Reference'.

Example lsf.shared file

Begin Cluster
ClusterName                                # This line is keyword(s)
test_cluster
End Cluster

Begin HostType
TYPENAME                                   # This line is keyword(s)
hppa
SUNSOL
rs6000
alpha
NTX86
End HostType

Begin HostModel
MODELNAME               CPUFACTOR          # This line is keyword(s)
HP735                   4.0
DEC3000                 5.0
ORIGIN2K                8.0
PENTI120                3.0
End HostModel

Begin Resource
RESOURCENAME            DESCRIPTION        #This line is keyword(s)
hpux                    (HP-UX operating system)
decunix                 (Digital Unix)
solaris                 (Sun Solaris operating system)
NT                      (Windows NT operating system)
fserver                 (File Server)
cserver                 (Compute Server)
End Resource

Example lsf.cluster.test_cluster file:

Begin ClusterManager
Manager = lsf user7
End ClusterManager

Begin Host
HostNAme     Model      Type       server     swp    Resources 
hostA        HP735      hppa          1       2     (fserver hpux)
hostD        ORIGIN2K   sgi           1       2     (cserver)
hostB        PENT200    NTX86         1       2     (NT)
End Host

In the above file, section ClusterManager takes horizontal format, while Host section takes vertical format.

Other LSF Batch configuration files are described in 'Example LSF Batch Configuration Files'.

Changing LIM Configuration

This section gives procedures for some common changes to the LIM configuration. There are three different ways for you to change LIM configuration:

The following discussions focus on changing configuration files using an editor so that you can understand the concepts behind the configuration changes. See 'Managing LSF Cluster Using xlsadmin' for the use of xlsadmin in changing configuration files.

Note:
If you run LSF Batch, you must restart mbatchd using 'badmin reconfig' command each time you change the LIM configuration, even if the LSF Batch configuration files do not change. This is needed because the LSF Batch configuration depends on the LIM configuration.

Adding a Host to a Cluster

Step 1.
If you are adding a host of a new host type, make sure you do the steps described in 'Installing Each Additional Host Type' first.
Step 2.
If you are adding a host of a type for which you already installed LSF binaries, make sure that the LSF binaries, configuration files, and working directories are NFS mounted on the new host. For each new host you add, do host setup steps as described in 'Additional Steps on Each LSF Server Host'.
Step 3.
If you are adding a new host type to the cluster, modify the HostType section of the lsf.shared file to add the new host type. A host type can be any alphanumeric string up to 29 characters long.
Step 4.
If you are adding a new host model, modify the HostModel section of the lsf.shared file to add in the new model together with its CPU speed factor relative to other models.
Step 5.
For each host you add into the cluster, you should add a line to the Host section of the lsf.cluster.cluster file with host name, host type, and all other attributes defined, as shown in 'Example Configuration Files'.
The master LIM and mbatchd daemons run on the first available host in the Host section of your lsf.cluster.cluster file, so you should list reliable batch server hosts first. For more information see 'Fault Tolerance'.
If you are adding a client host, set the SERVER field for the host to 0 (zero).
Step 6.
Reconfigure your LSF cluster so that LIM knows that you have added a new host to the cluster. Follow instructions in 'Reconfiguring an LSF Cluster'. If you are adding more than one host, do this step after you have done step 1 to 6 for all added hosts.
Step 7.
If you are adding hosts as LSF Batch server hosts, add these hosts to LSF Batch configuration by following steps described in 'Restarting sbatchd'.
Step 8.
Start the LSF daemons on the newly added host(s) by running LSF_SERVERDIR/lsf_daemons start and use ps to make sure that res, lim, and sbatchd have started.

CAUTION!
LSF daemons start must be run as root. If you are creating a private cluster, do not attempt to use lsf_daemons to start your daemons. Start them manually.

Removing Hosts From a Cluster

Step 1.
If you are running LSF Batch, make sure you remove unwanted hosts from the LSF Batch first following steps described in 'Restarting sbatchd'.
Step 2.
Edit your lsf.cluster.cluster file and remove the unwanted hosts from the Host section.
Step 3.
Log in to any host in the cluster as the LSF administrator. Run:
% lsadmin resshutdown host1 host2 ...

where host1, host2, ... are hosts you want to remove from your cluster.

Step 4.
Follow instructions in 'Reconfiguring an LSF Cluster' to reconfigure your LSF cluster. The LIMs on the removed hosts will quit upon reconfiguration.
Step 5.
Remove the LSF section from the host's system startup files. This undoes what you have done previously to start LSF daemons at boot time. See 'Starting LSF Servers at Boot Time' for details.
Step 6.
If any users use lstcsh as their login shell, change their login shell to tcsh or csh. Remove lstcsh from the /etc/shells file.

Customizing Host Resources

Your cluster is most likely heterogeneous. Even if your computers are all the same, it may still be heterogeneous. For example, some machines are configured as file servers, while others are compute servers; some have more memory, others have less, some have four CPUs, others have only one; some have host-locked software licenses installed, others do not.

LSF provides powerful resource selection mechanisms so that correct hosts with required resources are chosen to run your jobs. For maximum flexibility, you should characterize your resources clear enough so that users have enough choices. For example, if some of your machines are connected to both Ethernet and FDDI, while others are only connected to Ethernet, then you probably want to define a resource called fddi and associate the fddi resource to machines connected to FDDI. This way, users can specify resource fddi if they want their jobs to run on machines connected to FDDI.

To customize host resources for your cluster, follow the following procedures.

Step 1.
Log in to any host in the cluster as the LSF administrator.
Step 2.
Define new resource names by modifying the resource section of the lsf.shared file. Add a brief description to each of the added resource names. Resource descriptions will be displayed to a user by lsinfo command.
Step 3.
If you want to associate added resource names to an application, edit the lsf.task file properly to reflect the resource into the resource requirements of the application. Alternatively, you can leave this to individual users who can use lsrtasks command to customize his/her own file.
Step 4.
Edit the lsf.cluster.cluster file to modify the RESOURCES column of the Host section so that all hosts that have the added resources will now have the added resource names in that column.
Step 5.
Follow instructions in 'Reconfiguring an LSF Cluster' to reconfigure your LSF cluster.

Adding Dedicated Resources

Some hosts are dedicated to running a particular application or class of applications. For example, a software group might have a compute server with very fast local disk and a special compiler license. This compute server is intended to run compilation jobs only.

You can make a host dedicated to a resource to disallow unwanted jobs to go to a particular host. Dedicated resources are defined just like Boolean resources. To make a host dedicated to a resource, precede the resource name by an exclamation mark '!' in the RESOURCES column of the lsf.cluster.cluster. If a host is dedicated to a resource, LIM only selects that host if the application requires the dedicated resource. For example, add an f77 resource to the systems cluster, and dedicate a host to that resource.

% lshosts
HOST_NAME type model  cpuf ncpus maxmem maxswp server RESOURCES
hostC     SUN4 SunIPC  2.7     1   24M    48M    Yes  (sparc !f77)
hostE     SUN4 SunIPC  2.7     2   96M   170M    Yes  (sparc f77)

Now when you run an ordinary load shared command, only hostE is eligible. If you ask for the f77 resource in the resource requirement, both hosts are eligible.

Note
Dedicated resources are not a secure way of controlling access to a host. Users can specify the dedicated resource on the command line, and they can also force remote execution on a particular host by explicitly specifying the host name. Use LSF Batch policy to enforce access control.

Reconfiguring an LSF Cluster

After changing LIM configuration files you must tell LIM to read the new configuration. Use the lsadmin commands to tell LIM to pick up the new configuration.

Operations can be specified on the command line or entered at a prompt. Run the lsadmin command with no arguments, and type help to see the available operations.

The lsadmin reconfig command checks the LIM configuration files for errors. If no errors are found, the command confirms that you want to restart the LIMs on all hosts, and reconfigures all the LIM daemons:

% lsadmin reconfig
Checking configuration files ...
No errors found.

Do you really want to restart LIMs on all hosts? [y/n] y
Restart LIM on <hostD> ...... done
Restart LIM on <hostA> ...... done
Restart LIM on <hostC> ...... done

In the above example no errors are found. If any non-fatal errors are found, the command asks you to confirm the reconfiguration. If fatal errors are found, the reconfiguration is aborted.

If you want to see details on any errors, run the command lsadmin ckconfig -v. This reports all errors to your terminal.

If you change the configuration file of LIM, you should also reconfigure LSF Batch by running badmin reconfig because LSF Batch depends on LIM configuration. If you change the configuration of LSF Batch, then you only need to run badmin reconfig.

External Load Indices

The LIM can be extended to support additional load information through external load indices. Examples of external load indices are network traffic, free disk space on scratch file systems, and available software licenses.

LSF supports over 100 external load indices, which can be defined for each cluster. The external load indices are calculated by a separate program, called the External LIM or ELIM. By providing your own ELIM you can configure your cluster to keep track of any dynamic information you choose.

Writing an External LIM

The ELIM can be any executable program, either an interpreted script or compiled code. Example code for an ELIM is included in the misc directory in the LSF distribution. The elim.c file is an ELIM written in C. You can customize this example to collect the load indices you want.

The ELIM communicates with the LIM by periodically writing a load update string to its standard output. The load update string contains the number of indices followed by a list of name-value pairs in the following format:

N name1 value1 name2 value2 ... nameN valueN

For example:

3 tmp2 47.5 nio 344.0 licenses 5

This string reports 3 indices: tmp2, nio, and licenses, with values 47.5, 344.0, and 5 respectively. Index names must be defined in the NewIndex section of the lsf.shared file (see 'Configuring External Load Indices'). Index values must be numbers between -INFINIT_LOAD and INFINIT_LOAD as defined in the lsf.h header file.

If the ELIM is implemented as a C program, as part of initialization it should use setbuf(3) to establish unbuffered output to stdout.

The ELIM should ensure that the entire load update string is written successfully to stdout. This can be done by checking the return value of printf(3s) if the ELIM is implemented as a C program or the return code of /bin/echo(1) from a shell script. The ELIM should exit if it fails to write the load information.

Each LIM sends updated load information to the master every 15 seconds. Depending on how quickly your external load indices change, the ELIM should write the load update string at most once every 15 seconds. If the external load indices rarely change, the ELIM can write the new values only when a change is detected. The LIM continues to use the old values until new values are received.

The executable for the ELIM must be in LSF_SERVERDIR and must have the name 'elim'. If any external load indices are defined in the NewIndex section of the lsf.shared file, the LIM invokes the ELIM automatically on startup. The ELIM runs with the same user id and file access permission as the LIM.

The LIM restarts the ELIM if it exits; to prevent problems in case of a fatal error in the ELIM, it is restarted at most once every 90 seconds. When the LIM terminates, it sends a SIGTERM signal to the ELIM. The ELIM must exit upon receiving this signal.

Configuring External Load Indices

Configure external load indices in the NewIndex section of the lsf.shared file. For each index, provide a name, an update interval, the direction of change with increasing load, and a description. For the ELIM above, the NewIndex section might be as follows.

Begin NewIndex
NAME  INTERVAL  INCREASING  DESCRIPTION
tmp2        30           N  (Disk space in /usr/tmp in MB)
nio         15           Y  (Network I/O in KB/second)
licenses    60           N  (Number of licenses available)
End NewIndex 

Note
The name of the load index must not be one of the resource name aliases cpu, idle, logins, or swap.

The update interval is for user information only. The actual update interval is controlled by your ELIM.

For a complete explanation of the meaning of the keywords, see 'External Load Indices'.

By default, the LIM polls the ELIM every five seconds for information. If the interval for the most frequently updated external index is less than five seconds, then the ELIM_POLL_INTERVAL parameter in the Parameters section of the lsf.cluster.cluster file can be used to specify a shorter sampling interval. Additionally it is necessary to set the EXINTERVAL (minimum exchange interval) parameter in the lsf.cluster.cluster file to the same value as ELIM_POLL_INTERVAL to ensure that the load index values collected from the ELIM are sent to the master LIM promptly.

After configuring the external load indices in lsf.shared, run lsadmin reconfig to check the configuration and restart the LIM on all hosts.

Overriding Built-In Load Indices

The ELIM can also return values for the built-in load indices. In this case the value produced by the ELIM overrides the value produced by the LIM. The ELIM must ensure that the semantics of any index it supplies is the same as that of the corresponding index returned by the lsinfo(1) command.

For example, some sites prefer to use /usr/tmp for temporary files. To override the tmp load index, write a program that periodically measures the space in the /usr/tmp file system, and writes the value to standard output. Name this program elim and put it in the LSF_SERVERDIR directory.

Note
The name of an external load index must not be one of the resource name aliases cpu, idle, logins, or swap. To override one of these indices, use its formal name: r1m, it, ls, or swp.

You must configure the external load index even if you are overriding a built-in load index.

LIM Policies

LIM provides very critical services to the all LSF components. In addition to the timely collection of resource information, LIM also provides host selection and job placement policies. If you are using the LSF MultiCluster product, LIM policies also determine how different clusters should exchange load and resource information.

LIM policies are advisory information for applications. Applications can either use the placement decision from the LIM, or make further decisions based on information from the LIM.

Most of the LSF interactive tools, such as lsrun, lsmake, and lstcsh, use LIM policies to place jobs on the network. LSF Batch and LSF JobScheduler use load and resource information from LIM and make their own placement decisions based on other factors in addition to load information.

As was described in 'Overview of LSF Configuration Files', LIM configuration file defines load-sharing policies. The LIM configuration parameters that affect LIM policies include:

There are two main goals in adjusting the LIM configuration parameters: improving response time, and reducing interference with interactive use. To improve response time, LSF should be tuned to correctly select the best available host for each job. To reduce interference, LSF should be tuned to avoid overloading any host.

Tuning CPU Factors

CPU factors are used to differentiate the relative speed of different machines. LSF runs jobs on the best possible machines so that the response time is minimized. To achieve this, it is important that you define correct CPU factors for each machine model in your cluster by changing the HostModel section of your lsf.shared file.

CPU factors should be set based on a benchmark that reflects your work load. (If there is no such benchmark, CPU factors can be set based on raw CPU power.) The CPU factor of the slowest hosts should be set to one, and faster hosts should be proportional to the slowest. For example, consider a cluster with two hosts, hostA and hostB, where hostA takes 30 seconds to run your favourite benchmark and hostB takes 15 seconds to run the same test. hostA should have a CPU factor of 1, and hostB (since it is twice as fast) should have a CPU factor of 2.

LSF uses a normalized CPU performance rating to decide which host has the most available CPU power. The normalized ratings can be seen by running the lsload -N command. The hosts in your cluster are displayed in order from best to worst. Normalized CPU run queue length values are based on an estimate of the time it would take each host to run one additional unit of work, given that an unloaded host with CPU factor 1 runs one unit of work in one unit of time.

Incorrect CPU factors can reduce performance in two ways. If the CPU factor for a host is too low, that host may not be selected for job placement when a slower host is available. This means that jobs would not always run on the fastest available host. If the CPU factor is too high, jobs are run on the fast host even when they would finish sooner on a slower but lightly loaded host. This causes the faster host to be overused while the slower hosts are underused

Both of these conditions are somewhat self-correcting. If the CPU factor for a host is too high, jobs are sent to that host until the CPU load threshold is reached. The LIM then marks that host as busy, and no further jobs will be sent there. If the CPU factor is too low, jobs may be sent to slower hosts. This increases the load on the slower hosts, making LSF more likely to schedule future jobs on the faster host.

Tuning LIM Load Thresholds

The Host section of the lsf.cluster.cluster file can contain busy thresholds for load indices. You do not need to specify a threshold for every index; indices that are not listed do not affect the scheduling decision. These thresholds are a major factor influencing LSF performance. This section does not describe all LSF load indices; see 'Resource Requirements' and 'Threshold Fields' for more complete discussions.

The parameters that most often affect performance are:

r15s  15-second average
r1m   1-minute average
r15m  15-minute average
pg    paging rate in pages per second
swp   available swap space

For tuning these parameters you should compare the output of lsload to the thresholds reported by lshosts -l.

The lsload and lsmon commands display an asterisk '*' next to each load index that exceeds its threshold. For example, consider the following output from lshosts -l and lsload:

% lshosts -l
HOST_NAME:  hostD
...
LOAD_THRESHOLDS:
  r15s   r1m  r15m   ut    pg    io   ls   it   tmp   swp   mem
     -   3.5     -    -    15     -    -    -     -    2M    1M

HOST_NAME:  hostA
...
LOAD_THRESHOLDS:
  r15s   r1m  r15m   ut    pg    io   ls   it   tmp   swp   mem
     -   3.5     -    -    15     -    -    -     -    2M    1M
% lsload
HOST_NAME  status r15s  r1m  r15m   ut    pg  ls   it   tmp   swp   mem
hostD          ok  0.0  0.0   0.0   0%   0.0   6    0   30M   32M   10M
hostA        busy  1.9  2.1   1.9  47% *69.6  21    0   38M   96M   60M

In this example, hostD is ok. However, hostA is busy; the pg (paging rate) index is 69.6, above the threshold of 15.

Other monitoring tools such as xlsmon also help to show the effects of changes.

If the LIM often reports a host to be busy when the CPU run queue length is low, the most likely cause is the paging rate threshold. Different versions of UNIX assign subtly different meanings to the paging rate statistic, so the threshold needs to be set at different levels for different host types. In particular, HP-UX systems need to be configured with significantly higher pg values; try starting at a value of 50 rather than the default 15.

If the LIM often shows systems busy when the CPU utilization and run queue lengths are relatively low and the system is responding quickly, try raising the pg threshold. There is a point of diminishing returns; as the paging rate rises, eventually the system spends too much time waiting for pages and the CPU utilization decreases. Paging rate is the factor that most directly affects perceived interactive response. If a system is paging heavily, it feels very slow.

The CPU run queue threshold can be reduced if you find that interactive jobs slow down your response too much while the LIM still reports your host as ok. Likewise, it can be increased if hosts become busy at too low a load.

On multiprocessor systems the CPU run queue threshold is compared to the effective run queue length as displayed by the lsload -E command. The run queue threshold should be configured as the load limit for a single processor. Sites with a variety of uniprocessor and multiprocessor machines can use a standard value for r15s, r1m, and r15m in the configuration files, and the multi-processor machines will automatically run more jobs. Note that the normalized run queue length printed by lsload -N is scaled by the number of processors. See the 'Resources' chapter of the LSF User's Guide and lsfintro(1) for the concept of effective and normalized run queue lengths.

Cluster Monitoring with LSF

Because LSF takes a wide variety of measurements on the hosts in your network, it can be a powerful tool for monitoring and capacity planning. The lsmon command gives updated information that can quickly identify problems such as inaccessible hosts or unusual load levels. The lsmon -L option logs the load information to a file for later processing. See the lsmon(1) and lim.acct(5) manual pages for more information.

For example, if the paging rate (pg) on a host is always high, adding memory to the system will give a significant increase in both interactive performance and total throughput. If the pg index is low but the CPU utilization (ut) is usually more than 90 percent, the CPU is the limiting resource. Getting a faster host, or adding another host to the network, would provide the best performance improvement. The external load indices can be used to track other limited resources such as user disk space, network traffic, or software licenses.

The xlsmon program is a Motif graphic interface to the LSF load information. The xlsmon display uses colour to highlight busy and unavailable hosts, and can show both the current levels and scrolling histories of selected load indices.

See the 'Cluster Information' chapter of the LSF User's Guide for more information about xlsmon.

LSF License Management

LSF software is licensed using the FLEXlm license manager from Globetrotter Software, Inc. The LSF license key controls the hosts allowed to run LSF. The procedures for obtaining, installing and upgrading license keys are described in 'Getting License Key Information' and 'Setting Up the License Key'. This section provides background information on FLEXlm.

FLEXlm controls the total number of hosts configured in all your LSF clusters. You can organize your hosts into clusters however you choose. Each server host requires at least one license; multiprocessor hosts require more than one, as a function of the number of processors. Each client host requires 1/5 of a license.

LSF uses two kinds of FLEXlm license: time-limited DEMO licenses and permanent licenses.

The DEMO license allows you to try LSF out on an unlimited number of hosts on any supported host type. The trial period has a fixed expiry date, and the LSF software will not function after that date. DEMO licenses do not require any additional daemons.

Permanent licenses are the most common. A permanent license limits only the total number of hosts that can run the LSF software, and normally has no time limit. You can choose which hosts in your network will run LSF, and how they are arranged into clusters. Permanent licenses are counted by a license daemon running on one host on your network.

For permanent licenses, you need to choose a license server host and send hardware host identification numbers for the license server host to your software vendor. The vendor uses this information to create a permanent license that is keyed to the license server host. Some host types have a built-in hardware host ID; on others, the hardware address of the primary LAN interface is used.

How FLEXlm Works

FLEXlm is used by many UNIX software packages because it provides a simple and flexible method for controlling access to licensed software. A single FLEXlm license server can handle licenses for many software packages, even if those packages come from different vendors. This reduces the systems administration load, since you do not need to install a new license manager every time you get a new package.

The License Server Daemon

FLEXlm uses a daemon called lmgrd to manage permanent licenses. This daemon runs on one host on your network, and handles license requests from all applications. Each license key is associated with a particular software vendor. lmgrd automatically starts a vendor daemon; the LSF version is called lsf_ld and is provided by Platform Computing Corporation. The vendor daemon keeps track of all licenses supported by that vendor. DEMO licenses do not require you to run license daemons.

The license server daemons should be run on a reliable host, since licensed software will not run if it cannot contact the server. The FLEXlm daemons create very little load, so they are usually run on the file server. If you are concerned about availability, you can run lmgrd on a set of three or five hosts. As long as a majority of the license server hosts are available, applications can obtain licenses.

The License File

Software licenses are stored in a text file. The default location for this file is /usr/local/flexlm/licenses/license.dat, but this can be overridden. The license file must be readable on every host that runs licensed software. It is most convenient to place the license file in a shared NFS directory.

The license.dat file normally contains:

The FEATURE line contains an encrypted code to prevent tampering. For permanent licenses, the licenses granted by the FEATURE line can be accessed only through license servers listed on the SERVER lines.

For DEMO licenses no FLEXlm daemons are needed, so the license file contains only the FEATURE line.

Here is an example of a DEMO license file. This file contains one line for each separate component (see 'Modifying LSF Components and Licensing'). However, no SERVER or DAEMON information is needed. The license is for LSF version 3.0 and is valid until Jan. 10, 1997.

FEATURE lsf_base lsf_ld 3.000 10-Jan-1997 0 5C51F231E238555BAD7F "Platform" DEMO
FEATURE lsf_batch lsf_ld 3.000 10-Jan-1997 0 6CC1D2C137651068E23C "Platform" DEMO
FEATURE lsf_mc lsf_ld 3.000 10-Jan-1997 0 2CC1F2E132C85B8D1806 "Platform" DEMO

The following is an example of a permanent license file. The license server is configured to run on hostD, using TCP port 1700. This allows 10 hosts to run LSF, with no expiry date.

SERVER hostD 08000962cc47 1700
DAEMON lsf_ld /usr/local/lsf/etc/lsf_ld
FEATURE lsf_base lsf_ld 3.000 01-Jan-0000 0 51F2315CE238555BAD7F "Platform"
FEATURE lsf_batch lsf_ld 3.000 01-Jan-0000 0 C1D2C1376C651068E23C "Platform"
FEATURE lsf_mc lsf_ld 3.000 01-Jan-0000 0 C1F2E1322CC85B8D1806 "Platform"

License management utilities

FLEXlm provides several utility programs for managing software licenses. These utilities and their manual pages are included in the LSF software distribution.

Because these utilities can be used to shut down the FLEXlm license server, and thus prevent licensed software from running, they are installed in the LSF_SERVERDIR directory. The file permissions are set so that only root and members of group 0 can use them.

The utilities included are:

lmcksum
Calculate check sums of the license key information
lmdown
Shut down the FLEXlm server
lmhostid
Display the hardware host ID
lmremove
Remove a feature from the list of checked out features
lmreread
Tell the license daemons to re-read the license file
lmstat
Display the status of the license servers and checked out licenses
lmver
Display the FLEXlm version information for a program or library

For complete details on these commands, see the on-line manual pages.

Updating an LSF License

FLEXlm only accepts one license key for each feature listed in a license key file. If there is more than one FEATURE line for the same feature, only the first FEATURE line is used. To add hosts to your LSF cluster, you must replace the old FEATURE line with a new one listing the new total number of licenses.

The procedure for updating a license key file to include new license keys is described in 'Adding a Permanent License'.

Changing the FLEXlm Server TCP Port

The fourth field on the SERVER line specifies the TCP port number that the FLEXlm server uses. Choose an unused port number. LSF usually uses port numbers in the range 3879 to 3882, so the numbers from 3883 on are good choices. If the lmgrd daemon complains that the license server port is in use, you can choose another port number and restart lmgrd.

For example, if your license file contains the line:

SERVER hostname host-id 1700

and you want your FLEXlm server to use TCP port 3883, change the SERVER line to:

SERVER hostname host-id 3883

Modifying LSF Components and Licensing

LSF Suite V3.0 includes the following components: LSF Base, LSF Batch, LSF JobScheduler (formerly LSF-PJS), and LSF MultiCluster.

The configuration changes to enable a particular component in a cluster is handled during installation by lsfsetup. If at some later time you want to modify the components of your cluster, edit the FEATURES line in the Parameters section of the lsf.cluster.cluster file. You can specify any combination of the strings 'lsf_base', 'lsf_batch', 'lsf_js', and 'lsf_mc' to enable the operation of LSF Base, LSF Batch, LSF JobScheduler and LSF MultiCluster, respectively. If any of 'lsf_batch', 'lsf_js', or 'lsf_mc' are specified then 'lsf_base' is automatically enabled as well.

If the lsf.cluster.cluster file is shared, adding a component name to the FEATURES line enables that component for all hosts in the cluster. For example, enable the operation of LSF Base, LSF Batch and LSF MultiCluster:

Begin Parameters
FEATURES=lsf_batch lsf_mc
End Parameters
Enable the operation of LSF Base only:
Begin Parameters
FEATURES=lsf_base
End Parameters

Enable the operation of LSF JobScheduler:

Begin Parameters
FEATURES=lsf_js
End Parameters

Selected Hosts

It is possible to indicate that only certain hosts run LSF Batch or LSF JobScheduler within a cluster. This is done by specifying 'lsf_batch' or 'lsf_js' in the RESOURCES field on the HOSTS section of the lsf.cluster.cluster file. For example, the following enables hosts hostA, hostB, and hostC to run LSF JobScheduler and hosts hostD, hostE, and hostF to run LSF Batch.

Begin Parameters
FEATURES=lsf_batch
End Parameters

Begin Host
HOSTNAME    model     type  server RESOURCES
hostA       SUN41 SPARCSLC    1    (sparc bsd lsf_js)
hostB       HPPA9    HP735    1    (linux lsf_js)
hostC         SGI SGIINDIG    1    (irix cs lsf_js)
hostD      SUNSOL SunSparc    1    (solaris)
hostE       HP_UX     A900    1    (hpux cs bigmem)
hostF       ALPHA  DEC5000    1    (alpha)
End Hosts

The license file used to serve the cluster must have the corresponding features. A host will show as unlicensed if the license for the component it was configured to run is unavailable. For example, if a cluster is configured to run LSF JobScheduler on all hosts, and the license file does not contain the LSF JobScheduler feature, than the hosts will be unlicensed, even if there are licenses for LSF Base or LSF Batch.


[Contents] [Prev] [Next] [End]

doc@platform.com

Copyright © 1994-1997 Platform Computing Corporation.
All rights reserved.