[Contents] [Prev] [Title]


Appendix F. New Features in LSF 3.0


This appendix summarizes the new features of direct interest to LSF administrators. For descriptions of those new features in LSF version 3.0 that are of direct interest to end users, see Appendix B of the LSF User's Guide or the LSF JobScheduler User's Guide.

One major difference in LSF 3.0 is that LSF is now a suite instead of a single product. LSF Suite contains several components: LSF Base, LSF Batch, LSF JobScheduler, LSF MultiCluster.

LSF version 3.0 is backwards compatible with LSF 2.x versions in the sense that you do not need to change your configuration files to upgrade to LSF 3.0. However, you cannot run LSF 3.0 on some of the hosts while LSF 2.x on others. LSF 3.0 also requires a different license key.

LSF version 3.0 includes many new features requested by LSF users, in addition to bug fixes to LSF version 2.2.

Windows NT and Additional Unix Platform Support

LSF now runs on Microsoft Windows NT. Machines running Windows NT can now be part of the LSF cluster.

LSF is also ported to HP Exemplar system running SPP-UX 4.2.

Interactive Jobs with Batch Scheduling Control

This enables the user to run interactive jobs under the resource sharing control supported by LSF Batch. An option is provided to allow users to submit a job to LSF Batch with terminal I/O and signals attached to the job. With this support, all jobs can be under the effective resource sharing control, scheduling policies, and job accounting of LSF Batch. Parameters are available at queue level to control the types of jobs that a queue can accept such as interactive only, batch only, or mixed.

Job Level Resource Usage

Job-level resource usage is collected through a special process called PIM (Process Information Manager). The PIM is managed internally by LSF. The information collected by the PIM includes:

The above information is collected while the job is running and can be viewed by bjobs command.

Enhanced Resource Limit Control

If a job consists of multiple processes, the CPULIMIT parameter applies to all processes in a job. If a job dynamically spawns processes the cpu time used by these processes is accumulated over the life of the job.

Two additional resource limits that apply to the entire job are added: SWAPLIMIT and PROCESSLIMIT. The SWAPLIMIT limits the maximum virtual memory and PROCESSLIMIT limits the number of concurrent processes that can be part of a job at any time.

Resource Reservation

The resource reservation feature allows user's to specify that the system should reserve resources after a job starts. The reservation can be specified either by the user at job submission time, or configured at the queue level. Resource reservation ensures that a job will have sufficient resource during its execution and no other jobs will be started to use the reserved resources.

Processor Reservation For Parallel Jobs

The scheduling of parallel jobs has been enhanced to support the notion of processor reservation. Parallel jobs requiring a large number of processors can often not be started if there are many lower priority sequential jobs in the system. There may not be enough resources at any one instant to satisfy a large parallel job, but there may be enough to allow a sequential job to be started. With the processor reservation feature the problem of starvation of parallel jobs can be reduced. A queue level parameter is available to enable processor reservation and to specify slot hold time.

Flexible Expressions for Queue Scheduling

The queue level parameters for the dispatch and control of jobs have been enhanced to permit a more flexible specification. Three new parameters: RES_REQ, STOP_COND and RESUME_COND have been added, which take resource requirement strings as values.

RES_REQ defines a queue level resource requirement that applies to all jobs in the queue. This can also be used to specify scheduling conditions in a more flexible way than the scheduling load threshold parameters for the previous releases.

STOP_COND specifies a load condition for suspending jobs. This is a generalization of the load threshold parameters in previous versions for suspending jobs. The condition can be specified using a resource requirement expression syntax.

RESUME_COND specifies a condition at which a suspended job should be resumed. In previous versions, a resume condition is the same as the scheduling condition. With the introduction of RESUME_COND parameter, load conditions for resuming jobs can be different from those for scheduling jobs.

Host Preferences

You can now configure your queues so that some servers are preferable to others for running jobs. Preference levels can be associated for different hosts to give you flexibility. This feature allows better matching between jobs and desirable hosts, leading to improved performance and resource usage. A user can also specify host preferences at job submission time.

Generalized Checkpointing Support

LSF3.0 supports uniform checkpointing interface for different checkpointing mechanisms for all platforms. This interface uses external executables to initiate checkpointing and restart. New ways of checkpointing or checkpointing on new platforms can be supported by writing new external executables following the standard checkpointing protocol.

The external executables are installed in LSF_SERVERDIR by default. Users can use their own external executables by defining an environment variable.

Multiple jobs can now share a checkpoint directory.

Job Starter

A job starter can now be defined at the queue level with which to start all jobs in the queue. For example, in previous versions, users submit an MPI job by running the mpijob script. Now you can define mpijob in your MPI queue as a job starter so that users can submit MPI jobs without having to specify mpijob before the job command line. A job starter is also very useful for applications that must be started under a special environment such as Atria ClearCase.

Configurable Job Control Actions

LSF Batch needs to control jobs dispatched to a host to enforce scheduling policies or in response to user requests such as suspend, resume, and terminate. In previous releases such actions are hard coded as sending signals to jobs. In LSF 3.0, it is possible to override an action used for job control with a specified signal name or an arbitrary command line.

Unlimited Number of Load Indices and Resources

LSF 3.0 eliminates the previous limit of 21 for external load indices. The limit of 32 for configurable resources is also removed.

Enhanced Preemptive Scheduling

In previous releases, specifying a queue to be PREEMPTIVE would preempt jobs in any lower priority queue. LSF 3.0 allows the LSF administrator to specify a selective number of lower priority queues to preempt.

Per-Host Job Slot Limit of a Queue

A new parameter HJOB_LIMIT allows you to limit the number of jobs dispatched from a queue to a host regardless of the number of processors it may have. This may be useful, for example, if the queue dispatches jobs, which require a node-locked license.

Remote Startup

LSF administrators can start up any, or all, LSF daemons, on any, or all, LSF hosts, from any host in the LSF cluster. For this to work, the LSF administrator should be able to run rsh across the LSF hosts without having to enter the password. This feature is supported as additional commands inside lsadmin and badmin tools.

Exclusive Job Requeue

The queue parameter REQUEUE_EXIT_VALUE controls job requeue behavior. It defines a series of exit code values. If a job exits with one of those values, the job gets requeued. In LSF 3.0, a special requeue method called exclusive requeue is introduced such that if a job fails on a host, it is requeued and will not be dispatched to that host again.

LSF MultiCluster

LSF MultiCluster is a new product component of LSF 3.0.

File Status Events

The LSF JobScheduler (formerly PJS) product now supports file status events. These events allow you to define production jobs that depends on file arrival or status.

External Events

LSF JobScheduler now provides a generic mechanism for handling site specific events, in addition to time events and file events. Examples of such events are exceptional conditions and tape silo status. External events are collected by an external event daemon that is customizable.

System Calendars

LSF JobScheduler now supports the concept of system calendars, in addition to user calendars supported in the previous version. System calendars are defined in a configuration file by LSF cluster administrator and are read-only by all users.


[Contents] [Prev] [Title]

doc@platform.com

Copyright © 1994-1997 Platform Computing Corporation.
All rights reserved.