Chapter 8. Advanced Features

Resources

A computer network may be thought of as a collection of resources used to execute programs. Different applications often require different resources. For example, a market simulation program may take a lot of CPU power, whereas a large database application may need a lot of memory to run well. If the machines in the network are of different types, certain applications may run on some of them, but not on others.

To run applications quickly and correctly, resource requirements can be used. Resource requirements are strings that contain resource names and operators. There are several types of resources. Load indices measure dynamic resource availability, such as a server's CPU load or available swap space. Static resources represent unchanging information, such as the number of CPUs a host has, the host type, and the maximum available swap space.

Resources can have numeric, string, or boolean (logical) values. Memory, swap space and CPU load are examples of resources with numeric values. Host type and model are string resources. Boolean resources are static resources assigned by the JobScheduler administrator to represent features available on particular hosts such as the availability of host-locked software licences or special hardware.

The lsinfo command lists the resources available in your cluster. See the example listing in 'Cluster Resource Information'. Resource names are case sensitive.

Resource Requirements

A resource requirement string is divided into three sections, including a selection section, an ordering section, and a resource usage section.The selection string specifies the characteristics a host must have to be considered eligible. The ordering section indicates how the hosts that meet the selection criteria should be sorted. The resource usage section specifies the expected resource consumption of the task. The syntax of a resource requirement string is:

select[selectstring] order[orderstring] rusage[usagestring]

where 'select', 'order' and 'rusage' are the section names. The selection string is a logical expression built from a set of resource names.

Note
The 'select' keyword may be omitted if the 'selectstring' appears as the first string in the resource requirement.

The order string allows the selected hosts to be sorted according to the value(s) of resource(s). The syntax of the order string is:

"[-]res:[-]res:...[-]res"

where each 'res' is a load index resource name, as is returned from the lsload command. The order string is used as input to a multi-level sorting algorithm, where each sorting phase orders the hosts according to one particular load index and discards some hosts. The remaining hosts are passed onto the next phase. The first phase begins with the last index and proceeds from right to left.

The usagestring specifies the expected resource usage of the job. This is used for resource reservation so that JobScheduler would reserve the specified resources for the job to not overcommit resources to other jobs. See 'Resource Reservation' for details.

If you need to explicitly specify resource requirements for your job, use the -R option to the bsub command.

% bsub -R "mem >= 5 && solaris order[cpu]" command

The above example requests your job runs on a Solaris host that has a lightly loaded CPU and at least 5 megabytes of memory available.

You do not have to specify resource requirements every time you submit a job. JobScheduler maintains a task resource list that is consulted automatically to get the resource requirements of a job. The task list is initially configured by your cluster administrator. To view the task list, use lsrtasks command:

% lsrtasks
lotus123/type==solaris 
eval/type==any && swp>40
weekly_report/hname==hostD
backup/fserver

Each line of the output associates a resource requirement expression with a task name. When you submit a job to JobScheduler, the first string from your command line (the command name) is used as the task name to consult the task list to find the resource requirements of the task.

If a resource requirement string is not given for a task, the default resource requirement string "type==$host_type order[r15s:pg]" is assumed, where host_type refers to the host type of the job submission host.

Resource Limits

Most UNIX systems allow you to limit some of the resources consumed by a job. Jobs that consume more than the specified amount of a resource are signalled or have their priority lowered. Not all limits are supported on all systems, and the exact behaviour is system specific. See the setrlimit(2) manual page for details on how each system implements resource limits.

Queues can also specify resource limits for jobs in the queue. If both the queue and the bsub command specify a limit, the job is given the most restrictive limit.

The following resource limits can be specified to bsub:

-c cpu_limit[/host_spec]: Limit this job to using only cpu_limit of time on the execution host. You specify the limit with the syntax [hour:]minute (minute can be greater than 59). If you supply a host_spec, the limit is scaled by the appropriate CPU scaling factor. This option is useful for preventing an erroneous job from running for an excessive amount of time.

-W run_limit[/host_spec]: Limit this job to using only run_limit of wall clock time. If it exceeds the limit, it is given a warning and then terminated 10 minutes later.

-F file_limit: If this job attempts to write to a file such that the file size would increase beyond file_limit kilobytes, it is sent a warning (that normally causes the process to terminate).

-D data_limit: Limit the size of the data segment of a job to data_limit kilobytes. If the job attempts to acquire more memory, it receives an error. This option would not normally be of interest to JobScheduler users.

-S stack_limit: Limit the stack segment size limit for this job. An attempt to extend the stack beyond stack_limit kilobytes causes the job to be terminated. This option would not normally be of interest to JobScheduler users.

-C core_limit: Limit this job to a core file size of core_limit kilobytes. This option would not normally be of interest to JobScheduler users.

-M mem_limit: Limit the process resident set size limit to mem_limit kilobytes for this batch job. This option would not normally be of interest to JobScheduler users.

Note
Not all operating systems support all the resource limits listed above. Consult your operating system manual to find out what resource limits are available on your system.

The following examples specify that your job can run for 3.5 hours. The first example uses the syntax hour:minute, the second uses minute only.

% bsub -c 3:30 -J 3.5Hours command
% bsub -c 210 -J 3.5Hours command

The following example specifies that your job can use 10 minutes of run time on a DEC3100 host, or the corresponding time on any other host.

% bsub -c 10/DEC3100 -J TenMinJob command

In the following example, your job can use 4 hours of wall clock time. If it is still running 4 hours later, it will be warned and then terminated after another 10 minutes.

% bsub -W 4:00 -J FourHours command

File Transfer

The JobScheduler is normally used in networks with shared file space. When shared file space is not available, JobScheduler can copy needed files to the execution host before running the job, then copy resultant files back to the submission host after the job completes.

The bsub command has the -f "[lfile op [rfile]]" option, which copies a file between the submission host and the execution host.

lfile (local file) is the file name on the submission host, and rfile (remote file) is the name on the execution host. lfile and rfile can be specified with absolute or relative path names. If you do not specify one of the files, bsub uses the filename of the other. At least one must be given.

op is the operation to perform on the file. op must be surrounded by white space. op is invalid without at least one of lfile or rfile. The possible values for op are:

>: lfile on the submission host is copied to rfile on the execution host before job execution. rfile is overwritten if it exists.

<: rfile on the execution host is copied to lfile on the submission host after the job completes. lfile is overwritten if it exists.

<<: rfile is appended to lfile after the job completes. lfile is created if it does not exist.

><, <>: lfile is copied to rfile before the job executes, then rfile is copied back (replacing the previous lfile) after the job completes (<> is the same as ><).

To run the job update, which updates the data file in place, you need to copy the file to the execution host before the job runs and copy it back after the job completes.

% bsub -f "data <>" update data

The -f option may be repeated to specify multiple files.

% bsub -f "data1 >" -f "out1 <" command data1

In the above example, the file data1 will be copied to the execution host before running the job. The resultant file, out1, will be copied back to the submission host after the job completes.

Note
The files being copied to and from the execution host may also need to appear on the command line of the job.

If you specified an input file with the -i option (see 'Input and Output') and it is not found on the execution host, the file is copied from the submission host using the JobScheduler remote file access facility. It is removed from the execution host after the job finishes.

If you specified output files with the -o (standard output) and -e (standard error) options, these files are created on the execution host. They are not copied back to the submission host by default. You must explicitly copy these files back to the submission host. The following command stores the job output in the job_out file and copies the file back to the submission host.

% bsub -o job_out -f 'job_out <' command

JobScheduler tries to change directories to the same path name as the directory where you ran the bsub command. If this directory does not exist, the job is run in the temporary directory on the execution host.

If the submission and execution hosts have different directory structures, you must ensure that the directory where rfile will be placed exists. You should always specify it with relative path names, preferably as a file name excluding any path. This places rfile in the current working directory of the job. The job will work correctly even if the directory where the bsub command is run does not exist on the execution host.

In the following example, you submit a job with input taken from the file /data/data3 and the output copied back to /data/out3.

% bsub -f "/data/data3 > data3" -f "/data/out3 < out3" command data3 out3

Start and Termination Time

You can control the times when your job is dispatched by associating it with a calendar. You can also specify a start and termination time using the -b and -t options to the bsub command.

% bsub -b 5:00 command

The submitted job remains pending until after 5 a.m. local time on the JobScheduler master host.

% bsub -b 11:12:5:40 -t 11:12:20:30 command

The above command submits your job to start after November 12 05:40:00. If the job is still running on November 12 at 20:30:00, it is killed.

Reinitializing the Job Environment

By default JobScheduler copies the environment of the job from the submission host when the job is submitted. The environment is recreated on the execution host when the job is started. This is convenient in many cases because the job runs as if it had been run interactively on the submission host.

There are cases where you want to use a platform or host specific environment to run the job, rather than the same environment as on the submission host. For example, you may want to set up different search paths on the execution host. The -L login_shell option causes bsub to emulate a login on the execution host before starting the user job. This makes sure that the startup files (.profile for /bin/sh, or .cshrc and .login for /bin/csh) are sourced before the job is started. The shell argument specifies the login shell to use to reinitialize the environment.

% bsub -L /usr/local/shell command

Note
This is not the shell under which the job will be executed. When a login shell is specified with the -L option, that shell is only used to set the environment. The job is run using /bin/sh, unless the user specifies otherwise (see 'Running a Job Under a Particular Shell' ).

Exclusive Jobs

You can submit a job requesting that it must run exclusively on a host. Use the -x option to the bsub command.

The job is started on a host that has no other JobScheduler jobs running on it. The host is locked while this job is running, so that no other jobs are sent to it until the exclusive job finishes.

% bsub -x -J Exclusive command

Job Scripts

If bsub is run without giving a command to submit, it reads job command lines from the standard input. If the standard input is a controlling terminal, then you are prompted with bsub> for each line.

% bsub
bsub> cd /home/username/data
bsub> command arg1 arg2 ...
bsub> rm job.log
bsub> ^D
Job <1234> submitted to queue <default>.

In this case, the three command lines are submitted to JobScheduler and run as a /bin/sh script. Type CTRL-D to end the command and submit the job.

You can also redirect commands into bsub.

% bsub < command_file
Job <1237> submitted to queue <default>.

command_file must contain valid /bin/sh command lines.

Note
On Windows NT, the command shell cmd.exe is invoked to run the commands.

Embedded Submission Options

You can specify job submission options in the script read from the standard input by the bsub command using lines starting with '#BSUB':

% bsub -q priority
bsub> #BSUB -q test
bsub> #BSUB -o outfile -R "mem > 10"
bsub> command arg1 arg2
bsub> #BSUB -J EmbeddedSub
bsub> ^D
Job <1238> submitted to queue <priority>.

There are a few things to note:

Command line options override embedded options, therefore, the job is submitted to the priority queue rather than the test queue.
Submission options can be specified anywhere in the standard input. In the above example, the -J option is specified after the command to be run.
More than one option can be specified on one line.

You can type the above commands into a script and redirect it to the standard input of the bsub command:

% bsub < shell_script
Job <1239> submitted to queue <test>.

The shell_script file contains job submission options (lines starting with '#BSUB') as well as command lines to execute. When the bsub command reads a script from its standard input, the script file is actually spooled by JobScheduler, therefore, you can modify the script after bsub returns without affecting the current submission.

When the script is specified on the bsub command line, the script is not spooled:

% bsub shell_script
Job <1240> submitted to default queue <default>.

In this case the shell script is spooled by JobScheduler, instead of the contents of the shell_script file. If you subsequently change the script, the behaviour of your job may also change.

Note
The bsub command interprets embedded options only if the script is supplied to its stdin. If the script is specified on the bsub command line, as is the case with the latter example above, the embedded options in the script file are ignored.

Running a Job Under a Particular Shell

By default, JobScheduler runs job scripts using the /bin/sh shell. However, you can specify the shell under which the job is run. This is done by specifying an interpreter in the first line of the script.

% bsub
bsub> #!/bin/csh -f
bsub> set coredump=`ls |grep core`
bsub> if ( "$coredump" != "") then
bsub> mv core core.`date | cut -d" " -f1`
bsub> endif
bsub> CoreJob
bsub> ^D
Job <1241> is submitted to default queue <default>.

The bsub command must read the job script from the standard input to set the execution shell.

If you do not specify a shell in the script, the script is run using /bin/sh. If the first line of the script starts with a "#" but the second character is not a "!", then /bin/csh is used to run the job.

% bsub
bsub> # This is a comment line. This tells the system to use /bin/csh to
bsub> # interpret the script.
bsub>
bsub> setenv DAY `date | cut -d" " -f1`
bsub> DateJob
bsub> ^D
Job <1242> is submitted to default queue <default>.

Resource Reservation

By default, job scheduling is based on current load conditions. In this case, it is assumed that the resources a job consumes will be reflected in the load information. However, many jobs do not consume all of the resources they require when they first start. For example, a job requiring 100MB of swap is dispatched to a host having 150MB of available swap. The job starts off initially allocating 5MB and gradually increases the amount consumed to 100MB over a period of 30 minutes. During this period, another job requiring more than 50MB of swap should not be started on the same host to avoid overcommitting swap.

Resources can be reserved as specified by a job's resource requirements or configured with queues. If resources are reserved, no other job will use the resources, even if the job for which the resources have been reserved has not consumed them. The syntax for resource reservation in the rusage section of resource requirement string is:

res=value[:res=value]...[:res=value][:duration=value][:decay=value]

Here 'res' can be any load index and 'value' is the initial reserved amount. If 'res' or 'value' is not given, the default is to not reserve that resource. 'duration' is the time period within which the specified resources should be reserved. If 'duration' is not specified then the default is to reserve the total amount for the lifetime of the job.

The 'decay' parameter indicates how the reserved amount should decrease over the duration. A value of 1 for 'decay' (for example, 'decay=1') indicates that the system should linearly decrease the amount reserved over the duration. The default decay value is 0, which causes the total amount to be reserved for the lifetime of the job. The decay parameter is ignored if the duration is not specified. Values other than 0 or 1 for this parameter are unsupported.

When deciding whether to schedule a job on a host , the JobScheduler system considers the reserved resources of all the jobs that have been started on that host. For each load index, the amount reserved by all jobs on that host is summed up and deducted from the current value of the resources as reported by lsload(1) to get the amount available for scheduling new jobs:

available amount = current value - reserved amount for all jobs

Reservation of the resources 'mem' and 'swap' are handled as special cases. For these resources, the run time usage is used to determine the amount to reserve. The reserved amount is the specified amount minus the run time usage. The 'duration' and 'decay' parameters are ignored for these resources.

For example:

% bsub -R "rusage[swap=50]" my_job

will reserve 50 megabytes of swap for the job.

% bsub -R "rusage[tmp=30:duration=30:decay=1]" my_job

will reserve 30 megabytes of /tmp space for the job. As the job runs, the amount reserved will decrease at approximately 1 megabyte/minute such that the reserved amount is 0 after 30 minutes.

[Contents] [Prev] [Title]

doc@platform.com