[Contents] [Prev] [Next] [End]


Chapter 7. Tracking Batch Jobs


This chapter describes the commands that report and change the status of your jobs:

Displaying Job Status

The bjobs command reports the status of LSF Batch jobs.

% bjobs 
JOBID USER     STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
3926  user1    RUN   priority   hostF        hostC      verilog    Oct 22 13:51
605   user1    SSUSP idle       hostQ        hostC      Test4      Oct 17 18:07
1480  user1    PEND  priority   hostD                   generator  Oct 19 18:13
7678  user1    PEND  priority   hostD                   verilog    Oct 28 13:08
7679  user1    PEND  priority   hostA                   coreHunter Oct 28 13:12
7680  user1    PEND  priority   hostB                   myjob      Oct 28 13:17

The -a option displays jobs that completed or exited in the recent past, along with pending and running jobs.

The -r option displays only running jobs.

The -u username option displays jobs submitted by other users. The special user name all displays jobs submitted by all users.

For example, to find out who is running jobs on which hosts enter:

% bjobs -u all

You can also find jobs on specific queues or hosts, find jobs submitted by specific projects, and check the status of specific jobs using their job IDs or names. See the bjobs(1) manual page for more information.

Finding Pending or Suspension Reasons

When you submit a job to LSF Batch, it may be held in the queue before it starts running and it may be suspended while running. The -p option to the bjobs command displays the reasons a job is pending. Because there can be more than one reason the job is pending or suspended, all reasons that contributed to the pending or suspension are reported. For example:

% bjobs -p
7678  user1    PEND  priority    hostD         verilog           Oct 28 13:08
Queue's resource requirements not satisfied:3 hosts;
Unable to reach slave lsbatch server: 1 host;
Not enough job slots: 1 host;

The pending reasons will also mention the number of hosts for each condition. To get the specific host names, along with pending reasons, use the -p and -l options to the bjobs command. For example:

% bjobs -lp
Job Id <7678>, User <user1>, Project <default>, Status <PEND>, Queue <priority>
                     , Command <verilog>
Mon Oct 28 13:08:11:Submitted from host <hostD>,CWD <$HOME>,Requested 
Resources <type==any && swp>35>;
PENDING REASONS:
Queue's resource requirements not satisfied: hostb, hostk, hostv;
Unable to reach slave lsbatch server: hostH;
 Not enough job slots: hostF;
 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem 
 loadSched   -    0.7   1.0    -      4.0    -    -     -     -      -      -
 loadStop    -    1.5   2.5    -      8.0    -    -     -     -      -      -

Note:
In a cluster with many hosts (100-200 hosts), it may be too verbose or considered unnecessary to always show the host names with the pending reasons. Therefore, use the bjobs command with the -p option only.

The -l option to the bjobs command displays detailed information about job status and parameters, such as the job's current working directory, parameters specified when the job was submitted, and the time when the job started running.

% bjobs -l 7678
Job Id <7678>, User <user1>, Project <default>, Status <PEND>, Queue <priority>
                     , Command <verilog>
Mon Oct 28 13:08:11:Submitted from host <hostD>,CWD <$HOME>, Requested 
Resources <type==any && swp>35>;

PENDING REASONS:
Queue's resource requirements not satisfied:3 hosts;
Unable to reach slave lsbatch server: 1 host;
Not enough job slots: 1 host;

 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem 
 loadSched   -    0.7   1.0    -      4.0    -    -     -     -      -      -
 loadStop    -    1.5   2.5    -      8.0    -    -     -     -      -      -

The loadSched and loadStop thresholds displayed are those that apply to this job. If the job is pending, the thresholds are taken from the queue. If the job has been dispatched, each threshold is the more restrictive of the queue and execution host thresholds for that load index.

Scheduling is also affected by other queue constraints such as RES_REQ, STOP_COND, RESUME_COND, fairshare policy, and others.

The -s option displays the reasons a batch job was suspended. Because the load conditions are constantly changing, the reasons for suspension may be out of date. Once the job is suspended it does not resume execution until its scheduling conditions are met.

% bjobs -s
JOBID USER     STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
605   user1    SSUSP idle       hostQ         hostC     Test4      Oct 17 18:07
The host load exceeded the following threshold(s):
Paging rate: pg;
Idle time: it;

In the example above, the job was suspended because the paging rate and interactive idle time on the execution host went above the suspending threshold. Even though the paging rate may have dropped back below the scheduling threshold, the job may remain suspended because of another threshold. The job does not resume until all load indices are within their scheduling thresholds.

Monitoring Resource Consumption of Jobs

Jobs submitted through the LSF Batch system have the resources they consume monitored while they are running. The -l option of the bjobs command displays the current resource usage of the job. This job-level information includes:

The job-level resource usage information is updated at a maximum frequency of every SBD_SLEEP_TIME seconds (see 'The lsb.params File' of the LSF Administrator's Guide for the value of SBD_SLEEP_TIME). The update is done only if the value for the CPU time, resident memory usage, or virtual memory usage has changed by more than 10 percent from the previous update or if a new process or process group has been created.

% bjobs -l 1531
Job Id <1531>, User <user1>, Project <default>, Status <RUN>, Queue <priority> 
                     Command <example 200>
Fri Dec 27 13:04:14: Submitted from host <hostA>, CWD <$HOME>, Specified Hosts 
                     <hostD>;
Fri Dec 27 13:04:19: Started on <hostD>, Execution Home </home/user1>, Executio
                     n CWD </home/user1>;
Fri Dec 27 13:05:00: Resource usage collected.
                     The CPU time used is 2 seconds.
                     MEM: 147 Kbytes;  SWAP: 201 Kbytes
                     PGID: 8920;  PIDs: 8920 8921 8922 
 SCHEDULING PARAMETERS:
              r15s   r1m   r15m   ut    pg    io    ls    it    tmp   swp   mem
 loadSched    -      -     -      -     -     -     -     -     -     -     -
 loadStop     -      -     -      -     -     -     -     -     -     -     -

Displaying Job History

Sometimes you want to know what has happened to your job since it was submitted. The bhist command displays a summary of the pending, suspended and running time of batch jobs. The -l option of the bhist command prints the time information and a complete history of the scheduling events for each job.

% bhist -l 1531
Job Id <1531>, User <user1>, Project <default>, Command <example 200>
Fri Dec 27 13:04:14: Submitted from host <hostA> to Queue <priority>, CWD <$HOM
                     E>, Specified Hosts <hostD>;
Fri Dec 27 13:04:19: Dispatched to <hostD>;
Fri Dec 27 13:04:19: Starting (Pid 8920);
Fri Dec 27 13:04:20: Running with execution home </home/user1>, Execution CWD <
                     /home/user1>, Execution Pid <8920>;
Fri Dec 27 13:05:49: Suspended by the user or administrator;
Fri Dec 27 13:05:56: Suspended:  Waiting for re-scheduling after being resumed 
                     by user;
Fri Dec 27 13:05:57: Running;
Fri Dec 27 13:07:52: Done successfully. The CPU time used is 28.3 seconds.

Summary of time in seconds spent in various states by Fri Dec 27 13:07:52 1996
  PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL
  5        0        205      7        1        0        218

The -J job_name option of the bhist command displays the history of all LSF Batch jobs with the specified job name. Job names are assigned with the -J job_name option of the bsub command.

LSF keeps job history information after the job exits, so you can look at the history of jobs that completed in the past. The length of the history depends on how often the LSF administrator cleans up the log files.

By default, bhist only displays job history from the current event log file. The -n option to the bhist command allows users to display the history of jobs that completed a long time ago, and are no longer listed in the active event log.

The LSF Batch system periodically backs up and prunes the job history log. The -n num_logfiles option tells the bhist command to search through the specified number of log files instead of only searching the current log file. Log files are searched in reverse time order; for example, the command bhist -n 3 searches the current event log file and then the two most recent backup files.

Checking Partial Job Output

The output from an LSF Batch job is normally not available until the job is finished. However, LSF Batch provides the bpeek command for you to look at the output the job has produced so far. By default, bpeek shows the output from the most recently submitted job; you can also select the job by queue or execution host, or specify the job ID or job name on the command line.

% bpeek 1234
<< output from stdout >>
Starting phase 1
Phase 1 done
Calculating new parameters
...

Only the job owner can use bpeek to see job output. The bpeek command will not work on a job running under a different user account.

You can use this command to check if your job is behaving as you expected and kill the job if it is running away or producing unusable results. This could save you time.

Displaying Queue and Host Status

The bqueues and bhosts commands display the number of jobs in a queue or dispatched to a host. For more information on these commands see 'Batch Queues' and 'Batch Hosts'.

Killing Jobs

The bkill command cancels pending batch jobs and sends signals to running jobs. By default, bkill sends the SIGKILL signal to running jobs. For example, to kill job 3421 enter:

% bkill 3421
Job <3421> is being terminated

20 seconds before SIGKILL is sent, SIGINT and SIGTERM are sent to give the job a chance to catch the signals and clean up. The signals are forwarded from the mbatchd to the sbatchd. The sbatchd waits for the job to exit before reporting the status. Because of this, bjobs may still report that the job is running for a few seconds.

To send an arbitrary signal to your job, use the -s option of the bkill command. You can specify either the signal name or the signal number. On most versions of UNIX, signal names and numbers are listed in the kill(1) or signal(2) manual page.

% bkill -s TSTP 3421
Job <3421> is being signaled

This example sends the TSTP signal to job 3421.

Note
Signal numbers are translated across different platforms because different operating systems may have different signal numbering. The real meaning of a specific signal is interpreted by the machine from which the bkill command is issued. For example, if you send signal 18 from an SunOS 4.x host, it means SIGTSTP. If the job is running on an HP-UX, SIGTSTP is defined as signal number 25, so signal 25 is sent to the job.

Only the owner of a batch job or an LSF administrator can send signals to a job.

Suspending and Resuming Jobs

The bstop and bresume commands are convenient aliases for bkill -s, sending the SIGSTOP/SIGTSTP and SIGCONT signals respectively.

Note
You cannot send arbitrary signals to a pending job; most signals are only valid for running jobs. However, LSF Batch does allow you to kill, suspend and resume pending jobs.

To suspend job 3421, enter:

% bstop 3421
Job <3421> is being stopped

bstop sends the SIGSTOP signal to sequential jobs and SIGTSTP to parallel jobs. SIGTSTP is sent to a parallel job so the master process can trap the signal and pass it to all the slave processes running on other hosts.

To resume the same job, enter:

% bresume 3421
Job <3421> is being resumed

Suspending a job causes your job to go into USUSP state if the job is already started, or to go into PSUSP state if your job is pending. Resuming a user suspended job does not put your job into RUN state immediately. If your job was running before the suspension, bresume first puts your job into SSUSP state and then waits for sbatchd to schedule it according to the load conditions.

Moving Jobs Within and Between Queues

The btop and bbot commands move pending jobs within a queue. btop moves jobs toward the top of the queue, so that they are dispatched before other pending jobs. bbot moves jobs toward the bottom of the queue so that they are dispatched later. The default behaviour is to move the job as close to the top or bottom of the queue as possible. By specifying a position on the command line, you can move a job to an arbitrary position relative to the top or bottom of the queue.

The btop and bbot commands do not allow users to move their own jobs ahead of those submitted by other users; only the dispatch order of the user's own jobs is changed. Only an LSF administrator can move one user's job ahead of another.

Note
The btop and bbot commands have no effect on the job dispatch order when fairshare policies are used.

% bjobs -u all
JOBID USER  STAT  QUEUE    FROM_HOST  EXEC_HOST  JOB_NAME   SUBMIT_TIME
5308  user2 RUN   normal   hostA      hostD      sleep 500  Oct 23 10:16
5309  user2 PEND  night    hostA                 sleep 200  Oct 23 11:04
5310  user1 PEND  night    hostB                 myjob      Oct 23 13:45
5311  user2 PEND  night    hostA                 sleep 700  Oct 23 18:17

% btop 5311
Job <5311> has been moved to position 1 from top.

% bjobs -u all
JOBID USER  STAT  QUEUE    FROM_HOST  EXEC_HOST  JOB_NAME   SUBMIT_TIME
5308  user2 RUN   normal   hostA      hostD      sleep 500  Oct 23 10:16
5311  user2 PEND  night    hostA                 sleep 700  Oct 23 18:17
5310  user1 PEND  night    hostB                 myjob      Oct 23 13:45
5309  user2 PEND  night    hostA                 sleep 200  Oct 23 11:04

Note that user1's job is still in the same position on the queue. User2 cannot use btop to get extra jobs at the top of the queue; when one of his jobs moves up on the queue, the rest of his jobs move down.

The bswitch command switches pending and running jobs from queue to queue. This is useful if you submit a job to the wrong queue, or if the job is suspended because of the queue thresholds or run windows and you would like to resume the job.

% bswitch priority 5309
Job <5309> is switched to queue <priority>

% bjobs -u all
JOBID USER  STAT  QUEUE    FROM_HOST  EXEC_HOST  JOB_NAME   SUBMIT_TIME
5308  user2 RUN   normal   hostA      hostD      sleep 500  Oct 23 10:16
5309  user2 RUN   priority hostA      hostB      sleep 200  Oct 23 11:04
5311  user2 PEND  night    hostA                 sleep 700  Oct 23 18:17
5310  user1 PEND  night    hostB                 myjob      Oct 23 13:45

Job Parameter Modification

The parameters associated with a job can be modified after the job has been submitted. The bmodify command allows for changes to the parameters of already submitted jobs. The parameters of a job can be modified only if the job is in pending status.

The bmodify command takes the same options as the bsub command together with a job ID. (See 'Submitting Batch Jobs'.) The given options replace the existing options of the specified job.

To reset an option to its default value, append the n character to the option name. No option value should be specified when resetting an option. For example:

% bmodify -bn 123

Job 123 will be dispatched as soon as possible, ignoring any previously specified start time. The following example shows how bmodify is used to change the start time to 2 A.M.:

% bmodify -b 2:00

Note
The job command line itself and the environment variables present at submission time cannot be modified. In versions of LSF prior to V3.0, the shell option set with the -L argument also could not be modified.

Job Tracking and Manipulation Using xlsbatch

Most of the operations discussed in this chapter can also be performed using the xlsbatch GUI. The main window of xlsbatch is shown in 'Figure 4. xlsbatch Main Window'.

You can view job details by first select a job and then click on the 'Detail' button. The resulting popup window is shown in Figure 11. This gives you the same information as you can get by running the bjobs -l command.

Figure 11. Detailed Job Information Popup Window

Detailed Job Information Popup Window

The 'History' button gives you a popup window for job history as you can otherwise get through the bhist command.

To perform control actions on jobs, such as killing a job or suspending/resuming a job, simply select the job and then click on an action button.

You can also invoke the xbsub window from inside xlsbatch to submit new jobs. If you want to modify a job parameter, simply select on the job and click on 'Modify' button to get the job modification popup window. Note that this window can also be invoked by running xbmodify from the command line. Figure 12 shows the xbmodify window. This window is the similar to the xbsub window except that the command line is read-only.

Figure 12. xbmodify Window

xbmodify Window


[Contents] [Prev] [Next] [End]

doc@platform.com

Copyright © 1994-1997 Platform Computing Corporation.
All rights reserved.