[Contents] [Prev] [Next] [End]


Chapter 11. Checkpointing and Migration


Checkpointing allows you to take a snapshot of the state of a running batch job, so that the job can be restarted later. There are two main reasons for checkpointing a job: fault tolerance and load balancing.

A batch job can be checkpointed periodically during its run. If the execution host goes down for any reason, the job can resume execution from the last checkpoint when the host comes back up, rather than having to start from the beginning. The job can also be restarted on a different host if the original execution host remains unavailable.

Sometimes one host is overloaded while the others are idle or lightly loaded. LSF Batch can checkpoint one or more jobs on the overloaded host and restart the jobs on other idle or lightly loaded hosts. Job migration can also be used to move intensive jobs away from hosts with interactive users such as desktop workstations, so that the intensive jobs do not interfere with the users.

Topics covered in this chapter are:

Approaches to Checkpointing

Checkpointing can be implemented at three levels. These levels are called kernel-level, user-level and application-level.

LSF Batch uses the kernel-level support available on ConvexOS, Cray (UNICOS), and SPP-UX systems to implement checkpointing on these machines and provides user-level support on most other platforms (HP-UX 9.x, DEC Alpha, SGI IRIX 5,6, Solaris 2.x, and SunOs 4). It also supports application-level checkpointing.

Kernel-level Checkpointing

In kernel-level checkpointing, the operating system supports checkpointing and restarting processes. The checkpointing is transparent to applications.: they do not have to be modified or linked with any special library to support checkpointing.

User-level Checkpointing

Application programs to be checkpointed are linked with a checkpoint user library. Upon checkpointing, a checkpoint triggering signal is sent to the process. The functions in the checkpoint library respond to the signal and save the information necessary to restart the process. On restart, the functions in the checkpoint library restore the execution environment for the process. To applications, checkpointing is transparent. But unlike kernel-level support, applications must be relinked to allow checkpointing.

Application-level Checkpointing

Applications can be coded in a way to checkpoint themselves either periodically or in response to signals sent by other processes. When restarted, the application must look for the checkpoint files and restore its state.

Checkpoint Directory

Checkpoint directory stores checkpoint files that are necessary to recreate a checkpointed job. LSF allows more than one job to be checkpointed into the same directory. Each job is saved in a different subdirectory named by the job ID.

LSF Batch automatically renames chkpntdir/jobId to chkpntdir/newjobId after a job is restarted. A conflict may occur when a directory chkpntdir/jobId already exists and the restart jobID is the same as the existing directory. In this case, LSF Batch first renames the existing directory chkpntdir/jobId to chkpntdir/jobId.bak.

If the conflict cannot be solved, sbatchd will report it to mbatchd and the job will be aborted.

Uniform Checkpointing Interface

All interaction between LSF and the platform or application dependent facility goes through a common interface provided by two executables: echkpnt and erestart1.. echkpnt is invoked when LSF needs to checkpoint a job. erestart is invoked when LSF needs to start a previously checkpointed job.

For systems supporting kernel-level checkpointing, echkpnt and erestart are just wrapper scripts for the vendor's checkpointing commands. User-level checkpointing for the supported platforms uses whichever echkpnt/erestart interacts with the supplied checkpoint library.

You can follow the format given below to extend or replace echkpnt and erestart for other checkpoint facilities, including the application-level checkpointing.

The echkpnt Command

The echkpnt command must support the following arguments and take appropriate actions to perform the checkpointing:

echkpnt [-c] [-f] [-k |-s] [-d chkpnt_dir ] process-group-id
-c
Copy all regular files in use by the checkpointed process to the checkpoint directory.
-f
Checkpoint the job even if non-checkpointable conditions exist (non-checkpointable conditions are specific to the type of checkpoint facility being used). This may create checkpoint files that will not restart properly.
-k
Kill the job if the checkpoint operation is successful. Default is that the job continues execution after being checkpointed. If this option is specified and the checkpoint fails for any reason, the job continues normal execution.
-s
Stop the job if the checkpoint operation is successful. Default is that the job continues execution after being checkpointed. If this option is specified and the checkpoint fails for any reason, the job continues normal execution.
-d chkpnt_dir
The directory containing the checkpoint files. Default is the current directory.
process-group-id
The process group ID of the job to be checkpointed.

The erestart Command

The erestart command must support the following arguments and take appropriate actions to perform the restart:

erestart [-c] [-f]  chkpnt_dir 
-c
Copy data files from the checkpoint directory to the original pathname. If any of the process's data files are copied into the checkpoint directory at the time of checkpoint, this option will cause restart to replace the contents of the original files with the copies. This option is currently only supported on ConvexOS.
-f
Force a restart of the process even if non-restartable conditions exist.
chkpnt_dir
Checkpoint directory.

When LSF Batch calls erestart, it will first pass the process-ID and process group-ID of the original job via the environment variables LSB_RESTART_PID and LSB_RESTART_PGID, respectively. Then it waits for a message from the standard error of erestart before proceeding. Upon success, erestart sends back either a message of "pid=NEW_PID pgid=NEW_PGID" if there are any restart process-id or process group-id changes, or a null message by closing the standard error if there are not any changes. On failure, LSF Batch expects erestart will write the error messages via its standard error.

Submitting Checkpointable Jobs

Checkpointable jobs are submitted using the -k "checkpoint_dir[period]" option of the bsub command. The checkpoint_dir parameter specifies the directory where the checkpoint files are created. The period parameter specifies an optional time interval in minutes for automatic and periodic checkpointing. If the period parameter is specified, the checkpoint directory and period must be enclosed in quotes.

If the checkpoint period is not specified, the job is considered checkpointable but it is not automatically checkpointed. A checkpoint can be created for any checkpointable batch job using the bchkpnt command. If the checkpoint period is specified, LSF Batch automatically checkpoints the job at the specified time interval.

% bsub -k "io.chkdir 10" io_job
Job <3426> is submitted to default queue <normal>.
% bjobs -l

Job Id <3426>, User <user2>, Status <RUN>, Queue <normal>, Command <io_job>
Thu Oct 24 16:50:27: Submitted from host <hostB>, CWD <$HOME/tmp>,
                     Checkpoint period 10 min., Checkpoint directory
                     <io.chkdir/3426>;
Thu Oct 24 16:50:28: Started on <hostB>;

           r15s   r1m  r15m   ut     pg   io   ls    it    tmp    swp    mem
 loadSched   -    0.7   1.0    -    4.0    -   12     -     0M     -      -
 loadStop    -    1.5   2.5    -    8.0    -   15     -     -      -      -

This example submits the checkpointable batch job io_job. Checkpoint files will be stored in the directory io.chkdir/3426 (relative to the current directory), and checkpoints will automatically be created every 10 minutes, each overwriting the previous one.

Note
It is your responsibility to clean up the checkpoint directory when it is no longer needed.

If the checkpoint directory does not exist, LSF Batch creates it. If it does exist, it must be writable by the user. To restart a job on a different host, the checkpoint files must be stored in a directory accessible to both hosts.

Checkpointing a Job

In addition to automatic checkpointing, users can checkpoint jobs explicitly with the bchkpnt command. A job is checkpointable only if it was submitted with the -k option to bsub.

The -p period option to bchkpnt allows you to set or change the checkpointing period for a job. period is specified in minutes. If period is 0 (zero), periodic checkpointing is turned off. Otherwise, periodic checkpointing is turned on and the checkpoint period is set to the value given.

Some jobs cannot be checkpointed and restarted correctly because it is not possible to recreate the running state of the job. See 'Limitations' and the chkpnt(1) manual page on ConvexOS and Cray systems for a list of conditions that can cause checkpointing to fail.

The -f option to bchkpnt forces the job to be checkpointed, even if some condition exists that would normally make the job non-checkpointable. When a non-checkpointable job is checkpointed using the -f flag, the job may not be restarted correctly.

The -k option to bchkpnt checkpoints and kills the batch job as an atomic action. This guarantees that the job does not do any processing or I/O after the checkpoint, so that the restarted job does not repeat any operations already performed by the original job.

% bchkpnt -f -p 15
Job <3426> is being checkpointed
% bjobs -l
Job Id <3426>, User <user2>, Status <RUN>, Queue <normal>, Command <io_job>
Thu Oct 24 16:50:27: Submitted from host <hostB>, CWD <$HOME/tmp>,
                     Checkpoint period 15 min., Checkpoint directory
                     <io.chkdir/3426>;
Thu Oct 24 16:50:28: Started on <hostB>;

           r15s   r1m  r15m   ut     pg   io   ls    it    tmp    swp    mem
 loadSched   -    0.7   1.0    -    4.0    -   12     -     0M     -      -
 loadStop    -    1.5   2.5    -    8.0    -   15     -     -      -      -
% bhist -l
Job Id <3426>, User <user2>, Command <io_job>
Thu Oct 24 16:50:27: Submitted from host <hostB> to Queue <normal>, CWD
                     <$HOME/tmp>, Checkpoint period 10 min., Checkpoint
                     directory <io.chkdir/3426>;
Thu Oct 24 16:50:28: Started on <hostB>, Pid <4705>;
Thu Oct 24 16:51:21: Checkpoint initiated (actpid 4717); checkpoint period
                     is 15 min.;
Thu Oct 24 16:51:29: Checkpoint succeeded (actpid 4717).

Summary of time in seconds spent in various states by Thu Oct 24 16:51:36 1996
  PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL
  1        0        68       0        0        0        69

Restarting a Checkpointed Job

A checkpointed batch job can be restarted on a host of the same type as, and running the same operating system version as, the original execution host. The executable being run and all open files must be located at the same absolute path name.

In addition to automatic job restart, specified with the -r option to bsub, LSF Batch provides the brestart command that restarts a checkpointed batch job from the information stored in the checkpoint directory. The checkpoint directory must have been created by a previously submitted job that was checkpointed successfully.

When you restart a job with brestart command, you should specify a jobId and chkpntDir directory. The command looks like:

brestart  chkpntDir  [lastJobId]

where lastJobId must be specified if there is more than one jobId directory under chkpntDir.

brestart creates a new batch job that uses the checkpoint information. The restarted job is assigned a new job ID number, and is placed at the end of the specified queue. The job begins executing when a suitable host is available, like any other batch job.

The brestart command takes many of the same options as the bsub command to specify the conditions under which the restarted job runs. The restarted job has the same output file and file transfer specifications, job name, run window signal value, checkpoint directory and period, and rerunability as the original job.

% brestart io.chkdir 3426
Job <3427> is submitted to default queue <normal>.
% bjobs -l
Job Id <3427>, User <user2>, Status <RUN>, Queue <normal>, Command <io_job>
Thu Oct 24 17:05:01: Submitted from host <hostB>, CWD <$HOME/tmp>,
                     Checkpoint directory <io.chkdir/3427>;
Thu Oct 24 17:05:01: Started on <hostB>;

           r15s   r1m  r15m   ut     pg   io   ls    it    tmp    swp    mem
 loadSched   -    0.7   1.0    -    4.0    -   12     -     0M    -       -
 loadStop    -    1.5   2.5    -    8.0    -   15     -     -      -      -

Job Migration

Checkpointable jobs and rerunable jobs (jobs that are submitted with the bsub -r option) can be migrated to another host for execution if the current host is too busy or the host is going to be shut down. Such jobs can be moved from one host to another, as long as both hosts are binary compatible and run the same version of the operating system.

The job's owner or the LSF administrator can use the bmig command to migrate jobs. If the job is checkpointable, the bmig command first checkpoints it. Then LSF kills the running or suspended job, and restarts or reruns the job on another suitable host if one is available. If LSF is unable to rerun or restart the job immediately, the job reverts to PEND status and is requeued with a higher priority than any submitted job, so it is rerun or restarted before other queued jobs are dispatched.

% bmig 3426
Job <3426> is being migrated
% bhist -l 3426

Job Id <3426>, User <user2>, Command <io_job>
Thu Oct 24 16:50:27: Submitted from host <hostB> to Queue <normal>, CWD
                     <$HOME/tmp>, Checkpoint period 10 min., Checkpoint
                     directory <io.chkdir/3426>;
Thu Oct 24 16:50:28: Started on <hostB>, Pid <4705>;
Thu Oct 24 16:51:21: Checkpoint initiated (actpid 4717); checkpoint period
                     is 15 min.;
Thu Oct 24 16:51:29: Checkpoint succeeded (actpid 4717);
Thu Oct 2424 16:53:42: Migration requested;
Thu Oct 24 16:54:03: Migration checkpoint initiated (actpid 4746);
Thu Oct 24 16:54:15: Migration checkpoint succeeded (actpid 4746);
Thu Oct 24 16:54:15: Pending: Migrating job is waiting for reschedule;
Thu Oct 24 16:55:16: Started on <lemon>, Pid <10354>.

Summary of time in seconds spent in various states by Thu Oct 24 16:57:26 1996
  PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL
  62       0        357      0        0        0        419

Queues and Hosts for Automatic Job Migration

LSF Batch will not automatically migrate a job unless the job has a migration threshold. A job has a migration threshold if it is dispatched from a queue or to a host that has a migration threshold defined. The bqueues -l and bhosts -l commands display the migration threshold if it is defined.

Automatically Rerunning and Restarting Jobs

Batch jobs submitted with the -r option to the bsub command are automatically rerun or restarted if the execution host becomes unavailable. (A host is considered unavailable if both the LIM and sbatchd daemons are unreachable.) If the job has not been checkpointed, the job is rerun from the beginning. If the job has been checkpointed, either automatically or by the bchkpnt command, it is restarted from the last checkpoint.

When a job is rerun or restarted, it is returned to the batch queue in which it was executing, with the same options but a higher priority as the original batch job. The job uses the same job ID number. It is executed when a suitable host is available, and an email message is sent to the job submitter informing the user of the restart.

Submitting a Job for Automatic Migration

If you want to submit a rerunable job and have the system automatically migrate it to another host when the job is suspended due to load, you must submit the job to a queue or a host configured for automatic job migration.

If a job with a migration threshold has been suspended for more than the specified number of minutes, LSF Batch attempts to migrate the job to another host. Only checkpointable or rerunable jobs are considered for migration. To submit a rerunable job, you must use the -r option to the bsub command.

If you want a job to be rerunable but you do not want the system to migrate it automatically, submit the job to a queue or host that does not have a migration threshold defined. You can still migrate the job manually with the bmig command.

Building Checkpointable Jobs

The LSF checkpoint library provides user-level checkpointing facilities on operating systems that do not provide kernel checkpointing. It consists of three parts:

Programs to be checkpointed must be linked with the checkpoint startup routine and library. The checkpoint linkers are shell scripts that call the standard linkers on your operating system with the correct options to link a checkpointable program.

The LSF user-level checkpoint library is based on the Condor system from the University of Wisconsin.

The Checkpoint Library

The checkpoint library consists of a set of file system call stubs for file operations such as open(), close(), and dup(), intercepted by the checkpoint library. It also contains a checkpoint signal handler and routines used internally to implement checkpointing.

The Checkpoint Startup Routine

The startup routine sets the checkpoint signal handler and initializes some data structures. The data structures are used to record file accesses of the process and are used during the restart to re-open files opened before the checkpoint and to restore the current access positions in those files. The signal handler settings of the process are also saved at the checkpoint time and restored when the process is restarted. All these operations are transparent to applications.

Linking

Except on those systems with kernel checkpointing support, checkpointable programs must be linked with the special checkpoint library and startup routine instead of the standard ones. LSF comes with replacement linkers for programs to be checkpointed. The ckpt_ld and ckpt_ld_f commands are shell scripts that take the same parameters as the system linker, ld. The checkpoint linkers call ld with the correct flags to link the user's program with the checkpoint library libckpt.a and the special startup routine ckpt_crt0.o. ckpt_ld is for linking C programs and ckpt_ld_f is for linking FORTRAN programs.

In most cases, users run the compiler for the language used, and the compiler then invokes the linker on behalf of the user. Because of this, few users call the linker directly. This section describes the steps involved in building a checkpointable application.

Suppose you have a C program called prog.c, and you normally create an executable, prog, as follows:

% cc -o prog prog.c

To build a checkpointable program, however, you need to build the object file, prog.o, as follows:

% cc -c prog.c

Note
On SGI systems running IRIX version 5 or 6, use the -nonshared option.

Use the LSF linker script to build the executable:

% ckpt_ld -o prog prog.o

Note
ckpt_ld requires the static library libc.a on all platforms. In addition, on AIX version 3.2 it requires libbsd.a; on Solaris version 2 it requires libmalloc.a, libsocket.a, libnsl.a, and libintl.a.

Like C programs, FORTRAN source files must be compiled to object files and then linked with the ckpt_ld_f command. To build a checkpointable version of prog1 from the prog1.f source file, proceed as follows:

% f77 -c prog1.f

Note
On SGI systems running IRIX version 5 or 6, use the -nonshared option.

% ckpt_ld_f -o prog1 prog1.o 

Note
ckpt_ld_f requires the static library libc.a on all platforms. It requires additional static libraries on each platform, as shown in the table below.

Table 5. Static Libraries for ckpt_ld_f

In addition to the checkpoint library and startup routine, ckpt_ld and ckpt_ld_f may also link in a few other system-provided object files that are platform-dependent. Normally, these files are installed in a standard directory by the operating system. However, you may get an error if your system administrator set things up differently. In this case, see the ckpt_ld(1) manual page for more information.

Limitations

There are restrictions to the use of the current implementation of the checkpoint library for user level checkpointing. These are:


1. echkpnt and erestart are by default located in the LSF_SERVERDIR directory (as defined in lsf.conf). You can specify another directory with the LSF_ECHKPNTDIR environment variable.


[Contents] [Prev] [Next] [End]

doc@platform.com

Copyright © 1994-1997 Platform Computing Corporation.
All rights reserved.