[Contents] [Prev] [Next] [End]


Appendix A. Troubleshooting and Error Messages


This chapter describes some common problems with LSF and LSF Batch operations, answers some frequently asked questions, and provides some instructions for solving problems.

Error Log Messages

When something goes wrong, the daemons almost always log an error message. The first step is to find the appropriate log and see whether there are any messages.

Specific error log messages are listed in 'Error Messages'.

Finding the Error Logs

Error messages of LSF servers are logged in either the syslog(3) or to files. This is determined by the LSF_LOGDIR definition in the lsf.conf file. For complete instructions on finding the LSF server logs, see 'Managing Error Logs'.

If you configure LSF to log daemon messages using syslog, the destination file is determined by the syslog configuration. On most systems, you can find out which file the LSF messages are logged in with the command:

grep daemon /etc/syslog.conf

Once you have found the syslog file, you can select the LSF error messages with the command:

egrep 'lim|res|batchd' syslog_file

Look at the /etc/syslog.conf file and the manual page for syslog or syslogd for help in finding the system logs.

When searching for log messages from LSF servers, you are more likely to find them on the remote machine where LSF put the task than on your local machine where the command was given.

LIM problems are usually logged on the master host. Run lsid to find the master host, and check syslog or the lim.log.hostname file on the master host. The res.log.hostname file contains messages about RES problems, execution problems and setup problems for LSF. Most problems with interactive applications are logged in the remote machine's log files.

Errors from LSF Batch are logged either in the mbatchd.log.hostname file on the master host, or the sbatchd.log.hostname file on the execution host. The bjobs or bhist command tells you the execution host for a specific job.

Most LSF log messages include the name of an internal LSF function to help the developers locate problems. Many error messages can be generated in more than one place, so it is important to report the entire error message when you ask for technical support.

Shared File Access

A frequent problem with LSF is non-accessible files due to a non-uniform file space. If a task is run on a remote host where a file it requires cannot be accessed using the same name, an error results. Almost all interactive LSF commands fail if the user's current working directory cannot be found on the remote host.

If you are running NFS, rearranging the NFS mount table may solve the problem. If your system is running the automount server, LSF tries to map the file names, and in most cases it succeeds. If shared mounts are used, the mapping may break for those files. In such cases, specific measures need to be taken to get around it.

The automount maps must be managed through NIS. When LSF tries to map file names, it assumes that automounted file systems are mounted under the /tmp_mnt directory.

Common LSF Problems

This section lists some other common problems with the LIM, the RES and interactive applications.

LIM Dies Quietly

Run lsadmin ckconfig -v to check for errors in the LIM configuration files. This displays most configuration errors. If this does not report any errors, check in the LIM error log.

LIM Unavailable

Sometimes the LIM is up but lsload prints the error message 'Communication time out'. If the LIM has just been started, this is normal because the LIM needs time to get initialized by reading configuration files and contacting other LIMs.

If the LIM does not become available within one or two minutes, check the LIM error log on the local host.

When the local LIM is running but there is no master LIM in the cluster, LSF applications display the message "Cannot locate master LIM now, try later". Check the LIM error logs on the first few hosts listed in the Host section of the lsf.cluster.cluster file.

RES Does Not Start

Check the RES error log. If the RES is unable to read the lsf.conf file and does not know where to write error messages, it logs errors into syslog(3).

User Permission Denied

If remote execution fails with error message 'User permission denied', the remote host could not securely determine the user ID of the user requesting remote execution. Check the RES error log on the remote host; this usually contains a more detailed error message.

If you are not using an identification daemon (LSF_AUTH is not defined in lsf.conf), then all applications that do remote executions must be owned by root with setuid bit set (as done by 'chmod 4755 filename'). If the binaries are on an NFS mounted file system, make sure that the file system is not mounted with the nosuid flag.

If you are using an identification daemon (defined in lsf.conf by the variable LSF_AUTH), inetd must be configured to run the daemon. The identification daemon must not be run directly.

If LSF_USE_HOSTEQUIV is defined in the lsf.conf file, check if /etc/hosts.equiv or HOME/.rhosts on the destination host has the client host name in it. Inconsistent host names in a name server with /etc/hosts and /etc/hosts.equiv can also cause this problem.

On SGI hosts running a name server, you can try:

% setenv HOSTRESORDER = local,nis,bind

to tell the host name lookup code to search the /etc/hosts file before calling the name server.

Non-uniform File Name Space

A command may fail with the error message 'chdir(...) failed: no such file or directory', due to a non-uniform file name space. You are trying to execute a command remotely, where either your current working directory does not exist on the remote host, or your current working directory is mapped to a different name on the remote host.

If your current working directory does not exist on a remote host, you should not execute commands remotely on that host.

If the directory exists, but is mapped to a different name on the remote host, you have to create symbolic links to make them consistent.

LSF can resolve most, but not all, problems using automount. The automount maps must be managed through NIS. Follow the instructions in your Release Notes for obtaining technical support if you are running automount and LSF is not able to locate directories on remote hosts.

Common LSF Batch Problems

This section lists some common problems with LSF Batch. Most problems are due to incorrect installation or configuration. Check the mbatchd and sbatchd error log files; often the log messages points directly to the problem.

Batch Daemons Die Quietly

First, check the sbatchd and mbatchd error logs. Try running badmin ckconfig to check the configuration. This reports most errors. You should also check if there is any electronic mail from LSF Batch in the LSF administrator's mail box. If the mbatchd is running but the sbatchd dies on some hosts, it may be because mbatchd has not been configured to use those hosts. See 'Host Not Used By LSF Batch'.

sbatchd Starts But mbatchd Does Not

Check whether LIM is running. You can test this by running the lsid command. If LIM is not running properly, follow the suggestions in this chapter to fix the LIM first. You should make sure that LSF and LSF Batch are using the same lsf.conf file. Note that it is possible that mbatchd is temporarily unavailable because the master LIM is temporarily unknown.

sbatchd: unknown service

Check whether services are registered properly. See 'Registering LSF Service Ports'.

Host Not Used By LSF Batch

If you configure a list of server hosts in the Host section of the lsb.hosts file, mbatchd allows sbatchd to run only on the hosts listed. If you try to configure an unknown host in the HostGroup or HostPartition sections of the lsb.hosts file, or as a HOSTS definition for a queue in the lsb.queues file, mbatchd logs the message:

If you start sbatchd on a host that is not known by mbatchd, mbatchd rejects the sbatchd. The sbatchd logs the message:

and exits. Both of these errors are most often caused by forgetting to run lsadmin reconfig and then badmin reconfig after adding a host to the configuration. You must run both of these before starting the daemons on the new host.

Error Messages

The following error messages are logged by the LSF daemons, or displayed by the lsadmin ckconfig and badmin ckconfig commands. LSF daemon message logs are described in 'Managing Error Logs' on page 83.

General Errors

The messages listed in this section may be generated by any LSF daemon.

can't open file: error
file(line): malloc failed
auth_user: getservbyname(ident/tcp) failed: error; ident must be registered in 
services
auth_user: operation(<host>/<port>) failed: error
auth_user: Authentication data format error (rbuf=<data>) from <host>/<port>
auth_user: Authentication port mismatch (...) from <host>/<port>
userok: Request from bad port (<portno>), denied
userok: Forged username suspected from <host>/<port>: <claimed user>/<actual
user>
userok: ruserok(<host>,<uid>) failed
init_AcceptSock: RES service(res) not registered, exiting
init_AcceptSock: res/tcp: unknown service, exiting
initSock: LIM service not registered. See LSF Guide for help
initSock: Service lim/udp is unknown. Read LSF Guide for help
get_ports: <serv> service not registered
init_AcceptSock: Can't bind daemon socket to port <port>: error, exiting
init_ServSock: Could not bind socket to port <port>: error

Configuration Errors

The messages listed in this section are caused by problems in the LSF configuration files. General errors are listed first, and then errors from specific files.

file(line): Section name expected after Begin; ignoring section
file(line): Invalid section name name; ignoring section
file(line): section section: Premature EOF
file(line): keyword line format error for section section; Ignore this section
file(line): values do not match keys for section section; Ignoring line
file: HostModel section missing or invalid
file: Resource section missing or invalid
file: HostType section missing or invalid
file(line): Name name reserved or previously defined. Ignoring index
file(line): Duplicate clustername name in section cluster. Ignoring current line
file(line): Bad cpuFactor for host model model. Ignoring line
file(line): Too many host models, ignoring model name
file(line): Resource name name too long in section resource. Should be less than
40 characters. Ignoring line
file(line): Resource name name reserved or previously defined. Ignoring line
file(line): illegal character in resource name: name, section resource. Line
ignored

LIM Messages

The following messages are logged by the LIM:

main: LIM cannot run without licenses, exiting
main: Received request from unlicensed host <host>/<port>
initLicense: Trying to get license for LIM from source <LSF_CONFDIR/license.dat>
getLicense: Can't get software license for LIM from license file 
<LSF_CONFDIR/license.dat>: feature not yet available.
findHostbyAddr/<proc>: Host <host>/<port> is unknown by <myhostname>
function: Gethostbyaddr_(<host>/<port>) failed: error
main: Request from unknown host <host>/<port>: error
function: Received request from non-LSF host <host>/<port>
rcvLoadVector: Sender (<host>/<port>) may have different config?
MasterRegister: Sender (host) may have different config?
rcvLoadVector: Got load from client-only host <host>/<port>. Kill LIM on
<host>/<port>
saveIndx: Unknown index name <name> from ELIM
saveIndx: ELIM over-riding value of index <name>
getusr: Protocol error numIndx not read (cc=num): error
getusr: Protocol error on index number (cc=num): error

RES Messages

These messages are logged by the RES.

doacceptconn: getpwnam(<username>@<host>/<port>) failed: error
doacceptconn: User <username> has uid <uid1> on client host <host>/<port>, uid
<uid2> on RES host; assume bad user
authRequest: username/uid <userName>/<uid>@<host>/<port> does not exist
authRequest: Submitter's name <clname>@<clhost> is different from name <lname>
on this host
doacceptconn: root remote execution permission denied
authRequest: root job submission rejected
resControl: operation permission denied, uid = <uid>
resControl: access(respath, X_OK): error

The RES received a reboot request, but failed to find the file respath to re-execute itself. Make sure respath contains the RES binary, and it has execution permission.

LSF Batch Messages

The following messages are logged by the mbatchd and sbatchd daemons:

renewJob: Job <jobId>: rename(<from>,<to>) failed: error
logJobInfo_: fopen(<logdir/info/jobfile>) failed: error
logJobInfo_: write <logdir/info/jobfile> <data> failed: error
logJobInfo_: seek <logdir/info/jobfile> failed: error
logJobInfo_: write <logdir/info/jobfile> xdrpos <pos> failed: error
logJobInfo_: write <logdir/info/jobfile> xdr buf len <len> failed: error
logJobInfo_: close(<logdir/info/jobfile>) failed: error
rmLogJobInfo: Job <jobId>: can't unlink(<logdir/info/jobfile>): error
rmLogJobInfo_: Job <jobId>: can't stat(<logdir/info/jobfile>): error
readLogJobInfo: Job <jobId> can't open(<logdir/info/jobfile>): error
start_job: Job <jobId>: readLogJobInfo failed: error
readLogJobInfo: Job <jobId>: can't read(<logdir/info/jobfile>) size size: error
initLog: mkdir(<logdir/info>) failed: error
<fname>: fopen(<logdir/file> failed: error
getElogLock: Can't open existing lock file <logdir/file>: error
getElogLock: Error in opening lock file <logdir/file>: error
releaseElogLock: unlink(<logdir/lockfile>) failed: error
touchElogLock: Failed to open lock file <logdir/file>: error
touchElogLock: close <logdir/file> failed: error
replay_newjob: File <logfile> at line <line>: Queue <queue> not found, saving to
queue <lost_and_found>
replay_switchjob: File <logfile> at line <line>: Destination queue <queue> not
found, switching to queue <lost_and_found>
replay_startjob: JobId <jobId>: exec host <host> not found, saving to host
<lost_and_found>
do_restartReq: Failed to get hData of host <hostname>/<hostaddr>


[Contents] [Prev] [Next] [End]

doc@platform.com

Copyright © 1994-1997 Platform Computing Corporation.
All rights reserved.