Email address protected by JavaScript.
Please enable JavaScript to contact me.

Portable Batch System (PBS)


Sphinx Group User's Guide

Introduction

We use a queue submission manager in the "cartoon network" machines named "PBS". A queue manager allows us to control how jobs are executed. One can create dependencies between them, or have them executed at particular times. The queue manager keeps track of which machines are being used, and who uses them, and launches jobs as machines become available. The order in which jobs are executed is not necessarily the order in which they were submitted. If the jobs do not have dependencies between them, the order of execution will depend on who launches jobs (the more jobs you run, the less priority your jobs will have, so everyone has a chance to run jobs), which queue the job is launched into, etc.

The queue server is "scrappy", and the queues we currently have are:

Before you get started

To use the queue, you will need:

To test your setup, copy the script below to a file named pbstest.sh in a local disk (e.g., /net/bunsen/usr0/robust/). This has to be a local directory, it cannot be on afs (e.g., it cannot be your home directory). Make sure that your current directory starts with /net/<machine name>. This way, the path to your current working directory is the same from every machine.

#!/bin/bash

echo "This job was submitted by user: $PBS_O_LOGNAME"
echo "This job was submitted to host: $PBS_O_HOST"
echo "This job was submitted to queue: $PBS_O_QUEUE"
echo "PBS working directory: $PBS_O_WORKDIR"
echo "PBS job id: $PBS_JOBID"
echo "PBS job name: $PBS_JOBNAME"
echo "PBS environment: $PBS_ENVIRONMENT"
echo " "
echo "This script is running on `hostname` "

Execute the following commands:

% chmod +x pbstest.sh
% qsub pbstest.sh
% qstat -n
# after the execution is complete, look for files named "pbstest.o" 
# and "pbstest.e*" in your current directory. "*" could be any number
% cat pbstest.o*
% cat pbstest.e*

Main commands

The version currently intalled is TORQUE. The list provided here is not exhaustive, and does not present all options for all commands. Rather, it provides the list of more common commands, with their more commonly used options. You are encouraged to check the man pages for more details. In the list below, we follow the convention:

qsub [-q <queue name>] [-l <resource list>] [-N <job name>] [-W <additional attributes>] [-o <output file>] [-j oe | -e <error file>] <script>

Submits a script for execution. The script cannot have any arguments, but it can be removed immediatelly after submission. If your jobs need arguments, it is a good solution to create a temporary script for job submission, and remove the temporary script immediately after it. The script below, although very simple, illustrates how this can be done.
#!/usr/local/bin/tcsh

set tmpscript = /tmp/tmp$$
echo "#!/usr/local/bin/tcsh" >! $tmpscript
echo "$_" >> $tmpscript
chmod +x $tmpscript
qsub $tmpscript
/bin/rm $tmpscript
qsub -q calo \  # launch job in queue 'calo'
  -j oe \  # joining standard error to the output file
  -o /net/george/usr0/egouvea/test.out \  # saving output here
  -N QueueTest \  # run job with name 'QueueTest'
  -W depend=afterok:2001:2010 \  # Run after successful completion
                                 # of jobs 2001 and 2010
  test.csh

qdel [-W force] <job id>

Deletes job with given id number.
qdel 1234 # deletes job 1234

qstat <-n [-u <user>] | -Q>

Lists jobs or queues.
qstat -n -u robust # displays jobs submitted by user 
                   # 'robust' with executing host information

qrerun [-W force] <job id>

Reruns the job with given job id.
qrerun -Wforce 1234 # rerun job 1234

pbshosts

Lists all machines on a table, displaying state of machines, and jobs running in each one. This is a perl script written internally, and not part of PBS's distribution.

pbsnodes [-a]

List details about machines, such as jobs in execution, amount of memory, state.

tracejob [-n <days>] <job id>

Job history. Use this command to find information about a job, even if the jobs has already finished. It displays information about when the job was submitted, when execution started, which execution host was used (search for "exec_host"). By default, it looks for information in log files for "today" only. To make sure it looks for jobs in previous days, use the option "-n". Additionally, since this program parses log files, it may provide additional information if you run it in the machine where the job was executed, since the execution host may contain details not available to the pbs server.
tracejob -n 2 1234 # displays info about events related 
                   # to job 1234 that happened today or yesterday

How to

Troubleshooting

Job in "E" state.

"E" state means "exit" state. After a jobs finishes, it goes into "E" state and does things like copying standard output and standard error to the appropriate files (given by the "-o" and "-e" option of qsub, respectively), and housekeeping tasks.

It is normal for a job to go into this state. If it stays in this state for long (say, more than a few seconds), this may indicate problems exiting. The execution may or may not have been successful (most likely, it was), but the copying of files over the network failed. When this occurs, PBS copies the standard output and error files into files with extension ".OU" and ".ER". The file name starts with the job ID.

So, if a job fails, find out where the job was executed, and look for files whose name starts with the job ID of the failed job at /var/spool/torque/undelivered in the execution host. Most likely, you will receive an email from sphinx stating where the job was executed, so it is easy to track it down.

Job seems to be running, but is stuck.

It may happen that a process may get stuck. A process (what you get when you do a ps) is different from a job (what you get when you do a qstat). Most commonly, if a process gets stuck, it gets stuck in the "D" state, the so called uninterruptible I/O state.

You can verify whether a process is in "D" state by typing ps x and looking at the letter under the column named STAT. Normally, this column will have a letter like "S" or "R", or "D" for a few seconds. If it is always in "D", you will need root access, or ask someone with root access to get out of it. You do not need to ask facilities to do this, just ask someone around you.

Ask your favorite system administrator to take a look at the Administrator's manual's troubleshooting section for details on how to get rid of the process.

Script not found

If you get an email from sphinx saying that the script was not found, most likely it was caused by the fact that PBS starts jobs from your home directory in the execution host, no matter which directory you were at when the job was launched. You can change this by specifically changing directory to $PBS_O_WORKDIR in your shell initialization script (typically, .login for tcsh and .profile for bash).

Alternately, you can design your wrapper script so that it changes directory to your working directory. The wrapper script is the script that you submit to the queue, i.e., the script that you use as argument to qsub command. This script can explictly change to the directory from where the script was launched. The robust tutorial does this. This way, execution starts in the home directorym and is cd-ed to the correct working directory.

Bad UID for job execution

If you get an email from sphinx talking about bad UID, most likely it is because PBS tried to execute the job in a machine where you do not have an account. Ask your favorite system administrator to create an account for you there. PBS does not know if a user has an account in a machine or not. It attempts to launch the job in the first machine available. This is why you need accounts in all machines in the PBS cluster.

kinit: password incorrect

The job finished too fast. Nothing happened, but you got a message about kinit in the error or output file. When you use the default CS environment (i.e. your shell is tcsh or csh and you use the default ~/.login and ~/.cshrc files), upon login, the initialization scrips will attempt to authenticate you to kerberos. Kerberos is the mythical monster that allows you to save files on AFS. When using the PBS queue, you will not be able to type in your kerberos password. The solution is to disable the automatic prompting by adding these lines to your .login file:

if ($?PBS_ENVIRONMENT) then
  set _no_kinit = 'If set, dont kinit if unauthenticated'
  cd $PBS_O_WORKDIR
endif


Page created by Evandro B. Gouvêa on 5 June 2004

Page maintained by Evandro B. Gouvêa ()

Last modified: Fri Aug 18 15:18:10 Eastern Daylight Time 2006