We use a queue submission manager in the "cartoon network" machines named "PBS". A queue manager allows us to control how jobs are executed. One can create dependencies between them, or have them executed at particular times. The queue manager keeps track of which machines are being used, and who uses them, and launches jobs as machines become available. The order in which jobs are executed is not necessarily the order in which they were submitted. If the jobs do not have dependencies between them, the order of execution will depend on who launches jobs (the more jobs you run, the less priority your jobs will have, so everyone has a chance to run jobs), which queue the job is launched into, etc.
The queue server is "scrappy", and the queues we currently have are:
workq - the default queue.s4 - queue used for regression tests. Jobs submitted
here will
only get executed on the machine george.calo - queue dedicated to the CALO project./usr*), to avoid the
need for
kerberos authentication on afs./usr/bin in your path (if it is not already there, add it to _append_path
if you are
using the default .login file)..login file (if
your login
shell is tcsh):if ($?PBS_ENVIRONMENT) then set _no_kinit = 'If set, dont kinit if unauthenticated' cd $PBS_O_WORKDIR endif
To test your setup, copy the script below to a file named pbstest.sh in a local disk
(e.g., /net/bunsen/usr0/robust/). This has to be a
local directory, it cannot be on afs (e.g., it cannot be your home
directory). Make sure that your current directory starts with
/net/<machine name>. This way, the path to
your current working directory is the same from every machine.
#!/bin/bash echo "This job was submitted by user: $PBS_O_LOGNAME" echo "This job was submitted to host: $PBS_O_HOST" echo "This job was submitted to queue: $PBS_O_QUEUE" echo "PBS working directory: $PBS_O_WORKDIR" echo "PBS job id: $PBS_JOBID" echo "PBS job name: $PBS_JOBNAME" echo "PBS environment: $PBS_ENVIRONMENT" echo " " echo "This script is running on `hostname` "
Execute the following commands:
% chmod +x pbstest.sh % qsub pbstest.sh % qstat -n # after the execution is complete, look for files named "pbstest.o" # and "pbstest.e*" in your current directory. "*" could be any number % cat pbstest.o* % cat pbstest.e*
The version currently intalled is TORQUE. The list provided here is not exhaustive, and does not present all options for all commands. Rather, it provides the list of more common commands, with their more commonly used options. You are encouraged to check the man pages for more details. In the list below, we follow the convention:
qsub [-q <queue name>] [-l
<resource
list>] [-N <job name>] [-W <additional attributes>] [-o
<output file>]
[-j oe | -e <error file>] <script>
#!/usr/local/bin/tcsh set tmpscript = /tmp/tmp$$ echo "#!/usr/local/bin/tcsh" >! $tmpscript echo "$_" >> $tmpscript chmod +x $tmpscript qsub $tmpscript /bin/rm $tmpscript
pbs_resources_linux
man page for more details.
qsub -q calo \ # launch job in queue 'calo'
-j oe \ # joining standard error to the output file
-o /net/george/usr0/egouvea/test.out \ # saving output here
-N QueueTest \ # run job with name 'QueueTest'
-W depend=afterok:2001:2010 \ # Run after successful completion
# of jobs 2001 and 2010
test.csh
qdel [-W force] <job id>
-W force : forces a job to be deleted, especially
useful when the host running the job does not respond. qdel 1234 # deletes job 1234
qstat <-n [-u <user>] | -Q>
-n : for each running job (state "R"), displays the
host where the job is running.-u <user> : list only jobs submitted by the
given user.-Q : displays information about the queues, such as
number of jobs running or holding in each list.qstat -n -u robust # displays jobs submitted by user
# 'robust' with executing host information
qrerun [-W force] <job id>-W force : forces a job to be rerun, especially
useful when the host running the job does not respond. qrerun -Wforce 1234 # rerun job 1234
pbshosts
pbsnodes [-a]
-a : displays information for all machines. tracejob [-n <days>]
<job id>
-n <days> : looks for information in log files
in the last days. It defaults to 1 day (today). tracejob -n 2 1234 # displays info about events related
# to job 1234 that happened today or yesterday
qstat -n
tracejob <job id>
qstat -Q
qsub -l host=<machine name> <script>
qsub -q <queue> <script>
pbshosts
qdel -W force <job id> # delete job qrerun -W force <job id> # rerun job
"E" state means "exit" state. After a jobs finishes, it goes into "E" state and does things like copying standard output and standard error to the appropriate files (given by the "-o" and "-e" option of qsub, respectively), and housekeeping tasks.
It is normal for a job to go into this state. If it stays in this state for long (say, more than a few seconds), this may indicate problems exiting. The execution may or may not have been successful (most likely, it was), but the copying of files over the network failed. When this occurs, PBS copies the standard output and error files into files with extension ".OU" and ".ER". The file name starts with the job ID.
So, if a job fails, find out where the job was executed, and look for
files whose name starts with the job ID of the failed job at
/var/spool/torque/undelivered in the execution host. Most
likely, you will receive an email from sphinx stating where the
job was executed, so it is easy to track it down.
It may happen that a process may get stuck. A process (what you
get when you do a ps) is different from a job (what you
get when you do a qstat). Most commonly, if a process gets
stuck, it gets stuck in the "D" state, the so called
uninterruptible I/O state.
You can verify whether a process is in "D" state by typing
ps x and looking at the letter under the column named
STAT. Normally, this column will have a letter like "S"
or "R", or "D" for a few seconds. If it is always in "D", you
will need root access, or ask someone with root access to get out of
it. You do not need to ask facilities to do this, just ask someone
around you.
Ask your favorite system administrator to take a look at the Administrator's manual's troubleshooting section for details on how to get rid of the process.
If you get an email from sphinx saying that the script was not found,
most likely it was caused by the fact that PBS starts jobs from your
home directory in the execution host, no matter which directory you
were at when the job was launched. You can change this by specifically
changing directory to $PBS_O_WORKDIR in your shell
initialization script (typically, .login for tcsh and
.profile for bash).
Alternately, you can design your wrapper script so that it changes
directory to your working directory. The wrapper script is the script
that you submit to the queue, i.e., the script that you use as
argument to qsub command. This script can explictly
change to the directory from where the script was launched. The robust
tutorial does this. This way, execution starts in the home directorym
and is cd-ed to the correct working directory.
If you get an email from sphinx talking about bad UID, most likely it is because PBS tried to execute the job in a machine where you do not have an account. Ask your favorite system administrator to create an account for you there. PBS does not know if a user has an account in a machine or not. It attempts to launch the job in the first machine available. This is why you need accounts in all machines in the PBS cluster.
The job finished too fast. Nothing happened, but you got a message
about kinit in the error or output file. When you use the default CS
environment (i.e. your shell is tcsh or csh
and you use the default ~/.login and
~/.cshrc files), upon login, the initialization scrips
will attempt to authenticate you to kerberos. Kerberos is the mythical
monster that allows you to save files on AFS. When using the PBS
queue, you will not be able to type in your kerberos password. The
solution is to disable the automatic prompting by adding these lines
to your .login file:
if ($?PBS_ENVIRONMENT) then set _no_kinit = 'If set, dont kinit if unauthenticated' cd $PBS_O_WORKDIR endif
Page created by Evandro B. Gouvêa on 5 June 2004
Page maintained by Evandro B. Gouvêa ()
Last modified: Fri Aug 18 15:18:10 Eastern Daylight Time 2006