Batch system, resource management system
SUMS (aka, STAR Scheduler)
SUMS, the product of the STAR Scheduler project, stands for Star Unified Meta-Scheduler. This tool is currently documented on its own pages. SUMS provides a uniform user interface to submitting jobs on "a" farm that is, regardless of the batch system used, the language it provides (in XML) is identical. The scheduling is controlled by policies handling all the details on fitting your jobs in the proper queue, requesting proper resource allocation and so on. In other words, it isolates users from the infrastructure details.
You would benefit from starting with the following documents:
- Manual for users
- Description of existing policies and dispatchers
- The SUMS' FAQ
LSF
This help page was initially made from Emails gathered from Lee Barnby and Jérôme Lauret.Quick start ...
- Getting help on LSF
- Quick tutorial, useful commands and tips
- Writing scripts ready for batch
- Avoiding useless network traffic and improving IO
- Available STAR queues at BNL
- Other scheduling rules
- LSF error codes
Getting help on LSF
There are many help pages on LSF and the command you may use. The most useful documents are
- The University Of Colorado has some on-line help in HTML and a nice Quick User Reference Guide
- The RCF has put together a Quick Start Guide page ...
- and the full documentation as PS or PDF.
A tutorial of our own
Of course, you can always access the help by reading the manpages. For example, the mysterious above command should give display some help about the bsub command ...
% man bsub
However we realise that may be a bit too detailed and that the array of options is bewildering so we will provide some examples as yet another startup guide. The first command used is like this (all on one line):
% bsub -q star_cas_short -L `which tcsh` -J 12kittf -o 12ktest_ittf.log -e 12ktest_ittf.log
root4star -b -q 'runMakerFromMuDst.C("/path/12ktest_ittf.MuDst.root","12ktest_ittf.root",20000)'
And this next wonder is unraveled by explaining each options ...
- -q QueueName is the queue to which you need to submit. valid queues are star_cas and star_cas_short . Their respective limitations and descriptions is given above in a summary table.
- -L `which tcsh` (those are reversed quotes which causes that command to be executed and the output to be inserted on the command line at that point). The -L option specifies a shell. It is not strictly required and useful only whenever you specify what to run on the command line (rather than submitting a script ). Depending on whether or no the default shell defined in the LSF setup match your default shell, you may need to specify it specifically.
- -J 12kittf specifies a name for the job (12kittf in this case). This is optional but extremely useful as it allows you to track the progress of your job. If you don't provide one, the job name defaults to the command you are running (i.e. root4star -b -q ...) which is hard to keep track of.
- -o 12ktest_ittf.log sends all the standard output, which means things which normally appear on screen if you run the program, to a file of your choice. This is extremely useful, almost essential, because if you do not use this option the output is mailed to you. It is not a good idea to have huge files mailed to you especially when running multiple jobs.
- -e 12ktest_ittf.log sends all the error messages to a file of your choice. In this case we chose the same file as for standard output but it does not have to be.
Finally after all the options to bsub you have the command you want to run (along with its options and arguments). Most people run root4star. After that you need the name of your macro. The above is a relatively straight foward example which takes a microdst file as input and writes out a root format file containing the histograms that are filled from within that macro. The essential points are that the input and output files can be specified. The singles quotes ('...') around the macro are required. If you can specify the input file and the output file (or the output file name is automatically formed from the input file name) then you can run several jobs simultaneously looking at different chunks of data which is much quicker.
Presumably everyone already has some kind of macro that they run. An example of a non-root4star command would be for example submitting staf (for running GEANT). Here, you need something like:
% bsub ... staf -w 0 -b run.kumac
after the bsub command and its options, where run.kumac is your kumac.
Once your jobs are running I found the following commands useful:
% bjobs
to shows a summary with the status of each job (or bjobs -w which ensures the name fits on the screen)
% bqueues
to shows you the status of each queues (in case you wonder why none of your jobs started after some time). Don't panic !! There is a fair-share system so that you don't have to wait for all the other jobs in the queue before it starts yours. If you realize something is wrong and you need to kill your job, use
% bkill JobID
where JobID is the job number (obtained from bjobs) . If the job do not seem to disappear from the queue (please, wait at least 1 minute), you can then try to useful
% bkill -r JobID
the above forces the removal of JobID from the queue. Well, and if LSF is being stuborn and your job still remains in the queue, you can then use the ultimate and (almost) final trick
% bkill -s KILL JobID
% bkill -s QUIT JobID
The -s is used to send a specific signal to the job (instead of letting LSF handle it). We promised a last trick ... there is none and you will have to submit a trouble ticket ... As a last note, if you want to kill all submitted jobs, useful
% bkill 0
...
% bpeek -f
shows you the last few lines of your output (which will eventually appear in the file you specified) and updates in a loop until you press control-c (together). Good for checking the progress assuming you are writing useful messages.
% lsload -R "swp<200" | grep rcas6
a more esoteric command, shows how many nodes have a swap space resource contention. A very useful command to see how things are working 'at a glance' . Other monitoring commands include things like
% bjobs -s -u star_gr
% bjobs -p -u star_gr
which shows the jobs in suspend state and pending state respectively. The output also gives you the reason why.
A useful utility for adding together the histograms you created in different files is $ROOTSYS/root/tutorials/hadd_old.C or hadd.C in the same directory (sometimes, one or the other does not work). Copy it to your directory, be sure to name it hadd.C and edit only the part at the top, replacing the example input and output filenames with the ones you create.
As for what you should run in batch mode I would say anything that takes longer than a few minutes and does not require manual intervention.
Writing scripts you can submit to a batch
Instead of submitting jobs via a long command line, you may prefer to submit a script which contains all information about your job. Actually, this method is far less ambiguous since you are in control of every command necessary before/after/for the main execution of your program.
Contrary to a general and almost religious belief, there are no difference between tcsh and csh shells in batch modes. tcsh adds features generally used for prompt manipulation (like tab completion) and unless you have a site-specific compiled tcsh , you should use the lighter (and existing under ALL Unix flavor at the same location) /bin/csh . You can do this ONLY if you do not use a .tcshrc but a .cshrc in your $HOME directory. Note as well that while the .cshrc file is used by both csh and tcsh, the .tcshrc is used exclusively by tcsh.
Now, the above example can be re-written in a script as follow (we use the Unix cat command to show you the content of the file)
% cat myscript.csh
#!/bin/csh
cd /star/u/userxxx/WhereIworkAndMyMacrosAre
root4star -b -q 'runMakerFromMuDst.C("/path/12ktest_ittf.MuDst.root","12ktest_ittf.root",20000)'
%
Note that the script myscript.csh must be executable to be used as a command you can submit to LSF. You can use
% chmod +x myscript.csh
the first time you create such script and would submit it using
% bsub ... myscript.csh
OK. So, what is more powerful here ??? After all, we are doing the same thing at the end. May be ... But since you are in a script, this means you can use any commands / checks, environment changes you may need to execute before starting executing root4star . As an example, a typical complaints we hear (about once a month) are users claiming that they cannot run root4star in batch mode but it works perfectly in interactive . This is of course a side effect of what we call a pilot error . Usually, a user starts with one of the STAR level (let's say pro for the sake of argument), changes it to a new environment, dev for example using the stardev command, compiles his/her code under this environment and try to send a batch by specifying everything on the command line after bsub . This has a high chance to fail if the libraries used have been drastically modified between the pro and the dev level. This could have been avoided by the poor and miserable user fighting to understand what happens if he had submitted something like the above
% cat myscript.csh
#!/bin/csh
stardev
cd /star/u/userxxx/WhereIworkAndMyMacrosAre
root4star -b -q 'runMakerFromMuDst.C("/path/12ktest_ittf.MuDst.root","12ktest_ittf.root",20000)'
%
Note that we also had there a change directory (cd) command we forgot to mention. This is necessary since a shell script, when activated, will not know about the directory where you were when you submitting the job. This is because each new shell reload the Unix environment (and executes your .cshrc and .login files).
Now, imagine you would like to submit a job to the queue but you are not sure when that job will start. Maybe it will start while a disk is down or unmounted, a problem has occurred on the cluster or any other events you would like to handle. There is an infinite amount of sophistication one can add to a shell script (and this depends on your shell scripting abilities ). We will only show a few commonly used tricks.
The first one is a quick wrapper before your command checking for the existence of a file as a necessary condition for starting your job. The script would be written as follow
#!/bin/csh
set count=0
AGAIN:
if( ! -f SomeFileIknowShouldbeThere ) then
echo "Did not find SomeFileIknowShouldbeThere" | /bin/Mail -s "`hostname` insane"
$USER@rcf.rhic.bnl.gov
if ( $count < 10 ) then
@ count++
sleep 600
goto AGAIN
endif
endif
root4star ...
Note that the infinite loop without the count trick would block a queue slot and prevent others from using the queue at some stage. So, be sure of what you are doing before going ahead. Within the same spirit, you may want to resubmit your batch by doing something like
if( $count < 10)
then
....
else
# Tried 10 times and did not succeed
echo "$0" >>$HOME/resubmit
exit
endif
The file $HOME/resubmit will contain the batch you have not successfully run because 'SomeFileIknowShouldbeThere' was not found. In shell script, the variable $0 is replaced by the executing script (it can be used only in a script context as it has little meaning when used from the prompt).
Avoiding useless network traffic and improving IO
There is a simple way to avoid network traffic while running a job in batch mode which is ... to read and/or write to local disk. This method however has to be carefully thought of : you will gain tremendous performance only if your application is IO bound. Also, while in the above example, copying a remote file to a local storage seems at minimum useless (at maximum dumb ) it is NOT. Usually, your application fragments the IO in small buffer chunks which renders improvements and optimization by the OS impossible. A copy over NFS of an entire file will however take advantage of the system buffer size setup.
In any case, in the above example, we copy several files to local disk, run over it, produce an output locally and move it to it's final destination. Each LSF job uses the environment variable LSB_JOBID (for PBS, it is PBS_JOBID etc ...) and we use this as a unique identifier for keeping the output in separate directories.
#!/bin/csh
set ID=$LSB_JOBID
set WDIR=/tmp/$USER/$ID # a temporary LOCAL and unique directory
set TARGET=/star/data01/xxx/output/ # the directory where all job output will go
# Create directory, go there
mkdir -p $WDIR || exit 1
cd $WDIR
# Copy input files locally
cp /star/data01/xxx/input/*.root .
# Run my program now
root4star -b -q 'runMakerFromMuDst.C("*.root","myHistograms.root",20000)'
if ( -e "myHistograms.root" ) then
# Move file and in addition, copy it with a unique name as well
mv -f myHistograms.root $TARGET/myHistograms$ID.root
else
echo "Action did not produce the expected file"
# eventually, use the same trick than above for keeping a record of what has failed
# ...
endif
# Do not forget to do some cleanup
rm -fr $WDIR
exit
The above example may be an overkill and not necessarily doable (too many input files, don't know a-priori the number of input file to use etc ...). But this method can always be used for the output and at least for writing (which is usually the most intensive operation), this trick may be used.
Available STAR queues at BNL
The above queues and their characteristics are available
| Queue Name | Note | Prio | Slot | CPU Limit (minutes) |
Res Mem Min (MB) |
Swap Min (MB) |
Res Mem Max (MB) |
Swap Max (MB) |
Job limit |
|---|---|---|---|---|---|---|---|---|---|
| star_cas_big | 40 | 12 | 14400 | 100 | 400 | 0 | 0 | 1 per node 110 per user |
|
| star_cas_short | 50 | 16 | 180 | 100 | 200 | 900 | 0 | 220 per user | |
| star_cas_mem | 56 | 28 | 90 | 100 | 200 | 400 | 700 | none | |
| star_cas_prod | Reserved | 100 | N/A | 14400 | 100 | 200 | 900 | 0 | 1 per node |
Some hidden implied restrictions are applied as follow
- /tmp must be of at least 1 GB for scheduling to occur
- There is a RunTime limit on ALL queues at BNL which is equal to 4 times the CPULimit.
- A running LSF job is limited to 50 processes (including children)
- Resident memory and swap limits minimum are used for scheduling purposes.
- Every job is limited by the Max values on Resident Memory and swap. Jobs exceeding this memory consumption will be killed. The Max cannot be exceeded. Default limits are 450/950 MB and are large enough to have the system self-protected.
- As a side note, our nodes supports ~ 2 GB of swap space so this limit was added as a safeguard.
- LSF was configure with UNLIM=stack, otherwise, a 2 MB memory limit is the default setting (which is low)
In addition, several global IO resources have been set one can use to throttle their job and get a better response (clock time/CPU time) from running jobs. This static IO resource MUST be specifically requested by the user submitting the job and require a minimal amount of education and discipline from the user. If not requested, LSF will NOT be able to optimize your job submission. What this resource does in reality is to decrement the global resource by the pre-allocated amount. Whenever the global resource reaches 0, job scheduling leaves sub-sequent jobs in pending mode. Although it appears at first glance that this mechanism decreases the number of jobs one has running, in reality, it improves drastically the job efficiency since running jobs are not slowed down by other jobs running in parrallele . In fact. When too many jobs are running simultaneously, they compete with each other for resource such as IO bandwidth. This particular mechanism has been set to address this problem.
The static resource can be requested by using the following command, subsequently explained
% bsub -R "rusage[sd1=50]" -q ...
The above command is a request to LSF to pre-allocate, from a static Pool of 1000 IO, 50 IO credits to this job. The credit is counted against the resource sd1 which is a resource specific to the partition /star/data01 . For a data disk /star/dataXX, the resource name sdXX . The resource sdu was also added for the /star/u/ disk partition. This is the most practical way to access this resource as disks may be moved from servers to servers. However, if you would like to use this resource on a per-server basis, the syntax is
% bsub -R "rusage[r601=50]" -q ...
where r601 indicates to LSF your intent to pre-allocate 50 credits against rmine601 global IO resource. Ultimately, STAR will move away from NFS storage accessible from the entire farm ... Note that multiple resource may be requested using the following syntax
% bsub -R "rusage[sd1=50:sdu=10]" -q ...
Other scheduling rules
Please, note that the following rules applies :
- Usually, LSF considers the following
- LSF considers FIRST the priority of the queue unless other rules applies.
- Within this queue, LSF looks at the users who have submitted jobs and sends according to the fare-share rule and the basic scheduling rules (if any).
- In 2004, we enabled cross-queue fair-share at BNL.
- Cross queue fair-share will make the system consider the highest priority queue “as usual”
- But since fair-sharing is enabled, this means that lower priority queue jobs (such as long running jobs) will influence scheduling on higher priority queues and vice-versa.
- Ideally, this will prevent a user from taking all queues at a time, the sum of all CPU used is whatmatters.
- There are other rules for scheduling such as queue limits and resource limitations.
For the star_cas_short queue, you are limited to 100 jobs per user. This is done ON PURPOSE to prevent the star_cas_short queue to be filled up with lots of jobs, rendering the star_cas queue, of lower priority, obsolete (i.e. unusable). This means that you may get in a situation where no jobs are in the star_cas queue but you cannot submit more than 100 jobs in star_cas_short . We consider this as a small (easy to get around) scenario which is not relevant in large (and diverse) number of batch environment. This would not be the case in an environment where the work load is low (queue length short). - After the upper priority level queue is taken care off, the lower priority queues are inspected. The fare-share quota is decremented as jobs are submitted. The fare-share is by queue (in LSF 5 and above, this is not necessarily the case as the fare-share operates across queues).
- If a given queue has other scheduling policies, those policies are inspected and LSF tries to find a match before going to a lower priority queue. For example, star_cas_big has extraneous memory restrictions which may render impossible to submit a job at a time. However, being of higher priority than other queues, as soon as the resource is available, the job will be submitted. Note that submitting to star_cas_big ensures proper scheduling in addition of automatically pre-allocating memory resources.
- This approach has been demonstrated as being useful to allow short and long jobs without blocking all resource by one user. The limitations above may be reshaped as the needs change. Especially, the number of jobs per queue or per node may be reshaped.
LSF error codes
The error code reported by LSF consists of two parts:
- the LSF error code
- the User job error code
The LSF error is in the lower nibble, the user job error in the higher nibble(i.e. offset by 128). The interpretation is as follows.
- ERROR <= 128 represents an error in the LSF environment and has nothing to do with the user job.
- ERROR > 128 represents a fault in the user's job. You should subtract 128 to get the 'real' exit code returned by your program. An error of 255 is a general (complete) failure of the user's job.
| Signal Name | Signal Number | LSF Usage |
| SIGHUP | 1 | |
| SIGINT | 2 | bkill (2nd attempt) memlimit (1st attempt) job_starter failed to execute |
| SIGQUIT | 3 | |
| SIGILL | 4 | |
| SIGTRAP | 5 | |
| SIGABRT | 6 | |
| SIGIOT | 6 | |
| SIGBUS | 7 | |
| SIGFPE | 8 | |
| SIGKILL | 9 | bkill (3rd attempt) memlimit (3rd attempt) |
| SIGUSR1 | 10 | |
| SIGSEGV | 11 | Segmentation Fault (User) |
| SIGUSR2 | 12 | RUNtime limit reached |
| SIGPIPE | 13 | |
| SIGALRM | 14 | |
| SIGTERM | 15 | bkill (1st attempt) memlimit (2nd attempt) |
| SIGSTKFLT | 16 | |
| SIGCHLD | 17 | |
| SIGCONT | 18 | |
| SIGSTOP | 19 | |
| SIGTSTP | 20 | |
| SIGTTIN | 21 | |
| SIGTTOU | 22 | |
| SIGURG | 23 | |
| SIGXCPU | 24 | CPUtime limit reached |
| SIGXFSZ | 25 | FILEsize limit reached |
| SIGVTALRM | 26 | |
| SIGPROF | 27 | |
| SIGWINCH | 28 | |
| SIGIO | 29 | Directory Access Error - No AFS token - dir does not exist) |
| SIGPOLL | SIGIO | |
| SIGPWR | 30 | |
| SIGSYS | 31 | |
| 143 |
We may extend this help with several new examples as user come up with questions and needs.
Condor
Quick start ...
Condor Pools at BNL
The condor pools are segmented into four pools:
- production
- users
- OpenScienceGrid (OSG)
- and general.
The four pools (like queues) span all STAR machines (CAS and CRS). On all STAR machines priority goes as follows
- production = 2
- users = 1
- OSG = 1
- general = 0
- Production jobs cannot be evicted on CRS machines ...
- Users and OSG jobs can be evicted
- Eviction happens after those have 3 hours of runtime from the time they start.
- For example, if a production job wants a node being used by a user job that has been running for two hours then that user job has one hour left before it gets kicked out ...
- Eviction happens after those have 3 hours of runtime from the time they start.
- This time limit comes into effect when a higher priority job wants the slot (i.e. production vs. user or production versus OSG)
- general queue jobs are evicted IMMEDIATELY when the slot is wanted by ANY STAR job (production, user, or OSG).
This provides the general structure of the Condor policy in place for STAR. The other policy options in place goes as follows:
- The following options apply to all machines but only come into play on machines that also execute LSF jobs or have interactive jobs allowing a rise of the load
- production jobs cannot start unless the non-Condor 1min load <= 1.4.
- Non-production jobs cannot start unless the non-Condor 5min load <= 1.4
- On the CAS machines, general jobs are evicted when the non-Condor 1min load <= 1.4. (this is so that LSF jobs can evict general queue jobs).
- General queue jobs will not start on any node unless 1min < 1.4, swap > 200M, memory > 100M.
- CAS nodes only run one Condor job at a time (for LSF reasons)
- CRS nodes can run three production jobs at a time, two jobs otherwise.
- User fairshare is in place. The time limit is the exact same as described above (3 hours from t0)
- Not really something I control but I should mention that production normally does not run on the rcas machines (it keeps them blocked)
Some condor commands
This is not meant to be an exhaustive set of commands nor a tutorial. You are invited to read to the manpages for condor_submit, condor_rm, condor_q, condor_status. Those will be most of what you will need to use on a daily basis. Help for version 6.9 is available online.- Query and information
- condor_q - submitter $USER
List jobs of specific submitter $USER from all the queues in the pool
- condor_q -analyze $JOBID
Perform an approximate analysis to determine how many resources are available to run the requested jobs.
- condor_status -submitters
shows the numbers of running/idle/held jobs for each user on all machines - condor_status -claimed
Summarize jobs by servers as claimed - condor_status -avail
Summarize resources which are available
- condor_q - submitter $USER
- Removing jobs, controlling them
- condor_rm $USER
removes all of your jobs submitted from this machine
- condor_rm -forcex $JOBID
Forces the immediate local removal of jobs in undefined state (only affects jobs already being removed). This is needed if condor_q -submitter shows your job but condor_q -analyze $JOBID does not (indicating an out of sync information at Condor level).
- condor_release $USER
releases all of your held jobs back into the pending pool for $USER
- condor_rm $USER
- More advanced
- condor_status -constraint 'RemoteUser == "$USER@bnl.gov"'
lists the machines on which your jobs are currently running
- condor_q -submitter username -format "%d" ClusterId -format " %d" JobStatus -format " %s\n" Cmd
shows the job id, status, and command for all of your jobs. 1==Idle, 2==Running for Status. I use something like this because the default output of condor_q truncates the command at 80 characters and prevents you from seeing the actually scheduler job ID associated with the Condor job. I'll work on improving this command, but this is what I've got for now.
- condor_status -constraint 'RemoteUser == "$USER@bnl.gov"'
