![]() |
Scheduler FAQ |
First thing you have to prepare your XML description. When you have prepared your files, you can type:
star-submit jobDescription.xml
where jobDescritpion.xml is the name of the file.
You can use one of the "cut and paste" examples. Of course, you still have to change the command line and the input files. I am sorry I couldn't prepare the exact file for you... :-)
Yes, this is normal. For every process a script and a list of files are created. It's something that will be fixed in the final version. You can delete them easily, since they all start with script* and fileList. Remember, though, to delete them _AFTER_ the job has finished.
Well, you shouldn't panic like that! You can send an e-mail to the scheduling mailing list, and somebody will help you.
In the comment at the beginning of each script there is the bsub command used to submit the job. You can copy and paste it the command line and execute it. Be sure you are in the same directory of the script.
You can use the name attribute for job like this:
<job ... name="My name" ... >
In the command section you can put more than one command. You can actually put a csh script. So you can write:
<command> starver SL02i root4star mymacro.C </command>
The file catalog is actually a separate tool from the scheduler. When you write a query, the get_file_list.pl command is used. So, the full documentation for the query is available in the file catalog manual. You will be interested in the -cond parameter, which is the one you are going to specify in the scheduler.
If you are asking this question, it's because you have been trying to submit something like:
<command>root4star -b -q doEvents.C\(9999,\"$FILELIST\"\)</command>
This won't work because doEvent interprets $FILEST as an input file and not as a filelist. But, if you put @ before the filename, doEvents (and bfc.C, ...) will interpret the filename correctly. So you should have something like:
<command>root4star -b -q doEvents.C\(9999,\"@$FILELIST\"\)</command>
Before version 1.8.6 the job will start default location for the particular batch system.
If you are using LSF jobs will execute in the directory in which you are submitting the job, which is the same directory where the scripts will be created, which is also the same directory you should be in for resubmitting the jobs.
In version 1.8.6 and above the default directory in which the job starts is define by the environment variable $SCRATCH. This will most likely be a directory local to the worker node. The base directory path will be different for every site. The directory and all its contents will be deleted as soon as the job is finished. For this reason do not ever attempt to change this directory. All files that need to be saved need to be copied back to some other directory.
You don't, but you can tell SUMS (a.ka. the STAR Scheduler) about your job,
so that SUMS can choose the correct queue or modify the job to fit into an existing
queue.
Remember: SUMS will have to work on different sites, on which the queue names,
the number of queues and their limitations will be different across sites. SUMS
knows these values through a configuration file maintained by facility or team
experts knowledgeable of such limitations. Also, note that if the queue characteristics
are modified or new queues open up these changes can be made transparently to
you but still allowing your jobs to take advantage immediately.
If you do not give SUMS any hint, it will be forced to make assumptions based on the provided properties of your job and hence, chose a queue selection based on those assumptions. Queues typically have limitations on run time (CPU or wall clock), memory and storage space. If your job exceeds any of these limitations the job may be killed by the batch system.
SUMS uses the filesPerHour which can be defined in the job tag to figure out
how long a job will run for. Here is an example of how you can define this:
<job filesPerHour="25">
This is a way of specifying that your job will be able to process on average
25 files in one hour. So a job with 50 files is assumed to run for two hours
and a job with one file is assumed to run for 2.4 minutes. If a smaller number
is used like filesPerHour="0.25" this means a job with one file will
run for four hours and a job with 2 files will run for eight hours.
If the distribution of events per job is wide, making the run times of the jobs very different (ex. Some jobs run for one hour and others run over ten hours) it may be more desirable to switch over to the events keywords group. This will allow a more even number of events in each job. So instead of using the filesPerHour, minFilesPerProcess and, maxFilesPerProcess you can use eventsPerHour, minEvents, and, maxEvents. The events key words will drop files to try and meet the requirement, since most users will want as many statistics as possible even if there is more variation in run time it is recommended that the softLimits keyword be set to true, this will try to meet the requirements, but if it can't it will try to fudge it as best as it can get each job as near the ideal min and max as possible. For example if you set the max events to 1,000 and a single file has 5,000 events in it, with the softLimits not set or set to false, the file with 5,000 events will be dropped. With the softLimits set to true the file will be given its own job, because that is the best that can be done to get it as close to the maximum users requested limit of 1,000 as possible.
Here is an example of how it would be used:SUMS has a preferred queue. This is usually the shortest highest priority queue. So as the value of filesPerHour gets smaller SUMS will try to group less files in one job to try to get it under the time limit for its preferred queue. This will result in a greater number of jobs. If for some reason you want to force SUMS to send your jobs to a queue with a longer lifetime of jobs, you can do this by setting limits on the number of files inside your job by setting the minFilesPerProcess and maxFilesPerProcess. Especially, a combination of minFilesPerProcess and filesPerHour will clearly indicate the minimum time for a job. If the job time limit exceeds that of the preferred queue, SUMS will try the second most preferred queue and so on until if finds a queue that meets the users defined requirements. If no such queue exists, for example if the longest queue has a time limit of a week and your job request 12 days, the job will not be submitted and SUMS will print a warning. I recommend allowing a little bit of wiggle room between min and max FilesPerProcess because SUMS will complain for example if you ask for N files in each job and the number of files returned by the catalog is not devisable by N.
There will always be some variation in run time because of differences in worker node hardware, load, variation in event complexity, file system load and so on. So it is recommended to provide a little slack in the filesPerHour or eventsPerHour (in other words making the number a little lower). If you guess that your job can process 100 files per hour and in reality it can only process 10 and you ask for a maximum of 100 files per job your job will run for 10hours in reality and the scheduler will assume it will run for a maximum of 1 hours. Now assuming 10hours real run time and the scheduler assuming a run time of 1 hour, the scheduler could submit this job to a queue with a limit on run time of 2 or 3 or 4 or 5 hours, depending on what your site offers. Once the job runs over the hard run time limit of the queue the job will be forcibly stopped even if it has not finished. This is why it is important to get this number close to reality or at least not off by huge factors. Presumably before running your script over hundreds or thousands of file you have tested with one or two files. You should use this number to guess the filesPerHour or eventsPerHour. It is not an exact science but close is good enough.
Now if you're having a hard time guessing what the scheduler thinks your run time is and why it sent your job to a particular queue or even what queue it sent it to you can look at the *.report file. This is a text file with a table of your jobs and some key parameters. You have to stretch the window really wide to see the table properly, else the text will wrap around and distort. There is a smaller table below the job table which shows you the queues at your site in your scheduler's policy which the scheduler had to select from. If you want to view this before submitting your jobs you can use the simulateSubmission keyword.
Please refer to the SUMS manual for defining other properties of your job such
as memory and storage space.
You can look at this example.
Check you .cshrc file. First of all, as Jerome says:
*********************************************** **** DO NOT SET LSF ENVIRONEMNTS YOURSELF ***** **** DO NOT SET LSF ENVIRONEMNTS YOURSELF ***** **** DO NOT SET LSF ENVIRONEMNTS YOURSELF ***** **** DO NOT SET LSF ENVIRONEMNTS YOURSELF ***** **** DO NOT SET LSF ENVIRONEMNTS YOURSELF ***** **** DO NOT SET LSF ENVIRONEMNTS YOURSELF ***** ***********************************************
Furthermore, you may want to have a look at this page
You can use the get_file_list.pl to know which values are can a particular have. For example:
[rcas6023] ~/> get_file_list.pl -keys filetype daq_reco_dst daq_reco_emcEvent daq_reco_event daq_reco_hist daq_reco_MuDst daq_reco_runco daq_reco_tags dummy_root ...
You can do the same for any keyword
You might want to have a look at the star-submit-template. Here is an example.
This is an XML problem: '<', '&' and '>' are XML reserved, so you can't use them, but you can use 'something else' in their place. Use the following table for the conversion:
< | < |
> | > |
& | & |
So, for example, if you need to put the following in a query:
runnumber<>2280003&&2280004
runnumber<>2280003&&2280004
Yes, it doesn't look so nice... :-( There is unfortunately not much I can do...
I suggest you have a look at this site: it has a lot of tutorials about XML, HTML and other web technologies. It's at a entry level and it has a lot of examples.
You can find a list of the installed policies here.
Use $FILEBASENAME as the stdout file name. This only works in version 1.6.2 and up. It only works if there is only one output file or one file preprocess.
Consult the manual if you need an example.
Emphasis is always placed on generating real detailed and meaningful user feedback at the prompt when errors occur. However more information can be found in the users log file. If the scheduler did not crash altogether it will append to the users log file, where the most detailed data of the internal workings can be found.
Every user that ever used the scheduler, even just once has a log file. It holds information about what the scheduler did with your job. To find out more about reading your log file click here.
Note: This only refers to version 1.7.5 and above, as resubmission was only built-in at this version. If you submitted via an older scheduler version the resubmit syntax will not work and you will not have a session file.
Note: When a job is resubmitted the file query is not redone. Scheduler uses the exact same files as the original job. Be careful when resubmitting old jobs as the path to the file may have changed.
When a request is submitted by the scheduler, it produces a file named [jobID].session.xml. This is a semi-human readable file, that contains everything you need to regenerate the .csh and .list files and to resubmit all or part of your job.
If you wish to resubmit the whole job, the command is (replace [jobID] with your job Id): star-submit -r all [jobID].session.xml
Example: star-submit -r all 08077748F46A7168F5AF92EC3A6E560A.session.xml
If you wish to resubmit a particular job number, the command is (where n is the job number): star-submit -r n [jobID].session.xml
To resubmit all of the the failed jobs, the command is: star-submit -f all [jobID].session.xml
Type : star-submit -h for more options and for help. There are a lot more options available this is a very short overview of the resubmission options.
Note: This only refers to version 1.7.5 and above, as resubmission was only built-in at this version. If you submitted via an older scheduler version the resubmit syntax will not work and you will not have a session file. It is also recommended you read 20. How resubmit jobs as there is more information about the session file in there.
The command to kill all the job in the submission is (replace [jobID] with your job Id): star-submit -k all [jobID].session.xml
If you wish only to kill part of the jobs, substitute the word all for a single job number (for job 08077748F46A7168F5AF92EC3A6E560A_4 the number is 4). A comma delaminated list may also be used (example: star-submit -k 5,6,10,22 [jobID].session.xml) or a range (example: star-submit -k 4-23 [jobID].session.xml)
Note: This only refers to version 1.7.5 and above.
This information is stored in the [jobID].report file in a nice neat table (I recommend you turn off word wrap to view this file). We are trying to put more and more information for users in this file with every new version. The file also stores information about queue selection. So it will probably answer such questions as "Why did my job have to go into the long queue as opposed to the short queue".
In a non-grid context when you ask for data to be moved back from $SCRATCH
using a tag like this one the:
<output fromScratch="*.root" toURL="file:/star/u/lbhajdu/temp/" />
Is translated into the cp command like you see below:
/bin/cp -r $SCRATCH/*.root /star/u/lbhajdu/temp/
If the cp command returns "/bin/cp: no match" it means it did not match
anything. This is because no files where generated by your macro for it to copy.
This can be verified by adding an "ls $SCRATCH/*" to the command block of your
job right after your macro finishes to list what file it has produced in
$SCRATCH.
Examine your macro carefully to see what it's doing. It could be writing your
files to a different directory, or not writing any files at all or crashing
before it gets a chance to write anything.
In scheduler version 1.10.0c and above there is an option to have the scheduler
write the files somewhere other then your current working directory. A basic
example of this capability would be to create a subdirectory off your current
working directory and have the scheduler copy everything there. To do this first
create the directory [mkdir ./myfiles]. Then add these tags inside the body
of your xml files jobs tag (that is the same level as the command tag):
<Generator><Location>./myFiles</Location></Generator>
These tags have far more fine grain options fully documented inside the manual
in section 5.1.
In order to be able to expand the scheduler functionality further and implement many of the changes people have been asking for we will need to expand the xml job description this means turning some of the XML attributes into <elements> so sub-elements could be added later. We will try to be gentle, in the 1.10.0 version the scheduler tells you how to upgrade your xml and make it functionally equivalent.
Use the nProcesses option to set the number of jobs you want to submit. Use the filesPerHour to set the run time of the jobs. Note that because there are no input files the filesPerHour is the number of hours for which the job will run for.
For this the STAR Scheduler (SUMS) provides a variable defined in your jobs called $SCRATCH. There are benefits to using this variable, pointing to a local storage.
The number one reason user’s cd back to their home directories is because
they don’t know how to bundle up their work and move it with their jobs.
This Email aims to instruct/remind you on how to do that.
1) First, write and debug your macro (obviously this will be different for everyone).
For the purpose of this explanation, let us assume you have written a macro
named findTheQGP.C
2) Start preparing your scheduler XML, now instead of moving back to your home directory write your command block as if everything you need is available right from where your job starts up (this area will be $SCRATCH).
<command> stardev root4star -q -b findTheQGP.C\(234,\"b\",\"$JOBID\",\"$FILELIST\"\) </command>3) Now think about what files your job needs to run, we are going to add a new block called <sandbox> which will bring with us to $SCRATCH our macro and any additional files our macro depends on. Here is an example:
<SandBox installer="ZIP"> <Package name="myQGPMacroFile"> <File>file:./findTheQGP.C</File> <File>file:./StarDb/</File> <File>file:./.sl53_gcc432/</File> </Package> </SandBox>This block tells our job, that we need to take with us the macro findTheQGP.C and the folder StarDb as well as the whole tree: ./.sl53_gcc432/ (which likely would contain local STAR codes you may need and have compiled).
We did not use the full path although we could have. The ./ means here these files will be in the directory we are submitting from (the submission description is relative to where you will submit your jobs). The installer="ZIP" tells the scheduler to zip up all these files and unzip them before our job starts. You can add as many files this way as you want. Don’t forget about adding hidden files your job depends on you can check your working directory for these with the command ls –a.
When submitting your jobs you will notice the message:
Zip Sandbox package myQGPMacroFile.zip created where the name is defined by
the name="myQGPMacroFile".
You will also notice the zip file in your submitting directory:
[rcas6019] ~/temp/> ls -l myQGPMacroFile.zip -rw-r--r-- 1 lbhajdu rhstar 3342 May 9 11:57 myQGPMacroFile.zip
It should be noted that after this point any change made to the original macro,
code I your tree etc … will NOT affect the running jobs. In other words,
the package will not be zipped again for the set of jobs you have just submitted
(see it as a container preserving your work). So if you made an error you should
delete the zip file before trying another submission. That zip file will be
created in the directory from which you submit your jobs.
You can put an ls command in your command block to verify that all of your files have been moved. You will see the scheduler wrapper placing your files inside the jobs log and an ls command will show you the files, just in case you’re not sure of the final configuration:
Archive: myQGPMacroFile.zip inflating: /tmp/lbhajdu/FC770BA9B391C27970BDE1312D8BA50D_0/numberOfEventsList.C creating: /tmp/lbhajdu/FC770BA9B391C27970BDE1312D8BA50D_0/StarDb/ creating: /tmp/lbhajdu/FC770BA9B391C27970BDE1312D8BA50D_0/StarDb/Calibrations/ creating: … /tmp/lbhajdu/FC770BA9B391C27970BDE1312D8BA50D_0/StarDb/Calibrations/tpc/tpcGridLeak.20060509.120001.C -------------------------------- STAR Unified Meta Scheduler 1.10.8 we are starting in $SCRATCH : /home/tmp/lbhajdu/FC770BA9B391C27970BDE1312D8BA50D_0 -------------------------------- total 8 -rw-r--r-- 1 lbhajdu rhstar 2689 Jul 6 2006 numberOfEventsList.C drwxr-xr-x 3 lbhajdu rhstar 4096 May 9 11:57 StarDb /tmp/lbhajdu/FC770BA9B391C27970BDE1312D8BA50D_0 total 8
4) There is one final item to take care of, getting our output files back from $SCRATCH. But don’t worry there is a block for that as well. Let us assume we only want root files back (picoDST, histograms … whatever will be created by the job), then we could add something like this to our xml, and sure you can add as many of these as you like.
<output fromScratch="*.root" toURL="file:/star/data05/scratch/inewton/temp/" />
5) Besides the Stream to the output files. There is another stream open to the
log files. By redirecting this to the local disk it provides the same benefits
as writing the output files locally. 99% of the output comes from root4star
or the program actually doing the work like an event generator for example.
Simple Linux redirection is all that is needed here.
<command> ls stardev root4star -q -b findTheQGP.C\(234,\"b\",\"$JOBID\",\"$FILELIST\"\) >& ${JOBID}.log </command>
Note the redirection at the back of the root4star command. This takes all the output from root4stars output and error streams and redirects them to the file ${JOBID}.log which is located in the current working directory. In this case it would be $SCRATCH. So this also has to be copied back once done. And we can do that the same way as before:
<output fromScratch="*.log" toURL="file:/star/data05/scratch/inewton/" />
So let’s put this all together to see what the final xml might look like:
<?xml version="1.0" encoding="utf-8" ?>
<job maxFilesPerProcess="50"> <command> ls stardev root4star -q -b findTheQGP.C\(234,\"b\",\"$JOBID\",\"$FILELIST\"\) >& ${JOBID}.log </command> <SandBox installer="ZIP"> <Package name="myQGPMacroFile"> <File>file:./findTheQGP.C</File> <File>file:./StarDb/</File> </Package> </SandBox> <input URL="catalog:star.bnl.gov?filetype=daq_reco_MuDst,storage=local" nFiles ="400"/> <stderr URL="file:./shed$JOBID.error.out" /> <stdout URL="file:./shed$JOBID.out" /> <output fromScratch="*.MuDst.root" toURL="file:/star/data05/scratch/inewton/"/> <output fromScratch="*.log" toURL="file:/star/data05/scratch/inewton/"/> </job>
Levente Hajdu - last modified this page
![]() | ![]() | ![]() |