![]() |
Scheduler FAQ |
First thing you have to prepare your XML description. When you have prepared your files, you can type:
star-submit jobDescription.xml
where jobDescritpion.xml is the name of the file.
You can use one of the "cut and paste" examples. Of course, you still have to change the command line and the input files. I am sorry I couldn't prepare the exact file for you... :-)
Yes, this is normal. For every process a script and a list of files are created. It's something that will be fixed in the final version. You can delete them easily, since they all start with script* and fileList. Remember, though, to delete them _AFTER_ the job has finished.
Well, you shouldn't panic like that! You can send an e-mail to the scheduling mailing list, and somebody will help you.
In the comment at the beginning of each script there is the bsub command used to submit the job. You can copy and paste it the command line and execute it. Be sure you are in the same directory of the script.
You can use the name attribute for job like this:
<job ... name="My name" ... >
In the command section you can put more than one command. You can actually put a csh script. So you can write:
<command> starver SL02i root4star mymacro.C </command>
The file catalog is actually a separate tool from the scheduler. When you write a query, the get_file_list.pl command is used. So, the full documentation for the query is available in the file catalog manual. You will be interested in the -cond parameter, which is the one you are going to specify in the scheduler.
If you are asking this question, it's because you have been trying to submit something like:
<command>root4star -b -q doEvents.C\(9999,\"$FILELIST\"\)</command>
This won't work because doEvent interprets $FILEST as an input file and not as a filelist. But, if you put @ before the filename, doEvents (and bfc.C, ...) will interpret the filename correctly. So you should have something like:
<command>root4star -b -q doEvents.C\(9999,\"@$FILELIST\"\)</command>
Before version 1.8.6 the job will start default location for the particular batch system.
If you are using LSF jobs will execute in the directory in which you are submitting the job, which is the same directory where the scripts will be created, which is also the same directory you should be in for resubmitting the jobs.
In version 1.8.6 and above the default directory in which the job starts is define by the environment variable $SCRATCH. This will most likely be a directory local to the worker node. The base directory path will be different for every site. The directory and all its contents will be deleted as soon as the job is finished. For this reason do not ever attempt to change this directory. All files that need to be saved need to be copied back to some other directory.
You don't, but you can tell SUMS (a.ka. the STAR Scheduler) about your job,
so that SUMS can choose the correct queue or modify the job to fit into an existing
queue.
Remember: SUMS will have to work on different sites, on which the queue names,
the number of queues and their limitations will be different across sites. SUMS
knows these values through a configuration file maintained by facility or team
experts knowledgeable of such limitations. Also, note that if the queue characteristics
are modified or new queues open up these changes can be made transparently to
you but still allowing your jobs to take advantage immediately.
If you do not give SUMS any hint, it will be forced to make assumptions based on the provided properties of your job and hence, chose a queue selection based on those assumptions. Queues typically have limitations on run time (CPU or wall clock), memory and storage space. If your job exceeds any of these limitations the job may be killed by the batch system.
SUMS uses the filesPerHour which can be defined in the job tag to figure out
how long a job will run for. Here is an example of how you can define this:
<jobs filesPerHour="25">
This is a way of specifying that your job will be able to process on average
25 files in one hour. So a job with 50 files is assumed to run for two hours
and a job with one file is assumed to run for 2.4 minutes. If a smaller number
is used like filesPerHour="0.25" this means a job with one file will
run for four hours and a job with 2 files will run for eight hours.
SUMS has a preferred queue. This is usually the shortest highest priority queue. So as the value of filesPerHour gets smaller SUMS will try to group less files in one job to try to get it under the time limit for its preferred queue. This will result in a greater number of jobs. If for some reason you want to force SUMS to send your jobs to a queue with a longer lifetime of jobs, you can do this by setting limits on the number of files inside your job by setting the minFilesPerProcess and maxFilesPerProcess. Especially, a combination of minFilesPerProcess and filesPerHour will clearly indicate the minimum time for a job. If the job time limit exceeds that of the preferred queue, SUMS will try the second most preferred queue and so on until if finds a queue that meets the users defined requirements. If no such queue exists, for example if the longest queue has a time limit of a week and your job request 12 days, the job will not be submitted and SUMS will print a warning. I recommend allowing a little bit of wiggle room between min and max FilesPerProcess because SUMS will complain for example if you ask for N files in each job and the number of files returned by the catalog is not devisable by N.
Please refer to the SUMS manual for defining other properties of your job such
as memory and storage space.
You can look at this example.
Check you .cshrc file. First of all, as Jerome says:
*********************************************** **** DO NOT SET LSF ENVIRONEMNTS YOURSELF ***** **** DO NOT SET LSF ENVIRONEMNTS YOURSELF ***** **** DO NOT SET LSF ENVIRONEMNTS YOURSELF ***** **** DO NOT SET LSF ENVIRONEMNTS YOURSELF ***** **** DO NOT SET LSF ENVIRONEMNTS YOURSELF ***** **** DO NOT SET LSF ENVIRONEMNTS YOURSELF ***** ***********************************************
Furthermore, you may want to have a look at this page
You can use the get_file_list.pl to know which values are can a particular have. For example:
[rcas6023] ~/> get_file_list.pl -keys filetype daq_reco_dst daq_reco_emcEvent daq_reco_event daq_reco_hist daq_reco_MuDst daq_reco_runco daq_reco_tags dummy_root ...
You can do the same for any keyword
You might want to have a look at the star-submit-template. Here is an example.
This is an XML problem: '<', '&' and '>' are XML reserved, so you can't use them, but you can use 'something else' in their place. Use the following table for the conversion:
| < | < |
| > | > |
| & | & |
So, for example, if you need to put the following in a query:
runnumber<>2280003&&2280004
runnumber<>2280003&&2280004
Yes, it doesn't look so nice... :-( There is unfortunately not much I can do...
I suggest you have a look at this site: it has a lot of tutorials about XML, HTML and other web technologies. It's at a entry level and it has a lot of examples.
You can find a list of the installed policies here.
Use $FILEBASENAME as the stdout file name. This only works in version 1.6.2 and up. It only works if there is only one output file or one file preprocess.
Consult the manual if you need an example.
Emphasis is always placed on generating real detailed and meaningful user feedback at the prompt when errors occur. However more information can be found in the users log file. If the scheduler did not crash altogether it will append to the users log file, where the most detailed data of the internal workings can be found.
Every user that ever used the scheduler, even just once has a log file. It holds information about what the scheduler did with your job. To find out more about reading your log file click here.
Note: This only refers to version 1.7.5 and above, as resubmission was only built-in at this version. If you submitted via an older scheduler version the resubmit syntax will not work and you will not have a session file.
Note: When a job is resubmitted the file query is not redone. Scheduler uses the exact same files as the original job. Be careful when resubmitting old jobs as the path to the file may have changed.
When a request is submitted by the scheduler, it produces a file named [jobID].session.xml. This is a semi-human readable file, that contains everything you need to regenerate the .csh and .list files and to resubmit all or part of your job.
If you wish to resubmit the whole job, the command is (replace [jobID] with your job Id): star-submit -r all [jobID].session.xml
Example: star-submit -r all 08077748F46A7168F5AF92EC3A6E560A.session.xml
If you wish to resubmit a particular job number, the command is (where n is the job number): star-submit -r n [jobID].session.xml
To resubmit all of the the failed jobs, the command is: star-submit -f all [jobID].session.xml
Type : star-submit -h for more options and for help. There are a lot more options available this is a very short overview of the resubmission options.
Note: This only refers to version 1.7.5 and above, as resubmission was only built-in at this version. If you submitted via an older scheduler version the resubmit syntax will not work and you will not have a session file. It is also recommended you read 20. How resubmit jobs as there is more information about the session file in there.
The command to kill all the job in the submission is (replace [jobID] with your job Id): star-submit -k all [jobID].session.xml
If you wish only to kill part of the jobs, substitute the word all for a single job number (for job 08077748F46A7168F5AF92EC3A6E560A_4 the number is 4). A comma delaminated list may also be used (example: star-submit -k 5,6,10,22 [jobID].session.xml) or a range (example: star-submit -k 4-23 [jobID].session.xml)
Note: This only refers to version 1.7.5 and above.
This information is stored in the [jobID].report file in a nice neat table (I recommend you turn off word wrap to view this file). We are trying to put more and more information for users in this file with every new version. The file also stores information about queue selection. So it will probably answer such questions as "Why did my job have to go into the long queue as opposed to the short queue".
In a non-grid context when you ask for data to be moved back from $SCRATCH
using a tag like this one the:
<output fromScratch="*.root" toURL="file:/star/u/lbhajdu/temp/" />
Is translated into the cp command like you see below:
/bin/cp -r $SCRATCH/*.root /star/u/lbhajdu/temp/
If the cp command returns "/bin/cp: no match" it means it did not match
anything. This is because no files where generated by your macro for it to copy.
This can be verified by adding an “ls $SCRATCH/*� to the command block of your
job right after your macro finishes to list what file it has produced in
$SCRATCH.
Examine your macro carefully to see what it’s doing. It could be writing your
files to a different directory, or not writing any files at all or crashing
before it gets a chance to write anything.
In scheduler version 1.10.0c and above there is an option to have the scheduler
write the files somewhere other then your current working directory. A basic
example of this capability would be to create a subdirectory off your current
working directory and have the scheduler copy everything there. To do this first
create the directory [mkdir ./myfiles]. Then add these tags inside the body
of your xml files jobs tag (that is the same level as the command tag):
<Generator><Location>./myFiles</Location></Generator>
These tags have far more fine grain options fully documented inside the manual
in section 5.1.
In order to be able to expand the scheduler functionality further and implement many of the changes people have been asking for we will need to expand the xml job description this means turning some of the XML attributes into <elements> so sub-elements could be added later. We will try to be gentle, in the 1.10.0 version the scheduler tells you how to upgrade your xml and make it functionally equivalent.
Use the nProcesses option to set the number of jobs you want to submit. Use the filesPerHour to set the run time of the jobs. Note that because there are no input files the filesPerHour is the number of hours for which the job will run for.
Levente Hajdu - last modified this page
![]() | ![]() | ![]() |