Current short-term plan
This plan will focus on these aspects:
- Provide a tool usable for year 2002-2003.
- Freeze the user interface for job submission for the years to come.
- Define a software architecture that allows migration to other GRID tools.
- Provide user interface for scheduler policy management.
Provide something usable for year 2002-2003
The main focus of this work is to provide a simple tool that can be readily
used. There has been work in Wayne State University (WSU) to develop a scheduler that basically
uses
some functionalities of LSF, which is the system currently used for job
submission.
The model for year 2001-2002 has been the following:
- The user would consult a catalog to get the file names of
the data set on which he should run his job. The data itself resides on NFS
(that is, a disk space that is mounted on all the machines of the farm).
- The user would submit to LSF a job, described by a command line. The
command line refers to a program that resides on NFS/AFS. The user typically
assumes that NFS is always available in the same path, and might even have
absolute pathnames in his code/scripts.
- No assistance is provided to manage job submission on multiple files, and
every user developed his own way of doing that.
The new scheduler developed at WSU does the following:
- Allows the user to submit a job to a set of files that might be on NFS as
well as on the local disk of any node of the farm.
- Automatically splits the job to be executed in different machines.
- Relies on LSF to actually dispatch the job.
- Assumes the submitted program resides on NFS/AFS, and that results will be put
on NFS.
Therefore the main task is to provide assistance to manage job submission on
multiple files, and to allow the files to reside on the local disk of the nodes
of the farm. This work has been developed at Wayne State University by Vishist Mandapaka,
a student, under the supervizion of Prof. Claude Pruneau. The
scheduler has the following strong point:
- Allows better management of disk resources.
- It uses the existing infrastructure, allowing the farm managers to use the
past years of experience.
Freeze the user interface for job submission for the years to come
The scheduler at WSU has a user interface different from pure LSF. We
have been reviewing this interface so that it will allow us to leave it
unchanged even if we change the underlying scheduler (i.e. if we pass to
Condor and other tools).
Note that changing both the underlying scheduler and the user interface would
be traumatic for STAR. STAR users are already submitting jobs, and we cannot
expect them to change their scripts and habits in one day. A period of transition has to be
allowed. The following has to be taken into consideration:
- In the period of transition, both users and farm managers would have no
experience with the new configuration.
- Users should be encouraged to migrate to the new interface, but will be
likely to have worse support on the new platform, leading them to frustration
and rejection of the new platform.
- Different user interface will most likely change usage patterns. This is
another difficulty for the farm managers, and having to accommodate on a new
platform that is less known might be difficult.
By changing the user interface only, and leaving the underlying scheduling
system (i.e. Wayne scheduler works on LSF which is being used directly by
users now) farm managers can assist users to make modification to their script and
tuning the system accordingly to the change in usage patterns. Afterwards the
actual scheduler can be changed, but the interface can be kept. This has the
following advantages:
- There are two periods of transitions: one for the user and one later for
the farm managers.
- When the user interface is changing, farm managers will be able to help
users to ease the change. Level of support will be optimal in the new system,
and the old system will work in the same way. A higher priority will be given
to the jobs using the new interface, to encourage people to migrate their
code. Farm managers can study how the usage patterns are changing, and act
accordingly.
- When the underlying scheduling system will change, the users will be less
likely to change their behavior, and farm managers can focus on the transition
with less unexpected problems.
Job description is done through an XML file, and the detailed description can
be found here.
Define an architecture that allows migration to other GRID tools
The Wayne scheduler is designed as a stand-alone project, and not with the
GRID in mind. Since, as we said, this is a temporary solution, we need a
software
architecture that allows us to integrate other components (e.g. the file
catalog, the database catalog, ...) and to change the underlying implementation
(e.g. Condor instead of LSF).
In order to do that, we created a modular architecture in which different
parts of the
scheduler are decoupled, so that those functions can be assigned to different GRID
tools.
Basically the process of scheduler is divided into the following parts:
- Job validation: validates the job request. Analyzes the XML for the job
request, determine if it's valid (e.g. program exists, file exists, user has
an account, ...)
- Policy: decides how to process the job. Determines if the job has to be
cut into smaller jobs, and on which machines will be executed. If file catalog
is used, may trigger copies directly, or, better, may use DAGMAN and create a
job tree that includes moving-jobs.
- Dispatcher: dispatches the job to the underlying scheduling method.
The proposed architecture is described in more detail
here.
Provide user interface for scheduler policy management
One thing the farm managers will want to be able to change is the policy
according to which a job is associated the input files, divided in different
jobs and dispatched to the machines. The policy manager in the future will:
- Change the sub-job execution length by assigning more or less data files
to a sub-job (trade-off between overhead of execution and cpu power lost in
case the sub-job hangs)
- Integrate scheduling decisions with queries to the file catalog.
- Change the way a job is assigned a local data file (on a local node) or a
network data file.
- Trigger automatic file movement connected with the file catalog (get a
file from HPSS, move a file to a local machine...)
To provide this, the proposed architecture isolates the policy behavior in a
Java interface. Exploiting the ability of java to load a class by it's name, the
policy can be changed without having to redeploy. This allows the policy manager
to test the new policy, change it gradually and go back to the previous policy
without disruption.
The details on the Policy class can be found
here.
Gabriele Carcassi - page was last modified