File Catalog integration

File catalog integration with the scheduler is:

  1. allow the users to specify a file catalog query as input file
  2. allow the scheduler to decide which copy of a particular file to use
  3. allow the scheduler to trigger file replication, for example from HPSS to a target machine

As of August 2002, we start working on the first two point and we expect to conclude them by October 2002. The third point will be approached only after the work on the first point has finished, and user usage pattern are understood.

Other topics that are relevant are:

  1. experiment with existing GRID middleware that does file catalog
  2. define an interface to enables to change file catalog implementation

August 2002: what is present now

At present, there is already a file catalog. It consists of a MySQL database containing file description information (metadata - ex. number of events in the file, run conditions), file information (ex. file name, size)  and file location information (ex. site, node name, file path).

The DB is basically divided into two parts: the first includes all the information about the file and the second all the information about the replica of the file (which could be more than one). The placement of some attributes in a category or the other seems critical for future integration with other GRID tools. The database schema can be found here. These are the point that I think to be most notable:

On top of this, there is a Perl script that accesses the table and executes queries is provided. This is the user interface of the catalog. The manual can be found here. All the keywords are defined in the script, and are not passed directly to the database. The database structure is well insulated, meaning that some modification to the database schema is possible. However, notice that fileData has 1.3 million entries, and fileLocation has 1.9 million entries. Therefore for each modification a careful migration path has to be developed.

User interface extension for the job description

The XML job description has to be extended to allow for catalog queries. The interface should map as close as possible the keywords and the syntax of the perl command. Moreover, the query execution should give the same exact result that the perl script gives. This way the user can test the query before submitting the job, to check whether the files it returned are the correct ones.

Given the generic command line

% get_file_list.pl [-all] [-distinct] -keys keyword[,keyword,...]
     [-cond keyword=value[,keyword=value,...]] [-start ] [-limit ] [-delim ] [-onefile]

some of the options don't have meaning, and can be taken away.

Given these consideration, we can introduce a catalog query as a different description for an input entity. For example:

<input URL="catalog:star.bnl.gov?keyword=value[,keyword=value]" [start = value] [limit = value] />

More properties can be defined to enrich the possibilities. For example:

<input URL="catalog:star.bnl.gov?keyword=value[,keyword=value]" [nEvents = value] />
<input URL="catalog:star.bnl.gov?keyword=value[,keyword=value]" [nFiles = value] />

Query implementation in the scheduler

The crucial part is to decide how to execute the query. There are two different possibilities:

For a first implementation, it is probably better to use the second solution. In the 1.0beta2 the catalog will be integrated with the syntax that allows start and limit.


Gabriele Carcassi - page was last modified