File Catalog integration

File catalog integration with the scheduler is:

allow the users to specify a file catalog query as input file
allow the scheduler to decide which copy of a particular file to use
allow the scheduler to trigger file replication, for example from HPSS to a target machine

As of August 2002, we start working on the first two point and we expect to conclude them by October 2002. The third point will be approached only after the work on the first point has finished, and user usage pattern are understood.

August 2002: what is present now

At present, there is already a file catalog. It consists of a MySQL database containing file description information (metadata - ex. number of events in the file, run conditions), file information (ex. file name, size) and file location information (ex. site, node name, file path).

The DB is basically divided into two parts: the first includes all the information about the file and the second all the information about the replica of the file (which could be more than one). The placement of some attributes in a category or the other seems critical for future integration with other GRID tools. The database schema can be found here. These are the point that I think to be most notable:

Metadata and file information are in the same table, fileData, which describes the properties of a file. Metadata is STAR dependent, and no common GRID tool will be able to offer such functionalities. File information, such as file name and file size, is STAR independent.
The fact that the fileData contains filenames and file sizes, means that a file can't change name on different locations and can't change size. The impact on the file size means that the same encoding has to be used (for example, a text file can't change from UNIX to Windows: one has to be chosen)
The file name is not right now unique. More files can have the same name.
The file path can change from location to location, since it is in the fileLocation table, which describes the information associated to a specific copy of a file.

On top of this, there is a Perl script that accesses the table and executes queries is provided. This is the user interface of the catalog. The manual can be found here. All the keywords are defined in the script, and are not passed directly to the database. The database structure is well insulated, meaning that some modification to the database schema is possible. However, notice that fileData has 1.3 million entries, and fileLocation has 1.9 million entries. Therefore for each modification a careful migration path has to be developed.

User interface extension for the job description

The XML job description has to be extended to allow for catalog queries. The interface should map as close as possible the keywords and the syntax of the perl command. Moreover, the query execution should give the same exact result that the perl script gives. This way the user can test the query before submitting the job, to check whether the files it returned are the correct ones.

Given the generic command line

% get_file_list.pl [-all] [-distinct] -keys keyword[,keyword,...]
     [-cond keyword=value[,keyword=value,...]] [-start ] [-limit ] [-delim ] [-onefile]

some of the options don't have meaning, and can be taken away.

-keys is useless. It's the scheduler that will ask for the result of the query.
-all and -distinct don't have meaning here. The scheduler will always be interested in the file location list. There can't be duplicates.
-cond is the essential part of the query, and has to be kept AS IT IS.
-start and -limit might be useful, though they should be integrated with other keywords. Typically, the user may want to get some data (10000 events, 2 datafiles...) with some characteristic. The scheduler, though, should be more clever and not pick at random (ex. get files that are located on a node that is idle)
-delim is useless. Again the result of the query goes to the scheduler
-onefile is useless. It's the internal policy of the scheduler that should decide which copy to get

Given these consideration, we can introduce a catalog query as a different description for an input entity. For example:

More properties can be defined to enrich the possibilities. For example:

Query implementation in the scheduler

The crucial part is to decide how to execute the query. There are two different possibilities:

Implement the query in the scheduler codebase. Advantages: uniform codebase for the scheduler, flexibility for the scheduler to make any desired query (the scheduler might want to make more sophisticated queries), easier to migrate to tools that might not be available in perl. Disadvantages: re-implement all the keywords (they don't map directly the DB), whenever a change is made to the perl script, a change has to be made to the scheduler also, assure consistency between the two interfaces.
Use the perl script as an external process. Advantages: perfect consistency between catalog and scheduler, much faster. Disadvantages: scheduler might want to make some queries that the perl script doesn't allow.

For a first implementation, it is probably better to use the second solution. In the 1.0beta2 the catalog will be integrated with the syntax that allows start and limit.

Gabriele Carcassi - page was last modified