File Catalog integration |
File catalog integration with the scheduler is:
As of August 2002, we start working on the first two point and we expect to conclude them by October 2002. The third point will be approached only after the work on the first point has finished, and user usage pattern are understood.
Other topics that are relevant are:
At present, there is already a file catalog. It consists of a MySQL database containing file description information (metadata - ex. number of events in the file, run conditions), file information (ex. file name, size) and file location information (ex. site, node name, file path).
The DB is basically divided into two parts: the first includes all the information about the file and the second all the information about the replica of the file (which could be more than one). The placement of some attributes in a category or the other seems critical for future integration with other GRID tools. The database schema can be found here. These are the point that I think to be most notable:
On top of this, there is a Perl script that accesses the table and executes queries is provided. This is the user interface of the catalog. The manual can be found here. All the keywords are defined in the script, and are not passed directly to the database. The database structure is well insulated, meaning that some modification to the database schema is possible. However, notice that fileData has 1.3 million entries, and fileLocation has 1.9 million entries. Therefore for each modification a careful migration path has to be developed.
The XML job description has to be extended to allow for catalog queries. The interface should map as close as possible the keywords and the syntax of the perl command. Moreover, the query execution should give the same exact result that the perl script gives. This way the user can test the query before submitting the job, to check whether the files it returned are the correct ones.
Given the generic command line
% get_file_list.pl [-all] [-distinct] -keys keyword[,keyword,...] [-cond keyword=value[,keyword=value,...]] [-start] [-limit ] [-delim ] [-onefile]
some of the options don't have meaning, and can be taken away.
Given these consideration, we can introduce a catalog query as a different description for an input entity. For example:
<input URL="catalog:star.bnl.gov?keyword=value[,keyword=value]" [start = value] [limit = value] />
More properties can be defined to enrich the possibilities. For example:
<input URL="catalog:star.bnl.gov?keyword=value[,keyword=value]" [nEvents = value]
/>
<input URL="catalog:star.bnl.gov?keyword=value[,keyword=value]" [nFiles = value]
/>
The crucial part is to decide how to execute the query. There are two different possibilities:
For a first implementation, it is probably better to use the second solution. In the 1.0beta2 the catalog will be integrated with the syntax that allows start and limit.
Gabriele Carcassi - page was last modified