Here I condense all the consideration on how to make the current file catalog evolve. I'll start by the considerations on what we need to do, then I'll pass onto the implementation details of the integration with MAGDA.
The first assumption is that the catalog will put it's information on a MySQL database. We need to understand to what degree MySQL will hold. Provided that replication functions properly (more details on this later), we can disregard the effect of the reads on the master server. We first approach the problem with a simple approach: a single master at BNL. Therefore the critical part is how many writes the master can handle. So we propose:
first step: test how many writes per second a server like the
one we will have in production is able to handle before crashing. In
order to do this, we can have a db with the same structure of our current file
catalog, and have multiple processes, started from different machines. By doing
this, we can calculate how many writes/sec are necessary to bring down the
server. We can than calculate the expected update frequency of the catalog, and
compare. If there exist a factor of 10 between the worst scenario and the
allowed rate, then we can consider to be safe.
If this is not the case, then there are two possibilities: either proceed in the design of a more complex architecture, or use the solution anyway, and decrease the requirements. Of course, this last can be done only if the gap is not high.
I had two separate estimate from Jerome and Jeff. Jeff assumed the population of the local disk to be static, and put his attention on online. He took the daq rate, calculated the number of files per second produced by the daq and the farm. The rate was 2 files/seconds. Jerome, instead, had had a particular problem on this issue. When restoring files with the data carousel, it would send 25 query almost simultaneously and that took down the server once. We should test, then, how many concurrent threads can write on the db. The problem was solved by using the delay mechanism. The worst case is here represent by the file mover copying a file on every node. 250 files could be moved at the same time, and the copy might finish at the same time, since it started at the same time. This would cause 250 requests in less than one second.
second step: make a similar stress test on replication. The fact that we have been using MySQL replication reliably in a moderately static database environment, doesn't mean it will be safe in a more evolving environment such as the file catalog. We should then create a test-bed of one master and some slaves (the number of which should be representative of how many slaves we will need in production) and stress the master as before with the load expected in worst conditions. Again, if the results are satisfactory, then we can proceed. If not, then either we change solution or decrease the requirements.
Before making these two tests, there is little point in start working on extending the catalog. MySQL might prove suitable only for a temporary solution. One should keep in mind, though, that MySQL can handle multiple inserts, updates and deletes in the same query. Or use the delay mechanism to provide better performances. One should try also these features before declaring the test failed.
There are some implementation details that are to be addressed before the integration.
First of all, it is clear that there are two parts in a file catalog: file description, with all the information that describes what the files contains (i.e. production, content type, number of events, ...) and the file location, which describes where all the copies of each file are located. While the file location information will be basically similar within STAR and ATLAS (and maybe many other GRID project), the file description, or file metadata, will most probably be different. This information can't be separated in two different databases, because there is the requirement in STAR, and maybe in other experiments too, to query files depending on the file description and the file location (i.e. select all the file that belongs to production a that are resident on hpss on site b).
There is a contrast here: the file description in a sense should or shouldn't be part of MAGDA? From a performance reasoning it has to. Solving a query on information located on both file description and location would result in a lot of queries if the system was to be divided. I doubt we can afford that. Another solution might be to incorporate some extra fields in MAGDA, so that it can store all the possible information that ATLAS or STAR might want to add. This means imposing restriction on the level of complexity one wants to give to the metadata. One might want to provide collections in the file descriptions (i.e. DataSets), user information of who published the files and so on. It is likely that this approach is bound to not work.
The way to get out of this cleanly, seems to me, is to have two interfaces provided by the catalog. One is used for data replication and movement. All the tools that are work on a basic Logical File Name to Physical File Name mapping would use this interface. This is basically MAGDA as it is. This is basically unaware of the file description part of the database.
The second, more complete, interface will provide access to both the metadata and the file description part of the database. This has to be, at some level, experiment specific. There can be commonalities to different experiment, therefore some of the code and SQL scripts can be shared. But the complete database structure and full API will have variations across experiments. Let's call the STAR interface created for this STARMAGDA.
There a few concerns regarding MAGDA implementation, that we want to more clear on.
The Logical File Name is what is used in the prime table of MAGDA to pass from the file to the file locations. That is, the Logical File Name would be used to join between a file description table of STARMAGDA. The most logic candidate for a Logical File Name in the STAR catalog would be path + filename. This can be even 120 characters long, so joining on that is not feasable. The FK used must be an ID number, otherwise performance problems will come.
The site in MAGDA is a mix of location and protocol. We would need to have this information separated, since we need to make queries such as: what are the files stored on local disk on this machine? To do this efficient, we shouldn't be force to join on the site table on multiple entries, because this would force MySQL to read tables multiple times.
There is quite some information in MAGDA that is redundant. Just in the prime you have the URI, and then replicated the site, the path, the filename, the location (which can also be found in the site table). Reducing the amount of replicated data leads to smaller tables that makes the cache management in MySQL more efficient.