The FileCatalog database contains information about all files used
in production for the STAR experiment. The database itself and the
PERL module used to manipulate it are described later
in this document. Below You'll find the description of the command
line utility to access data in this database.
The utility is
called get_file_list.pl. If issued without any arguments it
will print the following usage message:
% get_file_list.pl [-all] [-distinct] -keys keyword[,keyword,...] [-cond keyword=value[,keyword=value,...]] [-start #] [-limit #] [-delim $St] [-onefile] [-o outputfile]
The command line options are described below:
-all |
use all entries regardless of availability flag |
-distinct |
returns only distinct rows - the default setting is to allow repetition of returned values. Note that selection such as node,path,filename,storage shall return only a set of values with no duplicates. Any other queries would return duplicate entries |
-keys |
Specify what data You want to get from the database. A list of valid keywords, separated by colons should follow this parameter. See examples for more clarification. See also the description of aggregate functions for some more sophisticated tricks. |
-cond |
Specify the conditions limiting returned dataset. A list of valid expressions (consisting of a valid keyword, a valid operator and a value), separated by colons should follow this parameter. Since some of the operators are special characters, the list of expressions should always be enclosed in single quotes. Again, see examples for more explanations. |
-onefile |
A special mode of operation; returns a list of files, but gives only one location (the one with highest persistence) for each file, even if the database has many. |
-start # |
specify the record number # to start from - default - start with the first record (together with -limit can be used to get the data in chunks) |
-limit # |
limit the number of records returned (default 100). |
-delim $St |
specify the characters that will separate the fields in the output (default: “::“) |
|
<= |
Not greater than |
|
|
< |
Lesser than |
|
|
>= |
Not less than |
|
|
> |
Greater than |
|
|
<> |
Not equal to |
|
|
!= |
Not equal to |
|
|
= |
equal to |
|
|
!~ |
Not containing (i.e. do not match) |
strings |
|
~ |
Containing (i.e. approximately matching) |
strings |
|
[] |
In range |
|
|
][ |
Outside the range |
|
|
% |
Modulo |
integer |
|
%% |
Not Modulo |
integer |
The following logical operators can also be used in a query. The usage scope in this case is in a -cond context as keyword=Value1 LogicalOperator Value2 {LogicalOperator Value3 ...}
|
|| |
Logical OR |
Strings or numbers |
|
&& |
Logical AND |
Strings or numbers |
Note that the use of the logical AND
operator will return no selection in most cases (for example, a
runnumber cannot be of Value1 and Value2 at the same time) but was
added for later extension of the database : selections based on meta
data such as triggered-events (many triggers in a file) would be case
where this operator would be used.
These are special aggregate functions. They can be used in conjunction with any keyword that describes some data. Note that most of them only make sense for numerical values. See examples for the description on how to use them.
sum |
The sum of the values |
avg |
The average of the values |
min |
The minimum of the values |
max |
The maximum of the values |
orda |
Sort the output in ascending order by this keyword |
ordd |
Sort the output in descending order by this keyword |
count |
The count for a given selection |
grp |
Group the output - put all the records with the same value for a given keyword together. This is required in conjunction with any aggregate functions used in a multiple keyword syntax context. |
Below are shown a few examples of how to use the get_file_list.pl command. The section gives the examples of a common questions a user may have and shows how to get an answer for them.
The database contains files from many productions. What are they, and what is their description ?
% get_file_list.pl -keys 'production,prodcomment'
The output should be:
simulation:: raw:: MDC4:: MDC4:: MDC4_test:: MDC4:: P00hd:: P00he:: P00hg:: P00hi:: P00hk:: P00hi:: P00hm:: P01he:: P01hf:: P01hg:: P01hi:: P01gk:: P02gd:: P02ge:: P02gc:: P01gl:: P02ga:: P02gb:: P02gf:: P02gg:: P02gh2:: P02gx:: P02gi1:: P02gi2:: n/a:: P03ii:: P03gi:: P03ia:: P03ib:: P03ib::
Using this kind of simple query one can get a list of valid values
for most of keywords (e.g. storage,
site, runtype, library). In our examples, we embrace the
keywords and condition string with a single quote. This is because
although without quotes, most query will be accepted, you have to be
careful that some syntax use characters the shell understands as
being special characters. Using single quotes around everything
prevents the shell from interpreting the special characters.
The
query above returns twice four times the value MDC4
for production. What happens is that this value is found multiple
times in the catalog but under different conditions. To eliminate
the confusion, consider the following query instead
% get_file_list.pl -keys 'production,library,prodcomment' simulation::sim:: raw::raw:: MDC4::trs2pp:: MDC4::trsY2:: MDC4_test::trs2y:: MDC4::trsY2v:: P00hd::SL00d:: P00he::SL00e:: P00hg::SL00g:: P00hi::SL00i:: P00hk::SL00k:: P00hi::trs1i:: P00hm::SL00m:: P01he::SL01e:: P01hf::SL01f:: P01hg::SL01g:: P01hi::SL01i:: P01gk::SL01k:: P02gd::SL02d:: P02ge::SL02e:: P02gc::SL02c:: P01gl::SL01l:: P02ga::SL02a:: P02gb::SL02b:: P02gf::SL02f:: P02gg::SL02g:: P02gh2::SL02h:: P02gx::SL02x:: P02gi1::SL02i:: P02gi2::SL02i:: n/a::n/a:: P03ii::SL02i:: P03gi::SL02i:: P03ia::SL03ia:: P03ib::SL02i::
In the above, we see that the key-pair production,library is unique while production alone is not. This is because the two keywords are intimately related and the library keyword was used to differentiate two production run condition in this older production version. This is a typical case where the -distinct comes to the rescue .
% get_file_list.pl -keys 'production,prodcomment' -distinct Simulation:: raw:: MDC4:: MDC4_test:: P00hd:: P00he:: P00hg:: P00hi:: P00hk:: P00hm:: P01he:: P01hf:: P01hg:: P01hi:: P01gk:: P02gd:: P02ge:: P02gc:: P01gl:: P02ga:: P02gb:: P02gf:: P02gg:: P02gh2:: P02gx:: P02gi1:: P02gi2:: n/a:: P03ii:: P03gi:: P03ia:: P03ib::
This time, we have only one set ...
Optimizing on the above example, what to do to order the list ??
% get_file_list.pl -keys 'orda(production),prodcomment' -distinct MDC4:: MDC4_test:: n/a:: P00hd:: P00he:: P00hg:: P00hi:: P00hk:: P00hm:: P01gk:: P01gl:: P01he:: P01hf:: P01hg:: P01hi:: P02ga:: P02gb:: P02gc:: P02gd:: P02ge:: P02gf:: P02gg:: P02gh2:: P02gi1:: P02gi2:: P02gx:: P03gi:: P03ia:: P03ib:: P03ii:: raw:: simulation::
What are the file types coming from a given production ?
% get_file_list.pl -distinct -keys 'production,filetype' -cond 'library=SL01g'
The output should be:
P01hg::daq_reco_tags P01hg::daq_reco_runco P01hg::daq_reco_hist P01hg::daq_reco_event P01hg::daq_reco_dst
List a valid file types for a given library. Note how the -cond parameter is used. We also used -distinct in this query to ensure the return of unique key pairs.
Where are the files of a specific type coming from a given production ?
% get_file_list.pl -keys 'path,production,filetype' -cond 'library=SL01g,filetype=daq_reco_dst' -distinct
The output should be:
/home/starreco/reco/P01hg/2001/242::P01hg::daq_reco_dst /home/starreco/reco/P01hg/2001/238::P01hg::daq_reco_dst /home/starreco/reco/P01hg/2001/244::P01hg::daq_reco_dst /home/starreco/reco/P01hg/2001/239::P01hg::daq_reco_dst /home/starreco/reco/P01hg/2001/240::P01hg::daq_reco_dst /home/starreco/reco/P01hg/2001/249::P01hg::daq_reco_dst /home/starreco/reco/P01hg/2001/236::P01hg::daq_reco_dst /home/starreco/reco/P01hg/2001/251::P01hg::daq_reco_dst /home/starreco/reco/P01hg/2001/237::P01hg::daq_reco_dst /home/starreco/reco/P01hg/2001/252::P01hg::daq_reco_dst /home/starreco/reco/P01hg/2001/254::P01hg::daq_reco_dst /home/starreco/reco/P01hg/2001/257::P01hg::daq_reco_dst /home/starreco/reco/P01hg/2001/258::P01hg::daq_reco_dst /home/starreco/reco/P01hg/2001/259::P01hg::daq_reco_dst /home/starreco/reco/P01hg/2001/260::P01hg::daq_reco_dst /home/starreco/reco/P01hg/2001/261::P01hg::daq_reco_dst /home/starreco/reco/P01hg/2001/262::P01hg::daq_reco_dst /home/starreco/reco/P01hg/2001/263::P01hg::daq_reco_dst /home/starreco/reco/P01hg/2001/267::P01hg::daq_reco_dst /home/starreco/reco/P01hg/2001/268::P01hg::daq_reco_dst /home/starreco/reco/P01hg/2001/270::P01hg::daq_reco_dst
Note that there might be several conditions, but only one per keyword. Also note that you should NOT assume that the output is ordered (would be a rather fatal mistake on the perl Module interface level).
Give me a "ready to use" list of files of a given type from a specific production.
% get_file_list.pl -keys 'path,filename' -cond 'library=SL01g,filetype=daq_reco_dst' -delim '/'
The output should be:
/home/starreco/reco/P01hg/2001/242/st_physics_2242040_raw_0001.dst.root /home/starreco/reco/P01hg/2001/242/st_physics_2242040_raw_0002.dst.root /home/starreco/reco/P01hg/2001/242/st_physics_2242040_raw_0003.dst.root /home/starreco/reco/P01hg/2001/242/st_physics_2242040_raw_0004.dst.root /home/starreco/reco/P01hg/2001/242/st_physics_2242040_raw_0005.dst.root /home/starreco/reco/P01hg/2001/242/st_physics_2242040_raw_0006.dst.root /home/starreco/reco/P01hg/2001/242/st_physics_2242040_raw_0007.dst.root /home/starreco/reco/P01hg/2001/242/st_physics_2242040_raw_0008.dst.root /home/starreco/reco/P01hg/2001/242/st_physics_2242040_raw_0009.dst.root /home/starreco/reco/P01hg/2001/242/st_physics_2242040_raw_0010.dst.root /home/starreco/reco/P01hg/2001/242/st_physics_2242040_raw_0011.dst.root /home/starreco/reco/P01hg/2001/242/st_physics_2242040_raw_0012.dst.root /home/starreco/reco/P01hg/2001/242/st_physics_2242040_raw_0013.dst.root /home/starreco/reco/P01hg/2001/242/st_physics_2242040_raw_0014.dst.root /home/starreco/reco/P01hg/2001/242/st_physics_2242040_raw_0015.dst.root ...
Note the using the delimiter of "/" between path and filename You get a "ready to use" list of files and their paths. A few w remarks : the list is long ... but by default, the interface will return to you only 100 lines / results. To get the full list (not advisable to do all the time ; there are more than 2 Million records in our Catalog), the -limit 0 command line option can be used. Any value <= 0 will return the full list while a number will make the interface return the exact number of records.
Give me a "ready to use" list of files of a given type from a specific production, but only one location for each file.
% get_file_list.pl -keys 'path,filename' -cond 'library=SL01g,filetype=daq_reco_dst' -delim "/" -onefile
The output should be:
The field list is path,filename The conditions list is library=SL01g,filetype=daq_reco_dst /home/starreco/reco/P01hg/2001/238/st_physics_2238001_raw_0105.dst.root /home/starreco/reco/P01hg/2001/238/st_physics_2238006_raw_0092.dst.root /home/starreco/reco/P01hg/2001/238/st_physics_2238009_raw_0085.dst.root /home/starreco/reco/P01hg/2001/238/st_physics_2238009_raw_0228.dst.root /home/starreco/reco/P01hg/2001/238/st_physics_2238013_raw_0020.dst.root /home/starreco/reco/P01hg/2001/239/st_physics_2239035_raw_0001.dst.root /home/starreco/reco/P01hg/2001/239/st_physics_2239041_raw_0068.dst.root /home/starreco/reco/P01hg/2001/240/st_physics_2240003_raw_0030.dst.root /home/starreco/reco/P01hg/2001/240/st_physics_2240003_raw_0160.dst.root /home/starreco/reco/P01hg/2001/240/st_physics_2240003_raw_0283.dst.root /home/starreco/reco/P01hg/2001/240/st_physics_2240007_raw_0100.dst.root /home/starreco/reco/P01hg/2001/240/st_physics_2240009_raw_0102.dst.root /home/starreco/reco/P01hg/2001/240/st_physics_2240013_raw_0045.dst.root /home/starreco/reco/P01hg/2001/240/st_physics_2240013_raw_0175.dst.root /home/starreco/reco/P01hg/2001/240/st_physics_2240013_raw_0300.dst.root /home/starreco/reco/P01hg/2001/249/st_physics_2249008_raw_0006.dst.root /home/starreco/reco/P01hg/2001/249/st_physics_2249009_raw_0027.dst.root /home/starreco/reco/P01hg/2001/249/st_physics_2249009_raw_0154.dst.root /home/starreco/reco/P01hg/2001/236/st_physics_2236019_raw_0027.dst.root ...
The -onefile keyword may be usefully when you have several
copies of each files, but want to get the location of only one of
them for each file (e.g. you need to process every MuDST, but each
of them only once). This FileCatalog
has been designed to account for multiple instance or replica of
each file and by default, it may return multiple instances of a
file.
As a file may exist in several location, so was born the
-onefile command line
option.
Count number of events in files in a given directory
% get_file_list.pl -keys 'grp(path),sum(events)' -cond 'production=P02gc,filetype=daq_reco_event,storage=NFS'
The output should be something like :
/star/data06/reco/productionCentral600/FullField/P02gc/2001/313::54976 /star/data06/reco/productionCentral600/FullField/P02gc/2001/318::65275 /star/data06/reco/productionCentral600/FullField/P02gc/2001/319::67357 /star/data06/reco/productionCentral600/FullField/P02gc/2001/320::91600 /star/data06/reco/productionCentral600/FullField/P02gc/2001/321::42974 /star/data06/reco/productionCentral600/ReversedFullField/P02gc/2001/324::90226 /star/data06/reco/productionCentral600/ReversedFullField/P02gc/2001/325::63604 /star/data06/reco/productionCentral600/ReversedFullField/P02gc/2001/326::19403 /star/data06/reco/productionCentral600/ReversedFullField/P02gc/2001/327::48130 /star/data24/reco/productionCentral1200/FullField/P02gc/2001/318::2079 /star/data24/reco/productionCentral1200/FullField/P02gc/2001/319::3513 /star/data24/reco/productionCentral1200/FullField/P02gc/2001/320::20709 /star/data24/reco/productionCentral1200/FullField/P02gc/2001/321::35001 /star/data24/reco/productionCentral1200/FullField/P02gc/2001/322::1006 /star/data24/reco/productionCentral1200/ReversedFullField/P02gc/2001/324::8985 /star/data24/reco/productionCentral1200/ReversedFullField/P02gc/2001/326::7646 /star/data24/reco/productionCentral1200/ReversedFullField/P02gc/2001/327::7390 /star/data24/reco/productionCentral1200/ReversedFullField/P02gc/2001/328::3116 /star/data24/reco/productionCentral600/FullField/P02gc/2001/322::1244 /star/data24/reco/productionCentral600/FullField/P02gc/2001/323::4724 /star/data25/reco/ProductionMinBias/FullField/P02gc/2001/269::102989 /star/data25/reco/ProductionMinBias/FullField/P02gc/2001/270::296570 /star/data25/reco/ProductionMinBias/FullField/P02gc/2001/313::53934 /star/data25/reco/ProductionMinBias/ReversedFullField/P02gc/2001/269::143083 /star/data25/reco/ProductionMinBias/ReversedFullField/P02gc/2001/310::37807 /star/data25/reco/ProductionMinBias/ReversedFullField/P02gc/2001/311::20774 /star/data26/reco/Central/FullField/P02gc/2001/288::28445 /star/data26/reco/productionCentral/FullField/P02gc/2001/311::50089 /star/data26/reco/productionCentral/FullField/P02gc/2001/312::41030 /star/data26/reco/productionCentral/FullField/P02gc/2001/313::97691 /star/data26/reco/productionCentral/ReversedFullField/P02gc/2001/310::42260 /star/data26/reco/productionCentral/ReversedFullField/P02gc/2001/311::75828
The above output is only an example of outputs. Our query used
storage=NFS and what we have
on disk may depend on time ...
The use of the aggregate function
sum() requires in this case
the use of the aggregate qualifier grp()
- this is because sum(events)
is not complete enough to specify what you want (the sum by
directories ? By trigger setup ? ...). In this specific example, the
data will be grouped according to the directory name, and in each
group a sum of number of event will be calculated.
Give me a "ready to use" list of files of a given type from a specific production, in batches of 1000
% get_file_list.pl -keys 'path,filename' -cond 'library=SL01g,filetype=MC_reco_dst' -delim "/" -start 0 -limit 1000 % get_file_list.pl -keys 'path,filename' -cond 'library=SL01g,filetype=MC_reco_dst' -delim "/" -start 1000 -limit 1000 % get_file_list.pl -keys 'path,filename' -cond 'library=SL01g,filetype=MC_reco_dst' -delim "/" -start 2000 -limit 1000 % get_file_list.pl -keys 'path,filename' -cond 'library=SL01g,filetype=MC_reco_dst' -delim "/" -start 3000 -limit 1000 ...
Note the combination of -start and -limit keywords used to divide the output in batches.
There are many more example we can give but but getting familiar with the keywords, operators and command line options and understanding the basics as described above is a good startup point for trying queries of your own.
******* GREEN SECTIONS BELOW ARE UN-CHECKED PARAGRAPHS *******
This documents describes briefly the FileCatalog database and the PERL module "FileCatalog" which can be used to access and manipulate data in this database.
The following diagram shows the structure of the database. it containts a number of "dictionary" tables - the ones that do not reference any other tables and usually hold only one usefull field. They are:
FileTypes
StorageTypes
StorageSites
ProductionConditions
TriggerSetups Contains the online trigger setup name (which itself is a reference for a collection of triggers). The detail of the trigger composition is held by the TriggerCompositions table
TriggerCompositions - holds data about the number of events collected with a specific TriggerWord, that are included in a given file.
These are the three main tables holding data about the files and run parameters, and binding all the other tables together. TriggerWords - holds collections of triggerWord.
RunTypes.
There are also special tables:
EventGenerators - holding the data about various Event Generators and their simulation parameters.
SimulationParams - parameters pertaining to the specific simulation run.
CollisionTypes - the type of particles colliding and their energy. Holds also special values like "cosmic" etc..
DetectorConfiguration - holds a detector configuration used to collect specific data. For simulation it holds the geometry version used.
RunParams - parameters of a single physics run. Each run is identified by its run number.
FileData - data about a specific file, without any information about its physical location. Each file is identified by a combination of a file name, file type, file sequence, production condtions and the run number it is connected to. File Data can be connected to a run number it comes from.
FileLocations - data describing a specific file location and its physical storage - the file path, the owner, protection etc.
In general the database is supposed to be hidden from the user. It is to be modified only through the PERL module and the corresponding functions. Manual changes to the database should not, in principle, be neccessary.
The FileCatalog
perl module is intended to provide access to the databse both for
data querying and retrieval as well as data insertion and
modifications.
It is based on the concept of setting keywords
within a context persistency. The user first sets a context using to
a set of keywords with the desired value (or conditions), and then
uses special commands to get/insert/delete/modify the data in the
database. All sub-seuqnet operations will be made according to the
context. In other words, the context consists of keywords i.e.
"filename", "path",
"storage" that may
have values assigned to them. By using the method
set_context("storage =
HPSS"), we say that from now on, we will only
consider operations on files which storage is HPPS. Consecutive calls
to this method will only refine the context .
The following subroutines are available
in the FileCatalog module:
new() : create new object FileCatalog, on which all the following operations would be carried out. It is neccessary to issue this command before any other operation with the module can be carried out.
connect([$user,$passwd]) : connect to the database FilaCatalog. If user and password are unspecified, the connection is made via a read-only access. You MUST specify the management user and password if you want to insert records.
connect_as($id) : id being a string identifier, connect to the database using the scope 'id'. Information about accessing the database using the scope 'id' must be described in the XML connection interface of the FileCatalog.
get_connection($id) : return sthe connection information within the scope 'id'. The output of this routine can be used
destroy() : destroy object and disconnect from the database FileCatalog
set_context("$kwd = $value"[,"$kwd = $value"...]) : set one of the context keywords to the given operator and value
get_context("$kwd") : get the context value connected to a given keyword $kwd
clear_context() : Clears the context (that is, forget about everything)
debug_off() : turn all kind of debugging OFF
debug_on([mode]) : turns debugging ON. Default is OFF. mode is optional and may have the following values :
Default is plain text debugging sentax
html outputs the debugging message in HTML format
cmth outputs the debugging as HTML comments
set_silent($mode) : Sets silent mode i.e. tell the module NOT to display any informational messages. Messages related to errors are still displayed. You shoudl not use this whenever you debug your code.
set_delimeter($string) : Sets the keyword/field delimiter to be $string. By default, the delimiter is "::".
get_delimeter() : return sthe current delimiter string. set_delayed() : Sets the dealy mode on any action you plan to do on the catalog. Action includes inserts of new records, updates, deletion etc ... print_delayed([$flag]) : Prints to the screen the SQL commands it would otherwise have executed internally. The set_delayed() method must be called prior to using it. It also resets the delay mode state which means that you also have to call set_delayed() again after you use this method. $flag, a logical value, if set to TRUE, will print an extraneous message on the screen (the number of commands and the time) before printing the SQL commands.
flush_delayed([$flag]) : Flush all command held by the set_delayed() method i.e. executes them and reset the delay state. As for the print_delayed() method, you must call set_delayed() again after this method if you want to continue to use the delayed mode. The $flag variable, a logical value, if set to TRUE, will print an extraneous message on the screen (the number of commands and the time) before executing the SQL commands.
The following methods require a connection to the database and are meant to be used outside the module
First the methods of everyday use:
get_keyword_list() : returns as an array the list of available keywords
insert_file_location() : after setting the whole context this method inserts the file location record. If it doesn't find the corresponding file data or run parameters it insert the necessary information into the database. It also ensures data integrity and refuses to insert a record if some critical data was not specified in the context. For a new entry to appear, you need the combination storage, path, node and site to be different.
run_query() : get the data from the database. As a parameter to this procedure user gives a list of keyowrds - the returned data is either an undefined value (if the query fails) or an array of strings. Each string corresponds to one record in the database ans is the concatenation by "::" of all requested keywords. Example:
The following code:
$fC->run_query("extension","trgsetupname","runtype","size","fileseq","owner","createtime");will give an output similar to:
fzd::central::simulation::1042891200::1::starsink::20010913000000 fzd::central::simulation::998146800::2::starsink::20010913000000 fzd::central::simulation::1050894000::1::starsink::20010913000000 fzd::central::simulation::1040850000::2::starsink::20010913000000 fzd::central::simulation::1040590800::3::starsink::20010913000000 fzd::central::simulation::1041076800::4::starsink::20010913000000 fzd::central::simulation::950259600::5::starsink::20010913000000 fzd::central::simulation::1049792400::1::starsink::20010913000000 fzd::central::simulation::1046293200::2::starsink::20010913000000
run_query_st() : same as run_query() but the returned value is a string, each possible value '\n' separated, rather than an array. This function returns undef if the query fails.
This subroutines are database specific, and should be used rarely and only if necessary.
check_ID_for_params($kwd) : returns the database row ID from the dictionary table connected to the keyword $kwd
insert_dictionary_value($kwd) : inserts the value for keyword $kwd taken from the context into the dictionary table
insert_detector_configuration() : inserts the detector configuration from current context. This method does not require any arguments.
insert_run_param_info() : insert the run param record taking data from the current context
insert_simulation_params() : insert the simulation parameters taking data from the current context
get_current_run_param() : get the ID of a run params corresponding to the current context
get_current_detector_configuration() : gets the ID of a detector configuration described by the current context
get_current_file_data() : gets the ID of a file data corresponding to the current context
get_file_location() : base don the current Catalog context, returns the FileLocations information as an array. Each element of the array is an association keyword=value so the returned array can be used as-is in a set_context() statement.
get_file_data() : base don the current Catalog context, returns the FileData information as an array. Each element of the array is an association keyword=value so the returned array can be used as-is in a set_context() statement.
get_current_simulation_params : gets the ID of a simulation params corresponding to the current contex
delete_records({$doit}) : deletes the current file location. If it finds that the current file data has no file locations left, it deletes it too. The delete_records() method is based on the current context and works within a global scope (i.e. whatever has been selected with set_context() will be ALL deleted). Therefore, this function is EXTREMELY dangerous. Note that it is based on run_query() therefore, you may want to check what you will delete using that method instead to avoid catastrophes. The optional argument $doit is used to confirm the deletion. Default is NOT to delete. This method returns an array of deleted records (partial information is flid, fdid, path, filename.
clone_location() : based on the most recent query, creates an instance for FileData and a copy of the FileLocations information and return the success status. WARNING : You should NOT use a call to the update methods if cloning has failed, the update would work but enter inconsistent values. Later version will protect against this problem. The suggested usage is then
$fC->set_context(....); # whatever is needed
$fC->run_query("size"); # as an example
if ( $fC->clone_location() ){
$fC->set_context("storage=NFS"); # change the storage
...
$fC->insert_file_location(); # insert the new entry
}This function's usage is as above i.e. create a clone (i.e. a copy of the fileMETA-Data) modify a few characteristics and create a new entry (file replication). This is handy whenever you copy a file from one place to another for example. Note : YOU MUST NOT try to change the FileData information but only modify keywords associated to the FileLocations table.
update_record($kwd,$newvalue{,$doit}) : modifies the data in the database tables or dictionaries. This method returns TRUE/FALSE on success/failure.
This method modifies 'a' field set in the current context and corresponding to the given keyword $kwd to the value passed as argument $newvalue. The last optional argument, $doit, is used to confirm ($doit=1) the update or just debug the operation ($doit=0) ; the default is 1.
The example above would set the storageSiteName field associated to the keyword site from its original value test to a new value test2
$fC->set_context("site=test");
$fC->update_record("site","test2",1);If your intent is to update, not a dictionary but one of the main table (the META-Data or the FileLocation table information), you should use a as-specific-as-possible context otherwise, an entire range of records matching the most recent context will be updated ... For example, trying to update the number of events based on a set_context() will probably update far too much comparing to what you really want.
You should be using the update_location() method for changing file's basic characteristics (much safer).
update_location($kwd,$newvalue{,$doit}) This method returns TRUE/FALSE on success/failure. It is used to update characteristics associated to the FileLocation table AND ONLY THAT TABLE. The keyword you are attempting to change does not need to be part of a preceeding context (unlike the update_record() method). Usage example :
$fC->set_context("site=test",
"storage=NFS",
"path=/star/data11/reco/ProductionMinBias/FullField/P02gx/2001/274",
"filename=st_physics_2274024_raw_0095.dummy.root");
$fC->update_location("size",200,1); bootstrap($kwd,$delete) : database maintenance procedure. Looks at the all database entries and make a sanity check on keyword $kwd. Possible values for the keyword $kwd will be searched in its dictionary table and compared to the entry of its parent table. If none are found, you can delete the zombie-keyword value by specifying the $delete argument (0|1)
The following functions are now implemented and exported but will be merged with the above routine in a later version.
Currently implemented are bootstrap associated with any dictionary table, and the follwoing tables FileData, FileLocations, TriggerCompositions, and TriggerWords.
The following example is the equivalent of the example 4.
use lib "/afs/rhic/star/packages/scripts";
use FileCatalog;
my($fC);
$fC = new FileCatalog;
$fC->connect();
$fC->set_context("library=SL01g","filetype=daq_reco_dst");
@all = $fC->run_query("path","production","filetype");
foreach $line (@all){
print "$line\n";
}
$fC->destroy(); # Delete the instance
The list of examples would be long ... Instead, we include a few script examples which we hope will guide the novice through the details on how to use the FileCatalog interface.
Example on how to update the createtime field in the database
This next example is more complete and updates the creation time, the ownership and the protection and also mark the file unavailable if not found. But it will do that only for local files (i.e. local disk or NFS if the file system is attched to the node where you run this script).
The next example is a spider script allowing to scan disks and auto-update the database with a new entry entry. It relies on the fact that an entry exists for storage=HPSS and is a perfect example for the use of the clone_location() method
FCPathSwap is a complete script handling file movement and re-registration from one location to another. Note that instead of marking the old location with available=0, in this case, we delete the entry. This requires Admin bypass ...
add-file.pl is a test suite and "tutorial" for administrators using the API .
Here are the keywords that can be used in the context. There is a
color scheme to those keywords as follow
Keywords
in blue are currently supported by the database schema but
unused by the production scripts and therefore are not filled (or
ill-filled).
Keywords in aqua are
automatically updated (there is no need to reset)
Keywords
in magenta are filled but update may be needed (do not
strongly rely on their value)
|
keyword |
Notes |
Meaning |
|
site |
|
The site where the data is stored, eg. BNL, LBL |
|
sitecmt |
|
The site comment string |
|
siteloc |
|
A full string describing the site location in the world |
|
storage |
The storage medium, eg. HPSS, NFS, local disk. Note that the local disk storage does not allow for a unique file location. One must also select on node |
|
|
node |
The name of the node where data is stored (necessary to locate local disk storage) |
|
|
path |
the path to a specific copy of the file |
|
|
filename |
The name of the data file |
|
|
filetype |
The type of the file, e.g. "daq_reco_dst", "MC_fzd" etc ... |
|
|
extension |
The extension of the file - directly connected to type (each file type has an associated extension) |
|
|
events |
Number of events or entries in the file |
|
|
size |
The size of the data file |
|
|
fileseq |
The file sequence as determined during data taking by DAQ. Arbitrary for simulation and processed files. |
|
|
stream |
The file stream if applicable (defaut is 0) |
|
|
md5sum |
Early stage db fill did not update this field. It may return 0. |
The file's md5 checksum |
|
production |
The production tag with which a given file was produced. Can also be "raw" or "simulation" |
|
|
library |
The library version this file was produced with |
|
|
trgsetupname |
Used in to encode the path in production |
The name of the online trigger setup name |
|
trgname |
|
The name of one trigger in a collection of triggers associated to a runumber. |
|
trgcount |
|
The event count having the associated trgname for a given runnumber |
|
trgword |
This is available for Year4 data and beyond for DAQ files |
The trigger word associated to one trigger in a collection |
|
trgversion |
The trigger word version associated to a trgname |
|
|
trgdefinition |
The trigger definition of one trigger in a collection |
|
|
runtype |
the type of the run - eg. "physics", "laser" , “pulser”, “pedestal”, “test” but also "simulation" for simulated datasets |
|
|
configuration |
The detector configuration name. A detector configuration is a combination of detectors that were present during data taking in a given run. Note tha the combination configuration/ geometry is unique (but not any of the two alone) |
|
|
geometry |
The geometry definition for a given simulation set. |
|
|
runnumber |
The number of the run. Arbitrary for simulations. |
|
|
runcomments |
The comments for a given run. |
|
|
collision |
The collision type. Specified in the form of <first particle><second particle><collision energy>, eg. "AuAu200" |
|
|
datetaken |
Format was messed up at conversion old->new Catalog. Can be (and will be) recovered. |
The date the data was taken. Arbitrary for simulation. |
|
magscale |
The name of the magnetic field scale, e.g. FullField |
|
|
magvalue |
The actual magnetic field value |
|
|
filecomment |
The comment to the file. |
|
|
owner |
The owner of the file. |
|
|
protection |
Subject to changes |
The protection or read/write permissions, given in a format similar to UNIX 'ls -l' |
|
available |
is the file available ? |
|
|
persistent |
is the file persistent ? |
|
|
createtime |
Only HPSS files have a createtime which is not subject to changes |
the time a file was created. Format is YYYYmmddHHMMSS |
|
inserttime |
the time a file data was inserted into the database. |
|
|
simcomment |
The comments for the simulation |
|
|
generator |
The event generator name |
|
|
genversion |
Event generator version |
|
|
gencomment |
Event generator comments |
|
|
genparams |
Event generator params |
|
|
tpc |
was the TPC in the data stream when specific data was taken? |
|
|
svt |
was the SVT in the data stream when specific data was taken? |
|
|
tof |
was the TOF in the data stream when specific data was taken? |
|
|
emc |
was the B-EMC in the data stream when specific data was taken? |
|
|
eemc |
was the E-EMC in the data stream when specific data was taken? |
|
|
fpd |
was the FPD in the data stream when specific data was taken? |
|
|
ftpc |
was the FTPC in the data stream when specific data was taken? |
|
|
pmd |
was the PMD in the data stream when specific data was taken? |
|
|
rich |
was the RICH in the data stream when specific data was taken? |
|
|
ssd |
was the SSD in the data stream when specific data was taken? |
|
|
bbc |
was the BBC in the data stream when specific data was taken? |
|
|
bsmd |
was the Barrel EMC SMD in the data stream when specific data was taken? |
|
|
esmd |
was the End-Cap SMD in the data stream when specific data was taken? |
The following keywords are for either internal use or specific management purposes. They have no meaning to users (but are unique).
|
flid |
Access the FileLocation ID of the FileLocation table |
|
fdid |
Access the FileData ID of the FileData table |
|
rfdid |
Access the FileData ID of the FileLocation table |
|
pcid |
Access the ProductionCondition ID of the ProductionConditions table |
|
rpcid |
Access the ProductionCondition ID of the FileData table |
|
rpid |
Access the runParam ID of the runParams table |
|
rrpid |
Access the runParam ID of the FileData table |
|
ftid |
Access the FileType ID of the FileTypes table |
|
rftid |
Access the FileType ID of the FileData table |
|
stid |
Access the storageType ID of the StorageTypes table |
|
rtid |
Access the storageType ID of the FileLocations table |
|
ssid |
Access the storageSite ID of the StorageSites table |
|
rssid |
Access the storageSite ID of the FileLocations table |
|
tcfdid |
Access the FileData ID of the TriggerCompositions table |
|
tctwid |
Access the TriggerWords ID of the TriggerCompositions table |
|
twid |
Access the TriggerWords ID of the TriggerWords table |
|
dcid |
Access the detectorConfiguration ID of the DetectorConfigurations table |
|
rdcid |
Access the detectorConfiguration ID o the RunParams table |
|
|
|
|
lgnm |
An aggregate keyword returning an equivalence to the logical name |
|
lgpth |
An aggregate keyword returning a logical path (a string which uniquely characterize the file's location) |
|
fulld |
An aggregate keyword returning a string completely defining all meta-data for real data |
|
fulls |
An aggregate keyword returning a string completely defining all meta-data for simulation data |
Here are the keywords not connected to a specific field in the database. They change the behaviour of the module itself.
|
keyword |
Notes |
Meaning |
|
simulation |
Is the data a simulation? |
|
|
nounique |
In script mode, this keyword is set to 0 (i.e. unique fields) which may slow down tremendously your scripting. In the user interface get_file_list.pl however, this is set by default to 1 (does not ensure unique fields). |
Should the module return all fields, instead of only unique selected fields. |
|
noround |
Turns off rounding of magfield, and collision energy. |
|
|
startrecord |
The PERL module will skip the first startrecord records and start returning data beginning from the next one. |
|
|
limit |
The PERL module will return the maximum of limit records. |