Table of Contents
1
Introduction................................................................................................................................................................................... 2
2 PPDG Year 3 Plans......................................................................................................................................................................... 3
2.1 Action Items from ‘03 Questionnaires................................................................................................................................ 4
2.2 File Cataloguing, Replication and registration projects, Storage
Management........................................................... 4
2.2.1 Deliverables..................................................................................................................................................................... 4
2.2.2 Milestones....................................................................................................................................................................... 4
2.2.3 Participants...................................................................................................................................................................... 5
2.2.4 Dependencies................................................................................................................................................................. 5
2.3 VO registration, membership handling, Grid infrastructure............................................................................................. 5
2.3.1 Deliverables..................................................................................................................................................................... 5
2.3.2 Milestones....................................................................................................................................................................... 5
2.3.3 Participants...................................................................................................................................................................... 6
2.3.4 Dependencies................................................................................................................................................................. 6
2.4 Job scheduling and job description.................................................................................................................................... 6
2.4.1 Deliverables..................................................................................................................................................................... 6
2.4.2 Milestones....................................................................................................................................................................... 6
2.4.3 Participants...................................................................................................................................................................... 7
2.4.4 Dependencies................................................................................................................................................................. 7
2.4.5 Issues and Concerns..................................................................................................................................................... 7
2.5 Database services integration to the Grid.......................................................................................................................... 8
2.5.1 Deliverables..................................................................................................................................................................... 8
2.5.2 Milestones....................................................................................................................................................................... 8
2.5.3 Participants...................................................................................................................................................................... 8
2.5.4 Issues and Concerns..................................................................................................................................................... 8
3 General concerns........................................................................................................................................................................... 8
4 PPDG Year 4 and 5......................................................................................................................................................................... 8
5 Background.................................................................................................................................................................................... 8
5.1 Summary Table from PPDG proposal.................................................................................................................................. 8
5.2 Evaluation and Continuing Support................................................................................................................................. 10
The STAR experiment is one of the four Relativistic Heavy Ion Collider (RHIC) experiment at the Brookhaven National Laboratory (BNL). As a Nuclear Physics experiment however, the STAR experiment records massive amount of data: up to a PByte of data a year (10^12 bytes) is being accumulated representing to date an integrated total size of 3 PB spanning over 3 million files. The projection for the next RHIC run (also called, Year4 run), shows and increase by a factor of five on the number of collected events. Not only managing such a vast amount of data, but also processing it, will become more and more problematic as our Physics program evolves toward the search for rare probes.
The STAR-PPDG team has therefore an important role: prepare and design the computing infrastructure necessary for coping with such reality. In that regard, our past years activities have been clearly focused on the realization of a production Grid infrastructure without disrupting our main RHIC program that is, our objectives and primary mission to the DOE : delivering science. Our main strategy was therefore oriented toward the introduction of either additional (low risk) tools or providing to the user stable and robust APIs shielding our scientists from the fabric details.
Along the past years, we have therefore collaborated with computer scientists, mainly the SDM Group, to first deliver a production level data transfer for our two main facilities, BNL and NERSC. Now in full production since early 2002, the file transfer alone allows us to de-localize our still-standard analysis approach over two sites. Over 20% of our data is reliably shipped between the two sites and we are now in the consolidation phase of the tools and software from the SDM group. The use of this tool enables us to remove the mindless task data management is, allocating the manpower on the next phase of Grid tools and needs. The second year saw its focus on the development and release of several projects which intent was to prepare our users for running on the Grid. Amongst those, the legacy File Catalog is being phased-out and re-designed to accommodate for a replica and meta-data Catalog. This rendered possible the emergence of new activities relying on dataset or logical collection queries rather than direct access to Physical location of the files. Two projects were born or enhanced by this work:
Additional, several activities were started in collaboration with our local Computer Scientists from the Information and Technology division (ITD) at BNL. We were pleased of the addition of the personnel from the site-facility and could start, early 2002, a trail of activities of general interest for the community.
Considering the work done and objectives needed for a running experiment, the STAR PPDG supported team is however composed only of a total of ~ 1 FTE namely, Gabriele Carcassi (BNL) and Eric Hjort (LBL), the core of the manpower currently arising from collaboration with computer scientists from within our outside the scope of PPDG, internal additional manpower and external funding.
As outlined in the introduction, a real challenge is ahead of the STAR collaboration for the experiment Year4 run. The long Gold on Gold running period will lead to an unprecedented amount of data which will require drastic change in the computing model, waiting for years for the data to be analyzed not being an option. In that regard, it will be a test of the strength of choices in the adoption of some of the Grid technology. Although foreseen, our general goals include
Consolidation and finalization of our File or data collection Replication mechanism: To this end, we must
Finalize our monitoring strategy and find/deploy/adopt a common set of tools suitable for most (if not all) experiments at our respective shared facilities.
Finalize the convergence toward a site registration authority agent. Work with facility personnel to define and deploy a STAR “VO” and evaluate existing tools for VO membership handling and account mapping.
Full migration of some work load to the grid, two of which is within our reach:
While our focus must remain on the above objectives, GT3 recent release will ultimately lead to an increase usage in the community: its attempt to generalize services through the OGSA and Web Services approach using a potentially interoperable way to specify services (WSDL) makes it an attractive and awaited stable product we cannot ignore. While a full deployment of GT3 is seen as part of our Grid Year 4 activities and hardening, we will actively work on migration of our existing tools toward a Web Service oriented approach (WSDL/G-SOAP).
We will try to go in more details in the following sections.
Considering past activities and our objectives, it is clear that collaboration with the Computer Science is essential.. We hope to pursue our good collaboration with the members of the SDM group and continue the work we have started on many front and bring to the community the next wave of robust, well defined and polished set of tools. We were particularly pleased of the attention and efforts this group provided to our real-world experience and problems and are comfortable with the current convergence of vision and mutual driven benefits.
We are also seeking a more direct and intense collaboration with the Condor team as our job submission scheme somehow depends on it. We need immediate technical advice and evaluate the robustness of the submission through Condor-G considering the initial target of an average 5k job a day with peak activities up to 20k jobs a day. Later, we would like to take advantage of the matchmaking mechanism Condor provides. At the last PPDG collaboration meeting, we understood this collaboration to be a granted coming activity and surely hope to further increase the use of Condor and Condor-G throughout our facility if this would be the case.
We are planning to intensify our collaboration with the Globus team and especially, test and provide feedback of the GT3 release as requested. We see this activity as being mutually beneficial: our feedback will hopefully lead to a faster convergence toward a stable release and our involvement at an early stage will allow us to be ready for our planned migration from GT2 to GT3.
On a different front, with the soon joining of our European colleagues in our grid efforts, we hope to fully start the evaluation of the OGSA-DAI and Grid Data Service (GDS). Our goals are to evaluate the existing software and the DAIS specifications from the UK e-Science, document wherever necessary the interface and service description, study its applicability and usability for the general US community, work on the definition and consolidation of schemas and converge toward a fully functional Web Service oriented access for the commonly used database: Oracle, MySQL, Postgres, etc … to name only those.
Our current file and replica catalog is fully operational at BNL but partly operational at NERSC. On a local level, all replica made at BNL is currently registered instantaneously in our File Catalog. This is not yet the case when files are transferred over the Grid using the HRM tool. Work was started to converge toward a Replica Registration Service to account for this discrepancy. This effort will further evolve toward a fully functional and robust replica management system.
We would greatly benefit from following closely the development of Web Service for SRM and will commit to the deployment and when available. In fact, the use of such interoperable approach would render the joining of our European collaborators easier and open new avenues.
Finally, we intent to continue to test, support and propagate the use of SRM in the community.
By a continuous effort on the Replication and Registration Service (RRS), we hope to define and document a common specification for the replication of physical files, datasets or collections and their registration in “a” catalog. We will describe below the steps necessary for both the full use of our Catalog and the finalization of the RRS. We also hope to later replace our local BNL replica registration approach by a fully deployed and integrated DRM/HRM approach, completing the work on file replication.
An early participation in the Web Service for SRM activity would allow us to further extend the use of this tool across borders.
The main participant will be Eric Hjort, Alex Sim (SDM group) and Jerome Lauret. The total PPDG funded FTE-equivalent is projected to be
Further internal manpower for the deployment on another facility will be provided through external funding.
This work is self contained and requires cooperation from the SDM team for the RRS/HRM interaction. The job scheduling program relies heavily on an accurate Cataloguing of our data collections and therefore of this project success.
Currently, the STAR
grid infrastructure is handled by having users requesting certificates from the
DOE science
The Grid activities at BNL are currently not well centralized. Especially, the infrastructure do not include a common strategy for error reporting, diagnostic, help desk, consistent software deployment. Furthermore, the Grid activities do not connect well with the internal security policies in effect at BNL. As an example, DOE cyber-security policies require a centralization of the user account information. This is completely disconnected to any VO and certificate information.
Since our team also include members of the ITD personnel, we hope to drive and effort toward the consolidation of the local infrastructure, including, the delocalization of the registration authority agent for BNL, a stronger participation of the ITD security personnel in the Grid activities, the deployment of a common set of tools for bug tracking, problem reports, mailing lists, help desks and the continuous support for Grid tutorials and lecture to the community at large.
The same emphasis and effort will be the focus and effort we commit to dedicate to the convergence toward the deployment of a common set of tools for monitoring the fabric.
We identify three distinct projects we will tag in our milestones as
[P22] will be kept from within the STAR specific available manpower in conjunction with the PDSF personnel. A total 3 FTE months will be required for documentation of the VO, evaluation of user management tools, deployment and consolidation.
The two other projects will involve local information & technology teams with guidance from the experiments (STAR, Phenix and Atlas).
Some of the activities will be part of the DOESG VO Management project.
Note: An agreement with experiments on this project has not been made to date and this section will have to be reshaped as the situation clarifies.
The STAR team will
focus on the consolidation and further development of the Job scheduling user
interface in the coming year. We are
planning to proceed by steps with clear milestones for
Our main objective is to provide a fully functional tool and wrapper for Grid job submission by further developing the STAR Scheduler [W-GJS], define and implement at all level of infrastructure a logger utility [L-GJS], to document the requirements for a high level and develop a U-JDL schema [U-JDL]. We would also transition our current job scheduling tool toward a Web-Service based approach and deploy the service to more one or more site facility.
The U-JDL and definition will require a 3-5 FTE month (depending on participation from other teams). This work will be done in collaboration of the Jlab team identified as one of the main collaborating team.
The first part of the current Scheduler functionalities consolidation will require a 1 FTE month. The Phenix collaboration has expressed interest in evaluating our current tool and investigates the integration of their replica catalog. Java-Classes oriented, we do not expect this to be a problem. If come to realization, we will provide technical assistance and guidance through our conceptual design.
3 FTE months (optimistic) time will be needed for the deployment, stress and regression test of the local Condor-G implementation and reach MC level testing.
The migration to a Web-Service based client, planned for early 2004, will take a total 50% of 6 FTE months. Consolidation of the local submission through Condor-G, assistance and of the MC production and further development of the user level job submission will account for the other half of the time.
Logging implementation will require either a collaborative effort or internal manpower. We believe this task can be accomplished using a 4 FTE month equivalent.
Dire need for an increase support from the Condor team; we will also need for a better communication mechanism for reporting problems (if any) to the globus team.
Our mentioned objectives of 20k job submission a day may stress test the current globus infrastructure in ways not observed before. We can only repeat and stress one essential component: Support and collaboration from the Condor and/or Globus team is essential and may very well be the key to our success.
This activity relies heavily on the convergence and rapid hardening of tools and infrastructure beyond our control. Here again, a good communication with the Globus team is essential to our success.
This activity alone will allocate 18 FTE months equivalent.
Document the specifications and requirement for the gridification of a database solution. Implement within MySQL as an example and bring the GSI-enabled MySQL work back to the community at large.
Evaluate the usability of the OGSA-DAI release 2 (or later) and the DAIS specification, work on the definition and /or consolidation of schemas converging toward a Web Service oriented access to databases.
We will seek funding for this project and provide internal manpower and expertise as well as test bed for the application to a Catalog connection and condition database access. 6 to 10 FTE months equivalent at will be necessary for bringing this project to completion.
This work is needed prior from running real-life data mining on the Grid. The timescale being short, we would require assistance and help from the PI and executive committee members to shape such proposal. The time schedule, voluntarily fuzzy and volatile, will depend on funding and/or manpower availability.
Currently helped by PPDG funding at the level of 1 FTE year, our current plan layout requires 47 months FTE equivalent. Only 12 months are covered by direct funding, the rest is planned to be covered by collaborative efforts with other teams (SDM, Condor, Jlab)., internal manpower driven at great cost from within the experiment and external funding. The failure risk is high while the need for a transition to a Grid model for computing is within a year. This situation will ultimately lead to a dilemma.
For all project, we would like PPDG to concentrate its efforts on a strategy for end-to-end diagnostics, resource management policies at all levels and error recovery. Those two areas cannot be accomplished in a short time scale but are however necessary for the consolidation of our production Grid infrastructures.
The STAR team will concentrate its efforts on logging utilities
This is the summary table from the PPDG proposal. The full proposal is available at http://lbnl2.ppdg.net/docs/scidac01_ppdg_public.doc
Project Activity |
Experiments |
Yr1 |
Yr2 |
Yr3 |
CS-1 Job Description Language – definition of job processing requirements and policies, file placement & replication in distributed system. |
|
|
|
|
P1-1 Job Description Formal Language |
D0, CMS |
X |
|
|
P1-2 Deployment of Job and Production Computing Control |
CMS |
X |
|
|
P1-3 Deployment of Job and Production Computing Control |
ATLAS, BaBar, STAR |
|
X |
|
P1-4 Extensions to support object collections, event level access etc. |
All |
|
|
X |
CS-2 Job Scheduling and Management - job processing, data placement, resources discover and optimization over the Grid |
|
|
|
|
P2-1 Pre-production work on distributed job management and job placement optimization techniques |
BaBar, CMS, D0 |
X |
|
|
P2-2 Remote job submission and management of production computing activities |
ATLAS, CMS, STAR, JLab |
|
X |
|
P2-3 Production tests of network resource discovery and scheduling |
BaBar |
|
X |
|
P2-4 Distributed data management and enhanced resource discovery and optimization |
ATLAS, BaBar |
|
|
X |
P2-5 Support for object collections and event level data access. Enhanced data re-clustering and re-streaming services |
CMS, D0 |
|
|
X |
CS-3 Monitoring and Status Reporting |
|
|
|
|
P3-1 Monitoring and status reporting for initial production deployment |
ATLAS |
X |
|
|
P3-2 Monitoring and status reporting – including resource availability, quotas, priorities, cost estimation etc |
CMS, D0, JLab |
X |
X |
|
P3-3 Fully integrated monitoring and availability of information to job control and management. |
All |
|
X |
X |
CS-4 Storage resource management |
|
|
|
|
P4-1 HRM extensions and integration for local storage system. |
ATLAS, JLab, STAR |
X |
|
|
P4-2 HRM integration with HPSS, Enstore, Castor using GDMP |
CMS |
X |
|
|
P4-2 Storage resource discovery and scheduling |
BaBar, CMS |
|
X |
|
P4-3 Enhanced resource discovery and scheduling |
All |
|
|
X |
CS-5 Reliable replica management services |
|
|
|
|
P5-1 Deploy Globus Replica Catalog services in production |
BaBar, JLab |
X |
|
|
P5-2 Distributed file and replica catalogs between a few sites |
ATLAS, CMS, STAR, JLab |
X |
|
|
P5-3 Enhanced replication services including cache management |
ATLAS, CMS |
|
X |
|
CS-6 File transfer services |
|
|
|
|
P6-1 Reliable file transfer |
ATLAS , BaBar, CMS, STAR, JLab |
X |
|
|
P6-2 Enhanced data transfer and replication services |
ATLAS, BaBar, CMS, STAR, JLab |
|
X |
|
CS-7 Collect and document current experiment practices and potential generalizations |
All |
X |
X |
X |
Also from the proposal. Please could you comment on these items from your teams perspective:
“During the last phase of PPDG we will: