Background and Approach | |
Dear Colleagues, Last week, John Harris asked you to serve on the STAR Computing Requirements Task Force (henceforth known as the Force) to reassess the computing needs of STAR, on about a three month time scale. Some of you have responded positively, none of you has responded negatively, and I conclude that the remainder are either on vacation or wish they were, but in any case are willing to be with the Force. John asked me to chair this effort and I agreed to do so. Some of you have already inquired how we will proceed, and though all the pieces are not yet in place, I thought it would be a good idea to tell you how I think we should approach this problem, so that we can discuss this, arrive at an overall strategy, and get started. ---------------------------------------------------------------------- Background ========== The need for a new look at this issue arose from the DOE review in July of the RHIC Computing Facility (RCF) at BNL (I was also a member of that committee). In early 1996 a RHIC-wide committee of experimentalists, chaired by Shiva Kumar and with several members of STAR participating, produced the ROCOCO-2 report on the overall requirements for RHIC computing. This report has served very usefully as the basis until now of the design of the RCF. However, the July review committee noted that ROCOCO-2 is not a detailed enough basis for further design of the RCF. The committee recommended that "the experimental computing requirements need more study, especially those requirements that have profound effects on the RCF system architecture and on the size and scope of the remote facilities". I do not want to retype all of the recommendations of the committee here (the final report will be made available to you shortly). Generally speaking, however, two of the most important design issues that need to be addressed very soon are: (i) Central Analysis Server (CAS): what is the mix of the physics analysis to be performed? Primarily I/O intensive ("data mining") or primarily cpu intensive? This will drive the choice of using a farm of commodity (i.e. cheap) Intel-based processors (cpu-intensive tasks) vs. an array of SMP machines (Symmetric MultiProcessor, more expensive, high I/O). What bandwidth is needed between CAS and the Mass Data Store (MDS)? What is the volume of data needed for physics analysis on disk or on tape in the MDS, as opposed to on the shelf? Note that there are actually two questions here: what is the general character of the analysis, which drives the design (which is scalable), and what are the data volumes and rates, which drives the actual capacity. (ii) Offsite needs (I will address only the STAR issue here): we have stated that our computing needs will not be met by the baseline RCF and have requested additional funds from the DOE to establish a STAR simulations facility at NERSC. Note that there are again two questions here: will RCF be adequate for STAR to do all its physics, and if not, where is the best place to site the additional computing capacity? We cannot answer the second question, but it is not relevant until the first has been fully addressed and documented, in detail that goes significantly beyond ROCOCO-2. Note also that the committee felt that the Central Reconstruction Server (CRS) is well designed for its task, namely the production of DSTs from raw data. While the committee did recommend that the CRS design (and its network to the MDS) be scalable to be able to expand to process the raw data more than once, it did not see any essential open design issues that depend upon input from the experiments right now. In other words, how much cpu time it takes to produce a STAR DST-level event from raw data, which was a main focus of the STAR estimates in ROCOCO-2 and which is crucial to our overall throughput, is not the most important design issue for the RCF right now. Rather, we must now focus in some detail on how we will actually do our physics analysis, going from DST to muDST to final physics results. (We must of course say something about the cpu and network needs for the CRS, but I think that experience with the current STAR chain folded together with experience from currently running experiments give us sufficient data on this point, and I do not propose any targeted new programming efforts specifically for the purpose of the current round of estimates (i.e. before Nov. 1), but opinions are welcome on this point, as on all others.) Charge to the Task Force ======================== John's charge to us is quite comprehensive: develop computing requirements (cpu, storage on disk, tape and shelf, network bandwidths) for STAR physics for year 1 and future years for the full spectrum of the STAR program. This includes event reconstruction, DST and muDST analysis, and needed levels of simulation, with special attention paid to those tasks that may require unusually high resources of one type or another. It is clear that we cannot give the ultimate answers to these questions in the coming months, and that this will be an ongoing process for at least the next few years. But we should aim to give the best possible quantitative estimates, concentrating on areas that they are urgently needed now for RCF design decisions. Proposed Approach ================= Given the above background and John's charge to the Force (this metaphor is satisfyingly physical: charges and forces. It has a lot of potential...), which results directly from the recommendations of the July review committee, I propose the following procedure: (i) Assemble a committee of STAR physicists having practical expertise in each of the general STAR physics areas. That's you. I wanted a committee of modest size, so some experts were necessarily left out. Before we are done everyone will have a chance to give significant input and I do not intend to exclude anyone. Once we have a first draft we will circulate it to a larger circle of wise women and men, but if you know of anyone who wants to be or should be in at this level please tell me quietly. (ii) Clearly separate the issues of the character of analysis projects from the overall scale of cpu needs, data volume and network bandwidth. The RCF design itself separates these issues by making many elements scalable, so that if future physics requirements demand more of a certain resource, the cost to supply this resource is only incremental. For the purpose of our present estimates, I propose that each topical expert choose a very small number of specific analysis projects from her/his area (preferably two projects, three or four at most), one or two "garden variety" topics and one or two much more difficult and demanding of resources, and analyse these in great detail for number of events needed, major cpu-consuming tasks, number of expected analysis iterations of each type, data volume, all simulations needed, etc. You should use your years of accumulated experience and some WAGs (the converse of experience) to make these estimates, but each person has an as yet unspecified but certainly limited number of WAGs at her/his disposal. (iii) Assumptions about running conditions: there is a bewildering variety of possibilities (Au-Au at 200 GeV or lower energy, lighter symmetric nuclear systems at top or lower energy, pA, pp) and we must choose a subset to give ourselves some focus. Also there is the question of how these are mixed in an actual calendar year. For the moment I propose that we bypass this issue and discuss physics signals only for the following limited subset: (a) Au-Au at sqrt(s)=200 GeV, (b) p-Au at sqrt(s)=200 GeV, and (c) polarized pp at sqrt(s)=450 GeV. Let's take the unit of discussion to be a full year's data at one running condition, i.e. 10**7 events of a given collision system (some topics don't need this much, of course). We can scale appropriately later, based upon input from the runtime committee, but I think there is virtue in simply getting started to write this stuff down. What are these events (i.e. what triggers)? You tell us: assume what you need to assume and be explicit (centrality based upon E_T, or includes all single jets above a pt=10GeV in abs(eta)<.5, or includes all photons above 3 GeV, or whatever). I'm not sure this specific procedure is a good one and it's certainly something we need to decide upon soon. (iv) Using (ii) under the assumptions of (iii), each topical group tries to guess at how to scale from the specific individual topics in (ii) to the full STAR collaboration, in order to estimate the magnitude of the resources needed. (v) Each physics topical group of one or two people begins to work now, with no further discussion among us of a common method for making an estimate of resource needs, because a method that may work for hyperon physics may be inappropriate for event-by-event or dilepton analysis. First draft from each topical group due to me by September 15. (vi) I try to make some order out of the mess and then we have a free-for-all for a couple of weeks, trying to poke holes in each other's estimates. We try to bring it all together again, and this is our preliminary report to John on Oct. 1. (vii) Preliminary report goes out to a larger group of wise men and women within STAR and possibly outside of STAR, or maybe the whole STAR collaboration, with comments due back by Oct. 15. (viii) Two more weeks to assimilate comments and we produce our final report by Nov. 1. Comments ======== We will have no meetings that require travel or picking up the phone. I will try to do everything via email and the web, allowing all of us to work asynchronously. There is no alternative for me personally since I will be at CERN from next week until the middle of November. I have asked Torre to set up a web page devoted to this task force. It should contain links to relevant documents as well as an email archive (and there should be an email listserver) and places for drafts (holes for drafts?). Everything should be open to the collaboration. Important documents which you should look at very soon are ROCOCO-2, something on the current RCF design (transparencies from RCF review?), the DOE RCF Review Committee Final Report, and other resource documents and revisions to ROCOCO-2 (Lanny wrote a revision to it a couple of months ago in preparation for the DOE review). Lanny and others should add to this list - please excuse my limited knowledge of the resources available. I have also asked Torre to locate if possible documents from e.g. Babar to illustrate how they approached this problem in general terms (I don't know and am eager to learn). Torre should send around mail when this is ready (or tell me that I need to do something to set it up!). Interaction with the RCF: Torre is the main contact person on this matter, and I would like him to write to us his view of the prioritized list of questions that STAR needs to answer for the RCF, and how PHENIX is going about the same process. Units: I personally prefer GFlop-sec to specint95, but for no good reason other than familiarity. But I propose that, especially for raw->DST production and simulation production of a specified flavour of simulation, you should simply leave this for now as number of events. I think that Lanny is the main resource person for what is known in terms of processing times for various tasks. But as has been discussed a number of times, some of this needs to be revisited too in light of experience with current experiments. I gave my view on simulations on this issue at the collab. meeting, so I will now leave it to someone else to speak up on this matter. This process will be "politics-free", i.e. there will be no requirement to respect previous estimates or to try to match STAR's requests for resources to those of PHENIX. In this world of limited resources, it is imperative that we make a good estimate of what we really need to do physics. Summary ======= Please reply with your comments on this proposal to the entire list of recipients of this mail. Think about how to go about making estimates for your physics topic and figure out whether the general approach makes sense (I am especially worried about event-by-event physics here). Please comment on item (iii) (assumptions about running conditions), if nothing else. In the meantime, the web page should be set up and the various reference documents made available. If I don't hear from you within a couple of weeks I will contact you personally, and if there is no large change to the above proposal I expect the first drafts by September 15. May the Force be with you. Regards, Peter ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Peter Jacobs, MS 50A-1148, Lawrence Berkeley National Laboratory, Berkeley, CA 94720 Tel. (510)486-5413 Fax (510)486-4818 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^