This document was initially produced based on a version of a tutorial
written for PHENIX by S. Mioduszewski and T. Chujo.
It was last modified on
IMPORTANT Note for RCF Monitoring:
After diagnosing the problem, the shift person must decide whether an expert needs to be called during the night. If the problem does not interfere with data-taking, a CTS ticket should be filed. However, if the problem does affect data-taking, then the appropriate expert should be called.
(1) RCF Monitoring
During normal business hours (8 AM - 4 PM), a RCF staff member is always on shift. The next shift is the evening shift, 4 PM - 12 AM, which is the responsibility of an RCF operator. The owl shift is 12 AM - 8 AM and belongs to a PHENIX or STAR shift person (alternating weeks). On BNL lab holidays, PHENIX/STAR is responsible for shifts during the entire day. In the case of STAR, the online QA shift person is responsible for the RCF monitoring. Here is the schedule of RCF shifts. Please check first the RCF shift schedule assigned to STAR, when you start taking the offline shift. The hard copy of the shift schedule is also available at the counting house. There are two things to be monitored, 1) RCF Network, 2) HPSS status. In the following sections, you can figure out how to monitor the RCF and what you should do when you find the RCF related troubles. The collection of links provided by RCF staff is found here.
The RCF monitoring shift person is required to carry a pager which can be found together with the RCF phone-list above onllinux2 in the STAR counting house. If anyone from the 4 experiments experiences a problem with RCF, he/she should call the appropriate RCF responsible person or page the person on shift. The shift person is not required (or qualified) to fix the problem. The main responsibility of the shift person is to diagnose the problem and decide whether to call the appropriate RCF expert or to file a CTS ticket. The shift person is also required to monitor the HPSS from the counting house. The counting house should always have a display of the HPSS monitoring screen.
1-1. Network Monitoring
The status of monitored machines can be checked from here. Figure 1 shows the snapshot of the status of monitored machines. You can check the current status of the ssh gateway servers, FTP servers, NFS servers, DNS/NIS servers and mail servers. In this status monitor, green signal indicates OK, yellow is the warning sign and red is failure sing. Failure (red) on one of the RMINE (NFS server) machines does not warrant a call to the expert in the middle of the night. A CTS ticket should be filed. However, RNIS (NIS server) machines and ssh gateway machines (RSSH) are critical and DO warrant a late-night call during data-taking. Call the appropriate RCF responsible person (see below). The hard copy of list of RCF staff's home phone number ishould be available at the control room. If you cannot find this list (already provided 3 times since the beginning of the run), please send me (J.Lauret) a note ASAP and I will provide it again.
- On-Call RCF persons for RNIS and SSHGW problem
- Maurice Askinazi x2159, pager: 631-441-2595, home: (see the phone list at the counting house).
- Shigeki Misawa x2635, pager: NA, home: (see the phone list at the counting house).
Figure 1. Status of monitored machines display.
Check the following two links to see the PING status of switches.
If any of the SWITCHES (SW1 through SW9) are reported unreachable for 5 CONSECUTIVE minutes, this is a reportable network condition. Call the RCF networking responsible person.
- On-call person for Network problem
Terry Healy x4199, home : (see the phone list at the counting house).
The following links are useful to monitor the network and diagnose problems.
1-2. HPSS Monitoring
Figure 2. HPSS Health and Status monitor GUI.
1-3. Paging RCF Personnel
Instructions for CAS and CRS node monitor
Status of the CAS machines (the analysis nodes) can be found here. The link under " CAS Operational Status" lists any CAS machines experiencing problems. Failure on one of the RCAS machines does not warrant a call to the expert in the middle of the night. A CTS ticket should be filled.
Status of the CRS machines (the
reconstruction farm) can be found
Failure on one of the CRS machines does not warrant a call to the expert in
the middle of the night. A
CTS ticket should be filed. Reconstruction Farm
latest staging failures is also available.
can be found here (CRS) and here (CAS)
Summary of RCF on-call person (see the phone list at STAR counting house (above onllinux2) for home phone number)
|General||RCF Operator||x5480 (4:00pm - 00:00am)||NA|
|RNIS and RSSH||Maurice Askinazi||x2159||631-441-2595|
Useful Links for the online QA shifter