STAR RCF Monitoring Shifts

This document was initially produced based on a version of a tutorial written for PHENIX by S. Mioduszewski and T. Chujo.
It was last modified on


IMPORTANT Note for RCF Monitoring: 

After diagnosing the problem, the shift person must decide whether an expert needs to be called during the night. If the problem does not interfere with data-taking, a CTS ticket should be filed. However, if the problem does affect data-taking, then the appropriate expert should be called.


(1) RCF Monitoring

During normal business hours (8 AM - 4 PM), a RCF staff member is always on shift. The next shift is the evening shift, 4 PM - 12 AM, which is the responsibility of an RCF operator. The owl shift is 12 AM - 8 AM and belongs to a PHENIX or STAR shift person (alternating weeks). On BNL lab holidays, PHENIX/STAR is responsible for shifts during the entire day. In the case of STAR, the online QA shift person is responsible for the RCF monitoring. Here is the schedule of RCF shifts. Please check first the RCF shift schedule assigned to STAR, when you start taking the offline shift. The hard copy of the shift schedule is also available at the counting house.  There are two things to be monitored, 1) RCF Network, 2) HPSS status. In the following sections, you can figure out how to monitor the RCF and what you should do when you find the RCF related troubles. The collection of links provided by RCF staff is found here.  

The RCF monitoring shift person is required to carry a pager which can be found together with the RCF phone-list above onllinux2 in the STAR counting house. If anyone from the 4 experiments experiences a problem with RCF, he/she should call the appropriate RCF responsible person or page the person on shift. The shift person is not required (or qualified) to fix the problem. The main responsibility of the shift person is to diagnose the problem and decide whether to call the appropriate RCF expert or to file a CTS ticket. The shift person is also required to monitor the HPSS from the counting house. The counting house should always have a display of the HPSS monitoring screen.

1-1. Network Monitoring

The status of monitored machines can be checked from here. Figure 1 shows the snapshot of the status of monitored machines. You can check the current status of the ssh gateway servers, FTP servers, NFS servers, DNS/NIS servers and mail servers. In this status monitor, green signal indicates OK, yellow is the warning sign and  red is failure sing. Failure (red) on one of the RMINE (NFS server) machines does not warrant a call to the expert in the middle of the night. A CTS ticket should be filed. However, RNIS (NIS server) machines and ssh gateway machines (RSSH) are critical and DO warrant a late-night call during data-taking. Call the appropriate RCF responsible person (see below). The hard copy of list of RCF staff's home phone number ishould be available at the control room. If you cannot find this list (already provided 3 times since the beginning of the run), please send me (J.Lauret) a note ASAP and I will provide it again.

  1. Maurice Askinazi  x2159, pager: 631-441-2595, home: (see the phone list at the counting house).
  2. Shigeki Misawa     x2635, pager: NA,                  home: (see the phone list at the counting house).

  Figure 1. Status of monitored machines display.

Check the following two links to see the PING status of switches. 

  1. rolly.rhic.bnl.gov (on .80 subnet, as viewed from OUTSIDE the RCF firewall)
  2. ru10.rcf.bnl.gov (on .6 subnet, as viewed from INSIDE the RCF firewall)

If any of the SWITCHES (SW1 through SW9) are reported unreachable for 5 CONSECUTIVE minutes, this is a reportable network condition. Call the RCF networking responsible person.

 Terry Healy   x4199,  home : (see the phone list at the counting house).

The following links are useful to monitor the network and diagnose problems. 

1-2. HPSS Monitoring

  1. Check the clock in HPSS GUI is actually working. If not, HPSS is dead. Call RCF HPSS experts immediately.  
  2. Check the "Servers" status is Normal (green). If you see the "Major Problem" , call RCF HPSS experts. (yellow is "Suspect" or "Minor", you do not have to call).
  3. Check the "Space Thresholds" is Normal (green). If you see the "Major Problem" for more than 30 min. , call RCF HPSS experts. Otherwise, disks will fill up!
  4. Check  "Alarms and Events" status by clicking "Monitor" menu on the top of the GUI. Watch this screen, but red does not necessarily mean that someone needs to be called. Only if there are continual red lines, then expert should be called.

 

  Figure 2. HPSS Health and Status monitor GUI.

  1. Rasvan Popescu   x5806, pager: 631-554-8747, home phone : (see the phone list at the counting house).
  2. John Riordan       x7201, pager: 631-554-9011,  home phone : (see the phone list at the counting house).

 

  1. Log on to onllinux2 in the STAR control room as starcrew. See instructions taped to the monitor.
  2. Login as " STAR_op" on " hpss.rcf.bnl.gov" using ssh by typing "ssh hpss.rcf.bnl.gov -l STAR_op" and then enter the password. (ask J. PORTER <porter@bnl.gov>  or Jérôme LAURET <jlauret@bnl.gov> for the password.) 
  3. The procedure starts automatically the HPSS control. Accept the default values for user and display. When a graphical window prompts you for username and password, use user "STAR_op" and the same password as for the hpss account. TYPE the username in the first field, THEN SWITCH TO THE SECOND FIELD (using a tab or the mouse), type the password and then press "Enter". The main control will show up! Use the menus to access details of the state of the system. Check the "Monitor" menu. It is important that the clock is running in the main control GUI.  
  4. To logoff (IMPORTANT!!)
    To logoff:
    - Select "Session" --> "Quit Sammi" and confirm.
    - Don't use "Log off Session".

    The terminal window that initiated the session will not close since its process handles the X encrypted channel. Minimize it and ignore it. It will automatically close when the HPSS session will be terminated.

 

1-3. Paging RCF Personnel


Instructions for CAS and CRS node monitor


Summary of RCF on-call person (see the phone list at STAR counting house (above onllinux2) for home phone number)

Subsystem

Name Office Phone Pager
General RCF Operator x5480 (4:00pm - 00:00am) NA
RNIS and RSSH Maurice Askinazi x2159 631-441-2595
Shigeki Misawa x2635 631-554-3591
Network Terry Healy x4199 NA
HPSS Rasvan Popescu x5806 631-554-8747
John Riordan x7201 631-554-9011

Useful Links for the online QA shifter