From: Frank Geurts (geurts_at_rice.edu)
Date: Mon Feb 24 2003 - 16:53:33 EST
I 'cc' this email to Tonko and Jeff too, maybe they comment on this ...
In any case, while flying from Houston to Long Island I looked in more
detail to the DAQ log files of yesterday. Here's a couple of more
symptoms to our recent problem:
--- The log files are filled w/ a mixture of time-outs in which L2 are supposedly lost and L2 accepts in which an L2 is received w/o a prior associated L0. Closer inspection actually does show that we received the L0 but already timed out on it therefore discarding it. See a piece of yesterday night's (Sunday Feb 23) the log file below:[tof02 23:27:14] (getdata): WARNING: getdata.cxx [line 925]: time-out for token 816, lost a L2 Accept/Abort [tof02 23:27:14] (getdata): WARNING: getdata.cxx [line 925]: time-out for token 818, lost a L2 Accept/Abort [tof02 23:27:14] (getdata): WARNING: getdata.cxx [line 925]: time-out for token 820, lost a L2 Accept/Abort [tof02 23:27:14] (getdata): WARNING: getdata.cxx [line 1537]: L2 Accept with L0 822, L2 816, LAMG 0, buffer 68 [tof02 23:27:14] (getdata): WARNING: getdata.cxx [line 925]: time-out for token 822, lost a L2 Accept/Abort [tof02 23:27:14] (getdata): WARNING: getdata.cxx [line 1537]: L2 Accept with L0 0, L2 817, LAMG 0, buffer 70
[...]
[tof02 23:27:14] (getdata): WARNING: getdata.cxx [line 1537]: L2 Accept with L0 0, L2 818, LAMG 0, buffer 71 [tof02 23:27:14] (getdata): WARNING: getdata.cxx [line 1537]: L2 Accept with L0 0, L2 819, LAMG 0, buffer 72 [tof02 23:27:14] (getdata): WARNING: getdata.cxx [line 1537]: L2 Accept with L0 0, L2 820, LAMG 0, buffer 73 [tof02 23:27:14] (getdata): WARNING: getdata.cxx [line 1537]: L2 Accept with L0 0, L2 821, LAMG 0, buffer 74 [tof02 23:27:14] (getdata): WARNING: getdata.cxx [line 1537]: L2 Accept with L0 0, L2 822, LAMG 0, buffer 75 [tof02 23:27:14] (getdata): WARNING: getdata.cxx [line 1537]: L2 Accept with L0 0, L2 823, LAMG 0, buffer 76
You can clearly see that getdata times out on L0s 816,818,820 and then gets confused about the L2-accept with token 816 which arrives after we timed-out on 3(!) L0 tokens. Next nothing happens, i.e. we don't receive an L0 anymore (L0 is set to 0 until we've got a 'real' token) but again do receive L2-accepts for 817 (never received the L0), 818, 819(never received the L0), 820, 821(never received the L0), 822, and 823 (never received the L0).
Now, if this was a pedestal run it would explain to some extend the fact that we loose (in this case) the odd-numbered tokens. That 'feature' is what keeps us requiring the system to be run w/ a slow detector like the TPC.
But it isn't ... it is a physics run. At least that's how it identified itself to our system. Again, we can't run in a fast free running mode whether that is physics or pedestal. RTS operators ought to know that and so far operation of tofpdaq has been accordingly.
A very likely scenario might be that we have started to time out too fast. Remember that I, on Tonko's request, changed the daq time-out from 2s to 0.1s. The fact that the L2-accept comes in after more than 0.1s but within 1s (the log files time resolution is 1s). Is the L2 algorithm too slow? Or is some other subsystem pausing the trigger/daq making tofpdaq time-out a little too quick?
I don't think we can exclude hardware failures yet since the online-QA plots seem to miss entries for scalers (c-style) 9, 10, and 11 for EVENT, PULSE and FASTCLEAR respectively. ---
Frank Geurts wrote: > it's hard to find anything in the daq software other than lots of > time-outs (somehow it seems to be far more than usual) ... the online QA > plots do show something that really is weird: despite all histograms to > look bad, the scalers histogram looks perfectly fine! Like it collected > a lot of EVENTS only creating one or two real events. > > more later .... > > > > > > Frank Geurts wrote: > >> online QA looks weird ... take e.g. run 4054033 ... only 2 in the ADC >> and TDC hit distributions? Not to mention the corrupted TOFp temp., >> threshold and Ramp distribution. This run lasted for 46k events ... >> not good. >> >> *** why is this not picked up by the RTS shift crew member ??? Isn't >> it his/hers _explicit_ task to check the consistency of the online QA >> plots ???? *** >> >> >> i'll see what i can do ... people in the control room seem to be >> having trigger problems right now. >> >> -f. >> >> >> >> >> >> >> W.J. Llope wrote: >> >>> o.k. - at the moment i can do a live comparison. >>> >>> star IS running. we ARE listed as "included". >>> >>> localmon.dat is NOT getting getting incremented w/ new evts >>> (strobe or physics). file just sits there. it should be >>> HUGE by now. >>> >>> >>> >>> >>> _________________________________________________________ >>> W.J. Llope, Ph.D. Res. Assoc. Professor >>> http://wjllope.rice.edu/default.html >>> llope_at_physics.rice.edu >>> T.W. Bonner Nuclear Lab. Rice University, MS-315 6100 S. >>> Main phone: 713-348-4741 Houston, TX >>> 77005-1892 fax: 713-348-5215 >>> >>> >> >
This archive was generated by hypermail 2.1.4 : Thu Jul 24 2003 - 00:39:35 EDT