Re: problem

Date view Thread view Subject view Author view Attachment view

From: Frank Geurts (geurts_at_rice.edu)
Date: Tue Feb 25 2003 - 02:15:17 EST


Jeff -- it is kind of hard to tell from the log file whether I see this
behaviour change after 10,000 events ... do you want me to count the
tokens :)

Clearly Tonko's observation
(http://www.star.bnl.gov/HyperNews-star/get/startrig/2088.html) affected
  the TOF detector big time since it times out after 0.1s.

Also, could those people responsible for this *please* communicate their
actions at least to the shiftcrew ... i called them yesterday a couple
of times asking about this and they were not aware of anything. (Not to
mention the fact that we spend quite some time debugging our system
trying to find the problem)

-frank

Jeff Landgraf wrote:
> This was near the start of the run. We think they are due to the l2
> write, which sometimes slows down l2. Do you ever see this after 10,000
> events into the run? (about 5 minutes into the run?). The slow accepts
> also play havoc with the other detectors and with DAQ.
>
> -jeff
>
> On Mon, 24 Feb 2003, Frank Geurts wrote:
>
>
>>I 'cc' this email to Tonko and Jeff too, maybe they comment on this ...
>>
>>In any case, while flying from Houston to Long Island I looked in more
>>detail to the DAQ log files of yesterday. Here's a couple of more
>>symptoms to our recent problem:
>>
>>---
>>The log files are filled w/ a mixture of time-outs in which L2 are
>>supposedly lost and L2 accepts in which an L2 is received w/o a prior
>>associated L0. Closer inspection actually does show that we received the
>>L0 but already timed out on it therefore discarding it. See a piece of
>>yesterday night's (Sunday Feb 23) the log file below:
>>
>>[tof02 23:27:14] (getdata): WARNING: getdata.cxx [line 925]: time-out
>>for token 816, lost a L2 Accept/Abort
>>[tof02 23:27:14] (getdata): WARNING: getdata.cxx [line 925]: time-out
>>for token 818, lost a L2 Accept/Abort
>>[tof02 23:27:14] (getdata): WARNING: getdata.cxx [line 925]: time-out
>>for token 820, lost a L2 Accept/Abort
>>[tof02 23:27:14] (getdata): WARNING: getdata.cxx [line 1537]: L2
>>Accept with L0 822, L2 816, LAMG 0, buffer 68
>>[tof02 23:27:14] (getdata): WARNING: getdata.cxx [line 925]: time-out
>>for token 822, lost a L2 Accept/Abort
>>[tof02 23:27:14] (getdata): WARNING: getdata.cxx [line 1537]: L2
>>Accept with L0 0, L2 817, LAMG 0, buffer 70
>>
>>[...]
>>
>>[tof02 23:27:14] (getdata): WARNING: getdata.cxx [line 1537]: L2
>>Accept with L0 0, L2 818, LAMG 0, buffer 71
>>[tof02 23:27:14] (getdata): WARNING: getdata.cxx [line 1537]: L2
>>Accept with L0 0, L2 819, LAMG 0, buffer 72
>>[tof02 23:27:14] (getdata): WARNING: getdata.cxx [line 1537]: L2
>>Accept with L0 0, L2 820, LAMG 0, buffer 73
>>[tof02 23:27:14] (getdata): WARNING: getdata.cxx [line 1537]: L2
>>Accept with L0 0, L2 821, LAMG 0, buffer 74
>>[tof02 23:27:14] (getdata): WARNING: getdata.cxx [line 1537]: L2
>>Accept with L0 0, L2 822, LAMG 0, buffer 75
>>[tof02 23:27:14] (getdata): WARNING: getdata.cxx [line 1537]: L2
>>Accept with L0 0, L2 823, LAMG 0, buffer 76
>>
>>You can clearly see that getdata times out on L0s 816,818,820 and then
>>gets confused about the L2-accept with token 816 which arrives after we
>>timed-out on 3(!) L0 tokens. Next nothing happens, i.e. we don't receive
>>an L0 anymore (L0 is set to 0 until we've got a 'real' token) but again
>>do receive L2-accepts for 817 (never received the L0), 818, 819(never
>>received the L0), 820, 821(never received the L0), 822, and 823 (never
>>received the L0).
>>
>>Now, if this was a pedestal run it would explain to some extend the fact
>>that we loose (in this case) the odd-numbered tokens. That 'feature' is
>>what keeps us requiring the system to be run w/ a slow detector like the
>>TPC.
>>
>>But it isn't ... it is a physics run. At least that's how it identified
>>itself to our system. Again, we can't run in a fast free running mode
>>whether that is physics or pedestal. RTS operators ought to know that
>>and so far operation of tofpdaq has been accordingly.
>>
>>A very likely scenario might be that we have started to time out too
>>fast. Remember that I, on Tonko's request, changed the daq time-out from
>>2s to 0.1s. The fact that the L2-accept comes in after more than 0.1s
>>but within 1s (the log files time resolution is 1s). Is the L2 algorithm
>>too slow? Or is some other subsystem pausing the trigger/daq making
>>tofpdaq time-out a little too quick?
>>
>>I don't think we can exclude hardware failures yet since the online-QA
>>plots seem to miss entries for scalers (c-style) 9, 10, and 11 for
>>EVENT, PULSE and FASTCLEAR respectively.
>>---
>>
>>
>>
>>
>>
>>Frank Geurts wrote:
>>
>>>it's hard to find anything in the daq software other than lots of
>>>time-outs (somehow it seems to be far more than usual) ... the online QA
>>>plots do show something that really is weird: despite all histograms to
>>>look bad, the scalers histogram looks perfectly fine! Like it collected
>>>a lot of EVENTS only creating one or two real events.
>>>
>>>more later ....
>>>
>>>
>>>
>>>
>>>
>>>Frank Geurts wrote:
>>>
>>>
>>>>online QA looks weird ... take e.g. run 4054033 ... only 2 in the ADC
>>>>and TDC hit distributions? Not to mention the corrupted TOFp temp.,
>>>>threshold and Ramp distribution. This run lasted for 46k events ...
>>>>not good.
>>>>
>>>> *** why is this not picked up by the RTS shift crew member ??? Isn't
>>>>it his/hers _explicit_ task to check the consistency of the online QA
>>>>plots ???? ***
>>>>
>>>>
>>>>i'll see what i can do ... people in the control room seem to be
>>>>having trigger problems right now.
>>>>
>>>>-f.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>W.J. Llope wrote:
>>>>
>>>>
>>>>>o.k. - at the moment i can do a live comparison.
>>>>>
>>>>>star IS running. we ARE listed as "included".
>>>>>
>>>>>localmon.dat is NOT getting getting incremented w/ new evts
>>>>>(strobe or physics). file just sits there. it should be
>>>>>HUGE by now.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>_________________________________________________________
>>>>>W.J. Llope, Ph.D. Res. Assoc. Professor
>>>>> http://wjllope.rice.edu/default.html
>>>>> llope_at_physics.rice.edu
>>>>>T.W. Bonner Nuclear Lab. Rice University, MS-315 6100 S.
>>>>>Main phone: 713-348-4741 Houston, TX
>>>>>77005-1892 fax: 713-348-5215
>>>>>
>>>>>
>>>>
>>
>


Date view Thread view Subject view Author view Attachment view

This archive was generated by hypermail 2.1.4 : Thu Jul 24 2003 - 00:39:36 EDT