I've spent a lot of time lately improving and trying to understand the
Here I show round trip performance between vxworks nodes.
First of all, I see a significantly
better performance between the node "bdb" (used to be gb) and other
The myrinet card on this node is newest board that does NOT have the
Myrinet 2 functionality. (lanai 4 is very old sector broker boards, lanai 7
bdb and some of the l3 nodes, lanai 9 is the new nodes used by trigger and
some of the newest detectors) The newer nodes are running in a "backward
compatability mode". Also, the BDB node is different. I still don't know if
the performance difference is due to the drivers, the electronics, or the processor itself.
Round Trip Messages
Here I show the round trip times (I don't divide by two). I don't see any latencies > 400 usec, although there is a tail with about 15 of 50000 messages
with times greater than 200 usec. This is node dependent, and I don't see it
on the trigger nodes. I don't know the source.
Most messages take about 60-70 usec round trip, but there
is a striking peak (on a log plot) at ~150 usec. I beleive this is
due to the way the myrinet control program on the NIC works. The
physical layer of myrinet is NOT "reliable" (although it has low error rates).
The way that they guarentee reliability is to send ack signals back to the
sender. However, the ack signals themselves are not reliable, so they also
put in a timeout. However, the timeout is handled in a strange way,
every millisecond a timer goes off. All messages that have not been
acknowledged are then assumed to be bad, and are queued for re-sending.
Because the timer signals are out of sync with the messages this means
that it is "normal" for messages to be interupted this way. If you assume
message takes 20usec to send, and that when you send round trips you are
sending messages 1/2 of the time, you would expect about 1 in 100 trips would
be interrupted. This is about the ratio of messages in the second peak to the first.
Timing while sending unidirectional messages, is a bit more confusing. The distribution
of times depends very much on which node sends to which node. The tails on the
distributions are much longer than for round-trip messages. The reasons for
these observations are tied up with the internal buffering of the messages.
This is not an issue with ping-pong messages, because there is never more than
one message in the system at a time.
Before describing what is happening its usefull to know the rates for
unidirectional sends. The sustained average rate depends almost completely on the sending node:
l2ana01 (linux lanai9) = 9.3 usec
bdb (vxworks lanai7) = 14 usec (user task high priority)
bdb (vxworks lanai7) = 17 usec (user task low priority)
l1/ctb (vxworks lanai9) = 20 usec
pmd (vxworks lanai9 slower cpu?) = 25 usec
You also need to know that to user myriLib you have at least two tasks.
The first is the user task, which performs the message sends. The second is
the receive task which receives all information coming from the network
card. This information consists of the arriving messages, and also the
confirmations that sending messages have been sent.
There is some competition for resources between these two tasks, so
I do observe some effects of inverting the priorities for them. I assume,
of course that the user task blocks when not in use, so if you are polling
you don't have a choice.
Sending unidirectional messages I get better performance keeping the user
task at high priority. This ensures that the card always has messages to send.
In this plot, you see spray messages sent from the slowest node (pmd) to a
faster one (bdb). In the case where the user task has a high priority (uh), you see a
nice peak at about 25usec (the sustained rate), and no additional structure.
What happens is that the users buffers are filled up and one message is sent
at a time.
However, when the priority of the user task is lower than the receive task
what happens is that several messages get put into the queue which the NIC is
processing the first one. However, as messages finish getting processed, the
receive tasks starves the user task as it processes the "event sent"
notifications. The result is that the time distribution has a sharp
peak at ~10usec followed by another broad peak at 40-100 usec.
I hacked up the driver to have a counter of the number of messages currently
in the system. Then I take the difference between each successive message
send to see how many messages finished being processed during this send. Here
is the result for a low priority user task. (If the user task has high priority,
the number is always 1 after the first 29 sends fill up the buffers).
This feature is even more pronounced with a faster NIC doing the sending
Here, for the low priority user task, you see exactly the same situation, although the distribution is shifted to slower values because the more events can
processed at a time.
You see in addition a new feature with the HIGH priority sender. In this case,
the user buffers are always completely full, however I beleive the same queueing effect occurs
with the buffers in the destination receiving node. Unfortunately,
I don't have any way to put hooks in to demonstrate that this leads to
the peak in the high priority messages.
Here I show the rates corrected by the number of sends that were processed during the period of the send. (t / nserved) Note that the rates are slightly inflated,
because the time spend sending 0 messages is not in the data of this plot.
However, you do see the point of buffing the messages in the first place -- the rate per message drops as the number of sends increases.
Finally, I show the timing distributions for the spray messages on the trigger nodes. Sending from l1 to ctb (same speed), I don't see much structure. Sending from l2 I see strong evidence of the queueing effect.
Spray messages with empty buffers
The situation in L1 and on the DSM clients is different. Here you need to do
some processing that is independent of the myrinet card every event. The
myrinet card is doing its own work in parallel.
I have mimicked this situation (poorly), by sending 5 messages at a time and then waiting for 10ms (the shortest wait in vxworks). This gives a good
indication of how much of the latency is due to the host CPU.
We see that the times are much more deterministic, and very much shorter.
Here we can also see the difference between the single message versions
of myriMsgSend and the multiple message version. The multiple message version
saves about 10% (8 usec / msg compared to 9 usec/msg. The reason for this is just that the multiple message version saves several semaphore operations.
Simplified trigger protocol
The final test I made was a much simplified version of the L2 protocol.
Here I sent two messages from L1, first a message to CTB and then to L2.
CTB forwards the message to L2. Finally L2 returns the message to L1.
I show the round trip times.
The result is quite similiar to the results from ping-pong messages, exept the
times are somewhat longer (more messages) and the distribution a bit broader as would be expected.
I do, unfortunately see some very long events. They don't show up every time, because my test only runs 50000 cycles. My last plot is the same plot
but for a run that had a long delay in it. It shows that the delayed event does
not seem to be correlated with other delayed events. It also shows that the delayed events are not frequent enought to adversely effect the overall rate.
The rate of cycles with > 2ms is approximately 1 in 100,000 to 200,000 cycles.
Last modified: Tue Aug 19 18:25:12 EDT 2003