Custom Query (28 matches)


Show under each result:

Results (4 - 6 of 28)

1 2 3 4 5 6 7 8 9 10
Ticket Owner Reporter Resolution Summary
#10 kazutomo kazutomo fixed Collective performance issue

Performance issue on collective operation on Linux although point-to-point communication(using torus DMA) seems ok. One With our MPI microbenchmark, I got only 477 MB/s on Linux while 780 MB/s CNK(SMP mode with I/O thread). It does not look like noise related issue because the difference between CNK and Linux is constant (I tested until 1024 nodes).

To narrow down, I wrote a SPI broadcast test program, which basically do:


Write packet to the collective device (memory copy Querying the status to see if the packet is sent (memory read)


Query the status until packet gets arrived (memory read) Read packet from the device (memory copy)

I found that Linux spends more time on receiving data. Technically, the status query loop spins more on Linux. On CNK, the number of iterations is constant and around 20. Linux shows some variation and it iterates around between 20 to 50.

To get detail, I did added some instrumentation code to the fifo querying loop in the benchmark code.

start[i] = rdtsc(); do {

CollectiveFifoStatus?(0,&rechcount, &recdcount, &injhcount, &injdcount);

} while(rechcount==0 && recdcount==0); elapsed[i] = rdtsc() - start[i];

The result is interesting(please refer the graph I attached). It looks like the CollectiveFifoStatus?() dealy is decreasing. BTW, 30 usec spike is Linux timer interrupt; It happens every 10ms. I didn't see any other noise events.

It looks like me that hardware is just slow on Linux. Tree device training might be suspect but the training code is in CNS.

NOTE: for some reason, NPB does not show any performance difference between Linux and CNK.

#9 kazutomo kazutomo fixed CommThread support including IPI

No I/O thread is supported now. We need to implement some kind of user level interrupt hander to Linux.

#6 [email protected] kazutomo fixed DMA inj counter underflow

RAS events are often generated from zcb. DMA unit injection counter 0 has underflowed at torus location (0,0,0). This is likely a software error.

1 2 3 4 5 6 7 8 9 10
Note: See TracQuery for help on using queries.