Ticket #10 (closed defect: fixed)

Opened 14 years ago

Last modified 13 years ago

Collective performance issue

Reported by: kazutomo Owned by: kazutomo
Priority: minor Milestone:
Component: ZeptoOS Version:
Keywords: Cc:

Description

Performance issue on collective operation on Linux although point-to-point communication(using torus DMA) seems ok. One With our MPI microbenchmark, I got only 477 MB/s on Linux while 780 MB/s CNK(SMP mode with I/O thread). It does not look like noise related issue because the difference between CNK and Linux is constant (I tested until 1024 nodes).

To narrow down, I wrote a SPI broadcast test program, which basically do:

Sender:

Write packet to the collective device (memory copy Querying the status to see if the packet is sent (memory read)

Receiver:

Query the status until packet gets arrived (memory read) Read packet from the device (memory copy)

I found that Linux spends more time on receiving data. Technically, the status query loop spins more on Linux. On CNK, the number of iterations is constant and around 20. Linux shows some variation and it iterates around between 20 to 50.

To get detail, I did added some instrumentation code to the fifo querying loop in the benchmark code.

start[i] = rdtsc(); do {

CollectiveFifoStatus?(0,&rechcount, &recdcount, &injhcount, &injdcount);

} while(rechcount==0 && recdcount==0); elapsed[i] = rdtsc() - start[i];

The result is interesting(please refer the graph I attached). It looks like the CollectiveFifoStatus?() dealy is decreasing. BTW, 30 usec spike is Linux timer interrupt; It happens every 10ms. I didn't see any other noise events.

It looks like me that hardware is just slow on Linux. Tree device training might be suspect but the training code is in CNS.

NOTE: for some reason, NPB does not show any performance difference between Linux and CNK.

Change History

comment:1 Changed 14 years ago by kazutomo

  • Owner changed from [email protected] to kazutomo
  • Status changed from new to assigned

still investigating.

comment:2 Changed 14 years ago by kazutomo

  • Priority changed from major to minor
  • Milestone set to Release before SC08

comment:3 Changed 13 years ago by anonymous

  • Milestone 0 V1R3 release deleted

Milestone 0 V1R3 release deleted

comment:4 Changed 13 years ago by kazutomo

  • Status changed from assigned to closed
  • Resolution set to fixed

One of the reason was that we didn't implement I/O thread. We have already implemented it. Collect performance is comparable to CNK with V1R2 DCMF. V1R3 DCMF has optimization on broadcast, etc using torus. I'll open a new ticket for V1R3 or later DCMF support

Note: See TracTickets for help on using tickets.