Ticket #10 (closed defect: fixed)
Collective performance issue
Reported by: | kazutomo | Owned by: | kazutomo |
---|---|---|---|
Priority: | minor | Milestone: | |
Component: | ZeptoOS | Version: | |
Keywords: | Cc: |
Description
Performance issue on collective operation on Linux although point-to-point communication(using torus DMA) seems ok. One With our MPI microbenchmark, I got only 477 MB/s on Linux while 780 MB/s CNK(SMP mode with I/O thread). It does not look like noise related issue because the difference between CNK and Linux is constant (I tested until 1024 nodes).
To narrow down, I wrote a SPI broadcast test program, which basically do:
Sender:
Write packet to the collective device (memory copy Querying the status to see if the packet is sent (memory read)
Receiver:
Query the status until packet gets arrived (memory read) Read packet from the device (memory copy)
I found that Linux spends more time on receiving data. Technically, the status query loop spins more on Linux. On CNK, the number of iterations is constant and around 20. Linux shows some variation and it iterates around between 20 to 50.
To get detail, I did added some instrumentation code to the fifo querying loop in the benchmark code.
start[i] = rdtsc(); do {
CollectiveFifoStatus?(0,&rechcount, &recdcount, &injhcount, &injdcount);
} while(rechcount==0 && recdcount==0); elapsed[i] = rdtsc() - start[i];
The result is interesting(please refer the graph I attached). It looks like the CollectiveFifoStatus?() dealy is decreasing. BTW, 30 usec spike is Linux timer interrupt; It happens every 10ms. I didn't see any other noise events.
It looks like me that hardware is just slow on Linux. Tree device training might be suspect but the training code is in CNS.
NOTE: for some reason, NPB does not show any performance difference between Linux and CNK.
Change History
comment:1 Changed 14 years ago by kazutomo
- Owner changed from [email protected]… to kazutomo
- Status changed from new to assigned
comment:2 Changed 14 years ago by kazutomo
- Priority changed from major to minor
- Milestone set to Release before SC08
comment:3 Changed 14 years ago by anonymous
- Milestone 0 V1R3 release deleted
Milestone 0 V1R3 release deleted
comment:4 Changed 14 years ago by kazutomo
- Status changed from assigned to closed
- Resolution set to fixed
One of the reason was that we didn't implement I/O thread. We have already implemented it. Collect performance is comparable to CNK with V1R2 DCMF. V1R3 DCMF has optimization on broadcast, etc using torus. I'll open a new ticket for V1R3 or later DCMF support
still investigating.