Version 2 (modified by hartono, 15 years ago) (diff)


Experiment Results on Empirical Performance Optimizations


All results were obtained from using a quad-core Intel Core 2 Quad Q6600 CPU clocked at 2.4 Ghz with 32 KB L1 D cache, 8MB of L2 cache (4MB shared per core pair), and 2 GB of DDR2-667 RAM, running Linux kernel version 2.6.22 (x86-64). The compiler used was ICC 10.1.

LU Decomposition

Original code:

for (k=0; k<=N-1; k++) 
    for (j=k+1; j<=N-1; j++)
      A[k][j] = A[k][j]/A[k][k];
    for(i=k+1; i<=N-1; i++)   
      for (j=k+1; j<=N-1; j++)   
	A[i][j] = A[i][j]-A[i][k]*A[k][j];

Sequential -- Single Core: No image "lu.png" attached to

Parallel -- Multi-core: