Changes between Version 6 and Version 7 of experiments

10/01/08 09:58:24 (10 years ago)

more emails


  • experiments

    v6 v7  
     67More details, which I include in their raw format: 
     69First, an updated summary of WD_def scaling runs with the 
     70same code as used previously (2007) on Seaborg and Franklin. 
     71This is the same table I sent earlier, except for 
     72 - some reordering, 
     73 - elimination of some failed runs, 
     74 - addition of the test with 154419 blocks . 
     75Note that that mem/proc column shows whether a test was done 
     76in virtual mode (0.5 GiB/proc) or dual mode (1.0 GiB/proc). 
     78Nblocks(ini) Nprocs t_evo tref t/1mort max#blk/proc mem/proc[GiB] 
     80  6067        256   567.9  47.9   0.45  29*             .5       *27 up to last step 
     81  6067        240   598.6  51.6   0.43  30*             .5       *28 up to last step 
     83 12083        480   636.0  57.2   1.87  29              .5 
     85 54515       2176   613.0  66.7   4.36  29              .5 
     87110387       3800                       32             1.0       **failed 
     88110387       4096                                       .5       **failed 
     89110387       4096   647.1  61.9   8.61  30             1.0 
     90110387       4416   609.8  59.5   8.88  29              .5 
     91110387       8192   396.9  42.2  11.25  17              .5 
     93154419       6144   663.6  91.2  12.31  29              .5 
     95203443       8136   700.9 106.7  16.86  28(leaf:24)     .5 
     96203443       8192   699.9 107.7  17.25  29(leaf:23)     .5 
     98Times are from FLASH timers, given in seconds: 
     99t_evo  = total evolution time for 10 steps 
     100tref   = total evolution time spent in Grid_updateRefinement 
     101t/1mort= time spent per invocation of amr_morton_process 
     103As I noted before, these tests show relatively poor scaling 
     104when going from ~4k to ~8k procs.  (The new ~6k test doesn't  
     105add anything new, it fits the trend.)  Again summarizing 
     106what I could figure out from looking at the FLASH timers: 
     107The increase in t_evo is due mostly to increased time spent 
     108in Grid_updateRefinement and, to a lesser degree, increased 
     109time spent in Hydro and sourceTerms (and within those, the 
     110time increase is mostly due to time spent in amr_guardcell 
     113Increased time required in Grid_updateRefinement is something 
     114we saw before on Franklin, in runs with ~800,000 blocks on  
     115~8k procs and larger.  So on both machines, intrepid and 
     116franklin, we have evidence of degraded scaling in  
     117Grid_updateRefinement with an onset at ~8k procs. 
     118It used to be assumed that on franklin, this was due to 
     119that architecture's intrinsically poor scaling of global 
     120operations like MPI_AllToAll and MPI_AllReduce.  So it 
     121was unexpected to find the same kind of behavior on BGP. 
     123The timer info from franklin and now BGP also shows that 
     124the part of Grid_updateRefinement where scaling breaks down 
     125is within the PARAMESH routine amr_refine_derefine, 
     126and within that in the routine amr_morton_process. 
     127And if it's not architecture-specific quirks in the 
     128behavior of global MPI calls that degrade performance here, 
     129then the algorithms used here by PARAMESH may just not 
     130be scaling well. 
     132Kevin Olson has basically rewritten the offending parts of 
     133PARAMESH code for PARAMESH 4.1.  We have had the trunk FLASH 
     134code working with that version of PARAMESH (called Paramesh4dev 
     135within the source tree) for a while, but haven't systematically 
     136tested performance.  I decided to test whether weak scaling in  
     137WD_def was improved by using this newer code.  And of course,  
     138the newer code should not only scale better, but also be at 
     139least as efficient as the previously tested version. 
     141I made the necessary changes in several steps.  The following is 
     142mostly narrative, but I'll show total evolution and refinement times, 
     143same measure t_evo, and tref as above, for 203443 blks on 8192 procs 
     144after each step, and also for 110387 blks on 4416 procs (or smaller) 
     145where available. 
     1471. Updated the code for testing to current trunk level. 
     148A (surprisingly painless) Subversion merge. 
     150Nblocks(ini) Nprocs t_evo  tref t/1mort max#blk/proc mem/proc[GiB] 
     152 54515       2176   760.6  44.1   4.36  29              .5 
     153203443       8192   834.7  83.7  17.27  29              .5 
     155It can be seen that the code has become significantly slower.  However, 
     156this is not really a surprise.  We already knew that the trunk code 
     157had become more inefficient (especially for WD?) at some point. 
     1592. Removed unnecessary EOS calls from Hydro code. 
     160Examination of timer info showed that the slowdown was 
     161probably due to a change in the way EOS is called on guard 
     162cells before each Hydro sweep.  After reverting the logic 
     163back to a previous code version: 
     165110387       4416   567.7  35.8   8.85  29              .5 
     166203443       8192   650.5  83.4  17.23  29              .5 
     168It can be seen that the test now actually runs faster than the 
     169originally tested version.  (Not sure which changes exactly 
     170are responsible for this improvement.  Prabably careful removal 
     171of unnecessary EOS calls in varius places play a large role.) 
     1733. Compiled FLASH with  "Paramesh4dev" instead of "Paramesh3". 
     174(The trunk code is Paramesh4dev-ready, so adding the argument  
     175+pm4dev to the setup command line is all it takes.)  Results: 
     177110387       4416   573.8  32.5     ?   29              .5 
     178203443       8192   633.9  55.2     ?   29              .5 
     179(The Paramesh4dev code was not instrumented with additional 
     180timers, thus the time per call to PM4 amr_morton_process is 
     182Note that Grid_updateRefinement actually changes the grid only 
     1831 time in the 110387-block simulation, but 3 times in the 203443-block 
     184simulation. This may account for the remaining difference in tref. 
     187A. The newer FLASH code (with the mentioned modification to Hydro) 
     188is more efficient overall than the previously tested code; thus 
     189presumale more efficient than the code derived from the wd_def 
     190repository branch used in WD_def simulations up to now. 
     192B. The newer FLASH code scales about the same as before when 
     193using Paramesh3:  (weak scaling, comparing time increase 
     194for 110387-block -> 203443-block tests) 
     196    699.9 / 609.8 = 1.15   (before) 
     197    650.5 / 567.7 = 1.15   (new) 
     199The newer FLASH code scales better when using Paramesh4dev: 
     201    633.9 / 573.8 = 1.10 
     203This is only a moderate improvement; the scaling may look 
     204even better when comparing tests with the same number of 
     205grid-change events. (TO DO) 
     207C. I suggest to make the newer code with Paramesh4dev 
     208the base of the scaling tests to be shown.  I will submit 
     209runs to fill out the series. 
     211D. The newer code - basically trunk - should become 
     212the version used in WD simulations. 
     214E. I suggest to make the newer code with Paramesh3 the 
     215base of estimates for CPU requirements for future WD 
     216runs on BGP.  This suggestion is based on the assumption 
     217that WD runs on BGP will likely be run on up to ~4k procs 
     218but not significantly more. 
     219In particular, use 567.7 s / 10 = 56.7 s per time step 
     220for "grind time" estimate. If desired, this number can easily be adjusted upward 
     221for the fact that in real simulations, the grid usually 
     222changes 5 times per 10 steps (rather than 1 time as 
     223in the 110387-block test).