WRF 4.1.2 hanging

Chris Thomas · Oct 21, 2019

I am running an ensemble with various choices of physics options. I find that for a lot of combinations wrf.exe hangs after some time. I am using 504 processors, and when the hang occurs a traceback using padb shows that 503 processors are waiting either in rsl_lite_exch_x_ or rsl_lite_exch_y_ while one processor is in module_sf_sfclayrev_mp_zolri2_. This is true every time the hang occurs. The problem is reproducible and for a fixed set of physics combinations always occurs at the same time step. In one particular case I have produced a restart file at a time step shortly before the issue occurs, and I can reproduce the issue from a restart at this time step. This was so that I could track the problem down with a slow executable compiled with -d option, however it does not occur in this case!! This probably says something about the nature of the problem. I have attached namelist.input so that you know what physics options were used in this case, but the problem occurs for a number of different combinations. I could supply you with the wrfrst etc files if you wish. The restart files (d01, d02) are about 1.3G each.

kwerner · Oct 21, 2019

Hi,
Just as a test, can you try to run this particular configuration with fewer processors (maybe something around 320) and let me know if you see the same problem? You can start from the restart time.
If so, can you also attach a namelist for a physics set-up that doesn't have the problem. It would be best if the set-up was fairly similar to this one, if possible.
Thanks!

Chris Thomas · Oct 28, 2019

Hi,
At your suggestion I ran it with 336 processors from the restart time and it didn't hang at the same time step or for as long as the simulation ran (2 hrs model time). I would point out though that some of the ensemble members (differing physics options) have run successfully with 1260 processors.
Are you recommending that I use fewer processors or is this test a way of tracking down the problem? With 336 processors the simulations are considerably slower. My problem is that I have walltime limits on the cluster where I am running these simulations, and the simulations are likely to be too slow using 336 processors. I am also setting up for multi-decadal runs and need to be able to use a larger number or processors.
Also a help-desk person at the cluster where I am performing these runs has found that a change in the optimization flags when compiling will also allow the run to proceed past the timestep where it has been hanging (as does compiling with the -d flag). However this reduced optimization is slower and I have the same walltime problem as outlined above.
Thanks,
Chris

kwerner · Oct 29, 2019

Chris,
I'm suggesting it because there is a range of a reasonable number of processors that will work for each case. This is most-strongly related to the size of the domain. Take a look at this FAQ that describes a rough "rule-of-thumb" for this:
https://forum.mmm.ucar.edu/phpBB3/viewtopic.php?f=73&t=5082
While running with too few processors will make the runs slow, running with too many can create unreasonable results, or can simply make the model stop. The outcome can vary, depending on various settings within the namelist, especially varying physics settings. If you are getting reasonable results with the simulations using 1260 processors, then I suppose that is okay to use for those. Unfortunately you may have to use significantly fewer for some set-ups. If you aren't able to get through full runs, I would suggest creating restart files more often and starting the simulations from the restart times to complete them.

Chris Thomas · Oct 29, 2019

Okay, thanks for the feedback.

* Can you elaborate a little on what you mean by 'unreasonable results'? Typically, what effects do you see when the runs complete but with too few processors? Should I be looking for artifacts in the output at a grid-scale level, or waves? Should I be comparing results to those obtained with a smaller number of processors well within the guidelines that you point to.

* My domain sizes are: e_we = 540, 616 and e_sn = 363, 501. Using 1260 processors may have been pushing things a bit too far and I had to specify nproc_x = 42 and nproc_y = 30 to avoid contravening the hard-coded limit. But with 504 processors, the automatic decomposition gives 21 processors in x and 24 in y, giving at least 25 grid cells per processor in x and 15 in y. Perhaps I should have specified the decomposition in the namelist in this case as well: 24 in x and 21 in y would have been better. But I am surprised that this can actually cause the executable to hang? Particularly since the simulations will complete with less optimized code.

* It seems odd to me that all simulations hung while in module_sf_sfclayrev_mp_zolri2_. What is the mechanism that causes the runs to hang with too few grid cells per processor? Intuitively I had thought that problems due to optimization would not be related to those caused by using too few grid cells.

davegill · Oct 31, 2019

Chris,
First, thanks for the detailed information. It allows us to eliminate problems right away.

Second, nothing that you have provided looks like it should cause any troubles. This seems like a vanilla run.
1. The nest ratio is 5:1 - fine
2. The FG domain is well inside the parent
3. The choice of physics options seems acceptable
4. If anything, the time step is overly conservative
5. Reasonable values are used for the vertical coordinate definition
6. The model lid and the number of vertical levels are consistent
7. Your minimum domain size 540x363 easily handles the 504=21x24 MPI ranks => 25x15 resultant patch
8. Other than cumulus, all of the physics is identical for d01 and d02

However, the fact that a different core count had a reproducibly different result suggests that we at least try a simple palliative approach. The different core counts, besides having different memory sizes, have different domain decompositions. The fact that you were able to user few cores and get reasonable results indicates that insufficient memory is not likely your problem.

Let's stay with 504 cores for your timing, but split them up a bit differently so that we end up with a different domain decomposition. The following generates individual patches that have a larger overall minimum grid cell dimension size (again, just to see if there is some unusual edge case that you keep hitting, we tweak things a bit).

Code:

&domains
 nproc_x                             = 28,
 nproc_y                             = 18,
/

With your new version of the WRF model, to simplify your namelist, get rid of the CAM related radiation flags. Just remove all of these lines. If the WRF model needs these options because of a requested radiation scheme, the code automatically sets these for you.

Code:

&physics
 cam_abs_freq_s                      = 10800
 levsiz                              = 59
 paerlev                             = 29
 cam_abs_dim1                        = 4
 cam_abs_dim2                        = 45
/

Unlikely related to this issue ... When you do get the model runs to complete, you might want to experiment with bumping up the model time step. You are almost at half of the semi-recommended value. Is this related to stability troubles that you were seeing? If you reduced the time step to attempt overcome the "hang", go ahead and return to your original time step.

Chris Thomas · Nov 3, 2019

Thanks for your very helpful reply.

I implemented your suggestion of setting
nproc_x = 28
nproc_y = 18
and indeed my test case does run successfully to conclusion.

Up to now, based on the documentation, my methodology has been to choose a number of cpus that factors "well", i.e. as nproc_x * nproc_y, where nproc_x ~ nproc_y. This appears to be what WRF does when it chooses a default decomposition (i.e. when nproc_x = nproc_y = -1 in the namelist). Hence up to now I would choose 24 x 21 rather than 28 x 18. It looks as though you are suggesting that instead I arrange nproc_x and nproc_y so that the resulting tile dimensions (expressed as grid cells) are not too different from one another. For non-square domains this is quite different, but intuitively makes a lot of sense. Is this a reasonable interpretation of what you have suggested, and how I should proceed in the future?

davegill · Nov 4, 2019

Chris,
As mentioned in the previous email there does not appear to be anything that your model set-up is doing incorrectly, including the default decomposition of the MPI ranks.

Because you were able to get a completed model run with a different MPI rank count, that suggested a possible sensitivity to exploit. Since you were able to successfully run with 336 cores, it was not likely that the issue was insufficient memory when running with 504 cores (using more MPI ranks implies less required memory per MPI rank). You mentioned that the code was reproducibly hanging on a communication, also pointing to a possible problem in some infrastructure code. Because of this info, it was a hunch to try a different domain decomposition (for example, as opposed to a physically based modification such as a different time step).

This is certainly only a temporary fix. Trying to figure out the underlying cause is a significant investment in time trying to determine the interactions of episodic case, domain size, nest, physics and dynamics options, compiler version, MPI flavor, optimizations, machine architecture, domain decomposition, etc. With our limited support available, we need to try to do the most that we can for the largest portion of the user community. If other user cases similar to hanging in the communications after the surface layer occur, then we will likely need to revisit this problem.

WRF 4.1.2 hanging

Chris Thomas

New member

Attachments

kwerner

Administrator

Chris Thomas

New member

kwerner

Administrator

Chris Thomas

New member

davegill

New member

Chris Thomas

New member

davegill

New member