Model numerically stability depending on number of processors

LluisFB · Apr 20, 2021

Dear forum,

I have been struggling for a while trying to run a simulation with WRFV3.7.1 in quite a huge domain at high resolution (4 km) in South America see attached namelists and configure.wrf.

View attachment configure.wrf
View attachment namelist.wps
View attachment namelist.input

You will see it is not a fully standard WRF's nameilst, because it is coupled to ORCHIDEE's land model (this is why the

Code:

sf_surface_physics = 9

). But this is not the issue here. I tried to run standard WRF, and I got the same problems

It is being a nightmare to get the domain run, and I could run 1 month by setting up what I think it should be the most stable possible configuration:

Code:

 time_step                           = 1,
 (...)
  &dynamics
 w_damping                           = 1,
 diff_opt                            = 2,
 km_opt                              = 4,
 diff_6th_opt                        = 2,
 diff_6th_factor                     = 0.12,
 base_temp                           = 290.,
 damp_opt                            = 3,
 zdamp                               = 5000.,
 dampcoef                            = 0.2,
 khdif                               = 0,
 kvdif                               = 0,
 non_hydrostatic                     = .true.,
 moist_adv_opt                       = 1,
 scalar_adv_opt                      = 1,
 epssm                               = 0.3
 /

Notice that I have 4 km horizontal resolution, thus I should be able to run at least at 20 s (is less than 6 x Hres) During my trials most of the time I was getting CFLs, or NaN values everywhere.

It would be nice if WRF could incorporate a NaN check and stop if that happens. I have it because I am using the [WRF-CORDEX module][https://www.geosci-model-dev.net/12/1029/2019/gmd-12-1029-2019.html]. It could be done as follows (ni dyn_em/solve_em.F):

Code:

! Checking for NaNs
   im2 = ims + (ime - ims) / 2
   jm2 = jms + (jme - jms) / 2
   IF (grid%psfc(im2,jm2) /= grid%psfc(im2,jm2)) THEN
     PRINT *,'ERROR -- error -- ERROR -- error'
     WRITE(wrF_err_message,*)'solve_em: wrong PSFC value=', &
       grid%psfc(im2,jm2),' at: ', im2 ,', ', jm2, ' !!!'
#ifdef DM_PARALLEL
     CALL wrf_error_fatal(TRIM(wrf_err_message))
#else
     PRINT *,TRIM(wrf_err_message)
     STOP
#endif
   END IF

This is my first question: Which would be the most robust possible WRF namelist configuration? Is there any way to set up a namelist which would be stable [&dynamics section] under any configurtation and domain? I can understand that at the beginning of the simulation, WRF is highly unstable due to spin-up issues. Thus, I was thinking to use a CFL prone free configuration and then, as spin-up was being done, ease the restrains little by little. But it did not work. By the end I run an entire month almost entirely using the

Code:

time_step=1

The second issue I noticed, is that model crashes at different times depending on the number of cpus being used. Here some examples, being consecutively performed:
# time_step = 1, ncpus = 2920, radt=10: 1998-01-01_00:26:27
# time_step = 3 : 1998-01-01_00:10:00
# time_step = 3, radt = 4, ncups = 2840: 1998-01-02_04:08:00
# ncups = 2760: 1998-01-01_09:27:57
# ncups = 2800: 1998-01-05_04:24:18
# time_step = 5: 1998-01-01_11:00:00
# time_step = 3 ncpus = 3200: 1998-01-01_00:02:50

As you can see, there is not clearly any better performance with increasing/decreasing time_step or increasing/decreasing number of cpus being used.

I understand that all these issues are highly machine depending and the state of the nodes where the model is run (nodes are entirely blocked when used). But I can't understand why simulated time before the model crashes will show a sensitivity to the number of cpus being used.

I have cpu time to make the tests you might consider.

Many thanks in advance

LluisFB · Apr 21, 2021

** This is a copied e-mail answer from Emily Collier **

A couple quick ideas off the top of my head:

- could you be over decomposing your domain? 1050*800/3000 gives a very small tile size (< 17x17)
- do the forcing files look fine for the period when NaNs appear? no missing data?
- we don't use diff_6th_opt because we saw spurious numerical noise in simulations over HMA many years ago
- with diff_opt = 2, it is recommended (in v4.2.2 — check 3.7.1) to also set mix_full_fields:

From test/em_real/README.namelist
mix_full_fields(max_dom) = .true., ! used with diff_opt = 2; value of ".true." is recommended, except for
highly idealized numerical tests; damp_opt must not be 1 if ".true."
is chosen. .false. means subtract 1-d base-state profile before mixing

although I am not entirely clear if this recommendation also applies when using a PBL scheme. But, something to check.
- have you tried debugging the issue using ddt? It's a nice tool if you have access on your cluster.

With such a large domain and therefore larger map scale factors, I'm not surprised you need a lower timestep (for a 650x750 5-km grid spacing domain over Europe, we can't go higher than 5*dx without getting persistent CFL errors) but 1 s is painful! Good luck.

Emily

LluisFB · Apr 21, 2021

could you be over decomposing your domain? 1050*800/3000 gives a very small tile size (< 17x17)

I was trying to find within the entire possible spectrum, how the model answered. But definetively yes, it is too much cpus.

do the forcing files look fine for the period when NaNs appear? no missing data?

Yes, forcing files are fine. NaN values started to appear in the middle of the domain

we don’t use diff_6th_opt because we saw spurious numerical noise in simulations over HMA many years ago

I did not know that, I though diff_6th_opt would precisely help to reduce numerical noise, or at least it is what says in this publication.
https://www.researchgate.net/public..._diffusion_in_the_Advanced_Research_WRF_Model

with diff_opt = 2, it is recommended (in v4.2.2 — check 3.7.1) to also set mix_full_fields:
From test/em_real/README.namelist

Code:

    mix_full_fields(max_dom)            = .true.,  ! used with diff_opt= 2; value of ".true." is recommended, except for
                                                   !  highly idealized
                                                   ! numerical tests; damp_opt must not be 1 if ".true." is chosen. .false.
                                                   ! means subtract 1-d base-state profile before mixing

although I am not entirely clear if this recommendation also applies when using a PBL scheme. But, something to check.

I must recognize that I am quite lost with all the dynamics options. I stuggled to fully understand their role and meaning. I missed this one. Thanks for pointing me to it.

have you tried debugging the issue using ddt? It’s a nice tool if you have access on your cluster.

What I did, was to run simulations with output at every time-step. It helped me to see when and where NaN values arose. They start as a single point and as integration time went on, NaN values start to spread and some time-steps later encompass all the domain.

With such a large domain and therefore larger map scale factors, I’m not surprised you need a lower timestep (for a 650x750 5-km grid spacing domain over Europe, we can’t go higher than 5*dx without getting persistent CFL errors) but 1 s is painful! Good luck.

Fortunately mapfactors are not that large in the domain: 0.994521, 1.03068

LluisFB · Apr 21, 2021

** This is a copied e-mail answer from Andreas Prein **

My first suggestion would be to test a more recent WRF version (version 4). A new Hybrid Mass Coordinate was introduced in WRF v4, which reduces erroneously large turbulence over steep topography (see here https://journals.ametsoc.org/view/journals/mwre/147/3/mwr-d-18-0334.1.xml).

In the 4 km WRF South America simulation that NCAR performers to support the South America Affinity Group (https://ral.ucar.edu/projects/south-america) we use model version V4.1.5

LluisFB · Apr 21, 2021

LluisFB said:
** This is a copied e-mail answer from Andreas Prein **

My first suggestion would be to test a more recent WRF version (version 4). A new Hybrid Mass Coordinate was introduced in WRF v4, which reduces erroneously large turbulence over steep topography (see here https://journals.ametsoc.org/view/journals/mwre/147/3/mwr-d-18-0334.1.xml).

In the 4 km WRF South America simulation that NCAR performers to support the South America Affinity Group (https://ral.ucar.edu/projects/south-america) we use model version V4.1.5

Thank you very much for the answer.

I am aware of the new hybrid coordinates in WRF. Unfortunately, this has to be done with the WRFV3.7.1 which is the version that we had coupled in RegIPSL (https://gitlab.in2p3.fr/ipsl/lmd/intro/regipsl/regipsl/-/wikis/home). We have plans to update WRF, but not right now. But maybe this would be something that we had to accelerate. The annoying thing, is that a similar configuration over Iberian peninsula at 4kmm for the EUCP project, ran without many problems.

LluisFB · Apr 21, 2021

** Copy of the answer to an e-mail from Tinghai Ou **

I am also running WRF3.7.1 for a 3 km downscaling over southeast Asia nested in a 9km big domain. Below is some experience that may be helpful.
1. check memory used
The number of grids of my case is 700x721, around half of yours. I only used 480 CPUs (36s in 9km and 12s in 3km).

This gives you 32x32 grid point tiles, which I think is the minimum recommended

Below is the information of CPU and memory used (using seff).
CPU Efficiency: 99.69% of 1571-20:32:00 core-walltime
Job Wall-clock time: 3-06:35:34
Memory Utilized: 161.87 GB (estimated maximum)
Memory Efficiency: 11.89% of 1.33 TB (2.84 GB/core)

With the increase of the number of CPUs, the wall time needed has reduced, but not linearly. This is a balance between CPU and memory.

In my case, sometimes, the model stopped due to overflow and I will receive an error message such as, 'wrf.exe.80s-242652,n759.btr'. The simulation will continue with a shorter timestep. This may not be your case since your minimum time step is already 1s.

I even tried to run below 1 second with the namelist options:

Code:

time_step = 0
time_step_fract_num = 7
time_step_fract_den = 10

but it did not stabilize neither the simulation

Besides, I, recently, received a lot of NaN when doing post-processing of model simulation. The script works very well in one of our local servers, with 200GB memory for 20 CUPs. However, I only received NaN when I work in a supercomputer. I think it may be due to the small memory in the node I am working with.

NaNs were recieved whilst simulating, not during post-processing.

Maybe you can have a check of the used memory information of the supercomputer you are working with. This may not be the case for you since more than 2000CPUs were used if I understood correctly. Worth to know.

Thanks I will try it.

2. check your input Sometimes, the error may be due to errors in the input, especially over the water body. You may have a check of the forcing data you are using and the output if any. We, previously, encountered a problem due to an error in the forcing data (reanalysis) over a small lake on the Tibetan Plateau, which leads to huge precipitation.

I checked the input. If the problem would be there, errors should happen at the same time-step all the times. Or at least closer in time.

Thank you very much for your suggestions

LluisFB · Apr 21, 2021

** This is a copy of the answer to an e-mail from Rita Margarida Cardoso **

We’ve encountered similar problems when running our first FPS long simulation. Some of the CPU issues are compiler dependent, try to update your compiler or if you have the latest, go back to an older version.

I didn't though about degrading the compiler. That might help.

We also found that if our nº of CPUs was a factor of 2, 3 or 5 we had less problems.

So, you mean that if the number of cpus was multiple of these numbers, you had less problems? That's awkard. In the HPC where I am running, nodes have 40 procs each one. (which is multiple of 2 and 5, but not 3)

The boundary layer scheme was also a problem. When we switched from
sf_sfclay_physics = 5,
bl_pbl_physics = 5,

to

sf_sfclay_physics = 1,
bl_pbl_physics = 1,

the simulation was much more stable.

I can't switch PBL scheme. We are using a coupled version of WRF to ORCHIDEE land scheme, and we are forced to use that PBL and sfclay schemes.

In our LUCAS experiment we ran into CFL problems and after looking at the vertical wind speed, temperature and humidity we found a perturbation that propagated upward that wasn’t sufficiently diffused at the top boundary and was reflected downward. We stopped the simulation about 24h before the crash and set

w_damping = 0
zdamp =1000.
epssm = 0.9

We ran for 72 hours and then went back to the original parameters (same as yours).

That's a goot trick! Although I don't think that my problem has a 'phgysical' explanation... But definetivly it's worth a try to increase the epssm.

Many thanks,

LluisFB · Aug 10, 2021

My appologies for not posting here for a long time.

Finally I got it working! Randomly I changed the microphysics scheme and it make it work!

Before I had (Morrison):

Code:

mp_physics                          = 10,

but changing to (Thompson)

Code:

mp_physics                          = 8,

it works!

Evenusing a 15 s time step and the following configutation into the dynamics section:

Code:

&dynamics
 w_damping                           = 1,
 diff_opt                            = 2,
 km_opt                              = 4,
 diff_6th_opt                        = 2,
 diff_6th_factor                     = 0.12,
 base_temp                           = 290.,
 damp_opt                            = 3,
 zdamp                               = 5000.,
 dampcoef                            = 0.2,
 khdif                               = 0,
 kvdif                               = 0,
 non_hydrostatic                     = .true.,
 moist_adv_opt                       = 1,
 scalar_adv_opt                      = 1,
 epssm                               = 0.5
 mix_full_fields                     = .true.
 /

I should check into the code and see if there is something wrong in the Morrison scheme (I am using WRF v3.7.1), but up to now I did not have time

Ming Chen · Aug 12, 2021

Thanks for the update.
Note that it is not normal that the case can run with one scheme but failed with other scheme. Please keep me updated about if you find the reasons. Thanks.

Model numerically stability depending on number of processors

LluisFB

Member

LluisFB

Member

LluisFB

Member

LluisFB

Member

LluisFB

Member

LluisFB

Member

LluisFB

Member

LluisFB

Member

Ming Chen

Moderator