Dear forum,
I have been struggling for a while trying to run a simulation with WRFV3.7.1 in quite a huge domain at high resolution (4 km) in South America see attached namelists and configure.wrf.
View attachment configure.wrf
View attachment namelist.wps
View attachment namelist.input
You will see it is not a fully standard WRF's nameilst, because it is coupled to ORCHIDEE's land model (this is why the
). But this is not the issue here. I tried to run standard WRF, and I got the same problems
It is being a nightmare to get the domain run, and I could run 1 month by setting up what I think it should be the most stable possible configuration:
Notice that I have 4 km horizontal resolution, thus I should be able to run at least at 20 s (is less than 6 x Hres) During my trials most of the time I was getting CFLs, or NaN values everywhere.
It would be nice if WRF could incorporate a NaN check and stop if that happens. I have it because I am using the [WRF-CORDEX module][https://www.geosci-model-dev.net/12/1029/2019/gmd-12-1029-2019.html]. It could be done as follows (ni dyn_em/solve_em.F):
This is my first question: Which would be the most robust possible WRF namelist configuration? Is there any way to set up a namelist which would be stable [&dynamics section] under any configurtation and domain? I can understand that at the beginning of the simulation, WRF is highly unstable due to spin-up issues. Thus, I was thinking to use a CFL prone free configuration and then, as spin-up was being done, ease the restrains little by little. But it did not work. By the end I run an entire month almost entirely using the
The second issue I noticed, is that model crashes at different times depending on the number of cpus being used. Here some examples, being consecutively performed:
# time_step = 1, ncpus = 2920, radt=10: 1998-01-01_00:26:27
# time_step = 3 : 1998-01-01_00:10:00
# time_step = 3, radt = 4, ncups = 2840: 1998-01-02_04:08:00
# ncups = 2760: 1998-01-01_09:27:57
# ncups = 2800: 1998-01-05_04:24:18
# time_step = 5: 1998-01-01_11:00:00
# time_step = 3 ncpus = 3200: 1998-01-01_00:02:50
As you can see, there is not clearly any better performance with increasing/decreasing time_step or increasing/decreasing number of cpus being used.
I understand that all these issues are highly machine depending and the state of the nodes where the model is run (nodes are entirely blocked when used). But I can't understand why simulated time before the model crashes will show a sensitivity to the number of cpus being used.
I have cpu time to make the tests you might consider.
Many thanks in advance
I have been struggling for a while trying to run a simulation with WRFV3.7.1 in quite a huge domain at high resolution (4 km) in South America see attached namelists and configure.wrf.
View attachment configure.wrf
View attachment namelist.wps
View attachment namelist.input
You will see it is not a fully standard WRF's nameilst, because it is coupled to ORCHIDEE's land model (this is why the
Code:
sf_surface_physics = 9
It is being a nightmare to get the domain run, and I could run 1 month by setting up what I think it should be the most stable possible configuration:
Code:
time_step = 1,
(...)
&dynamics
w_damping = 1,
diff_opt = 2,
km_opt = 4,
diff_6th_opt = 2,
diff_6th_factor = 0.12,
base_temp = 290.,
damp_opt = 3,
zdamp = 5000.,
dampcoef = 0.2,
khdif = 0,
kvdif = 0,
non_hydrostatic = .true.,
moist_adv_opt = 1,
scalar_adv_opt = 1,
epssm = 0.3
/
Notice that I have 4 km horizontal resolution, thus I should be able to run at least at 20 s (is less than 6 x Hres) During my trials most of the time I was getting CFLs, or NaN values everywhere.
It would be nice if WRF could incorporate a NaN check and stop if that happens. I have it because I am using the [WRF-CORDEX module][https://www.geosci-model-dev.net/12/1029/2019/gmd-12-1029-2019.html]. It could be done as follows (ni dyn_em/solve_em.F):
Code:
! Checking for NaNs
im2 = ims + (ime - ims) / 2
jm2 = jms + (jme - jms) / 2
IF (grid%psfc(im2,jm2) /= grid%psfc(im2,jm2)) THEN
PRINT *,'ERROR -- error -- ERROR -- error'
WRITE(wrF_err_message,*)'solve_em: wrong PSFC value=', &
grid%psfc(im2,jm2),' at: ', im2 ,', ', jm2, ' !!!'
#ifdef DM_PARALLEL
CALL wrf_error_fatal(TRIM(wrf_err_message))
#else
PRINT *,TRIM(wrf_err_message)
STOP
#endif
END IF
This is my first question: Which would be the most robust possible WRF namelist configuration? Is there any way to set up a namelist which would be stable [&dynamics section] under any configurtation and domain? I can understand that at the beginning of the simulation, WRF is highly unstable due to spin-up issues. Thus, I was thinking to use a CFL prone free configuration and then, as spin-up was being done, ease the restrains little by little. But it did not work. By the end I run an entire month almost entirely using the
Code:
time_step=1
The second issue I noticed, is that model crashes at different times depending on the number of cpus being used. Here some examples, being consecutively performed:
# time_step = 1, ncpus = 2920, radt=10: 1998-01-01_00:26:27
# time_step = 3 : 1998-01-01_00:10:00
# time_step = 3, radt = 4, ncups = 2840: 1998-01-02_04:08:00
# ncups = 2760: 1998-01-01_09:27:57
# ncups = 2800: 1998-01-05_04:24:18
# time_step = 5: 1998-01-01_11:00:00
# time_step = 3 ncpus = 3200: 1998-01-01_00:02:50
As you can see, there is not clearly any better performance with increasing/decreasing time_step or increasing/decreasing number of cpus being used.
I understand that all these issues are highly machine depending and the state of the nodes where the model is run (nodes are entirely blocked when used). But I can't understand why simulated time before the model crashes will show a sensitivity to the number of cpus being used.
I have cpu time to make the tests you might consider.
Many thanks in advance