Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

WRF crashes at first time_step for nested run

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

Chris Thomas

New member
I'm using WRF 4.3, with one nested domain, grid_ratio and time_step ratio = 5. WRF crashes consistently at the first time_step:
For example, with time_step = 60 the tail of rsl.out.0000 is
Code:
Timing for main: time 1982-01-01_00:00:36 on domain   2:    0.17169 elapsed seconds
Timing for main: time 1982-01-01_00:00:48 on domain   2:    0.17240 elapsed seconds
Timing for main: time 1982-01-01_00:01:00 on domain   2:    0.17161 elapsed seconds
Timing for main: time 1982-01-01_00:01:00 on domain   1:   63.93356 elapsed seconds
taskid: 0 hostname: gadi-cpu-clx-1261.gadi.nci.org.au
If time_step = 30:
Code:
Timing for main: time 1982-01-01_00:00:18 on domain   2:    0.17194 elapsed seconds
Timing for main: time 1982-01-01_00:00:24 on domain   2:    0.17238 elapsed seconds
Timing for main: time 1982-01-01_00:00:30 on domain   2:    0.17284 elapsed seconds
Timing for main: time 1982-01-01_00:00:30 on domain   1:   61.38421 elapsed seconds
taskid: 0 hostname: gadi-cpu-clx-1153.gadi.nci.org.au
With debug_level = 100 the tail of rsl.out.0000 looks like:
Code:
d01 1982-01-01_00:00:30 calling inc/PERIOD_BDY_EM_B_inline.inc
d01 1982-01-01_00:00:30 calling inc/PERIOD_BDY_EM_B_inline.inc
d01 1982-01-01_00:00:30 calling inc/HALO_EM_C_inline.inc
d01 1982-01-01_00:00:30 calling inc/HALO_EM_C2_inline.inc
taskid: 0 hostname: gadi-cpu-clx-2053.gadi.nci.org.au
I've attached namelist.input, rsl.out.0000 and rsl.error.0000 for the debug_level = 100 run.
Any ideas?
 

Attachments

  • files.zip
    16.2 KB · Views: 21
Hi,

can anyone help me with this? If I run precisely the same configuration over the same geographical area but at a lower resolution, ie with

dx and dy = 19567.24, 3913.447 instead of
dx and dy = 12229.522, 2445.904

it runs fine. At the higher resolution we have:

e_we = 864, 991
e_sn = 581, 806

pretty big domains admittedly, but I am using 1152 cpus and requesting 4000 GB of memory. That should be plenty (the successful lower resolution run uses 305 GB).
 
Hi Chris,
I looked at your namelist.input. All options look fine except that you can change time_step = 60.
Given the large grid numbers, I am suspicious that this is a memory issue. Usually if the model crashed immediately, it is either caused by insufficient memory or by wrong input data.
Can you try to run with just a singe domain (max_dm = 1) ? If it works, at least we will know the input data is fine.
If the singe-domain case works, then please increase the number of processors, which will give you more memory to run this case.
 
If I set max_doms = 1 the run still crashes. If this indicates a problem with the wrfinput files, can you make any suggestions as to how to diagnose this? Also, I have tried increasing the number of processors to 1440 and requesting 5700 GB memory but it still crashes. This is a huge amount, so I don't think it is an issue with memory.
 
Hi, Chris,
we usually examine wrfinput file to make sure all the variables are physically reasonable. We also pay attention to inconsistency between variables, for example no soil moisture and temperature data at a land point. Such situation rarely occurs but we did see it in certain cases.

I guess you run the standard WRV4.3 and didn't modify the codes, Please let me know if I am wrong.

Can you send me your namelist.wps, and tell me what data you ungrib to provide initial and boundary conditions for this case? I would like to repeat it and hope to figure out what is wrong.
 
Top