WRF terminated unexpectedly when using multiple processors

hhhyd

New member
Hi,

I am currently calculating ERA5 data for January 2025 using HPC with MPI.

I am very confused because my calculation terminates on January 21, 2025, when I set the time range from January 15 to January 25. However, it successfully calculated from January 5 to January 15. I haven't made any changes to the namelist except for the dates. I try on the single day from 2025-01-21-00:00:00 to 2025-01-22-00:00:00 failed again.

I used 9 nodes, each with 16 cores, totaling 144 cores. I suspected the issue might be due to using too many nodes, but I also encountered failures when using just 4 nodes.

I have attached my namelist.input, rsl.error.0000, and rsl.out.0000 here for your review.
rsl.error.0000, and rsl.out.0000 are in google drive: Google Drive: Sign-in
Thank you so much!

ERROR: c1:342629:0:342629] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfffffffe0b5e6760)
==== backtrace (tid: 342629) ====
0 0x0000000000053519 ucs_debug_print_backtrace() ???:0
1 0x000000000004e5b0 killpg() ???:0
2 0x00000000022608d0 __module_ra_rrtm_MOD_taugb3() ???:0
3 0x000000000226244b __module_ra_rrtm_MOD_gasabs() ???:0
4 0x000000000227475b __module_ra_rrtm_MOD_rrtm() ???:0
5 0x00000000022766b1 __module_ra_rrtm_MOD_rrtmlwrad() ???:0
6 0x0000000001942253 __module_radiation_driver_MOD_radiation_driver() ???:0
7 0x0000000001a44902 __module_first_rk_step_part1_MOD_first_rk_step_part1() ???:0
8 0x000000000132af84 solve_em_() ???:0
9 0x0000000001138496 solve_interface_() ???:0
10 0x000000000047486e __module_integrate_MOD_integrate() ???:0
11 0x0000000000474ece __module_integrate_MOD_integrate() ???:0
12 0x0000000000474ece __module_integrate_MOD_integrate() ???:0
13 0x0000000000474ece __module_integrate_MOD_integrate() ???:0
14 0x0000000000406562 __module_wrf_top_MOD_wrf_run() ???:0
15 0x0000000000405b4d main() ???:0
16 0x000000000003a7e5 __libc_start_main() ???:0
17 0x0000000000405b8e _start() ???:0
=================================

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0 0x153f437945af in ???
#1 0x22608d0 in ???
#2 0x226244a in ???
#3 0x227475a in ???
#4 0x22766b0 in ???
#5 0x1942252 in ???
#6 0x1a44901 in ???
#7 0x132af83 in ???
#8 0x1138495 in ???
#9 0x47486d in ???
 

Attachments

The successful run of the first 10-day simulation from January 5 to January 15 cannot guarantee the same success over anpother 10-day period. This is because the forcing and the physics/dynamics may evolve differently during different periods.

For your case, I am concerned of the physics over the 4-domains with resolutions from round 9km to 330m. Assumptions in the PBL scheme are invalid when the resolution is below 1km, which is a typical grey-zone for PBL scheme. Under this situation, the model may or may not work, just as what you have seen for your case.
 
Hi Ming,
Thank you for your reply. I will attempt to decrease the resolution for those four domains. Additionally, do you have any suggestions for achieving high resolution with WRF? I believe high resolution is possible under WRF-LES and WRF-URBAN. Would selecting mp_physics = 10, ra_lw_physics = 5, ra_sw_physics = 5, and sf_surface_physics = 2 help me achieve this high resolution? Thank you!
 
You are right that WRF can run in LES mode for high-resolution simulations. mp_physics = 10 is a good option, and rrtmg scheme (i.e., ra_lw_physics = 4, ra_sw_physics = 4) might be better for radiation. Noah LSM (sf_surface_physics = 2) is also widely used, but its snow package is less ideal. If your study period is over cold season with snow accumulation, then NoahMP (sf_surface_physics =4) could be better than Noah.
 
Back
Top