Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

WRF terminated unexpectedly when using multiple processors

hhhyd

New member
Hi,

I am currently calculating ERA5 data for January 2025 using HPC with MPI.

I am very confused because my calculation terminates on January 21, 2025, when I set the time range from January 15 to January 25. However, it successfully calculated from January 5 to January 15. I haven't made any changes to the namelist except for the dates. I try on the single day from 2025-01-21-00:00:00 to 2025-01-22-00:00:00 failed again.

I used 9 nodes, each with 16 cores, totaling 144 cores. I suspected the issue might be due to using too many nodes, but I also encountered failures when using just 4 nodes.

I have attached my namelist.input, rsl.error.0000, and rsl.out.0000 here for your review.
rsl.error.0000, and rsl.out.0000 are in google drive: Google Drive: Sign-in
Thank you so much!

ERROR: c1:342629:0:342629] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfffffffe0b5e6760)
==== backtrace (tid: 342629) ====
0 0x0000000000053519 ucs_debug_print_backtrace() ???:0
1 0x000000000004e5b0 killpg() ???:0
2 0x00000000022608d0 __module_ra_rrtm_MOD_taugb3() ???:0
3 0x000000000226244b __module_ra_rrtm_MOD_gasabs() ???:0
4 0x000000000227475b __module_ra_rrtm_MOD_rrtm() ???:0
5 0x00000000022766b1 __module_ra_rrtm_MOD_rrtmlwrad() ???:0
6 0x0000000001942253 __module_radiation_driver_MOD_radiation_driver() ???:0
7 0x0000000001a44902 __module_first_rk_step_part1_MOD_first_rk_step_part1() ???:0
8 0x000000000132af84 solve_em_() ???:0
9 0x0000000001138496 solve_interface_() ???:0
10 0x000000000047486e __module_integrate_MOD_integrate() ???:0
11 0x0000000000474ece __module_integrate_MOD_integrate() ???:0
12 0x0000000000474ece __module_integrate_MOD_integrate() ???:0
13 0x0000000000474ece __module_integrate_MOD_integrate() ???:0
14 0x0000000000406562 __module_wrf_top_MOD_wrf_run() ???:0
15 0x0000000000405b4d main() ???:0
16 0x000000000003a7e5 __libc_start_main() ???:0
17 0x0000000000405b8e _start() ???:0
=================================

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0 0x153f437945af in ???
#1 0x22608d0 in ???
#2 0x226244a in ???
#3 0x227475a in ???
#4 0x22766b0 in ???
#5 0x1942252 in ???
#6 0x1a44901 in ???
#7 0x132af83 in ???
#8 0x1138495 in ???
#9 0x47486d in ???
 

Attachments

  • namelist.input
    5 KB · Views: 3
The successful run of the first 10-day simulation from January 5 to January 15 cannot guarantee the same success over anpother 10-day period. This is because the forcing and the physics/dynamics may evolve differently during different periods.

For your case, I am concerned of the physics over the 4-domains with resolutions from round 9km to 330m. Assumptions in the PBL scheme are invalid when the resolution is below 1km, which is a typical grey-zone for PBL scheme. Under this situation, the model may or may not work, just as what you have seen for your case.
 
Hi Ming,
Thank you for your reply. I will attempt to decrease the resolution for those four domains. Additionally, do you have any suggestions for achieving high resolution with WRF? I believe high resolution is possible under WRF-LES and WRF-URBAN. Would selecting mp_physics = 10, ra_lw_physics = 5, ra_sw_physics = 5, and sf_surface_physics = 2 help me achieve this high resolution? Thank you!
 
You are right that WRF can run in LES mode for high-resolution simulations. mp_physics = 10 is a good option, and rrtmg scheme (i.e., ra_lw_physics = 4, ra_sw_physics = 4) might be better for radiation. Noah LSM (sf_surface_physics = 2) is also widely used, but its snow package is less ideal. If your study period is over cold season with snow accumulation, then NoahMP (sf_surface_physics =4) could be better than Noah.
 
Top