Hi,
I am currently calculating ERA5 data for January 2025 using HPC with MPI.
I am very confused because my calculation terminates on January 21, 2025, when I set the time range from January 15 to January 25. However, it successfully calculated from January 5 to January 15. I haven't made any changes to the namelist except for the dates. I try on the single day from 2025-01-21-00:00:00 to 2025-01-22-00:00:00 failed again.
I used 9 nodes, each with 16 cores, totaling 144 cores. I suspected the issue might be due to using too many nodes, but I also encountered failures when using just 4 nodes.
I have attached my namelist.input, rsl.error.0000, and rsl.out.0000 here for your review.
rsl.error.0000, and rsl.out.0000 are in google drive: Google Drive: Sign-in
Thank you so much!
ERROR: c1:342629:0:342629] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfffffffe0b5e6760)
==== backtrace (tid: 342629) ====
0 0x0000000000053519 ucs_debug_print_backtrace() ???:0
1 0x000000000004e5b0 killpg() ???:0
2 0x00000000022608d0 __module_ra_rrtm_MOD_taugb3() ???:0
3 0x000000000226244b __module_ra_rrtm_MOD_gasabs() ???:0
4 0x000000000227475b __module_ra_rrtm_MOD_rrtm() ???:0
5 0x00000000022766b1 __module_ra_rrtm_MOD_rrtmlwrad() ???:0
6 0x0000000001942253 __module_radiation_driver_MOD_radiation_driver() ???:0
7 0x0000000001a44902 __module_first_rk_step_part1_MOD_first_rk_step_part1() ???:0
8 0x000000000132af84 solve_em_() ???:0
9 0x0000000001138496 solve_interface_() ???:0
10 0x000000000047486e __module_integrate_MOD_integrate() ???:0
11 0x0000000000474ece __module_integrate_MOD_integrate() ???:0
12 0x0000000000474ece __module_integrate_MOD_integrate() ???:0
13 0x0000000000474ece __module_integrate_MOD_integrate() ???:0
14 0x0000000000406562 __module_wrf_top_MOD_wrf_run() ???:0
15 0x0000000000405b4d main() ???:0
16 0x000000000003a7e5 __libc_start_main() ???:0
17 0x0000000000405b8e _start() ???:0
=================================
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
#0 0x153f437945af in ???
#1 0x22608d0 in ???
#2 0x226244a in ???
#3 0x227475a in ???
#4 0x22766b0 in ???
#5 0x1942252 in ???
#6 0x1a44901 in ???
#7 0x132af83 in ???
#8 0x1138495 in ???
#9 0x47486d in ???
I am currently calculating ERA5 data for January 2025 using HPC with MPI.
I am very confused because my calculation terminates on January 21, 2025, when I set the time range from January 15 to January 25. However, it successfully calculated from January 5 to January 15. I haven't made any changes to the namelist except for the dates. I try on the single day from 2025-01-21-00:00:00 to 2025-01-22-00:00:00 failed again.
I used 9 nodes, each with 16 cores, totaling 144 cores. I suspected the issue might be due to using too many nodes, but I also encountered failures when using just 4 nodes.
I have attached my namelist.input, rsl.error.0000, and rsl.out.0000 here for your review.
rsl.error.0000, and rsl.out.0000 are in google drive: Google Drive: Sign-in
Thank you so much!
ERROR: c1:342629:0:342629] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfffffffe0b5e6760)
==== backtrace (tid: 342629) ====
0 0x0000000000053519 ucs_debug_print_backtrace() ???:0
1 0x000000000004e5b0 killpg() ???:0
2 0x00000000022608d0 __module_ra_rrtm_MOD_taugb3() ???:0
3 0x000000000226244b __module_ra_rrtm_MOD_gasabs() ???:0
4 0x000000000227475b __module_ra_rrtm_MOD_rrtm() ???:0
5 0x00000000022766b1 __module_ra_rrtm_MOD_rrtmlwrad() ???:0
6 0x0000000001942253 __module_radiation_driver_MOD_radiation_driver() ???:0
7 0x0000000001a44902 __module_first_rk_step_part1_MOD_first_rk_step_part1() ???:0
8 0x000000000132af84 solve_em_() ???:0
9 0x0000000001138496 solve_interface_() ???:0
10 0x000000000047486e __module_integrate_MOD_integrate() ???:0
11 0x0000000000474ece __module_integrate_MOD_integrate() ???:0
12 0x0000000000474ece __module_integrate_MOD_integrate() ???:0
13 0x0000000000474ece __module_integrate_MOD_integrate() ???:0
14 0x0000000000406562 __module_wrf_top_MOD_wrf_run() ???:0
15 0x0000000000405b4d main() ???:0
16 0x000000000003a7e5 __libc_start_main() ???:0
17 0x0000000000405b8e _start() ???:0
=================================
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
#0 0x153f437945af in ???
#1 0x22608d0 in ???
#2 0x226244a in ???
#3 0x227475a in ???
#4 0x22766b0 in ???
#5 0x1942252 in ???
#6 0x1a44901 in ???
#7 0x132af83 in ???
#8 0x1138495 in ???
#9 0x47486d in ???