6 months simulation stopped after completing 4 months runs

Ifeanyi · Aug 23, 2023

Hi,
I am running several 6 months (April - September) simulation with irrigation and no-irrigation setups for various years on Cheyenne. Several years have been completed with both setups, but unfortunately, year 2012 and 2011 stopped after completing about 4 - 4.5 month runs with irrigation, while about three other years completed both setups successfully with no error.
The first restart file is wrfrst_d01_2012-04-11_00:00:00 and the model stops after restarting from wrfrst_d02_2012-08-09_00:00:00.
Year 2011 had the same error after restarting from wrfrst_d02_2011-08-19_00:00:00.
However, three other years with the same setup were completed with no error and I could not figure out what is causing the error.

This is the path to the problematic run on Cheyenne.

/glade/scratch/achugbu/WRF_IRRI_2012/run

MPT: 0x00002b01b81897da in waitpid ()
MPT: from /glade/u/apps/ch/os/lib64/libpthread.so.0
MPT: Missing separate debuginfos, use: zypper install glibc-debuginfo-2.22-100.27.3.x86_64
MPT: (gdb) #0 0x00002b01b81897da in waitpid ()
MPT: from /glade/u/apps/ch/os/lib64/libpthread.so.0
MPT: #1 0x00002b01b84cec66 in mpi_sgi_system (
MPT: #2 MPI_SGI_stacktraceback (
MPT: header=header@entry=0x7ffe7b9cd390 "MPT ERROR: Rank 0(g:0) received signal SIGSEGV(11).\n\tProcess ID: 4110, Host: r13i4n11, Program: /glade/scratch/achugbu/WRF_IRRI_2012/main/wrf.exe\n\tMPT Version: HPE MPT 2.25 08/14/21 03:05:20\n") at sig.c:340
MPT: #3 0x00002b01b84cee66 in first_arriver_handler (signo=signo@entry=11,
MPT: stack_trace_sem=stack_trace_sem@entry=0x2b01c26e0080) at sig.c:489
MPT: #4 0x00002b01b84cf0f3 in slave_sig_handler (signo=11,
MPT: siginfo=<optimized out>, extra=<optimized out>) at sig.c:565
MPT: #5 <signal handler called>
MPT: #6 0x0000000002ba31c1 in module_sf_sfclayrev_mp_psim_stable_ ()
MPT: #7 0x0000000002b9e734 in module_sf_sfclayrev_mp_sfclayrev1d_ ()
MPT: #8 0x0000000002b9c333 in module_sf_sfclayrev_mp_sfclayrev_ ()
MPT: #9 0x0000000002469603 in module_surface_driver_mp_surface_driver_ ()
MPT: #10 0x0000000001d58015 in module_first_rk_step_part1_mp_first_rk_step_part1_
MPT: ()
MPT: #11 0x0000000001500067 in solve_em_ ()
MPT: #12 0x00000000013153fc in solve_interface_ ()
MPT: #13 0x000000000056431b in module_integrate_mp_integrate_ ()
MPT: #14 0x0000000000564932 in module_integrate_mp_integrate_ ()
MPT: #15 0x0000000000406291 in module_wrf_top_mp_wrf_run_ ()
MPT: #16 0x000000000040624f in MAIN__ ()
MPT: #17 0x00000000004061e2 in main ()
MPT: (gdb) A debugging session is active.
MPT:
MPT: Inferior 1 [process 4110] will be detached.
MPT:
MPT: Quit anyway? (y or n) [answered Y; input not from terminal]
MPT: Detaching from program: /proc/4110/exe, process 4110
MPT: [Inferior 1 (process 4110) detached]
MPT: -----stack traceback ends-----

kwerner · Aug 28, 2023

Hi,
I ran two tests using your input and namelist. I ran them on Cheyenne - one using WRF version 4.2.2, where I got the same result as you - the model stopping at the same time. I then tried again with version 4.5.1 (the latest version), and it runs to completion. I'm not sure what update to the code corrected whatever was causing your problem, but it seems to work in the newer version. Can you try to run this with v4.5.1 to see if it works for you, as well?

Ifeanyi · Aug 29, 2023

Thanks kwerner.
I have completed 10 other 6 months setup using version 4.2.2. Is there no way to get the model run to completion?
Using a different version would lead to an unjustifiable kind of comparism with the other runs that were completed.

Ifeanyi.

kwerner · Sep 1, 2023

Hi Ifeanyi,
I'm trying to track down the specific changes in the code that fix the issue. If I can figure that out, then you will only need to modify those specific files, and not change the entire version. I'll keep you posted.

Ifeanyi · Sep 1, 2023

Thanks kwerner, I will be waiting for your update.

kwerner · Sep 5, 2023

Thank you for your patience. Okay, I tracked down the code commit that allowed this to work. I implemented the modifications to the files in that commit and placed them in V4.2.2, then recompiled and then ran your case, and it worked for me. I am attaching the files here. Place the module_sf* files in the phys/ directory. Rename the MPTABLE.TBL.txt file to just MPTABLE.TBL and place it in the run/ directory (you may want to save the original versions of those files somewhere else, or as a different name just to hold onto them). You will then need to recompile WRF, but you DO NOT have to clean the code, or reconfigure. Just simply recompile and it should be much quicker than a full compile. After that, please run the test again and see if you're able to get further, and please let me know. Thanks!

Ifeanyi · Sep 8, 2023

Thank you so much Kwerner.
The simulation was completed perfectly after implementing the modifications. Thank you so much for your effort.

Ifeanyi.

kwerner · Sep 8, 2023

That's great news! Thank you for the update.

6 months simulation stopped after completing 4 months runs

Ifeanyi

New member

kwerner

Administrator

Ifeanyi

New member

kwerner

Administrator

Ifeanyi

New member

kwerner

Administrator

Attachments

Ifeanyi

New member

kwerner

Administrator