Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Help solving received signal SIGSEGV(11)

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

aztec2021

New member
Hi all.
I was wondering if someone in here can help find a solution for a "received signal SIGSEGV(11)" error message when running WRF. I know that these types of error can be difficult to solve. Any advice will be appreciated. I am running a 2 domain simulation with the configuration below center over South America.
i_parent_start = 1, 26,
j_parent_start = 1, 20,
e_we = 305, 148,
e_sn = 108, 136,
the real program runs fine, but the wrf.exe stops at t=0 with the message below.

I have performed several tests with smaller W-E extension domain (list below) and they all work fine. So this is not a namelist or wrf.cmd problem.
Domain 1: 185x108 runs fine
Domain 2: 205x108 runs fine
Domain 3: 225x 108 runs fine
Domain 4: 305x108 crashes

In all of these domains, the number of land points is about the same, the only difference is the number of ocean points. In other words, domain 4 having a larger amount of ocean points than domain 3 and so on.

Lastly, if I turn off the nested domain the model runs fine for domain 4, which indicates to me that the issue is on the nested domain. This seems to be ta memory availability problem, in which a 305x108 domain requires too much memory that the HPC can't handle.

I have also tested different land surface scheme (NOAH, SSIB), SST UPDATE on and off, and different time steps. By the way, I am running this on NCAR Cheyenne.

Any help will be appreciated. The error message is below.
Thank you

--- ERROR MESSAGE----
Code:
MPT ERROR: Rank 6(g:6) received signal SIGSEGV(11).
        Process ID: 64274, Host: r12i0n25, Program: /glade/scratch/fsales/Amazon/WRFV3.9.1.1/main/wrf.exe
        MPT Version: HPE MPT 2.22 03/31/20 15:59:10
MPT: --------stack traceback-------
MPT: Attaching to program: /proc/64274/exe, process 64274
MPT: Try: zypper install -C "debuginfo(build-id)=4e96cf37d52b9c2f3648e691878b682da5abfa42"
MPT: (no debugging symbols found)...done.
MPT: [Thread debugging using libthread_db enabled]
MPT: Using host libthread_db library "/glade/u/apps/ch/os/lib64/libthread_db.so.1".
MPT: Try: zypper install -C "debuginfo(build-id)=93c4deac1088eb84fbd01cf2a2c54399f516e9a7"
MPT: (no debugging symbols found)...done.
MPT: 0x00002b788b27f6da in waitpid () from /glade/u/apps/ch/os/lib64/libpthread.so.0
MPT: Missing separate debuginfos, use: zypper install glibc-debuginfo-2.22-49.16.x86_64
MPT: (gdb) #0 0x00002b788b27f6da in waitpid ()
MPT: from /glade/u/apps/ch/os/lib64/libpthread.so.0
MPT: #1 0x00002b788b5c2306 in mpi_sgi_system (
MPT: #2 MPI_SGI_stacktraceback (
MPT: header=header entry=0x7ffea4caad10 "MPT ERROR: Rank 6(g:6) received signal SIGSEGV(11).\n\tProcess ID: 64274, Host: r12i0n25, Program: /glade/scratch/fsales/Amazon/WRFV3.9.1.1/main/wrf.exe\n\tMPT Version: HPE MPT 2.22 03/ 31/20 15:59:10\n") at sig.c:340
MPT: #3 0x00002b788b5c24ff in first_arriver_handler (signo=signo entry=11,
MPT: stack_trace_sem=stack_trace_sem entry=0x2b78959e0080) at sig.c:489
MPT: #4 0x00002b788b5c2793 in slave_sig_handler (signo=11,
MPT: siginfo=<optimized out>, extra=<optimized out>) at sig.c:565
MPT: #5 <signal handler called>
MPT: #6 0x000000000276c133 in module_sf_sfclayrev_mp_sfclayrev1d_ ()
MPT: #7 0x000000000276a2d3 in module_sf_sfclayrev_mp_sfclayrev_ ()
MPT: #8 0x000000000211fc6a in module_surface_driver_mp_sfclayrev_seaice_wrapper_ ()
MPT: #9 0x00000000020f17df in module_surface_driver_mp_surface_driver_ ()
MPT: #10 0x0000000001aa0642 in module_first_rk_step_part1_mp_first_rk_step_part1_ ()
MPT: #11 0x00000000012d06ec in solve_em_ ()
MPT: #12 0x00000000011692f4 in solve_interface_ ()
MPT: #13 0x0000000000541657 in module_integrate_mp_integrate_ ()
MPT: #14 0x0000000000541c6e in module_integrate_mp_integrate_ ()
MPT: #15 0x0000000000405e51 in module_wrf_top_mp_wrf_run_ ()
MPT: #16 0x0000000000405e0f in MAIN__ ()
MPT: #17 0x0000000000405da2 in main ()
MPT: (gdb) A debugging session is active.
MPT:
MPT: Inferior 1 [process 64274] will be detached.

MPT: -----stack traceback ends-----
MPT: On host r12i0n25, Program /glade/scratch/fsales/Amazon/WRFV3.9.1.1/main/wrf.exe, Rank 6, Process 64274: Dumping core on signal SIGSEGV(11) into directory /glade/scratch/fsales/Amazon/WRFV3.9.1.1/run_size5
 
Hi,
I am suspicious this is a memory issue. Since your grid numbers are large for both domains, you definitely need sufficient memory to run this case. Can you increase the number of processors? This often gives you larger memory. Or, pease consult your computer manager how to obtain larger memory.
 
Hi. Thanks Ming for your response. I was to run using the ndown.exe approach. Apparently, CISL Cheyenne did not like the nested-domain configuration. Running the nested after the parent domain run is complete did the trick. Cheers.
 
Top