Sig. 11, memory issues?

Topics specifically related to running the model in an HPC environment
Post Reply
Jialiwang
Posts: 12
Joined: Wed May 22, 2019 8:20 pm

Sig. 11, memory issues?

Post by Jialiwang » Wed May 22, 2019 8:42 pm

Dear all,

I am recently testing a large problem size WRF simulation on an IBM machine at Argonne (Mira). Strangely, the same model setup works with WRFV3.5 but not WRFV3.9.1 when using exactly the same number of nodes and cores/node. The error is always sig. 11 which I never figure out what it is. for the WRFV3.5, i use >=1GB per node for the memory, while for WRFV3.9.1.1, I tried 1GB, 2GB, and 4GB, and 8GB per node for memory, but no luck. i use 512 nodes. also tried 256, 128 and 64 node. While I am contacting the support of machine, I wonder do you have any clue about what is going on?

I may need to provide a lot more information for you to give some insight for this. Please let me know what information I could provide.

Thanks!

kwerner
Posts: 2287
Joined: Wed Feb 14, 2018 9:21 pm

Re: Sig. 11, memory issues?

Post by kwerner » Wed May 22, 2019 9:49 pm

Hi,
Can you provide the namelist.input file you are using, as well as your rsl.error* files (all of them packaged into one *.tar file). You can attach those to your response post. Please also let me know if you made any modifications to the code, or if the code is pristine (out-of-the-box) unmodified code.

Thanks!
NCAR/MMM

Jialiwang
Posts: 12
Joined: Wed May 22, 2019 8:20 pm

Re: Sig. 11, memory issues?

Post by Jialiwang » Tue May 28, 2019 3:21 pm

Thanks for you reply @kwerner.

I am attaching the namelist.input file, and several rsl files with their corresponding submission scripts. The difference between these scripts are mostly using different number of nodes/processors. The error information are the same (sig. 11).

When we compile the WRF on mira, we did make some modifications to configure.wrf and makefile under external/io_netcf, mostly to the library (to handle netcdf4). I am also attaching the modifications. if the original files are needed, i can upload them as well.

I should mention that, this problem size and this version of WRF works on several other machines I have tested. it's just Mira doesn't work. However, the same namelist.input and same problem does work with WRFv3.5 on Mira. I am going to test v3.7.1 and v3.8.1

Meanwhile, if you find anything that may cause the problem, please let me know! v3.9.1.1 is the version we like to use.

Thanks again.
Attachments
wrfhelp.tar
(820 KiB) Downloaded 26 times
modification.txt
(1016 Bytes) Downloaded 28 times

kwerner
Posts: 2287
Joined: Wed Feb 14, 2018 9:21 pm

Re: Sig. 11, memory issues?

Post by kwerner » Tue May 28, 2019 9:26 pm

Hi,
It certainly is odd that this works for 1 version, but not another, with the exact same set-up on the same machine. Unfortunately since this is only happening on 1 machine, it does seem that it's specific to your environment, and something we likely cannot reproduce here. I would suggest trying a few different things (if you haven't already):

4) If you have multiple rsl.error.* files, do a full listing on those so that you can see their sizes. If there are any (besides the *.0000 file) that are significantly larger than any of the others, check to see if there is any more useful information at the bottom, that may indicate the error.

2) try a very small simple case (e.g. the default namelist), using V3.9.1.1, and the same environment configuration that is causing the problems. If this runs, then it seems to be something specific to the case, and likely something with that case is causing problems with the newer code.

3) If you can get 2) to work, try adding a few new options at a time to see if you can make it fail. Continue this process, having your simulation migrate closer to the set-up you're having trouble with. This will help you to narrow down the problem.

4) You can try a debugger. For this you will need to clean, reconfigure, and recompile the code. When you configure, use

Code: Select all

./configure -d
and then go into your configure.wrf file and look for the following line, that should look something like:

Code: Select all

FCDEBUG         =       # -g $(FCNOOPT) # -fbacktrace -ggdb -fcheck=bounds,do,mem,pointer -ffpe-trap=invalid,zero,overflow
Modify it so that additional checks will be included, so that it now reads:

Code: Select all

FCDEBUG         =        -g $(FCNOOPT) -fcheck=all # -fbacktrace -ggdb -fcheck=bounds,do,mem,pointer -ffpe-trap=invalid,zero,overflow
and then recompile. When you run wrf.exe, you should see the error messages print out in your rsl.out.0000 file.
NCAR/MMM

Jialiwang
Posts: 12
Joined: Wed May 22, 2019 8:20 pm

Re: Sig. 11, memory issues?

Post by Jialiwang » Wed Jun 05, 2019 7:18 pm

Hi,

To follow up on your suggestion, we tried the debug mode, and found the surface scheme is the problem. we change these two schemes from 1 to 2, and it works fine now for a short period testing. will try a longer run and let you know if we see the problem again.
sf_sfclay_physics = 2
bl_pbl_physics = 2

thanks for your help!

Post Reply

Return to “High-performance Computing”