Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Sig. 11, memory issues?

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

Jialiwang

New member
Dear all,

I am recently testing a large problem size WRF simulation on an IBM machine at Argonne (Mira). Strangely, the same model setup works with WRFV3.5 but not WRFV3.9.1 when using exactly the same number of nodes and cores/node. The error is always sig. 11 which I never figure out what it is. for the WRFV3.5, i use >=1GB per node for the memory, while for WRFV3.9.1.1, I tried 1GB, 2GB, and 4GB, and 8GB per node for memory, but no luck. i use 512 nodes. also tried 256, 128 and 64 node. While I am contacting the support of machine, I wonder do you have any clue about what is going on?

I may need to provide a lot more information for you to give some insight for this. Please let me know what information I could provide.

Thanks!
 
Hi,
Can you provide the namelist.input file you are using, as well as your rsl.error* files (all of them packaged into one *.tar file). You can attach those to your response post. Please also let me know if you made any modifications to the code, or if the code is pristine (out-of-the-box) unmodified code.

Thanks!
 
Thanks for you reply @kwerner.

I am attaching the namelist.input file, and several rsl files with their corresponding submission scripts. The difference between these scripts are mostly using different number of nodes/processors. The error information are the same (sig. 11).

When we compile the WRF on mira, we did make some modifications to configure.wrf and makefile under external/io_netcf, mostly to the library (to handle netcdf4) and the compiler to make it compilable. I am also attaching the modifications. if the original files are needed, i can upload them as well.

I should mention that, this problem size and this version of WRF works on several other machines I have tested. it's just Mira doesn't work. However, the same namelist.input and same problem does work with WRFv3.5 on Mira. I am going to test v3.7.1 and v3.8.1

Meanwhile, if you find anything that may cause the problem, please let me know! v3.9.1.1 is the version we like to use.

Thanks again.
 

Attachments

  • wrfhelp.tar
    820 KB · Views: 0
  • modification.txt
    1.7 KB · Views: 0
Thanks for you reply @kwerner.

I am attaching the namelist.input file, and several rsl files with their corresponding submission scripts. The difference between these scripts are mostly using different number of nodes/processors. The error information are the same (sig. 11).

When we compile the WRF on mira, we did make some modifications to configure.wrf and makefile under external/io_netcf, mostly to the library (to handle netcdf4). I am also attaching the modifications. if the original files are needed, i can upload them as well.

I should mention that, this problem size and this version of WRF works on several other machines I have tested. it's just Mira doesn't work. However, the same namelist.input and same problem does work with WRFv3.5 on Mira. I am going to test v3.7.1 and v3.8.1

Meanwhile, if you find anything that may cause the problem, please let me know! v3.9.1.1 is the version we like to use.

Thanks again.
 

Attachments

  • modification.txt
    1,016 bytes · Views: 52
  • wrfhelp.tar
    820 KB · Views: 51
Hi,
It certainly is odd that this works for 1 version, but not another, with the exact same set-up on the same machine. Unfortunately since this is only happening on 1 machine, it does seem that it's specific to your environment, and something we likely cannot reproduce here. I would suggest trying a few different things (if you haven't already):

4) If you have multiple rsl.error.* files, do a full listing on those so that you can see their sizes. If there are any (besides the *.0000 file) that are significantly larger than any of the others, check to see if there is any more useful information at the bottom, that may indicate the error.

2) try a very small simple case (e.g. the default namelist), using V3.9.1.1, and the same environment configuration that is causing the problems. If this runs, then it seems to be something specific to the case, and likely something with that case is causing problems with the newer code.

3) If you can get 2) to work, try adding a few new options at a time to see if you can make it fail. Continue this process, having your simulation migrate closer to the set-up you're having trouble with. This will help you to narrow down the problem.

4) You can try a debugger. For this you will need to clean, reconfigure, and recompile the code. When you configure, use
Code:
./configure -d
and then go into your configure.wrf file and look for the following line, that should look something like:
Code:
FCDEBUG         =       # -g $(FCNOOPT) # -fbacktrace -ggdb -fcheck=bounds,do,mem,pointer -ffpe-trap=invalid,zero,overflow
Modify it so that additional checks will be included, so that it now reads:
Code:
FCDEBUG         =        -g $(FCNOOPT) -fcheck=all # -fbacktrace -ggdb -fcheck=bounds,do,mem,pointer -ffpe-trap=invalid,zero,overflow
and then recompile. When you run wrf.exe, you should see the error messages print out in your rsl.out.0000 file.
 
Hi,

To follow up on your suggestion, we tried the debug mode, and found the surface scheme is the problem. we change these two schemes from 1 to 2, and it works fine now for a short period testing. will try a longer run and let you know if we see the problem again.
sf_sfclay_physics = 2
bl_pbl_physics = 2

thanks for your help!
 
Top