Segmentation fault for unknown reason

apattant · Jan 14, 2019

I ran a number of cases some months ago and realized that I had not been using the right LSM option. I went to rerun the simulation with the correct option (2 instead of 1) and now the model will not run to completion. It stops at around the same spot in every instance. I have included all the rsl files and my namelist for review. The message does not reveal much about why it stopped. My namelist works with other data and namelist settings.

I had been running with 8 nodes (32 procs each) but for this example I ran with 4 nodes and the only difference was that it ran two minutes longer than previously.

I see no reason why it would fail after 7+ hours of simulation time if there were a namelist issue but I must be missing something.

Thank you.

Ming Chen · Jan 15, 2019

This is a restart run and the case failed immediately after the restart. I wonder how long did the model integrate before the restarting time?
Have you looked at wrfrst to make sure everything is fine?

You say your namelist works with other data and settings, do you mean that when WRF was driven by other input data, the same case worked? Please clarify.

I found segmentation fault in rsl.error.0043.wrf, which sometimes indicate a memory issue. Can you try with more nodes ( and thus more memory) to see whether it works?

apattant · Jan 15, 2019

Hi Ming Chen,

Thank you for your insight.

The model ran for 6 hours before the restart. So the model failed around 7 hours and 39-41 minutes into the simulation.

I just ran with 8 nodes (256 procs) and it failed about the same place. I have also tried with 4 nodes but same result. The machine has lots of memory though so not convinced that is the issue.

I run with the same met data (GFS) but different land surface data and just changing the LSM options. It runs for options 1 (thermal) and 4 (Noah-MP) but now is failing with option 2 (Noah). I have run another time period as well and it runs all the way through.

Ming Chen · Jan 15, 2019

By saying "I have run another time period as well and it runs all the way through", do you mean you run the case with the same namelist options (i.e., with Noah LSM) and GFS input data, and the case is done successfully?

By "different land surface data ", do you mean you choose different static data or you make some other changes?

Which version of WRF are you running? Can you post your namelist.wps and namelist.input for me to take a look?

apattant · Jan 16, 2019

Hi Ming Chen,

I did simulations for two different time periods May 10-12 and May 14-16 using WRFv3.8.1. May 10-12 finished with no problems. May 14-16 ran successfully with different LSM option (4).

In both cases mentioned above I ran experiments with a default CORINE landuse and then a manually edited CORINE landuse dataset. As above all experiments for May 10-12 finished successfully and experiments for May14-16 finished successfully only with LSM option 4.

When I mentioned to a colleague that memory may be an issue, he suggested I run with fewer processors on the same number of cores to free up more memory. This had no impact - in fact when I went to 4 nodes and only 8 processors (instead of 32) the job crashed much sooner.

My namelist.input I posted in a tar file in my first post. when I run diff on namelist.input files between experiments the only differences are for LSM options and start/end days as expected.

Andre

Ming Chen · Jan 16, 2019

Andre,
One option is to compile WRF in debug mode (./configure -d), then rerun the failed case. This will tell exactly when and where something goes wrong. This will be helpful for debugging the problem.
Due to limited human power here in NCAR, please understand that we are no longer able to help debugging individual user's case.

apattant · Jan 18, 2019

Is the debug mode the same as compiling with traceback enabled?

apattant · Jan 22, 2019

I ran in debug mode. The model advanced past where it had been stopping (around 01:41:00 UTC May 15 2017) but still ran into a segmentation fault before running out of wall time. Attached is one rsl.error file and namelist for this experiment.

View attachment namelist.input

View attachment rsl.error.txt

kwthomas · Jan 23, 2019

Unfortunately, the "rsl" file you loaded doesn't help. It says SIGTERM, which is "kill -15". This is a consequence
of the SIGSEGV (segmentation fault). In an MPI job that has a program crash on one processor, the wrapper
task that runs the mpi job will send SIGTERM to all the processors.

Find one that has the SIGSEGV in it and post that.

apattant · Jan 25, 2019

Hi Kevin,

There was no SIGSEGV in any rsl file during that simulation but several other simulations contain a SIGSEGV. The problem is they are not in debug mode. When I run in debug mode the job does not seg fault until the job hits the end of its walltime which is odd. I am starting to think that perhaps the issue is that the job is integrating too rapidly on the cluster and running out of memory when using noah lsm. noahmp lsm takes more compute time and does not experience this issue and in debug moed the integration is much slower as well.

Andre

Segmentation fault for unknown reason

apattant

New member

Attachments

Ming Chen

Moderator

apattant

New member

Ming Chen

Moderator

apattant

New member

Ming Chen

Moderator

apattant

New member

apattant

New member

kwthomas

New member

apattant

New member