Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

wrf.exe crashed during the calculation of YSU PBL at the beginning of the running.

jianglizhi

New member
Hello, I was tried to running wrf.exe, the program crashed at the begninning with the error "d01 2023-07-27_12:00:00 in YSU PBL forrtl: severe (174): SIGSEGV, segmentation fault occurred"
I had already increase the number of cores (up to 512!) and decrease the timestep length, as well set the "ulimit -s unlimited " in the submit script. however, the program crashed at the same point where at the calling of YSU PBL. Can anyone help to give some advices? thank you in advance!
environment:
WRFV 4.6.0 compiled with dm option(15)
max domains: 1
domain distance: 3km
source dataset :GFS 0p25 hourly downloaded from AWS https://noaa-gfs-bdp-pds.s3.amazonaws.com/index.html#gfs.20230727/00/atmos/
 

Attachments

  • rsl.error.0031.txt
    13.6 KB · Views: 1
  • namelist.input.txt
    13.7 KB · Views: 3
  • namelist.wps.txt
    1 KB · Views: 1
Update:
I have just tested run wrf.exe with only 1 core. Now the programs runs normally. It seem that the memory is not enough to run such a big domain. I will try to test with another numbers of cores.

Update2:
After several test, I added the nodes and cores and maintain that each cores shares the same size of memory as the former one core run.
I confirmed that the WRF program crashed only when there running the wrf.exe for more than 1 nodes. Maybe there is a problem of MPI. I thought maybe there is a problem of MPI communication. I will change a DM_CC to intel mpi in order to check whether the problem is.
 
Last edited:
How many processors do you have in 1 core/node ?

This is a big case with grid number of 1100 x 938, and I expect that it requires a large memory.

When you run this case with other PBL options, can it run successfully?

With the YSU PBL, does it crash immediately?
 
How many processors do you have in 1 core/node ?

This is a big case with grid number of 1100 x 938, and I expect that it requires a large memory.

When you run this case with other PBL options, can it run successfully?

With the YSU PBL, does it crash immediately?
There are 64 cores for each node.
I can running the case up to 48 cores for one node, and it consumes the memory for around 110G.
As the posts above, The problem occurs when the job is running over 1 nodes, even I request for 2 nodes with 32 cores for each node.
Another PBL options are not been tested.
Yes, as in my precious posts, the YSU PBL crashed immediately at the begining of the running.
 
Update:
I have just tested run wrf.exe with only 1 core. Now the programs runs normally. It seem that the memory is not enough to run such a big domain. I will try to test with another numbers of cores.

Update2:
After several test, I added the nodes and cores and maintain that each cores shares the same size of memory as the former one core run.
I confirmed that the WRF program crashed only when there running the wrf.exe for more than 1 nodes. Maybe there is a problem of MPI. I thought maybe there is a problem of MPI communication. I will change a DM_CC to intel mpi in order to check whether the problem is.
Update3:
Regarding the issue where the program crashed when running on more than one node, I have identified the problem using a newer toolchain.
Initially, with the older compiler toolchain (version 2019), the program failed. However, after switching to a newer toolchain (including impi/2021.2.0, intel-compilers-2021.2.0, and iimpi/2021a) to compile WRF-4.6.0, the wrf.exe could run the parallel program with multiple nodes.
It appears that the problem was primarily caused by the older version of the MPI libraries, which were not compatible with running on multiple nodes.
 
Hi,
Thanks for the update. Yes more cores will give you larger memory, which explains why the case can run. This is definitely a memory issue.
There are 64 cores for each node.
I can running the case up to 48 cores for one node, and it consumes the memory for around 110G.
As the posts above, The problem occurs when the job is running over 1 nodes, even I request for 2 nodes with 32 cores for each node.
Another PBL options are not been tested.
Yes, as in my precious posts, the YSU PBL crashed immediately at the begining of the running.
 
Update3:
Regarding the issue where the program crashed when running on more than one node, I have identified the problem using a newer toolchain.
Initially, with the older compiler toolchain (version 2019), the program failed. However, after switching to a newer toolchain (including impi/2021.2.0, intel-compilers-2021.2.0, and iimpi/2021a) to compile WRF-4.6.0, the wrf.exe could run the parallel program with multiple nodes.
It appears that the problem was primarily caused by the older version of the MPI libraries, which were not compatible with running on multiple nodes.
Thanks for the update.
 
Top