Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

RE: WRF model on HPC - Segmentation fault (core dumped)

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

Hi,

I am running the WRF model v3.9.1.1 for wind resource mapping using 3 two-way nested domains of 20 km, 4 km, and 1 km on an HPC at a higher resolution of 1 km x 1 km for a tropical island in the SW pacific.

While I try to run simulations for the whole group of islands over an area of 401 km x 401 km at 1 km grid resolution, the simulation runs for an hour or so and then the simulation crashes with a segmentation fault (core dumped) error message. I am getting the following errors:

WRF rsl.error.0000:
"
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0 0x2AAAAB10C6F7
#1 0x2AAAAB10CD3E
#2 0x2AAAAB74C26F
#3 0x1B66897 in taugb3.5950 at module_ra_rrtmg_lw.f90:?
#4 0x1B877C9 in __rrtmg_lw_taumol_MOD_taumol
#5 0x1B9F51B in __rrtmg_lw_rad_MOD_rrtmg_lw
#6 0x1BB2E7C in __module_ra_rrtmg_lw_MOD_rrtmg_lwrad
#7 0x16C5E49 in __module_radiation_driver_MOD_radiation_driver
#8 0x17B087C in __module_first_rk_step_part1_MOD_first_rk_step_part1
#9 0x11B1B79 in solve_em_
#10 0x1088EAA in solve_interface_
#11 0x47289A in __module_integrate_MOD_integrate
#12 0x4081C3 in __module_wrf_top_MOD_wrf_run
"
Slurm-9892195.out:
"
#!/bin/bash -e
#SBATCH --job-name=WRF5MPIJob # job name (shows up in the queue)
#SBATCH --account=uoa02450 # Nesi project
#SBATCH --time=72:00:00 # Walltime (HH:MM:SS)
#SBATCH --mem-per-cpu=6300 # memory/cpu in MB (half the actual required memory)
#SBATCH --partition=bigmem
#SBATCH --ntasks=72 # number of tasks (e.g. MPI)
#SBATCH --nodes=2
#SBATCH --hint=nomultithread # please try also without hyperthreading
#SBATCH --profile=all
cat $0

srun /scale_wlg_persistent/filesets/project/uoa02450/Build_WRF5/WRFV3/WRFV3/run/wrf.exe
### EOF
starting wrf task 40 of 72
starting wrf task 43 of 72
starting wrf task 47 of 72
starting wrf task 53 of 72
starting wrf task 54 of 72
starting wrf task 55 of 72
starting wrf task 59 of 72
starting wrf task 36 of 72
starting wrf task 38 of 72
starting wrf task 41 of 72
starting wrf task 42 of 72
starting wrf task 45 of 72
starting wrf task 46 of 72
starting wrf task 51 of 72
starting wrf task 56 of 72
starting wrf task 57 of 72
starting wrf task 60 of 72
starting wrf task 61 of 72
starting wrf task 63 of 72
starting wrf task 64 of 72
starting wrf task 65 of 72
starting wrf task 66 of 72
starting wrf task 67 of 72
starting wrf task 69 of 72
starting wrf task 70 of 72
starting wrf task 37 of 72
starting wrf task 39 of 72
starting wrf task 49 of 72
starting wrf task 50 of 72
starting wrf task 58 of 72
starting wrf task 68 of 72
starting wrf task 71 of 72
starting wrf task 52 of 72
starting wrf task 44 of 72
starting wrf task 48 of 72
starting wrf task 62 of 72
starting wrf task 6 of 72
starting wrf task 32 of 72
starting wrf task 7 of 72
starting wrf task 26 of 72
starting wrf task 16 of 72
starting wrf task 11 of 72
starting wrf task 20 of 72
starting wrf task 19 of 72
starting wrf task 29 of 72
starting wrf task 10 of 72
starting wrf task 21 of 72
starting wrf task 4 of 72
starting wrf task 23 of 72
starting wrf task 2 of 72
starting wrf task 17 of 72
starting wrf task 3 of 72
starting wrf task 18 of 72
starting wrf task 1 of 72
starting wrf task 27 of 72
starting wrf task 5 of 72
starting wrf task 33 of 72
starting wrf task 8 of 72
starting wrf task 24 of 72
starting wrf task 12 of 72
starting wrf task 25 of 72
starting wrf task 30 of 72
starting wrf task 31 of 72
starting wrf task 9 of 72
starting wrf task 22 of 72
starting wrf task 34 of 72
starting wrf task 0 of 72
starting wrf task 13 of 72
starting wrf task 15 of 72
starting wrf task 28 of 72
starting wrf task 14 of 72
starting wrf task 35 of 72
srun: error: wbl008: tasks 36-38,40-41,43-65,67-71: Segmentation fault (core dumped)
srun: error: wbl004: tasks 0-35: Segmentation fault (core dumped)
srun: error: wbl008: task 66: Segmentation fault (core dumped)
srun: error: wbl008: tasks 39,42: Segmentation fault (core dumped)
"
I have tried increasing memory from 1500 MB, 2000 MB, 3000 MB, and 6300 MB but the problem remains.

Can it be a namelist.input parent_time_step_ratio related error?? I have a timestep of 120s, 40s and 8s (1:3:5).

Initially, I ran simulations for an island covering 201 km x 201 km at 1 km grid resolution and the simulations worked fine.

Appreciate your kind assistance and advice.

Regards
Kunal
 
Hi,
I see that you posted another topic here:
https://forum.mmm.ucar.edu/phpBB3/viewtopic.php?f=48&t=8796
which was posted several days after this one. Does that mean you were able to get past this particular seg-fault problem?

If not, can you please attach the namelist.input file you are using, along with all of the rsl.error.* files? You can package all the rsl files together as one *.TAR file and attach it. Thanks!
 
Hi,

The post you mention is for a smaller domain 3 (@ 1km = 201 x 201 and @ 0.8 km = 241 x 241) just over one of the major islands.

The current one where I am facing problems with a segmentation fault (core dumped) is for a bigger domain 3 (@ 1 km = 401 x 401) which cover all the bigger islands.

Please note I do not have the exact error files for the set-up I mentioned earlier as I modified the domains to 25 km, 5 km, and 1 km and ran another simulation but got the same error message.

Find attached the namelist.input file and the rsl.error.*files.

Appreciate your assistance and advice.

Thanks and regards
Kunal
 

Attachments

  • namelist.txt
    5.1 KB · Views: 77
  • tar.7z
    366.5 KB · Views: 68
Kunal,
Do you mind sending the tar files in a *.TAR format? Unfortunately we don't have the software to unpack a .7z file on our systems. Thanks!
 
Kunal,
In the rsl* files I see several CFL errors, meaning the model has become unstable. Take a look at this FAQ that discusses segmentation faults, and toward the end in particular, CFL errors and ways to overcome them:
https://forum.mmm.ucar.edu/phpBB3/viewtopic.php?f=73&t=133
 
Top