RE: WRF model on HPC - Segmentation fault (core dumped)

Topics specifically related to running the model in an HPC environment
Post Reply
kunaldayal
Posts: 68
Joined: Tue Oct 16, 2018 9:59 pm

RE: WRF model on HPC - Segmentation fault (core dumped)

Post by kunaldayal » Tue Jan 14, 2020 3:33 am

Hi,

I am running the WRF model v3.9.1.1 for wind resource mapping using 3 two-way nested domains of 20 km, 4 km, and 1 km on an HPC at a higher resolution of 1 km x 1 km for a tropical island in the SW pacific.

While I try to run simulations for the whole group of islands over an area of 401 km x 401 km at 1 km grid resolution, the simulation runs for an hour or so and then the simulation crashes with a segmentation fault (core dumped) error message. I am getting the following errors:
    WRF rsl.error.0000:
    "
    Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

    Backtrace for this error:
    #0 0x2AAAAB10C6F7
    #1 0x2AAAAB10CD3E
    #2 0x2AAAAB74C26F
    #3 0x1B66897 in taugb3.5950 at module_ra_rrtmg_lw.f90:?
    #4 0x1B877C9 in __rrtmg_lw_taumol_MOD_taumol
    #5 0x1B9F51B in __rrtmg_lw_rad_MOD_rrtmg_lw
    #6 0x1BB2E7C in __module_ra_rrtmg_lw_MOD_rrtmg_lwrad
    #7 0x16C5E49 in __module_radiation_driver_MOD_radiation_driver
    #8 0x17B087C in __module_first_rk_step_part1_MOD_first_rk_step_part1
    #9 0x11B1B79 in solve_em_
    #10 0x1088EAA in solve_interface_
    #11 0x47289A in __module_integrate_MOD_integrate
    #12 0x4081C3 in __module_wrf_top_MOD_wrf_run
    "
      Slurm-9892195.out:
      "
      #!/bin/bash -e
      #SBATCH --job-name=WRF5MPIJob # job name (shows up in the queue)
      #SBATCH --account=uoa02450 # Nesi project
      #SBATCH --time=72:00:00 # Walltime (HH:MM:SS)
      #SBATCH --mem-per-cpu=6300 # memory/cpu in MB (half the actual required memory)
      #SBATCH --partition=bigmem
      #SBATCH --ntasks=72 # number of tasks (e.g. MPI)
      #SBATCH --nodes=2
      #SBATCH --hint=nomultithread # please try also without hyperthreading
      #SBATCH --profile=all
      cat $0

      srun /scale_wlg_persistent/filesets/project/uoa02450/Build_WRF5/WRFV3/WRFV3/run/wrf.exe
      ### EOF
      starting wrf task 40 of 72
      starting wrf task 43 of 72
      starting wrf task 47 of 72
      starting wrf task 53 of 72
      starting wrf task 54 of 72
      starting wrf task 55 of 72
      starting wrf task 59 of 72
      starting wrf task 36 of 72
      starting wrf task 38 of 72
      starting wrf task 41 of 72
      starting wrf task 42 of 72
      starting wrf task 45 of 72
      starting wrf task 46 of 72
      starting wrf task 51 of 72
      starting wrf task 56 of 72
      starting wrf task 57 of 72
      starting wrf task 60 of 72
      starting wrf task 61 of 72
      starting wrf task 63 of 72
      starting wrf task 64 of 72
      starting wrf task 65 of 72
      starting wrf task 66 of 72
      starting wrf task 67 of 72
      starting wrf task 69 of 72
      starting wrf task 70 of 72
      starting wrf task 37 of 72
      starting wrf task 39 of 72
      starting wrf task 49 of 72
      starting wrf task 50 of 72
      starting wrf task 58 of 72
      starting wrf task 68 of 72
      starting wrf task 71 of 72
      starting wrf task 52 of 72
      starting wrf task 44 of 72
      starting wrf task 48 of 72
      starting wrf task 62 of 72
      starting wrf task 6 of 72
      starting wrf task 32 of 72
      starting wrf task 7 of 72
      starting wrf task 26 of 72
      starting wrf task 16 of 72
      starting wrf task 11 of 72
      starting wrf task 20 of 72
      starting wrf task 19 of 72
      starting wrf task 29 of 72
      starting wrf task 10 of 72
      starting wrf task 21 of 72
      starting wrf task 4 of 72
      starting wrf task 23 of 72
      starting wrf task 2 of 72
      starting wrf task 17 of 72
      starting wrf task 3 of 72
      starting wrf task 18 of 72
      starting wrf task 1 of 72
      starting wrf task 27 of 72
      starting wrf task 5 of 72
      starting wrf task 33 of 72
      starting wrf task 8 of 72
      starting wrf task 24 of 72
      starting wrf task 12 of 72
      starting wrf task 25 of 72
      starting wrf task 30 of 72
      starting wrf task 31 of 72
      starting wrf task 9 of 72
      starting wrf task 22 of 72
      starting wrf task 34 of 72
      starting wrf task 0 of 72
      starting wrf task 13 of 72
      starting wrf task 15 of 72
      starting wrf task 28 of 72
      starting wrf task 14 of 72
      starting wrf task 35 of 72
      srun: error: wbl008: tasks 36-38,40-41,43-65,67-71: Segmentation fault (core dumped)
      srun: error: wbl004: tasks 0-35: Segmentation fault (core dumped)
      srun: error: wbl008: task 66: Segmentation fault (core dumped)
      srun: error: wbl008: tasks 39,42: Segmentation fault (core dumped)
      "
      I have tried increasing memory from 1500 MB, 2000 MB, 3000 MB, and 6300 MB but the problem remains.

      Can it be a namelist.input parent_time_step_ratio related error?? I have a timestep of 120s, 40s and 8s (1:3:5).

      Initially, I ran simulations for an island covering 201 km x 201 km at 1 km grid resolution and the simulations worked fine.

      Appreciate your kind assistance and advice.

      Regards
      Kunal

      kwerner
      Posts: 2287
      Joined: Wed Feb 14, 2018 9:21 pm

      Re: RE: WRF model on HPC - Segmentation fault (core dumped)

      Post by kwerner » Tue Jan 21, 2020 7:02 pm

      Hi,
      I see that you posted another topic here:
      https://forum.mmm.ucar.edu/phpBB3/viewt ... =48&t=8796
      which was posted several days after this one. Does that mean you were able to get past this particular seg-fault problem?

      If not, can you please attach the namelist.input file you are using, along with all of the rsl.error.* files? You can package all the rsl files together as one *.TAR file and attach it. Thanks!
      NCAR/MMM

      kunaldayal
      Posts: 68
      Joined: Tue Oct 16, 2018 9:59 pm

      Re: RE: WRF model on HPC - Segmentation fault (core dumped)

      Post by kunaldayal » Thu Jan 23, 2020 3:59 am

      Hi,

      The post you mention is for a smaller domain 3 (@ 1km = 201 x 201 and @ 0.8 km = 241 x 241) just over one of the major islands.

      The current one where I am facing problems with a segmentation fault (core dumped) is for a bigger domain 3 (@ 1 km = 401 x 401) which cover all the bigger islands.

      Please note I do not have the exact error files for the set-up I mentioned earlier as I modified the domains to 25 km, 5 km, and 1 km and ran another simulation but got the same error message.

      Find attached the namelist.input file and the rsl.error.*files.

      Appreciate your assistance and advice.

      Thanks and regards
      Kunal
      Attachments
      tar.7z
      (366.55 KiB) Downloaded 28 times
      namelist.txt
      (5.06 KiB) Downloaded 30 times

      kwerner
      Posts: 2287
      Joined: Wed Feb 14, 2018 9:21 pm

      Re: RE: WRF model on HPC - Segmentation fault (core dumped)

      Post by kwerner » Fri Jan 24, 2020 10:37 pm

      Kunal,
      Do you mind sending the tar files in a *.TAR format? Unfortunately we don't have the software to unpack a .7z file on our systems. Thanks!
      NCAR/MMM

      kunaldayal
      Posts: 68
      Joined: Tue Oct 16, 2018 9:59 pm

      Re: RE: WRF model on HPC - Segmentation fault (core dumped)

      Post by kunaldayal » Mon Jan 27, 2020 8:35 pm

      Hi,

      Please find attached.

      Regards
      Kunal
      Attachments
      tar.tar
      (8.03 MiB) Downloaded 27 times

      kwerner
      Posts: 2287
      Joined: Wed Feb 14, 2018 9:21 pm

      Re: RE: WRF model on HPC - Segmentation fault (core dumped)

      Post by kwerner » Mon Jan 27, 2020 10:19 pm

      Kunal,
      In the rsl* files I see several CFL errors, meaning the model has become unstable. Take a look at this FAQ that discusses segmentation faults, and toward the end in particular, CFL errors and ways to overcome them:
      https://forum.mmm.ucar.edu/phpBB3/viewt ... f=73&t=133
      NCAR/MMM

      kunaldayal
      Posts: 68
      Joined: Tue Oct 16, 2018 9:59 pm

      Re: RE: WRF model on HPC - Segmentation fault (core dumped)

      Post by kunaldayal » Mon Jan 27, 2020 10:54 pm

      Hi,

      I take note and I will have a look at the mentioned solutions.

      Thanks and regards
      Kunal

      Post Reply

      Return to “High-performance Computing”