Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

the wrf.exe does not run on a cluster

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

sekluzia

Member
Dear Colleges,

I successfully compiled the WRF model on GADI cluster with the dmpar mode (the details are in the attached configure.wrf file). The real.exe runs fine when I submit the job to the cluster with qsub command. The wrf.exe stops running at the beginning showing errors, only creating the first wrfout file. The namelist file is attached. The rsl* files, the wrfinput_d01 and wrfbdy_d01 files and the log file form the submitted job (run_mpi.o6159534) are in the in the attached input.tar.gz file.

Please note in my job submit script I upload openmpi module before running the wrf with mpirun.

module purge
module load pbs
module load dot
module load intel-compiler/2019.3.199
module load openmpi/4.0.2
module load hdf5/1.10.5
module load netcdf/4.7.1

export JASPERINC=/usr/include
export JASPERLIB=/usr/lib64



Kind regards,
Artur
 

Attachments

  • input.tar.gz
    186.6 MB · Views: 58
  • configure.wrf.txt
    20.7 KB · Views: 56
  • namelist.input
    6.3 KB · Views: 65
Please tell me more information about this case:
(1) Which version fo WRF did you run?
(2) What is the forcing data for this case?
(3) Please look at your rsl files and find error messages in the files. Note the errors may not be in rsl.error.0000, but can be in any of the rsl files.

If the model crashed immediately before integration, it often indicates that either the memory is not sufficient for running the case, or that the input data are wrong.
 
Hi Ming,

Thanks for your reply!
I compiled the WRF V4.1.4 with dm+sm option. I am using the ECMWF pressure-level initial and lateral-boundary conditions in this case (but I also tried with the GFS data, the same).
In the link below you can upload the archive file containing the all rsl* files, my namelist files, configure.wrf, Vtable.ECMWF, tslist and run_mpi.o6245106 files for this run.

wget --no-check-certificate https://bashupload.com/NWeCr/0tn14.gz

then please untar the file
tar xvfz 0tn14.gz

and you should find the input_wrf directory containing the mentioned files.

As I said previously only the first (at the simulation start) wrfout files are created .As you can see in my run_mpi.o6245106 file there should not be memory issues, since: Memory Requested: 80.0GB Memory Used: 30.73GB There should be also no problems with the input data. You can download the wrfinput_d01, wrfinput_d02 and wrfbdy_d01 using the following links:

wget --no-check-certificate https://bashupload.com/ouLT0/wrfinput_d01
wget --no-check-certificate https://bashupload.com/LNt-_/wrfinput_d02
wget --no-check-certificate https://bashupload.com/34rPJ/wrfbdy_d01


Kind regards,
Artur
 
Dear Colleges,

I still have problems with running the WRF model.

[gadi-cpu-clx-0416:58191:0:58191] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xffffffe005f9ffe0)
==== backtrace (tid: 58191) ====
0 0x0000000000012dc0 .annobin_sigaction.c() sigaction.c:0
1 0x000000000247fba3 rrtmg_lw_taumoltaumol_mp_taugb3_() module_ra_rrtmg_lw.f90:0
2 0x000000000245fca9 rrtmg_lw_taumol_mp_taumol_() ???:0
3 0x0000000002457e20 rrtmg_lw_rad_mp_rrtmg_lw_() ???:0
4 0x0000000002444534 module_ra_rrtmg_lw_mp_rrtmg_lwrad_() ???:0
5 0x0000000001af99b7 module_radiation_driver_mp_radiation_driver_() ???:0
6 0x0000000001e59f46 module_first_rk_step_part1_mp_first_rk_step_part1_() ???:0
7 0x000000000154c2d6 solve_em_() ???:0
8 0x0000000001327f4c solve_interface_() ???:0
9 0x00000000005722ef module_integrate_mp_integrate_() ???:0
10 0x0000000000415a21 module_wrf_top_mp_wrf_run_() ???:0
11 0x00000000004159d9 MAIN__() ???:0
12 0x0000000000415962 main() ???:0
13 0x0000000000023873 __libc_start_main() ???:0
14 0x000000000041586e _start() ???:0
=================================



Please note, the wrf.exe works fine when it is compiled with the with debugging options (-d). Compiling without debugging and with code optimization on the cluster produces the above error (Segmentation fault). However, running the simulations with debugging options on is WAY TOO SLOW.

What can you suggest me?

Kind regards,
Artur
 
Hi,

I was able to find what causes the wrf.exe to crash. The wrf model stops running because of using the Grell-Freitas ensemble cumulus scheme (cu_physics=3). I tested with other options of cu_physics and it runs. Do you know why the Grell-Freitas ensemble cumulus scheme can cause the model to stop running?
Reminding that this occurs only when I compile the model with the code optimization option in order to be able run the wrf faster on GADI cluster.

Kind regards,
Artur
 
Top