Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

WRF MPI Run problem [Unrecognized physics suite]

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

lslrsgis

Member
Hi, everyone, I have met a mpirun problem after executing wrf.exe.

After I compiled wrf.exe and run it using 20 cores as follows:

./compile -j 20 wrf
mpirun -np 20 ./wrf.exe


the job exit immediately, and files rsl.error.0000~0019 and rsl.out.0000~0019 were generated. In rsl.out.0001~0019, there is a "unrecognized physics suite" information.

Does anyone can point out the cause?

Thanks.


P.S. (1)
The code is compiled (1) with dmpar (34) and basic netsting (1) configuration
(2) with GNU compiler
(3) on a server x86_64 Linux, Red Hat System, with 80 Intel cores
(4) with mpich for parallelism


P.S. (2)
In file lsl.out.0000, it reads as (normal):
---------------------------------------------------------------------------------------------
Configuring physics suite 'conus'

mp_physics: 8 8
cu_physics: 6 6
ra_lw_physics: 4 4
ra_sw_physics: 4 4
bl_pbl_physics: 2 2
sf_sfclay_physics: 2 2
sf_surface_physics: 2 2
*************************************
WRF V4.0 MODEL
*************************************
Parent domain
ids,ide,jds,jde 1 92 1 63
ims,ime,jms,jme -4 30 -4 20
ips,ipe,jps,jpe 1 23 1 13
*************************************
DYNAMICS OPTION: Eulerian Mass Coordinate


P.S. (3)
In lsl.out.0001~0019, it reads as (abnormal) :
---------------------------------------------------------------------------------------------
taskid: 19 hostname: manager
module_io_quilt_old.F 2931 F
Quilting with 1 groups of 0 I/O tasks.
Ntasks in X 4 , ntasks in Y 5
*************************************
Configuring physics suite '^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@

-------------- FATAL CALLED ---------------
FATAL CALLED FROM FILE: <stdin> LINE: 1852
Unrecognized physics suite

-------------------------------------------
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 19

taskid: 0 hostname: manager
module_io_quilt_old.F 2931 F
Quilting with 1 groups of 0 I/O tasks.
Ntasks in X 4 , ntasks in Y 5
*************************************

P.S. (4)
In namelist.input, it reads:
------------------------------------------------------------------------------------------------
&time_control
run_days = 30,
run_hours = 0,
run_minutes = 0,
run_seconds = 0,
start_year = 2017, 2017,
start_month = 01, 01,
start_day = 01, 01,
start_hour = 12, 12,
end_year = 2017, 2017,
end_month = 12, 12,
end_day = 31, 31,
end_hour = 12, 12,
interval_seconds = 21600
input_from_file = .true.,.true.,
history_interval = 180, 180,
frames_per_outfile = 1000, 1000,
restart = .false.,
restart_interval = 7200,
io_form_history = 2
io_form_restart = 2
io_form_input = 2
io_form_boundary = 2
/

&domains
time_step = 180,
time_step_fract_num = 0,
time_step_fract_den = 1,
max_dom = 2,
max_ts_locs = 4,
e_we = 92, 133,
e_sn = 63, 97,
e_vert = 33, 33,
p_top_requested = 5000,
num_metgrid_levels = 32,
num_metgrid_soil_levels = 4,
dx = 30000, 10000,
dy = 30000, 10000,
grid_id = 1, 2,
parent_id = 0, 1,
i_parent_start = 1, 18,
j_parent_start = 1, 24,
parent_grid_ratio = 1, 3,
parent_time_step_ratio = 1, 3,
feedback = 1,
smooth_option = 0
/

&physics
physics_suite = 'CONUS'
mp_physics = -1, -1,
cu_physics = -1, -1,
ra_lw_physics = -1, -1,
ra_sw_physics = -1, -1,
bl_pbl_physics = -1, -1,
sf_sfclay_physics = -1, -1,
sf_surface_physics = -1, -1,
radt = 30, 30,
bldt = 0, 0,
cudt = 5, 5,
icloud = 1,
num_land_cat = 21,
sf_urban_physics = 0, 0,
/

&fdda
/

&dynamics
hybrid_opt = 2,
w_damping = 0,
diff_opt = 1, 1,
km_opt = 4, 4,
diff_6th_opt = 0, 0,
diff_6th_factor = 0.12, 0.12,
base_temp = 290.
damp_opt = 3,
zdamp = 5000., 5000.,
dampcoef = 0.2, 0.2,
khdif = 0, 0,
kvdif = 0, 0,
non_hydrostatic = .true., .true.,
moist_adv_opt = 1, 1,
scalar_adv_opt = 1, 1,
gwd_opt = 1,
/

&bdy_control
spec_bdy_width = 5,
specified = .true.
/

&grib2
/

&namelist_quilt
nio_tasks_per_group = 0,
nio_groups = 1,
/
------------------------------------------------------------------------------------------------
 
I am not sure whether the problem is caused by the GNU compiling.
Please try to rebuild the code following the standard method of compiling, and see whether it works.
 
Hi,
This shouldn't be a problem with a GNU compile. We use gfortran all the time, so that should be fine. I doubt this would be the problem, but can you try to change physics_suite = 'CONUS' to physics_suite = 'conus' (lower-case) to see if perhaps your system is sensitve to case? The code is looking for the lower-case version, but I really don't think that should cause a problem. If that doesn't help at all, can you attach one of the rsl files with the fatal error message, and your namelist.output file (this should be in the running directory after you run)?

One more question - did you make any modifications to the code, or is this the pristine 'out-of-the-box' WRFV4.0 code?

Thanks,
Kelly
 
Thank you very much for the kind reply.

-I have tried to recompile the code following the standard method, but still with GNU compiler. PGI/Intel compilers will be tested. smpar works, but dmpar does not.

-The wrf code version 4.0 were downloaded from http://www2.mmm.ucar.edu/wrf/users/downloads.html. No change has been added to the source code.

-In namelist.input, two changements have been tested. However, the problem still exist, and rsl.error.0001-rsl.error.19 reads the same.

Changement (1), replace “CONUS” with equivalent physical options as follows:

&physics
mp_physics = 8, 8,
cu_physics = 6, 6,
ra_lw_physics = 4, 4,
ra_sw_physics = 4, 4,
bl_pbl_physics = 2, 2,
sf_sfclay_physics = 2, 2,
sf_surface_physics = 2, 2,
radt = 30, 30,
bldt = 0, 0,
cudt = 5, 5,
icloud = 1,
num_land_cat = 21,
sf_urban_physics = 0, 0,
/

Changement (2), replace “CONUS” with “conus”:

&physics
physics_suite = 'conus'
mp_physics = -1, -1,
cu_physics = -1, -1,
ra_lw_physics = -1, -1,
ra_sw_physics = -1, -1,
bl_pbl_physics = -1, -1,
sf_sfclay_physics = -1, -1,
sf_surface_physics = -1, -1,
radt = 30, 30,
bldt = 0, 0,
cudt = 5, 5,
icloud = 1,
num_land_cat = 21,
sf_urban_physics = 0, 0,
/

namelist.input, namelist.output and rsl.out*, rsl.error* files are attached.

Siliang
 

Attachments

  • rsl.out.0001.txt
    611 bytes · Views: 59
  • rsl.out.0000.txt
    828 bytes · Views: 62
  • namelist.output.txt
    83.2 KB · Views: 61
  • namelist.input.txt
    3.6 KB · Views: 59
  • rsl.out.0003.txt
    611 bytes · Views: 72
I am running wrf on a single PC with 88 cores. It is found that:
(1) I can always run wrf successfully using command mpirun -np 20 ./wrf.exe for wrf.exe built with smpar configuration;
(2) I fail to run wrf sucessfully everytime using command mpirun -np 20 ./wrf.exe for wrf.exe built with (a) dmpar configuration and ./compile wrf and (b) dmpar configuration and ./compile -j 20 wrf. This has been tested for two different namelist.inputs, both failed.


When (2) failed, lsl.error.00** and lsl.out.00** are the same correspondingly. They read as:

[sliu@manager rsl_dmpar_j0]$ cat rsl.error.0000
taskid: 0 hostname: manager
module_io_quilt_old.F 2931 F
Quilting with 1 groups of 0 I/O tasks.
Ntasks in X 4 , ntasks in Y 5
WRF V4.0 MODEL
*************************************
Parent domain
ids,ide,jds,jde 1 92 1 63
ims,ime,jms,jme -4 30 -4 20
ips,ipe,jps,jpe 1 23 1 13
*************************************
DYNAMICS OPTION: Eulerian Mass Coordinate

[sliu@manager rsl_dmpar_j0]$ cat rsl.error.0001
taskid: 1 hostname: manager
module_io_quilt_old.F 2931 F
Quilting with 1 groups of 0 I/O tasks.
Ntasks in X 4 , ntasks in Y 5
*************************************
Configuring physics suite ''

-------------- FATAL CALLED ---------------
FATAL CALLED FROM FILE: <stdin> LINE: 1852
Unrecognized physics suite
-------------------------------------------
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1

[sliu@manager rsl_dmpar_j0]$ cat rsl.error.0002
taskid: 2 hostname: manager
Quilting with 1 groups of 0 I/O tasks.
Ntasks in X 4 , ntasks in Y 5
*************************************
Configuring physics suite ''

-------------- FATAL CALLED ---------------
FATAL CALLED FROM FILE: <stdin> LINE: 1852
Unrecognized physics suite
-------------------------------------------

The rest lsl.error.00** files are the same as rsl.error.0001 and rsl.error.0002.


CPU info
--------------------------------------------------------------------------------------------
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 88
On-line CPU(s) list: 0-87
Thread(s) per core: 2
Core(s) per socket: 22
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2699A v4 @ 2.40GHz
--------------------------------------------------------------------------------------------



Questions:
(1) Should I use smpar or dmpar?
Since I am working on a single PC, I am not sure whether “dmpar” can be used. In wrf user’s guide, it seems that “dmpar” is suitable for a shared-memory parallelism. Perhaps it means a cluster, not a single PC?

(2) If smpar is used, what is the difference between ./wrf.exe and mpirun -np 20 ./wrf.exe?

How many cores are the command ./wrf.exe used?
 

Attachments

  • namelist.input.01.txt
    4.3 KB · Views: 59
  • namelist.input.02.txt
    3.6 KB · Views: 58
You can certainly compile with smpar or dmpar on a single computer, as long as you have the correct type of parallel library installed. dmpar is for distributed memory parallel computing and smpar is for shared-memory parallel computing. If you are wanting to use shared memory, then it sounds like it is working fine for you. At this point, I am not sure why you are getting the physics error with a dmpar run, though. That part makes no sense.

If you would like to use shared memory processing, then you should just continue to run with the smpar option, especially since that is working for you. However, you shouldn't be using the command 'mpirun -np X ./wrf.exe.' For a shared memory run, you first can set the number of threads with a command like (csh e.g.):
setenv OMP_NUM_THREADS 16
and then to run wrf, simply use this command:
./wrf.exe >& wrf.log

If you'd like to use the distributed memory option (dmpar), you need to make sure you compile with an MPI library (such as MPICH2 or OPENMPI). If so, then let's disregard all of the smpar runs and start over, and just look at the dmpar case so that we don't get overwhelmed with all the different files/problems we are interested in. If this is the one you want to run, please try again to remove the physics_suite option from the namelist and manually enter the physics options as you did for one of your tests you mention above. Then attach your configure.wrf file (for the dmpar compile), your compile log (dmpar), your new namelist.input file and send all of your rsl* files (you can package those together into one *.tar file).

Thanks,
Kelly
 
Thank you very much for the detailed indications.

-I am using mpich-3.3 for parallelism.

-When using dmpar, I found it fails for two different namelist.inputs (attached). Hence, I guess it might not namelist.input that caused the exit.

-The namelist.input, namelist.output, rsl.out.*, rsl.error.* are packed int rsl_dmpar_j20.tar.gz, and rsl_dmpar_j0.tar.gz (attached).

(1) rsl_dmpar_j20 corresponds to mpirun -np 20 ./wrf.exe, where wrf.exe were compiled using ./compile -j 20 wrf.
(2) rsl_dmpar_j0 corresponds to mpirun -np 20 ./wrf.exe, where wrf.exe were compiled using ./compile wrf.

Merci.

Siliang
 

Attachments

  • rsl_dmpar_j20.tar.gz
    12.6 KB · Views: 57
  • rsl_dmpar_j0.tar.gz
    12.3 KB · Views: 58
  • namelist.input.01.txt
    4.3 KB · Views: 66
  • namelist.input.02.txt
    3.6 KB · Views: 55
I apologize for the delay. I've been busy the past couple of weeks preparing for and running 2 different tutorials. Thank you so much for your patience. This problem is very odd.

1) I assume you decided you DO want to use dmpar (and not smpar), correct?
2) If 1) is "yes" then can you download a brand new version of WRF and try to build that from scratch? You can get that from this page:
https://github.com/wrf-model/WRF/releases
If you specifically want V4.0, then scroll down for that one, but otherwise, you can grab the latest (V4.0.3).
Place this WRF/ directory somewhere new and then reconfigure and recompile. There is no need to build it with more than about 4 processors. Make sure to choose the 'dmpar' option and then when you are ready to run WRF, first try this with the namelist.input.01 that you attached in the most recent message, so that you are not setting the parameter physics_suite at all. You are just setting each individual physics option (this is the one I'm most interested in seeing the error for).

If this fails again, don't run any additional tests yet. Just attach the following files:
1) configure.wrf
2) compile log
3) The command you are using to run wrf (if you are using a batch script, then send that).
4) The rsl files, packaged together
5) namelist.input, namelist.output
6) your WRF/share/module_check_a_mundo.F file

Thanks,
Kelly
 
Hi,

Thank you for the detailed explanation. I have downloaded version 4.0.3 WRF code from https://github.com/wrf-model/WRF/releases you indicated.

Now mpirun works after either ./compile wrf or ./compile -j 20 wrf in the server, through typing
mpirun -np 20 ./wrf.exe.

When configure, I am using dmpar+basic nesting option, and no adding “-lgomp” has to be done.

The problem has been solved. Thanks again.

Siliang
 

Attachments

  • rsl.out.tar
    250 KB · Views: 58
  • rsl.error.tar
    340 KB · Views: 54
  • namelist.input
    4.3 KB · Views: 67
  • namelist.output.txt
    83.8 KB · Views: 64
Top