WRF v4.5.2: Only rank 0000 runs while other MPI ranks hang (./wrf runs smoothly)

nanyin · Monday at 1:57 AM

Hello,

I am running WRF v4.5.2 compiled with OpenMPI 4.1.8. When I run real.exe in parallel (e.g., mpirun -np 12 ./real.exe) it completes successfully. However, when I try to run wrf.exe in parallel (e.g., nohup mpirun -np 12 ./wrf.exe > wrf_12.log 2>&1 &), I see the following strange behavior:

Rank 0000 starts time integration normally.
Other ranks (e.g., rsl.error.0002, rsl.error.0005) stop after printing initialization messages such as:

Tile Strategy is not specified. Assuming 1D-Y
WRF TILE 1 IS ...
WRF NUMBER OF TILES = 1
and they don’t proceed further.

So effectively only rank 0000 is running, while the other MPI tasks are stuck. After some time the run crashes.

Additional notes:

If I run ./wrf.exe without MPI (serial mode), the model runs smoothly.
real.exe runs fine with 12 MPI processes.
I have already added ulimit -s unlimited before running.
I checked and the MPI library is linked correctly (via ldd ./wrf.exe | grep mpi).

I am wondering if this is related to domain decomposition vs. number of processors, or some I/O blocking problem, since the rsl files from non-zero ranks always stop after the Noah LSM initialization and tile decomposition messages.

Here is my namelist:

&domains
time_step = 45,
time_step_fract_num = 0,
time_step_fract_den = 1,
max_dom = 3,
e_we = 144, 190,154,
e_sn = 178, 226,190,
e_vert = 45, 45,45,
p_top_requested = 5000,
num_metgrid_levels = 38,
num_metgrid_soil_levels = 4,
dx = 9000,
dy = 9000,
grid_id = 1, 2,3,
parent_id = 1, 1,2,
i_parent_start = 1, 55,95,
j_parent_start = 1, 44,71,
parent_grid_ratio = 1, 3,3,
parent_time_step_ratio = 1, 3,3,
feedback = 1,
smooth_option = 0,
nproc_x = 3,
nproc_y = 4,

/

Does anyone have suggestions on why only rank 0000 proceeds while the other MPI ranks are stuck after tile initialization?

Thanks!

WRF v4.5.2: Only rank 0000 runs while other MPI ranks hang (./wrf runs smoothly)

nanyin

New member