Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

WRF v4.5.2: Only rank 0000 runs while other MPI ranks hang (./wrf runs smoothly)

nanyin

New member
Hello,


I am running WRF v4.5.2 compiled with OpenMPI 4.1.8. When I run real.exe in parallel (e.g., mpirun -np 12 ./real.exe) it completes successfully. However, when I try to run wrf.exe in parallel (e.g., nohup mpirun -np 12 ./wrf.exe > wrf_12.log 2>&1 &), I see the following strange behavior:


  • Rank 0000 starts time integration normally.
  • Other ranks (e.g., rsl.error.0002, rsl.error.0005) stop after printing initialization messages such as:

    Tile Strategy is not specified. Assuming 1D-Y
    WRF TILE 1 IS ...
    WRF NUMBER OF TILES = 1
    and they don’t proceed further.

So effectively only rank 0000 is running, while the other MPI tasks are stuck. After some time the run crashes.


Additional notes:


  • If I run ./wrf.exe without MPI (serial mode), the model runs smoothly.
  • real.exe runs fine with 12 MPI processes.
  • I have already added ulimit -s unlimited before running.
  • I checked and the MPI library is linked correctly (via ldd ./wrf.exe | grep mpi).

I am wondering if this is related to domain decomposition vs. number of processors, or some I/O blocking problem, since the rsl files from non-zero ranks always stop after the Noah LSM initialization and tile decomposition messages.


Here is my namelist:



&domains
time_step = 45,
time_step_fract_num = 0,
time_step_fract_den = 1,
max_dom = 3,
e_we = 144, 190,154,
e_sn = 178, 226,190,
e_vert = 45, 45,45,
p_top_requested = 5000,
num_metgrid_levels = 38,
num_metgrid_soil_levels = 4,
dx = 9000,
dy = 9000,
grid_id = 1, 2,3,
parent_id = 1, 1,2,
i_parent_start = 1, 55,95,
j_parent_start = 1, 44,71,
parent_grid_ratio = 1, 3,3,
parent_time_step_ratio = 1, 3,3,
feedback = 1,
smooth_option = 0,
nproc_x = 3,
nproc_y = 4,


/


Does anyone have suggestions on why only rank 0000 proceeds while the other MPI ranks are stuck after tile initialization?


Thanks!
 
Top