Hello,
I am running WRF v4.5.2 compiled with OpenMPI 4.1.8. When I run real.exe in parallel (e.g., mpirun -np 12 ./real.exe) it completes successfully. However, when I try to run wrf.exe in parallel (e.g., nohup mpirun -np 12 ./wrf.exe > wrf_12.log 2>&1 &), I see the following strange behavior:
So effectively only rank 0000 is running, while the other MPI tasks are stuck. After some time the run crashes.
Additional notes:
I am wondering if this is related to domain decomposition vs. number of processors, or some I/O blocking problem, since the rsl files from non-zero ranks always stop after the Noah LSM initialization and tile decomposition messages.
Here is my namelist:
&domains
time_step = 45,
time_step_fract_num = 0,
time_step_fract_den = 1,
max_dom = 3,
e_we = 144, 190,154,
e_sn = 178, 226,190,
e_vert = 45, 45,45,
p_top_requested = 5000,
num_metgrid_levels = 38,
num_metgrid_soil_levels = 4,
dx = 9000,
dy = 9000,
grid_id = 1, 2,3,
parent_id = 1, 1,2,
i_parent_start = 1, 55,95,
j_parent_start = 1, 44,71,
parent_grid_ratio = 1, 3,3,
parent_time_step_ratio = 1, 3,3,
feedback = 1,
smooth_option = 0,
nproc_x = 3,
nproc_y = 4,
/
Does anyone have suggestions on why only rank 0000 proceeds while the other MPI ranks are stuck after tile initialization?
Thanks!
I am running WRF v4.5.2 compiled with OpenMPI 4.1.8. When I run real.exe in parallel (e.g., mpirun -np 12 ./real.exe) it completes successfully. However, when I try to run wrf.exe in parallel (e.g., nohup mpirun -np 12 ./wrf.exe > wrf_12.log 2>&1 &), I see the following strange behavior:
- Rank 0000 starts time integration normally.
- Other ranks (e.g., rsl.error.0002, rsl.error.0005) stop after printing initialization messages such as:
Tile Strategy is not specified. Assuming 1D-Y
WRF TILE 1 IS ...
WRF NUMBER OF TILES = 1
and they don’t proceed further.
So effectively only rank 0000 is running, while the other MPI tasks are stuck. After some time the run crashes.
Additional notes:
- If I run ./wrf.exe without MPI (serial mode), the model runs smoothly.
- real.exe runs fine with 12 MPI processes.
- I have already added ulimit -s unlimited before running.
- I checked and the MPI library is linked correctly (via ldd ./wrf.exe | grep mpi).
I am wondering if this is related to domain decomposition vs. number of processors, or some I/O blocking problem, since the rsl files from non-zero ranks always stop after the Noah LSM initialization and tile decomposition messages.
Here is my namelist:
&domains
time_step = 45,
time_step_fract_num = 0,
time_step_fract_den = 1,
max_dom = 3,
e_we = 144, 190,154,
e_sn = 178, 226,190,
e_vert = 45, 45,45,
p_top_requested = 5000,
num_metgrid_levels = 38,
num_metgrid_soil_levels = 4,
dx = 9000,
dy = 9000,
grid_id = 1, 2,3,
parent_id = 1, 1,2,
i_parent_start = 1, 55,95,
j_parent_start = 1, 44,71,
parent_grid_ratio = 1, 3,3,
parent_time_step_ratio = 1, 3,3,
feedback = 1,
smooth_option = 0,
nproc_x = 3,
nproc_y = 4,
/
Does anyone have suggestions on why only rank 0000 proceeds while the other MPI ranks are stuck after tile initialization?
Thanks!