Hi all,
I am trying to run a WRF simulation with 3 nests on Cheyenne. I have successfully gotten it to run for 12 hours when I set the time_step under &domains in namelist.input to 15s. However, when I make this longer (30s,90s,120s) it fails after a few minutes with the following message:
MPT ERROR: MPI_COMM_WORLD rank 413 [or another random number] has terminated without calling MPI_Finalize()
aborting job
MPT: Received signal 9
Checking the rsl files, I don't see an error on 0000 or 0413 (in this example) or the other corresponding ones. I have played around with trying a different number of cores to no avail. It seems only to work when time_step is 15s.
I am a little confused on the interpretation of time_step and how that fits in with the other domains. Per other namelists/looking online, I have seen that parent_time_step_ratio for 3 domains is set to 1, 3, 3, (which is the same as my parent_grid_ratio). Would this mean the outer domain is 15s, middle is 5s, and inner domain is 5/3 seconds? I would like the inner domain to be 15s, which I think would correspond to 45s for the middle and 135s for the outer? But when I try WRF with a timestep other than 15s, I get the abort error above. So perhaps I am misinterpreting and its somehow violating the CFL condition and causing the crash? But with the 15s timestep, it takes about 1 hour of computing time per hour of simulation time, which seems slow. Or maybe this is a question for CISL?
Attached is my namelist.input file. Thanks for any insight on this (I am fairly new to all this).
I am trying to run a WRF simulation with 3 nests on Cheyenne. I have successfully gotten it to run for 12 hours when I set the time_step under &domains in namelist.input to 15s. However, when I make this longer (30s,90s,120s) it fails after a few minutes with the following message:
MPT ERROR: MPI_COMM_WORLD rank 413 [or another random number] has terminated without calling MPI_Finalize()
aborting job
MPT: Received signal 9
Checking the rsl files, I don't see an error on 0000 or 0413 (in this example) or the other corresponding ones. I have played around with trying a different number of cores to no avail. It seems only to work when time_step is 15s.
I am a little confused on the interpretation of time_step and how that fits in with the other domains. Per other namelists/looking online, I have seen that parent_time_step_ratio for 3 domains is set to 1, 3, 3, (which is the same as my parent_grid_ratio). Would this mean the outer domain is 15s, middle is 5s, and inner domain is 5/3 seconds? I would like the inner domain to be 15s, which I think would correspond to 45s for the middle and 135s for the outer? But when I try WRF with a timestep other than 15s, I get the abort error above. So perhaps I am misinterpreting and its somehow violating the CFL condition and causing the crash? But with the 15s timestep, it takes about 1 hour of computing time per hour of simulation time, which seems slow. Or maybe this is a question for CISL?
Attached is my namelist.input file. Thanks for any insight on this (I am fairly new to all this).