FATAL ERROR when increasing number of processors

delmoral · Aug 13, 2019

Good morning,

I am running a version compiled by myself of WRF v4.0 on Cheyenne. I compiled my version like the pre-compiled ones, with dmpar and Intel processor.
I have three nested domains (15-3-1 km) with the following extensions and grid points:

&geogrid
parent_id = 1, 1, 2,
parent_grid_ratio = 1, 5, 3,
i_parent_start = 1, 35,110,
j_parent_start = 1, 40,45,
e_we = 100, 231,322,
e_sn = 90, 201,298,

I am running the batch job with the following configuration:
#PBS -l walltime=08:00:00
#PBS -l select=2:ncpus=36:mpiprocs=36

At this time it takes 6 hours to run it.
I tried to increase the nodes, but I can't do more than two, and this message appears:

For domain 1 , the domain size is too small for this many processors, or the decomposition aspect ratio is poor.
Minimum decomposed computational patch size, either x-dir or y-dir, is 10 grid cells.
e_we = 100, nproc_x = 20, with cell width in x-direction = 5
e_sn = 90, nproc_y = 27, with cell width in y-direction = 3
--- ERROR: Reduce the MPI rank count, or redistribute the tasks.
For domain 2 , the domain size is too small for this many processors, or the decomposition aspect ratio is poor.
Minimum decomposed computational patch size, either x-dir or y-dir, is 10 grid cells.
e_we = 231, nproc_x = 20, with cell width in x-direction = 11
e_sn = 201, nproc_y = 27, with cell width in y-direction = 7
--- ERROR: Reduce the MPI rank count, or redistribute the tasks.

Is that just related with increasing my number of grid points in those domains?
Also, even having a high walltime and being able to run my entire simulation (30 hours in total), when it finishes, my exit status is never 0, and in the case file generated I have the following message:
PBS: job killed: walltime 43216 exceeded limit 43200

Why I am still having this messages even if my entire simulation is done?
Thanks a lot in advance.

kwerner · Aug 14, 2019

Hi,
The problem with trying to use more processors is that your domain 1 is so small. You may actually be using too many already for that domain. Take a look at this FAQ regarding choosing an appropriate number of processors based on domain sizes. I would perhaps advise to increase the size of your outer domain, as it's likely not going to cost much more, compared to the charge of your inner domains. As for the reason why you are getting the message that you are running out of wallclock time, I am unsure about that. That may be a question for CISL support, as it's related to their system. Make sure you are not looking at an older out file.

FATAL ERROR when increasing number of processors

delmoral

New member

kwerner

Administrator