Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

FATAL ERROR when increasing number of processors

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

delmoral

New member
Good morning,

I am running a version compiled by myself of WRF v4.0 on Cheyenne. I compiled my version like the pre-compiled ones, with dmpar and Intel processor.
I have three nested domains (15-3-1 km) with the following extensions and grid points:

&geogrid
parent_id = 1, 1, 2,
parent_grid_ratio = 1, 5, 3,
i_parent_start = 1, 35,110,
j_parent_start = 1, 40,45,
e_we = 100, 231,322,
e_sn = 90, 201,298,


I am running the batch job with the following configuration:
#PBS -l walltime=08:00:00
#PBS -l select=2:ncpus=36:mpiprocs=36


At this time it takes 6 hours to run it.
I tried to increase the nodes, but I can't do more than two, and this message appears:

For domain 1 , the domain size is too small for this many processors, or the decomposition aspect ratio is poor.
Minimum decomposed computational patch size, either x-dir or y-dir, is 10 grid cells.
e_we = 100, nproc_x = 20, with cell width in x-direction = 5
e_sn = 90, nproc_y = 27, with cell width in y-direction = 3
--- ERROR: Reduce the MPI rank count, or redistribute the tasks.
For domain 2 , the domain size is too small for this many processors, or the decomposition aspect ratio is poor.
Minimum decomposed computational patch size, either x-dir or y-dir, is 10 grid cells.
e_we = 231, nproc_x = 20, with cell width in x-direction = 11
e_sn = 201, nproc_y = 27, with cell width in y-direction = 7
--- ERROR: Reduce the MPI rank count, or redistribute the tasks.


Is that just related with increasing my number of grid points in those domains?
Also, even having a high walltime and being able to run my entire simulation (30 hours in total), when it finishes, my exit status is never 0, and in the case file generated I have the following message:
PBS: job killed: walltime 43216 exceeded limit 43200

Why I am still having this messages even if my entire simulation is done?
Thanks a lot in advance.
 
Hi,
The problem with trying to use more processors is that your domain 1 is so small. You may actually be using too many already for that domain. Take a look at this FAQ regarding choosing an appropriate number of processors based on domain sizes. I would perhaps advise to increase the size of your outer domain, as it's likely not going to cost much more, compared to the charge of your inner domains. As for the reason why you are getting the message that you are running out of wallclock time, I am unsure about that. That may be a question for CISL support, as it's related to their system. Make sure you are not looking at an older out file.
 
Top