Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Issue with LANDUSEF variable during geogrid.exe

After running geogrid.exe, the variable LANDUSEF in geo_em.d04.nc appears to have incorrect values.

Using ncview, the range shows 0 to 1e+20, whereas for geo_em.d03.nc and other domains, the range correctly shows 0 to 1. I'm not sure if there is anything wrong, but it seems unusual. The goegrid.log and namelist.wps files are attached. I tried to compress the geo_em files, but the size didn't reduce much. If there is another way to send large files (about 250 MB), please let me know.

Any ideas would be appreciated.
 

Attachments

  • geogrid.log
    104.5 KB · Views: 1
  • namelist.wps
    1.3 KB · Views: 2
Hi,
I ran a test with your namelist.wps file and I see the same thing you do; however, I don't easily see any issue with the geo_em.d04 file. Can you try to run ungrib and metgrid and see if things work and look reasonable? If so, continue to run real and wrf, and if everything looks okay, it should be okay to ignore the issue.
 
Hi,
I ran a test with your namelist.wps file and I see the same thing you do; however, I don't easily see any issue with the geo_em.d04 file. Can you try to run ungrib and metgrid and see if things work and look reasonable? If so, continue to run real and wrf, and if everything looks okay, it should be okay to ignore the issue.
Thank you for your helpful reply.

The same issue is observed with the met_em.d04* and wrfinput_d04 files, while everything seems normal for the met_em and wrfinput files in other domains.

I should mention that after attempting to run the command "mpirun -np 36 ./wrf.exe" (or other np options, which I believe are not relevant to this issue), the model runs normally for max_dom = 3. However, when max_dom = 4, the model stops running and displays the following message:

"""
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 20962 RUNNING AT compute13
= EXIT CODE: 9
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

"""
On the other hand, using "mpirun -np 16 ./wrf.exe" seems to make the model run for max_dom = 4, albeit with very slow progress.

The namelist.input and rsl files for max_dom = 4 and "mpirun -np 36 ./wrf.exe" are attached.

Any feedback or comments would be greatly appreciated.
 

Attachments

  • namelist.input
    4.5 KB · Views: 1
  • rsl_code9.tar.gz
    28.9 KB · Views: 1
Thank you for your helpful reply.

The same issue is observed with the met_em.d04* and wrfinput_d04 files, while everything seems normal for the met_em and wrfinput files in other domains.

I should mention that after attempting to run the command "mpirun -np 36 ./wrf.exe" (or other np options, which I believe are not relevant to this issue), the model runs normally for max_dom = 3. However, when max_dom = 4, the model stops running and displays the following message:

"""
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 20962 RUNNING AT compute13
= EXIT CODE: 9
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

"""
On the other hand, using "mpirun -np 16 ./wrf.exe" seems to make the model run for max_dom = 4, albeit with very slow progress.

The namelist.input and rsl files for max_dom = 4 and "mpirun -np 36 ./wrf.exe" are attached.

Any feedback or comments would be greatly appreciated.
I would like to add that the run for max_dom = 4, using a Slurm script with 2 nodes and 36 jobs per node, is currently in progress. I am uncertain about the LANDUSEF settings and am waiting for the WRF run to complete so I can check for any anomalies in the outputs.

On the other hand, considering that the resolutions of the third and fourth domains are 1 km and 333.33 m respectively, I am wondering if using MYJ as the bl_pbl_physics option is appropriate, or if I should switch to the Shin-Hong scheme for all domains?
 
I would like to add that the run for max_dom = 4, using a Slurm script with 2 nodes and 36 jobs per node, is currently in progress. I am uncertain about the LANDUSEF settings and am waiting for the WRF run to complete so I can check for any anomalies in the outputs.
Core dumped!

However, the results appear reasonable, even though the process only progressed for a few minutes.
 
On the other hand, considering that the resolutions of the third and fourth domains are 1 km and 333.33 m respectively, I am wondering if using MYJ as the bl_pbl_physics option is appropriate, or if I should switch to the Shin-Hong scheme for all domains?

For max_dom = 4, using a Slurm script with 2 nodes and 36 (and ever 16) tasks per node, and setting bl_pbl_physics = 11 (Shin-Hong) and sf_sfclay_physics = 1 (Revised MM5), the process encountered a core dump error once again. The only difference with the challenging bl_pbl_physics and sf_sfclay_physics settings in CONUS was that a few time steps were successfully completed.

If a core dump occurs solely due to increasing the resolution by adding a nested domain, what is the typical solution? Would modifying the physics configurations help? However, it didn’t seem effective in my case, as I tried different options like bl_pbl_physics (Shin-Hong) and sf_sfclay_physics (Revised MM5), as mentioned earlier.

Any comments or suggestions would be greatly appreciated.
 
Last edited:
Dear
Hi,
I think the problem is that you're using entirely too few processors (max 36) for your domain sizes, which are:

e_we = 239, 493, 967, 1840,
e_sn = 150, 316, 595, 1129,

Domain 4 is much larger than domain 1, which may make these two domain sizes difficult to run together. See Choosing an Appropriate Number of Processors for details.
Thank you for your response.

Based on my calculations, the most amount of processors ≈ 3323, and the least amount of processors ≈ 3. I also tested with -np 72, which both 36 and 72 exceed the least amount of processors. I plan to attempt using more processes, if permitted by our servers.

If you believe the issue could be linked to the number of processors, I would greatly appreciate any suggestions you might have for a number of processors that works well.

In the meantime, could you please confirm if the namelist settings appear correct? Specifically, I am curious if odd values like 1129, 967, or 493 might have any unintended effects. However, with max_dom = 3, I can confirm that wrf.exe ran successfully.

I look forward to hearing from you.
 
Apologies for the delay. To determine the number of processors you can use, you have to base it off the max # that can be used for the smallest domain and the minimum # that can be used for the largest domain. Your smallest domain is d01 (239x150). Based on the rough rule of thumb calculation shown in Choosing an Appropriate Number of Processors, the max you should use is about 54 processors.

However, your largest domain is d04 (1840x1129), and again, you need to calculate the minimum number of processors that can be used for the largest domain. Per the rough calculation, the minimum number that can be used for this domain is about 207.

This means if you want to run all of these domains together, simultaneously, you cannot use more than 54 processors, but can't use fewer than 207, which, of course, is impossible. Therefore you are going to need to use the ndown program to run d04 separately from the other domains. This will allow you to use more processors when processing d04 alone.
 
Apologies for the delay. To determine the number of processors you can use, you have to base it off the max # that can be used for the smallest domain and the minimum # that can be used for the largest domain. Your smallest domain is d01 (239x150). Based on the rough rule of thumb calculation shown in Choosing an Appropriate Number of Processors, the max you should use is about 54 processors.

However, your largest domain is d04 (1840x1129), and again, you need to calculate the minimum number of processors that can be used for the largest domain. Per the rough calculation, the minimum number that can be used for this domain is about 207.

This means if you want to run all of these domains together, simultaneously, you cannot use more than 54 processors, but can't use fewer than 207, which, of course, is impossible. Therefore you are going to need to use the ndown program to run d04 separately from the other domains. This will allow you to use more processors when processing d04 alone.
I appreciate your helpful and detailed reply.

I’ll give it a try and get back to you with the results—likely in a few weeks.
 
Top