Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Static Initialization Failure - 15km uniform mesh

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

paige_d

New member
Good morning,

I am a new MPAS user and am battling to successfully complete the static initialization of one of the uniform-resolution grids (15km - x1.2621442). I have managed to complete this, without any problems, on many other meshes (15km variable-resolution, 25km variable-resolution, 240km uniform-resolution and 240-km variable resolution).
The job is terminated at the same point on every attempt with exit code 137 & the static error file (below) suggests that this is due to a memory limit.

I would appreciate any advice as to how I may resolve this. Thanks!

Operating system error: Cannot allocate memory
Allocation would exceed memory limit

Error termination. Backtrace:

Could not print backtrace: mmap: Cannot allocate memory
#0 0x2aaaab8b1d8a
#1 0x2aaaab8b2865
#2 0x2aaaab8b2a62
#3 0x55d506
#4 0x5082c5
#5 0x50f886
#6 0x43a88c
#7 0x408a60
#8 0x40802b
#9 0x2aaaac551b34
#10 0x408062
#11 0xffffffffffffffff
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node cnode0629 exited on signal 9 (Killed).
--------------------------------------------------------------------------
 
The current static implementation of the static initialization is quite memory intensive. This often leads to this type of error because there is often a limit upon the amount of memory that can be allocated by one program (i.e. the stack size).

Depending on your chosen shell you can set the stack size to have no restrictions by the one of the following commands on the command line before running the static initialization:

For sh based shells (sh, bash, zsh):
Code:
ulimit -s unlimited

For csh based shells (csh, tcsh):
Code:
limit stacksize unlimited

If you are not sure shell you are using, you can find out by printing the value in the SHELL environment variable:
Code:
echo $SHELL


This should allow no limit on the stacksize, and therefore allow the static initialization complete successfully. However, please let us know if it does not!
 
Thank you very much for your response.

I have tried the static initialization again, after including 'ulimit -s unlimited' in my qsub script, however the issue persists and the job stops at the same point as before.

Are there any files/scripts I could provide that may help us identify the problem?

Thanks for your time!
 
By using a mesh decomposition in which each partition represents a convex region, it is possible to interpolate the static terrestrial fields in parallel following the steps outlined on the mesh download page for the 10-km quasi-uniform and 3-km quasi-uniform meshes. Although we don't yet have a convex partition file for the 15-km quasi-uniform mesh, I can produce one and make it available very soon.
 
I've attached a gzipped convex partition file with 16 partitions to this post. You can interpolate the static terrestrial fields in parallel for the 15-km quasi-uniform mesh using this file, and by spreading the 16 MPI tasks (corresponding to the 16 partitions) across multiple nodes, you can make use of more aggregate memory, hopefully working around the memory allocation error you had previously encountered.

As mentioned on the mesh download page for the quasi-uniform 10-km and 3-km meshes, you'll need to comment-out lines 217 through 222 of src/core_init_atmosphere/mpas_init_atm_cases.F . It might be good to also set the io_type for the "output" stream to io_type="pnetcdf,cdf5". The only other change that's needed is in the namelist.init_atmosphere file, where you can set
Code:
&decomposition
    config_block_decomp_file_prefix = 'x1.2621442.cvt.part.'
/

I've also updated the 15-km mesh download so that the convex partition file with 16 partitions will be available for anyone who downloads that mesh in future. If you encounter any other issues in interpolating the static fields for this mesh, please don't hesitate to follow-up in this thread.
 

Attachments

  • x1.2621442.cvt.part.16.gz
    163.4 KB · Views: 74
Thank you so much for the detailed response.

I do not currently have access to the mpas_init_atm_cases.F file, so it may take some time for me to test this. I will post an update to let you know if I was successful as soon as possible.

Thanks!
 
UPDATE:

Following the steps above, once I was able to comment out lines 217 - 222 of src/core_init_atmosphere/mpas_init_atm_cases.F, my static initialization completed successfully!

I am extremely grateful for the help, thank you so much.
 
Thanks very much for the update, and it's good to know that the use of a CVT partition file to interpolate the static fields in parallel worked!
 
mgduda said:
I've attached a gzipped convex partition file with 16 partitions to this post. You can interpolate the static terrestrial fields in parallel for the 15-km quasi-uniform mesh using this file, and by spreading the 16 MPI tasks (corresponding to the 16 partitions) across multiple nodes, you can make use of more aggregate memory, hopefully working around the memory allocation error you had previously encountered.

As mentioned on the mesh download page for the quasi-uniform 10-km and 3-km meshes, you'll need to comment-out lines 217 through 222 of src/core_init_atmosphere/mpas_init_atm_cases.F . It might be good to also set the io_type for the "output" stream to io_type="pnetcdf,cdf5". The only other change that's needed is in the namelist.init_atmosphere file, where you can set
Code:
&decomposition
    config_block_decomp_file_prefix = 'x1.2621442.cvt.part.'
/

I've also updated the 15-km mesh download so that the convex partition file with 16 partitions will be available for anyone who downloads that mesh in future. If you encounter any other issues in interpolating the static fields for this mesh, please don't hesitate to follow-up in this thread.


Please mgduda how can I get the convex partition file with 16 or more partitions for the 10-km uniform mesh?
Thank you
 
Sorry for the delayed reply. I think the 10-km mesh download should include a convex partition file with 64 partitions (x1.5898242.cvt.part.64). If you encounter any issues with this partition file, just let me know and I'll be glad to help!
 
I am very sorry for the late reply, I was still waiting for your response, I kept refreshing the same page but never show your reply. I just have to close and reopen this forum before I could see your response.
Thank you I will check the tar mesh file I downloaded.

Thank you once again.
 
Top