Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

MPI memory address error before writing restart file

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

daniel_lloveras

New member
Hi there,

I'm running a high-resolution idealized baroclinic wave simulation in WRF 3.6.1 (2400 x 1800 horizontal grid points, 4km resolution) on a HPC machine. When I run wrf.exe with the attached namelist.input file, the model runs fine up until the restart file is set to be written (2.5 days into the simulation). At this point, and before even attempting to write the restart file, the model closes out with the following error, which occurs at about line 18035 of the attached rsl.error file:

0: MPICH2 Error: Failed to register memory address 0x2aabf5ba4000 with length of 0x92000 (598016) bytes.
0: Unable to register memory at the requested address. This may indicate an application bug. See process virtual memory mappings below:

I was wondering if anyone has seen this weird MPI memory address error before, and if so, how I may go about attacking it?

One option, if the problem is that I do not have enough memory to write a single wrfrst file, would be to use "io_form_restart = 102" to make individual restart files for every processor. But, on another run, I wrote output frequently (in contrast to the namelist.input file above, in which I specified that no wrfout files would be made), and the output was placed onto a single, large wrfout file that is larger than what I expect the restart file to be. So, it appears that I have enough memory to write large files, and that this problem is specific to restart files.

I'm going to eventually try the "io_form_restart = 102" namelist option and the joiner script given here: http://www2.mmm.ucar.edu/wrf/users/special_code.html, but I figured it'd be best to ask here for other potential solutions before trying new things like that.

Thanks in advance for the help!

Best,

Daniel
 

Attachments

  • namelist.input
    5.8 KB · Views: 42
  • rsl.error.txt
    1.5 MB · Views: 52
Hi Daniel,
This could have to do with the actual file size, as well. Do you know whether your code was built with large file support? You would have set somethign like this prior to configuring:
Code:
setenv WRFIO_NCD_LARGE_FILE_SUPPORT 1
If not, then the model is incapable of writing a file that would be larger than 2 GB. Restart files are significantly larger than standard wrfout* files. Even when building with large file support, you would still be limited to 4 GB files. If you think the files may be larger than 2 GB, but smaller than 4 GB, then the large file support option would be the way to go. If, however, you've already tried that, or the files are larger than 4 GB, then yes, using the 102 option, with the joiner script would be the next best thing.
 
Thanks for the reply!

I did compile with large file support enabled, but here's where I'm a bit confused. Firstly, is the 4 GB limit imposed by WRF, or is it imposed by the netCDF files themselves? Secondly, is the 4 GB limit for files overall, or is it for individual variables? I've found various documentations/forums online that suggest there is a 4 GB limit to variable size, and that this limit does not exist for netCDF-4 files (which I haven't been using).

Also, I saw in the presentation on the joiner script that apparently it is not applicable for restart files. I know that the joiner script isn't supported, but do you happen to know if this is still the case?

Thanks again,

Daniel
 
Hi Daniel,
I spoke to our software engineer about this to ensure I'm providing the correct responses to your questions. The 4GB limit is determined by NetCDF, not by the WRF model, itself. The limitation is per record (variable), and not per file. There theoretically should not be a limitation with NetCDF-4; however, even if using NetCDF-4, there are particular NetCDF settings that must be configured prior to compiling WRF to enable this to behave properly. His recommendation is to simply do as you are doing - using a format of 102 for restart output. Once you have all that output, and are ready to run a restart, there is no need to patch it back together with the joiner script (which, you are correct - it does not stitch wrfrst files together). The WRF model will read the split NetCDF files, as is, without them needing to be stitched. If you are not having trouble outputting wrfout* files in standard NetCDF format (=2), then you can set io_form_history = 2 and you won't need to use the joiner script at all.
 
Top