Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

geogrid and metgrid displaying error messages when run in parallel

maxhbalsmeier

New member
I have an appplication that is very performance-critical and therefore I would like to run especially metgrid with MPI.

However, I get messages of the form ERROR: Error in ext_pkg_open_for_write_begin.

I am using WPS 4.6.0 and I run it with

mpirun -np 4 ./geogrid.exe.

I compiled with the MPI option.

Appended you find the namelist and the console output for geogrid.

Is this potentially a bug?
Thank you.
 

Attachments

  • namelist.wps
    273 bytes · Views: 4
  • geogrid_console_output.txt
    2.9 KB · Views: 4
Apologies for the long delay in response while our team tended to time-sensitive obligations. Thank you for your patience.

If you're still experiencing this issue, for clarification, are you getting this error when running metgrid.exe or geogrid.exe? This error message typically indicates an issue with permissions to write to the directory, but if you were able to write the geo_em* files, then you should be able to do the same with the met_em* files, unless there have been changes to the system. If it's with geogrid.exe, then perhaps check the write permissions for the directory to which you're trying to write.

When you say you're using a performance-critical application, do you mean an app you're using to run the WPS processes?
 
Thank you for your help!

I am getting this error with both geogrid and metgrid, but I only provided the geogrid namelist and console output because it is quicker and easier to reproduce.
The geo_em* files and the met_em* files do exist, but I am not sure if I can rely on them. I could in theory just ignore this error but I do not want to unknowingly work with corrupted files.
Maybe I should mention I built with CMake. I did this on two different Ubuntu systems and the error persists.

I observed the following: when I compile in serial mode and run geogrid in parallel I obtain the same error, which is consistent with this being a write permissions error (when two processes are trying to write the same file at the same time). So maybe WPS was in fact not built in parallel like I thought but in serial. Can I somehow verify geogrid was actually built with the MPI flags?

Regarding the performance-criticality: I am doing semi-operational WRF NWP runs and users rely on its output being available in time. Therefore I would like to scratch the last bit of performance out of WPS. Generating all the initial and boundary conditions for two or three domains takes a few minutes and I would like to make this faster.
 
Can I somehow verify geogrid was actually built with the MPI flags?
You can look in the configure.wps file to determine how it was built. Search for either the word "serial" (built serially) or "dmpar" (built for parallel processing).

We typically only recommend parallel processing for WPS programs when the domain sizes are extremely large (thousands x thousands of grid spaces). The reason is that the WPS programs run relatively quickly and you likely won't see much performance speed-up by running them in parallel. And also, parallel WPS compiles can cause annoying issues sometimes, making it not worth the hassle.

Since you do have the output from geogrid and metgrid, take a look at the files using something like 'ncview.' As long as you aren't missing any data and it looks how you would expect it to, I think you are safe to move forward to run real and wrf.
 
When I run geogrid with "mpirun ./geogrid.exe" the resulting nc file is indeed corrupted and ncview cannot even open it. In serial mode, it works.

I am building with CMake so I don't think I have the configure.wps. However, my _build/wps_config.cmake looks like it indeed compiled with MPI (see appended).

However, I made the following observation: When I run the configure_new script and confirm the "Use MPI?" question with y, the USE_MPI flag stays at "OFF" in the wps_config.cmake file. I have to enter "yy" to get it to "ON", which I find strange.

Sorry for not giving up but I really think there must be a bug somewhere and I'd like to help finding it.
So could this potentially be an issue with the CMake MPI build of WPS?
 
I'm not very familiar with the CMake build option. I'm going to reach out to one of our software engineers to inquire about this.
 
I believe I've narrowed it down to two independent issues.

Firstly, for the prompt not behaving as expected, the appropriate fix should be to remove the usage of not() on the user input (i.e. MPI is off by default but if a user inputs a valid yes value, then turn it on):

The second issue is that even when (either with the above fix or manual edit of _build/wps_config.cmake) the MPI configuration is selected the MPI libraries are linked in correctly but the compilation is missing a singular preprocessor definition _MPI to control conditional compilation of certain sections of the code. This is currently missing and should be added:

These two changes should enable WPS to work as expected.
 
Thank you, it's great that this is fixed now!
I will check after the next WPS release and report back if I encounter problems.
 
Top