Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

cxil_map: write error with real.exe

avipwrfhelp

New member
Hello,
I have been getting this error repeatedly with WRF (v4.1.3, 4.2.2) built with intel mpich on sapphire-rapids. I think I saw this issue on the forum but with nested domains on derecho but this is a large single (CONUS) domain. Also, I checked the floor on minimum number of processor cores when running this. Attaching my nameless.input and rsl.error.0000 file.

Thanks for any suggestions.
 

Attachments

  • namelist.input
    6.2 KB · Views: 4
  • rsl.error.0000
    62.1 KB · Views: 4
Hi,
Are you using the standard NCAR-managed WRF? I'm asking because the error message that prints out is

Code:
cxil_map: write error

and I can't find anything related to "cxil" in the WRF code. If so,

1) Have you made any modifications to the code?
2) Can you try this with V4.6.0 and see if the problem still exists?
3) Make sure you have enough disk space to write the large files from real.exe
4) Please package all rsl* files into a single *.tar file and attach that (if it's too large, see the forum home page for instructions on sharing large files).
 
Hello,

the source was obtained from https://github.com/wrf-model/WRF/
1) and no mods were made to the code.

Please see the ticket

Running WRF Domain Configuration on Derecho That Was Stable on Cheyenne​

dated 5/1/24, where this "cxil_map: write error" was also mentioned and where apparently the suggested solution was the domain decomposition on a nested domain, but this is one large domain.

The tarred up rsl.* files are attached.
 

Attachments

  • rsl-output-cxil.tar
    2.3 MB · Views: 2
Hi,
Apologies for the delay. I don't have access to that ticket. Our group is not part of the CISL support group. We only work with the WRF model. This does seem to be something specific to the system or environment setup, though. If you have all your files in a directory on Derecho, would you mind sharing that with me? I can test it out on my end, and I may end up having to point you to CISL support, but I can at least try it first.
 
No, that was not my ticket and I don't have access to Derecho. Just pointed out the fact that both these tickets are exhibiting the same error message although the Derecho solution provided was due to the domain/partition, while mine was a single domain on the Kestrel (NREL) system. Maybe I can request that a CISL person assist us on this issue?
 
Maybe I can request that a CISL person assist us on this issue?
No, if this isn't happening on Derecho, we can't ask them for help. I just looked more closely at your namelist.input file and I see you have history_interval set to 30000000. I assume that wasn't intentional. Can you modify that back to a more reasonable value and see if that happens to be the problem you're seeing?
 
Unfortunately, that was not what was causing the problem. Changing it to default values of 0 for nameless.input for real.exe runs didn't fix it. Running out of ideas what can be causing this.
 
Thanks for trying that. There is a lot going on with your setup, so let's try to narrow down the issue.

1) First, I am curious what resolution is of the input data. I notice you have dx/dy = 2000, which is very high resolution. The resolution of your domain should not be more than a 5:1 ratio to that of the input data. If it is, you will need to add a parent domain around the 2km domain to buffer the resolution differences. See Best Practices for WPS for suggestions for setting up a reasonable domain.

2) After that, if you're still getting the error, I would suggest trying your domain and dates with a very basic namelist.input. Grab the default namelist and just modify the components necessary to run your specific domain (only in the &time_control and &domain section). Don't make any other modifications or add any additional options. This will help us to know whether it's specifically your domain and/or data causing the problem, or whether the issue is related to one or more of the options you've chosen. If it fails again, please send your new rsl* files and your new namelist.
 
So let me set the stage for the errors I am getting..

1) We have used this domain before and it has run fine, but on a different HPC system. This is a new HPC system with sapphire-rapids and Cray software system and networking. Also, it doesn't appear memory is an issue

[apurkaya@kl3 test2] CPU $ seff 5911991
Job ID: 5911991
Cluster: kestrel
User/Group: apurkaya/apurkaya
State: COMPLETED (exit code 0)
Nodes: 11
Cores per node: 104
CPU Utilized: 13:47:09
CPU Efficiency: 61.10% of 22:33:44 core-walltime
Job Wall-clock time: 00:01:11
Memory Utilized: 1.78 TB (estimated maximum)
Memory Efficiency: 68.84% of 2.58 TB (2.31 GB/core)

2) I stripped out the physics, dynamics but then realized I need a few of them, so added back some of them and it ran but with the same errors. The nameless.input, and rsl.error.0000 is attached.
 

Attachments

  • rsl.error.0000
    72.9 KB · Views: 1
  • namelist.input
    4.6 KB · Views: 2
Unfortunately each machine environment is totally different and compilers, executables, libraries, etc. perform differently on them. Have you tried using more processors? I'm not sure why I haven't mentioned this before, but for the size of your domain, you could be using a LOT more - probably up to 7000 or so. With the high resolution you're using and the large domain size, this could need more. I am still interested to know the resolution of your input data.

When I suggested using a default namelist with only your domain size/dates in it, perhaps I wasn't clear with my statement. I wanted you to try a run with no extra options (e.g., auxhist*, eta_levels, etc.). I took a default namelist and made the modifications based on your domain/dates. I'm attaching it here so you can try to run with these settings.
 

Attachments

  • namelist.input
    3.9 KB · Views: 1
I ran with the above namelist file and upto 7000 cores or so, but still got the same error.. now at the end of my rope..

% more rsl.error.0000

taskid: 0 hostname: x1005c0s0b0n0
module_io_quilt_old.F 2931 T
Ntasks in X 80 , ntasks in Y 90
*************************************
Configuring physics suite 'conus'

mp_physics: 8
cu_physics: 6
ra_lw_physics: 4
ra_sw_physics: 4
bl_pbl_physics: 2
sf_sfclay_physics: 2
sf_surface_physics: 2

*************************************
REAL_EM V4.1.3 PREPROCESSOR

*************************************

Parent domain
ids,ide,jds,jde 1 2650 1 1950
ims,ime,jms,jme -4 41 -4 29
ips,ipe,jps,jpe 1 34 1 22

*************************************

DYNAMICS OPTION: Eulerian Mass Coordinate

alloc_space_field: domain 1 , 82089636 bytes allocated
Yes, this special data is acceptable to use: OUTPUT FROM METGRID V4.1
Input data is acceptable to use: met_em.d01.2014-01-01_00:00:00.nc
metgrid input_wrf.F first_date_input = 2014-01-01_00:00:00
metgrid input_wrf.F first_date_nml = 2014-01-01_00:00:00

cxil_map: write error
cxil_map: write error
:
 
Okay, Can you try recompiling the code with lower optimization? Issue the following from the top-level WRF directory:

1) ./clean -a
2) ./configure

Then open the configure.wrf file and look for instances of "-O2" or "-O3" and change those to "-O1." Then save that file and recompile the code. After that, try running the simulation again to see if there's any difference.
 
This seems awfully related to an issue we have been tracking on Derecho for a while, and might have a fix:

If possible, please replace the contents of your "frame/collect_on_comm.c" with this:

and try running the case again?
 
Thanks.
I have a same error with storing restart files in WRFv4.6 with the urban option 3 (BEP-BEM) on Derecho.
This issue is resolved using revised frame/collect_on_comm.c
 
Top