Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

calc_ecmwf_p tries to allocate too much memory

lnpilz

New member
Hi everybody,

I'm having the problem that intermittently (and I haven't quite figured out the system behind when it does and does not happen), `calc_ecmwf_p` tries to allocate 2 ExaBytes of memory and (predictably) crashes. The line where this happens is the following: WPS/util/src/calc_ecmwf_p.F at 884c1d15ffc97407cb174a64e540eaf873aad997 · wrf-model/WPS.

The error is:
In file 'calc_ecmwf_p.f90', around line 292: Error allocating 2506657909410179716 bytes: Cannot allocate memory



Error termination. Backtrace:
#0 0x403ca4 in ???
#1 0x405ff9 in ???
#2 0x7ffff6878cf2 in ???
#3 0x4016dd in ???
#4 0xffffffffffffffff in ???
srun: error: l10511: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=6789817.0

Could somebody help with this issue?

Thanks!

PS:
One hint might be that in the cases where it works correctly, calc_ecmwf_p writes this line as the next in the logfile: "WARNING: Either SOILHGT or SOILGEO are required to create 3-d GHT field, which is required for a correct vertical interpolation in real."

So maybe it finds some (malformed) height fields in the files which confuses it? No idea why this should happen only sometimes though... I tried to look into the FILE_ML (with xxd because I don't know how to inspect these custom intermediate files) and I could find some mentions of "HGT" but not of "SOILHGT" or "SOILGEO"
 
Last edited:
Hi Ming, thanks for getting back to me.

I'm trying to process ERA5 data. As indicated in the relevant tutorials and posts, I am first ungribbing the model-level data (because it is grib2) and then the surface level data (as it is grib1). I am then running `./util/avg_tsfc.exe` (which works fine) and then `./util/calc_ecmwf_p.exe`, which sometimes crashes.

I tried to make the crashes reproducible but it turns out that I can't.
On the login nodes of my HPC: I tried exactly the same configuration a couple of times and the first time it does crash while it works the subsequent times. When I then wait a couple of minutes (~2) and try again, it crashes again. That sounds like some uninitialized memory to me...
On the compute nodes of my HPC: It fails reproducibly.

I have attached the relevant VTable files, and calc_ecmwf_p logs (one for a working run and the _06 log contains the crash). If you need some more information, don't hesitate to reach out.
 

Attachments

  • calc_ecmwf_p.log
    1.8 MB · Views: 3
  • Vtable.ECMWF.grib1.txt
    2.5 KB · Views: 1
  • Vtable.ECMWF.grib2.txt
    1.3 KB · Views: 2
  • calc_ecmwf_p_06.log
    5.5 KB · Views: 2
Last edited:
Hey Ming,
just wanted to check back whether you had any new ideas to report. Unfortunately, the problem is still persisting.
Best,
Lukas
 
Lukas,
This seems more like a memory issue. Your login node should have larger memory than your computation node, which explains why the code can run in your login node. We are aware of the issues related to the model level ERA5 data. As a compromise, NCAR RDA archives ERA5 data on pressure level and the resolution is quarter degree. Please see details here. Hope the RDA dataset can meet your requirement.
If not, then a parallel code of calc_ecmwf_p.F might be required to process the model-level ERA5 data. Unfortunately due to limited human power in our group, it is not our priority at present.
 
Top