Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

rsl_malloc failed allocating -2076874848 bytes, called rsl_bcast.c, line 290, try 3

foji

New member
Hello everyone,

As shown in the title, I am faced with a perplexing problem using wrf.exe.
This problem does not usually occur, but it does occur in certain area settings and with a limited number of parallels.

I'm using dm+sm mode.
16 MPI and 9 OMP threads/process -> NG
16 MPI and 18OMP threads/process -> NG
32 MPI and 9 OMP threads/process -> OK
48 MPI and 9 OMP threads/process -> OK

I compiled using oneAPI 2023.0.0 intel compiler and intelMPI.
I use a large HPC system with over 100 GB per node.
I have already run several calculations, but this is the first time I have faced the problem.
The error message always appears only in rsl.error.0005, so the problem is occurring in process 5.
I think it's because the allocate size is negative, but I don't know why it would be negative.
Has anyone else faced a similar problem?
I would be happy to get your advice.

--- rsl.error.0005 ---
taskid: 5 hostname: aaaa5
module_io_quilt_old.F 2931 F
Quilting with 1 groups of 0 I/O tasks.
Ntasks in X 4 , ntasks in Y 4
Domain # 1: dx = 3333.000 m
Domain # 2: dx = 1111.000 m
WRF V4.4.2 MODEL
git commit 6233639c599119e76fca17dba9ea211af53a0ba9
*************************************
Parent domain
ids,ide,jds,jde 1 626 1 798
ims,ime,jms,jme 151 320 194 406
ips,ipe,jps,jpe 158 313 201 399
*************************************
DYNAMICS OPTION: Eulerian Mass Coordinate
alloc_space_field: domain 1 , 4073073176 bytes allocated
med_initialdata_input: calling input_input
Input data is acceptable to use:
CURRENT DATE = 2023-03-22_00:00:00
SIMULATION START DATE = 2023-03-22_00:00:00
Max map factor in domain 1 = 1.04. Scale the dt in the model accordingly.
D01: Time step = 15.0000000000000 (s)
D01: Grid Distance = 3.33300000000000 (km)
D01: Grid Distance Ratio dt/dx = 4.50045004500450 (s/km)
D01: Ratio Including Maximum Map Factor = 4.68098881221054 (s/km)
D01: NML defined reasonable_time_step_ratio = 6.00000000000000
Normal ending of CAMtr_volume_mixing_ratio file
GHG annual values from CAM trace gas file
Year = 2023 , Julian day = 81
CO2 = 4.224513890410958E-004 volume mixing ratio
N2O = 3.342180328767123E-007 volume mixing ratio
CH4 = 1.939667246575342E-006 volume mixing ratio
CFC11 = 2.106170986301369E-010 volume mixing ratio
CFC12 = 4.820791205479451E-010 volume mixing ratio
INPUT LandUse = "MODIFIED_IGBP_MODIS_NOAH"
LANDUSE TYPE = "MODIFIED_IGBP_MODIS_NOAH" FOUND 61 CATEGORIES 2 SEASONS WATER CATEGORY = 17 SNOW CATEGORY = 15
INITIALIZE THREE Noah LSM RELATED TABLES
*************************************
Nesting domain
ids,ide,jds,jde 1 484 1 484
ims,ime,jms,jme 112 252 112 252
ips,ipe,jps,jpe 122 242 122 242
INTERMEDIATE domain
ids,ide,jds,jde 205 371 296 462
ims,ime,jms,jme 237 297 328 388
ips,ipe,jps,jpe 247 287 338 378
*************************************
alloc_space_field: domain 2 , 156802940 bytes allocated
alloc_space_field: domain 2 , 2242279532 bytes allocated
rsl_malloc failed allocating -2076874848 bytes, called rsl_bcast.c, line 290, try 1
: Cannot allocate memory
mallinfo: arena 6213632
mallinfo: ordblks 10
mallinfo: smblks 8
mallinfo: hblks 22
mallinfo: hblkhd 125284352
mallinfo: usmblks 0
mallinfo: fsmblks 640
mallinfo: uordblks 4911040
mallinfo: fordblks 1302592
mallinfo: keepcost 106096
rsl_malloc failed allocating -2076874848 bytes, called rsl_bcast.c, line 290, try 2
: Cannot allocate memory
mallinfo: arena 6213632
mallinfo: ordblks 10
mallinfo: smblks 8
mallinfo: hblks 22
mallinfo: hblkhd 125284352
mallinfo: usmblks 0
mallinfo: fsmblks 640
mallinfo: uordblks 4911040
mallinfo: fordblks 1302592
mallinfo: keepcost 106096
sh: lsps: Cannot find command
rsl_malloc failed allocating -2076874848 bytes, called rsl_bcast.c, line 290, try 3
: Cannot allocate memory
mallinfo: arena 6213632
mallinfo: ordblks 10
mallinfo: smblks 8
mallinfo: hblks 22
mallinfo: hblkhd 125284352
mallinfo: usmblks 0
mallinfo: fsmblks 640
mallinfo: uordblks 4911040
mallinfo: fordblks 1302592
mallinfo: keepcost 106096
sh: lsps: Cannot found command
sh: lsps: Cannot found command
Abort(9) on node 5 (rank 5 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 9) - process 5
-----

Sincerely yours,
foji
 
Foji,
I would suggest that you recompile WRF in dmpar mode, then try again. This is because we occasionally experience OpenMP related issues and we are not 100% sure how to fix such problems. Sorry for not being to help on this issue.
 
Top