Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

possible memory issue when reading wrfbdy_d01

xjhuang

New member
I am using chem_opt=202 MOZART-MOSAIC-aq and using mozbc to write in chemical boundary and initial conditions from CAM-Chem.

I had a couple of successful runs before:
1) chem_opt = 201 + bc + ic
2) chem_opt = 202 + ic;
However, chem_opt = 202 + bc + ic always failed with varying error messages

1. Using 96 cores:
d01 2017-06-10_00:00:00 Input data is acceptable to use: wrfbdy_d01
forrtl: severe (174): SIGSEGV, segmentation fault occurred
double free or corruption (out)

2. Using 96 cores but significantly increase requested memory to avoid memory issue:
d01 2017-06-10_00:00:00 Input data is acceptable to use: wrfbdy_d01
then it just hangs there, no more information and not proceeding, just timeout

3. Using 196 cores:
the program can get over wrfbdy without error, but then have errors later:
d02 2017-06-10_00:00:00 Input data is acceptable to use: wrfinput_d03
corrupted size vs. prev_size

4. Using 196 cores, also significantly increase request memory:
It's running! But after processing for 7min, the error comes back:
corrupted size vs. prev_size

I want to understand better what are these corruption messages and what are possible ways to fix them?

Does it mean I always need to use a very large number of CPU and memory?

A note is my wrfbdy_d01 has exactly 3000 4D variables. It is surprisingly large to me; is this normal?

1759520136158.png

Attached my namelist.input

Thanks for any help!
 

Attachments

  • namelist.input
    7.8 KB · Views: 0
Apologies that in my previous namelist, I have a couple of bc options on for domains d02 and d03. I corrected these and don't think they are relevant to the issue above.

A previous discussion thread mentioned mp_physics=2 (Lin et al. scheme) would work but it does not have wet scavenging so I conducted the following tests:

___________________________________________________________________________________________________________________________
Test results (all 96 cpu if not specified):

1. mp_physics=2, wetscav_onoff=0, cldchem_onoff=1, works

2. mp_physics=10, wetscav_onoff=0, cldchem_onoff=1, works

3. mp_physics=2, wetscav_onoff=1, cldchem_onoff=1, not allowed, it's a namelist conflict: mp_physics=2 does not co-exist with wetscav_onoff=1

4. mp_physics=10, wetscav_onoff=1, cldchem_onoff=0, seems to work, but i have free(): invalid pointer in the end

5. mp_physics=10, wetscav_onoff=1, cldchem_onoff=1, that's the previous failed case;



5.1) tested it again with 96 cpu and still failed with this message:

d01 2017-06-10_00:00:00 Input data is acceptable to use: wrfbdy_d01

forrtl: severe (174): SIGSEGV, segmentation fault occurred



5.2) tested it again with 128 cpu and hanged with this message:

d01 2017-06-10_00:00:00 Input data is acceptable to use: wrfbdy_d01

corrupted size vs. prev_size

it practically failed because it will not proceed.



5.3) then I turned off wetscav_off only for the first domain, other domain: mp_physics=10, wetscav_onoff=1, cldchem_onoff=1, received a different error message then it hanged:

d02 2017-06-10_00:00:00 Input data is acceptable to use: wrfinput_d03

corrupted size vs. prev_size



6. mp_physics=2, wetscav_onoff=0, cldchem_onoff=0, surprisingly it hanged there:

d01 2017-06-10_00:00:00 Input data is acceptable to use: wrfchemi_d01_2017-06-10_00:00:00

corrupted double-linked list

this equals failure to me because usually it will not proceed



It seems like to be a little bit random and hard to explain to me.



7. I removed boundary condition to test if it works okay with initial condition

mp_physics=10, wetscav_onoff=1, cldchem_onoff=1, 96 cpu: works

___________________________________________________________________________________________________________________________

My final goal would still be able to run mp_physics=10, wetscav_onoff=1, cldchem_onoff=1, 96 cpu, with both bc and ic on. Appreciate it if someone could help investigate this further and give some suggestions.

Thanks!
 
Top