Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

MPAS-A stops running for global variable resolution grid 15km-3km, but does not raise any errors

jkukulies

New member
I am trying to run MPAS with the global variable resolution mesh 15km-3km with the refinement area being located over CONUS. I have succesfully created the static file and initial conditions, but the model run stops after 30 minutes and 18 seconds. I get no errors and the first history file is created with output until the simulation stops.

I have set the model timestep to 18 seconds. I have tried both the mesoscale_reference physics suite and the convection_permitting suite. I have also tried different combinations of batch nodes and MPI processes and undersubscribed batch nodes, as suggested for this high-resolution grid. The output format for history files is set to "pnetcdf,cdf5" and I have added config_apvm_upwinding = 0.0 in the namelist.atmosphere.

Attached are my namelist, stream file and the log output. My working directory on derecho: /glade/work/kukulies/MPAS-Model

Any help would be very appreciated!

Thanks!
Julia
 

Attachments

  • log.atmosphere.0000.out.txt
    88.3 KB · Views: 1
  • streams.atmosphere.txt
    855 bytes · Views: 1
  • namelist.atmosphere.txt
    2.3 KB · Views: 2
Hi Julia,

I assume at least one of the jobs that shows this error is associated with the "/glade/work/kukulies/MPAS-Model/mpas_atmosphere.o7705618" file. Looking at that job output file and the "log.atmosphere.0291.err" file, it seems your MPAS run terminated itself due to "CRITICAL ERROR: NaN detected in 'w' field." It seems that your simulation is eventually performing invalid arithmetic. Or is there some other job I should be looking at?

Maybe someone else could see if your namelist settings introduce an issue, but they could be reasonable (I'm not an expert on valid namelist settings combinations).

I will try looking into this by running a similar job myself. What modules are you running with on Derecho? Could you extend your "run_mpas.sh" to explicitly load those modules and run the module list command?

If you have the time, it may help to visualize the Conus_init.nc file to see if that seems like a reasonable input.

Cheers,
Dylan
 
Hi Dylan,

thanks for your prompt reply!

The job output and the error file you are looking at was from a try before where I replaced the convection scheme of the physics suite. You should look at the latest job which only produced log.atmosphere.0000.out but no error log files. That is the problem I have encountered again by trying different name list configurations and processors. The nan values for w, I only got for a specific physics setup with the convection_permitting physics suite.

Will get back to you shortly with a visualization of my init file!

Thanks for looking into into this!

Cheers,
Julia
 
Hi Dylan,

an update here is that I made the simulation work and I am pretty sure the only thing that I changed is that I explicitly loaded the modules, so thank you for that tip!

I ran the simulation with 3084 processors and that is the module list output:

Currently Loaded Modules:

1) ncarenv/23.09 (S) 5) cray-mpich/8.1.27 9) ncl/6.6.2


2) craype/2.7.23 6) hdf5/1.12.2 10) ncview/2.1.9


3) intel/2023.2.1 7) netcdf/4.9.2


4) ncarcompilers/1.0.0 8) nco/5.2.4

That said, I am still struggling to run the same simulation with the convection_permitting suite and that goes back to the previous error "CRITICAL ERROR: NaN detected in 'w' field.The CONUS_init.nc file seem to look fine (attached a visualization of the surface temperatures).

Thanks!
Julia
 

Attachments

  • init_temperature_field.png
    init_temperature_field.png
    1.1 MB · Views: 2
I'm glad the modules helped you out! As a non-scientist, I'd agree the temperatures look feasible in your image.

I'll try to get some time in the coming days to try your set up. I also wouldn't think switching to the convection_permitting suite would cause NaNs to occur.
 
OK, you are right. I did some more systematic testing now and it turns out that I can run the convection_permitting suite without problem when choosing exactly the same physics packages or when replacing certain schemes, e.g., the surface and boundary layer schemes.

However, I am getting the NaNs in the w-field error when I choose config_lsm_scheme = sf_noahmp. Sorry for the confusion earlier. So, the NaN error has nothing to do with the physics suite but shows up when choosing Noah MP as the land surface model. This happens both when I run my setup with mesoscale_reference or with convection_permitting. Should I maybe open a new thread for this because I know that the Noah MP capability is a rather recent addition? I am confused though, because I cannot think of a reason why the land surface scheme would cause unrealistic values in the vertical velocity field.

I should probably also add that I have successfully run the namelist settings that fail for the variable resolution 15-3km mesh with the global uniform 15 km mesh. So I hope that I identify the error coming from the write source, but it seems like the land surface option is what is causing the error.
 
Thanks for the investigation! Yes a new thread would be great to help boost the signal on this. Noah MP land surface model is one of our newest additions and there may be interactions or even namelist settings that aren't well understood for correct use.

(Personally, I think impact on the w field is the most reasonable, the temperature fluxes would likely contribute to the vertical wind velocity.)
 
Thanks for the investigation! Yes a new thread would be great to help boost the signal on this. Noah MP land surface model is one of our newest additions and there may be interactions or even namelist settings that aren't well understood for correct use.

(Personally, I think impact on the w field is the most reasonable, the temperature fluxes would likely contribute to the vertical wind velocity.)
You're absolutely right, that makes sense. I've heard from colleagues who've tested Noah MP on coarser grids, though not down to 3km, so that could be where the issue lies. I'll start a new thread to dive deeper into this. Thank you so much for your help so far, and apologies if my comments have been a bit all over the place! I really appreciate your time and insights.
 
Top