Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

MPAS-8.1 Restart Issues: Jobs Stopping Prematurely

makinde

New member
Dear everyone,
I am experiencing a strange and persistent issue with MPAS-8.1 when trying to restart from its own restart files. Someone else had reported a similar problem here, but I believe my case is unique to MPAS-8.1, as I have not encountered this issue with previous versions. Additionally, I am unsure what version of MPAS was used in that thread, which is why I decided to open a new discussion.

I hope someone here might be able to offer some insights or solutions.

Issue Description
I've observed that after the first successful run of MPAS-8.1 (compiled with parallel netcdf and SMIOL), any subsequent runs using restart files stop prematurely after about a day and six hours of simulation (output is every 6 hours). Here are the key points of the issue:

I performed a 2-year simulation (e.g., 2020-01-01 to 2021-12-31 with outputs every 6 hours and restarts every 2 days) using MPAS-8.1 as an initial run without restarting. The simulation runs successfully. However, if I restart the run using one of the restart files (e.g., 2020-06-06_00:00:00) with `config_run_duration = '365_00:00:00'`, which is supposed to run for one year, it stops after around 1 day and 6 hours of simulation, regardless of the restart time and restart file.

Despite that, MPAS-8.1 does not produce any error files (e.g., log.atmosphere_0000.err) but does produce outputs (including diag, history, and sometimes a new restart file if it reaches the time to output a restart file).

I checked that init_atmosphere for MPAS-8.1 works fine and the output files are good (upon plotting the skin temperature, topography, and SST-Delta). Also, all libraries needed by MPAS are sourced correctly.

Short runs that do not require a restart are completed successfully. Restarting from a specific point in the short run also results in the job stopping prematurely. I have attached my namelist.atmosphere and the standard out and error files.

Given the above details, I would appreciate any advice or suggestions on possible reasons why the simulation stops prematurely upon restart or diagnostic steps or troubleshooting tips that could help pinpoint the root cause of this issue.

Thank you in advance for your help!
 

Attachments

  • atmos_stdout.txt
    6.1 MB · Views: 1
  • atmos_stderr.txt
    294.7 KB · Views: 0
  • namelist_atmosphere.txt
    1.5 KB · Views: 1
In your atmos_stdout.txt, there are many messages like " Flerchinger USEd in NEW version. Iterations= 10", which indicates that physics is wrong in your simulation. Possible errors might occur in LSM. This is why your restart cannot run to the end.
 
In your atmos_stdout.txt, there are many messages like " Flerchinger USEd in NEW version. Iterations= 10", which indicates that physics is wrong in your simulation. Possible errors might occur in LSM. This is why your restart cannot run to the end.
Thank you so much for the reply. Yeah, I noticed the many " Flerchinger USEd in NEW version. Iterations= 10" in the standard out but I thought it was just information. As far as I know, I have not changed any of the physics schemes from the default mesoscale_reference suit so I can't think of where I could have got it wrong. One thing to note is that the model runs fine as long as I do not use any restart files and this could be misleading. Please any suggestion where I should look or further diagnostics to solve this issue?

Thanks.
 
Hi,

Please clarify that, if you run MPAS continuously over a period, it can run to the end successfully. However, if you run the same case over the same period but using 'restart' option from any time within the period, the model will crash.

I did run a test case using the 240km mesh, and the 'restart' option works just fine. Your case is a big case, but I don't think this should be a concern.

Would you please issue the command: ncdump -h restart (your MPAS restart file) > log and send me the log file to take a look?

Also, can you save soil data (i.e., smcrel, sh2o, smois, tslb) at the time of restarting, and compare them with the data in your restart file? I would expect that they should be same.
 
Hi,

Please clarify that, if you run MPAS continuously over a period, it can run to the end successfully. However, if you run the same case over the same period but using 'restart' option from any time within the period, the model will crash.

I did run a test case using the 240km mesh, and the 'restart' option works just fine. Your case is a big case, but I don't think this should be a concern.

Would you please issue the command: ncdump -h restart (your MPAS restart file) > log and send me the log file to take a look?

Also, can you save soil data (i.e., smcrel, sh2o, smois, tslb) at the time of restarting, and compare them with the data in your restart file? I would expect that they should be same.
Thank you, @Ming, for your reply.
Sorry for the late response. It took me some time to set up a new simulation that includes the variables you requested.

To answer your question:
Yes, that was what I mentioned previously. However, after running new simulations and performing various tests over the last few days, I discovered the following:
  1. MPAS crashes after 1 month, 5 days, and 6 hours when performing a long (1979-12-01 to 1980-12-31) continuous simulation at 60km mesh resolution, with outputs every 6 hours and restarts every 2 days.
  2. It also crashes after 1 month and 20 days when using a 30km uniform mesh for the same long, continuous case.
  3. Any attempt to restart MPAS from the two long cases mentioned above also results in MPAS crashing with a "forrtl: severe (174): SIGSEGV, segmentation fault occurred," regardless of which restart file is used.
  4. Need to note that for these simulations, I no longer get the " Flerchinger USEd in NEW version. Iterations= 10" in the atmos_stdout.
Hence, I ran some short test simulations using the 60km resolution and discovered that:
a. MPAS runs successfully for a continuous simulation over a period of 5 days at 60km mesh resolution, with outputs (including restart files) generated every hour.
b. It continues to run successfully if I restart the same case simulation using any of the restart files.
c. However, MPAS crashed again the moment I increased the duration from 5 days to 365 days using the same restart file. It crashes 1 month and 3 days after the restart.

As requested, I have attached the ncdump of the last restart file, log.restart_1980-01-04_00-00-00.txt, before MPAS crashed during the long run. Additionally, I extracted the soil data (i.e., smcrel, sh2o, smois, tslb) from the diag and restart files (both at the timestamp 1980-01-04_00:00:00) just before the crash and compared them by subtracting: restart-diag. They are identical, as shown by the zero/empty plot attached. In addition, I have attached the namelist.atmosphere.txt, log.atmosphere.0000.out.txt and atmos_stderr.txt for the short test simulation.
 

Attachments

  • diff_sh2o.gif
    diff_sh2o.gif
    6.5 KB · Views: 0
  • diff_smcrel.gif
    diff_smcrel.gif
    6.4 KB · Views: 0
  • diff_smois.gif
    diff_smois.gif
    6 KB · Views: 0
  • diff_tslb.gif
    diff_tslb.gif
    6 KB · Views: 0
  • log.atmosphere.0000.out.txt
    5.3 MB · Views: 0
  • namelist.atmosphere.txt
    1.5 KB · Views: 1
  • atmos_stderr.txt
    267.8 KB · Views: 0
  • log.restart_1980-01-04_00-00-00.txt
    42.5 KB · Views: 0
Hi,

Thank you for the update and detailed description. These tests do make sense and they actually address my concerns of the restart capability of MPAS.

Apparently the MPAS "restart" capability works fine as expected. The model crash is caused by errors in physics. Note that once the model crashed, it cannot continue to run from restart file because the same physics/dynamics errors again would lead to model blow up.

Your namelist.atmopshere looks fine and I don't see any inappropriate options. It seems that MPAS has trouble for long-term climate simulations. Now we need to identify the culprit that is responsible for the model crash.

Please recompile MPAS in debug mode, i.e.,

make clean CORE=atmosphere

make your-compiler CORE=atmosphere DEBUG=true

Then run this case from the latest restart file before it crashed and save all log files. Hope we can find exactly when and where the model starts to produce unreasonable results.
 
Thank you, @Ming, for your responses.

I recompiled MPAS-8.1.0 in debug mode as you suggested and then restarted both the 60km long run and the 60km short test that I later increased to 365 days. Unfortunately, both of them still crash but with different floating-point errors in the atmos_stderr files. I have attached the log files:

- 60km long run: atmos_stderr_long.txt
- 60km short test: atmos_stderr_test.txt

Apart from this, I did not notice any other differences when running MPAS-8.1.0 in debug mode.

Please advise on what else I should do. Would you suggest that MPAS-8.2 might overcome these challenges?

Thank you for your assistance.
 

Attachments

  • atmos_stderr_test.txt
    2.5 KB · Views: 2
  • atmos_stderr_long.txt
    1.3 KB · Views: 1
Thank you for the update.

It seems that MPAS has trouble for long-term climate simulation, while it works fine for short-term runs. I don't think MPASv8.2 could overcome this issue because no physics update in v8.2 addresses relevant issues.

Would you please add the option below:

Code:
config_bucket_update = '5_00:00:00',

Then rerun this case from the beginning (i.e., not restart, just start from your initial condition).

Please keep me updated of the result. Thanks.
 
Thank you for the update.

It seems that MPAS has trouble for long-term climate simulation, while it works fine for short-term runs. I don't think MPASv8.2 could overcome this issue because no physics update in v8.2 addresses relevant issues.

Would you please add the option below:

Code:
config_bucket_update = '5_00:00:00',

Then rerun this case from the beginning (i.e., not restart, just start from your initial condition).

Please keep me updated of the result. Thanks.
Again thank you, @Ming.
I just finished another new run of MPAS-8.1.0 setup for a long simulation from 1979-12-01 to 1980-12-31 as with previous case simulation but with `config_bucket_update = '5_00:00:00`. I ran the model from the initial condition (i.e. not using any restart file) with diag, history and restart output set to 6 hr, 1 day and 2 days respectively. However, MPAS crashed after 1 day of simulation (diag.1979-12-01_00.00.00.nc to diag.1979-12-02_00.00.00.nc) with `forrtl: error (73): floating divide by zero`. Please find attached the atmos_stderr.txt and log.atmosphere.0000.out.txt files.
Thanks.
 

Attachments

  • atmos_stderr.txt
    1.1 KB · Views: 1
  • log.atmosphere.0000.out.txt
    206.6 KB · Views: 1
Please upload your namelist and streams files for me to take a look. Also, what data did you use to produce initial condition for this case?
 
Please upload your namelist and streams files for me to take a look. Also, what data did you use to produce initial condition for this case?
Thanks, @Ming Chen
I used the NCEP CFSR 6 hourly product (NCAR RDA Dataset d093000) downloaded from the NCAR DSA. Usually, I download the dataset in Grib2 format and use the WRF ungrib tool to produce an intermediate file format for MPAS initialization. In the past, I have used this dataset to run MPAS-7.XX for a 30- and 60-year climatology.

Please find attached the namelist and streams files I used for the MPAS-8.1.0 run I described in the last post.
 

Attachments

  • namelist.atmosphere.txt
    1.7 KB · Views: 0
  • stream_list.atmosphere.diagnostics.txt
    1.3 KB · Views: 0
  • stream_list.atmosphere.output.txt
    927 bytes · Views: 0
  • stream_list.atmosphere.surface.txt
    9 bytes · Views: 0
  • streams.atmosphere.txt
    1.9 KB · Views: 0
  • streams.init_atmosphere.txt
    920 bytes · Views: 0
Last edited:
Top