Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

MPAS-A Crashed with failure opening MP_THOMPSON_QRacrQG.DBL

Feng Liu

Member
Hi,
I encountered the following micro-physics data error when running MPAS-A with an initial file prepared based on ERA5 data. Does anyone have any suggestions or insights that could help?

----------------------------------------------------------------------
Beginning MPAS-atmosphere Error Log File for task 0 of 36
Opened at 2025/02/04 14:23:15
----------------------------------------------------------------------

ERROR:
ERROR: ------------------------------ FATAL CALLED ------------------------------
ERROR: subroutine thompson_init: failure opening MP_THOMPSON_QRacrQG.DBL
CRITICAL ERROR: MPAS core_physics abort
Logging complete. Closing file at 2025/02/04 14:23:15
 
Is the file present in your run directory? If not, these MP_THOMPSON*.DBL files are generated by the build_tables executable. If you have a successful MPAS-A build, you can find build_tables in the top-level of the MPAS-Model directory.
 
Is the file present in your run directory? If not, these MP_THOMPSON*.DBL files are generated by the build_tables executable. If you have a successful MPAS-A build, you can find build_tables in the top-level of the MPAS-Model directory.
Thank you so much! I’ve successfully received the following files after implementing the build_tables:

  • MP_THOMPSON_QIautQS_DATA.DBL
  • MP_THOMPSON_freezeH2O_DATA.DBL
  • MP_THOMPSON_QRacrQS_DATA.DBL
  • MP_THOMPSON_QRacrQG_DATA.DBL
However, the model is not proceeding due to a core dumped issue. The log file (log.atmosphere.0000.out, renamed to log.atmosphere.txt), which I’ve attached, doesn’t show any errors or provide any hints about what went wrong.

On the other hand, a different case with a lower resolution mesh ran successfully. I’ve also attached the corresponding log file for this case (log.atmosphere.test.txt).

I appreciate your time and help.

Feng
 

Attachments

  • log.atmosphere.txt
    16.2 KB · Views: 1
  • log.atmosphere.test.txt
    380.1 KB · Views: 1
Last edited:
I'm unsure from the logs you provided. With my knowledge I'm only sure that your job died during the initialization phase of the atmosphere core.

Typically information for this type of failure would either be written to STDOUT or STDERR. If you are running on a HPC system these are usually re-directed to a file for you. Or you may have saved this output yourself. Otherwise, you'd need to look back in the history of whatever shell you ran the atmosphere_model from (or just re-run).

You mention a "core" file. Assuming this is the literal name and it's present in your run directory, you could use the gdb tool to examine the core file to understand the failure. A procedure for this generally looks like:
Bash:
# Setup the environment to use the same software you compiled/ran MPAS with.
# This is system dependent so I'm not adding the commands here

# Start gdb, replace each PATH_TO with your values (if needed)
gdb ${PATH_TO}/atmosphere_model ${PATH_TO}/core
# There will then be quite a bit of output. You should look at the last few lines
#  before the prompt. It may contain something like SIGFPE or SIGSEGV to
#  tell you how it failed.
# You should also see a line above the prompt that tells you what the last
#  line of code executed was. Depending on a few factors, you could get
#  more info with the backtrace (or 'bt') command

(gdb) bt
# gdb will spit out many lines of output. There's usually not much else I do
#  inside gdb. You could research about gdb to see how else to use the tool.
# I typically just exit/quit here and look the MPAS code to understand the error.
(gdb) quit

You may need to re-compile with DEBUG=true in your make command to get the most useful information from gdb. I can usually get at least the last line before the failure occurred even without building with debug flags.
 
Last edited:
I'm unsure from the logs you provided. With my knowledge I'm only sure that your job died during the initialization phase of the atmosphere core.

Typically information for this type of failure would either be written to STDOUT or STDERR. If you are running on a HPC system these are usually re-directed to a file for you. Or you may have saved this output yourself. Otherwise, you'd need to look back in the history of whatever shell you ran the atmosphere_model from (or just re-run).

You mention a "core" file. Assuming this is the literal name and it's present in your run directory, you could use the gdb tool to examine the core file to understand the failure. A procedure for this generally looks like:
Bash:
# Setup the environment to use the same software you compiled/ran MPAS with.
# This is system dependent so I'm not adding the commands here

# Start gdb, replace each PATH_TO with your values (if needed)
gdb ${PATH_TO}/atmosphere_model ${PATH_TO}/core
# There will then be quite a bit of output. You should look at the last few lines
#  before the prompt. It may contain something like SIGFPE or SIGSEGV to
#  tell you how it failed.
# You should also see a line above the prompt that tells you what the last
#  line of code executed was. Depending on a few factors, you could get
#  more info with the backtrace (or 'bt') command

(gdb) bt
# gdb will spit out many lines of output. There's usually not much else I do
#  inside gdb. You could research about gdb to see how else to use the tool.
# I typically just exit/quit here and look the MPAS code to understand the error.
(gdb) quit

You may need to re-compile with DEBUG=true in your make command to get the most useful information from gdb. I can usually get at least the last line before the failure occurred even without building with debug flags.
Thank you for your response. I checked it using the debugging tool and obtained the following results which seems not helpful.

Interestingly, the model is running successfully after I switched run directory and linked the namelist.atmosphere, streams.atmosphere, region.init.nc, and all the lbc* files to the MPAS_Model folder where I built the MPAS model. As expected, I found three output files: log.atmosphere.0000.out, diag.2017-06-14_00.00.00.nc, and history.2017-06-14_00.00.00.nc.

Where should I set up the environment parameters, such as SDTOUT or STDEER, that you mentioned?

Thanks again for your time and assistance!
-------------
gdb atmosphere_model core.90312
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-114.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <The GNU General Public License v3.0 - GNU Project - Free Software Foundation>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<Bugs in GDB>...
Reading symbols from MPAS/MPAS_Model/MPAS-Model-8.2.2/atmosphere_model...done.

warning: core file may not match specified executable file.
[New LWP 90312]
[New LWP 90334]
[New LWP 90333]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib64/libthread_db.so.1".

warning: the debug information found in "/usr/lib/debug/usr/lib64/libz.so.1.2.7.debug" does not match "/usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libz.so.1" (CRC mismatch).


warning: the debug information found in "/usr/lib/debug//usr/lib64/libz.so.1.2.7.debug" does not match "/usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libz.so.1" (CRC mismatch).
.....
Core was generated by `./atmosphere_model'.
Program terminated with signal 11, Segmentation fault.
#0 mca_btl_smcuda_sendi () at ../../../../../opal/mca/btl/smcuda/btl_smcuda.c:938
938 ../../../../../opal/mca/btl/smcuda/btl_smcuda.c: No such file or directory.
Missing separate debuginfos, use: debuginfo-install libibverbs-41mlnx1-OFED.4.5.0.1.0.45101.x86_64 libmlx4-41mlnx1-OFED.4.5.0.0.3.45101.x86_64 libmlx5-41mlnx1-OFED.4.5.0.3.8.45101.x86_64 libnl3-3.2.28-4.el7.x86_64 librdmacm-41mlnx1-OFED.4.2.0.1.3.45101.x86_64 librxe-41mlnx1-OFED.4.4.2.4.6.45101.x86_64 nspr-4.35.0-1.el7_9.x86_64 nss-3.90.0-2.el7_9.x86_64 nss-softokn-freebl-3.90.0-6.el7_9.x86_64 nss-util-3.90.0-1.el7_9.x86_64 numactl-libs-2.0.9-7.el7.x86_64 openssl-libs-1.0.2k-26.el7_9.x86_64 zlib-1.2.7-21.el7_9.x86_64
 
The first part of this message will focus on your now successful run. More information is provided at the end about gdb and other details that aren't helpful right now.

Interestingly, the model is running successfully after I switched run directory and linked the namelist.atmosphere, streams.atmosphere, region.init.nc, and all the lbc* files to the MPAS_Model folder where I built the MPAS model.
As I understand this sentence, you had issues using some other run-directory (outside your clone of MPAS-Model). When you instead configured things so your desired namelist, streams, input file, and lbc files were inside your MPAS_Model directory things ran as you expected. My suspicion would be that the initial run-directory was lacking any of the physics lookup tables that MPAS-A requires.

Could you please provide a listing of the previous run-directory (the failed run) and another listing of the MPAS_Model folder you mention? The output of the ls command would be enough.



Where should I set up the environment parameters, such as SDTOUT or STDEER, that you mentioned?
Apologies for my incorrect formatting and the confusion. I meant to refer to the "standard output stream" (a.k.a. stdout) of your shell/terminal and "standard error stream" (a.k.a. stderr). Any command you use in your terminal sends its output to one of these streams. A full explanation and how to use them is better explained outside this forum. Some helpful links based on my personal opinion:



I agree with you, that gdb output doesn't not seem very helpful at all. Though I see two lines that may be helpful in the future:

warning: core file may not match specified executable file.
This may relate to an important detail that I didn't say explicitly: you should always use the same executable that generated the core file with gdb. If you want to use a debug version of the code to get more information from gdb, you will need to run the code again with that new executable. That should create a new core file if nothing else changed. I only say "should" because some errors may be tied to the flags used during the build. It is all heavily dependent on the software you use, the error that caused the crash, your run-time variables, and many other factors.

Although, sometimes the information you get from gdb is exactly as helpful as you've seen here... (Again due to many very specific details.)

Program terminated with signal 11, Segmentation fault.
A "segmentation fault" usually means the code tried to access some region of code, memory, or data on disk that either doesn't exist or isn't accessible (for various reasons). This mca_btl_smcude_sendi command accessed some region of memory (a variable or part of one) that it wasn't allowed to. An example of this would be trying to access a(9) when the array only has 8 elements or hasn't been allocated.
 
As I understand this sentence, you had issues using some other run-directory (outside your clone of MPAS-Model). When you instead configured things so your desired namelist, streams, input file, and lbc files were inside your MPAS_Model directory things ran as you expected. My suspicion would be that the initial run-directory was lacking any of the physics lookup tables that MPAS-A requires.

Could you please provide a listing of the previous run-directory (the failed run) and another listing of the MPAS_Model folder you mention? The output of the ls command would be enough.
Thank you for your detailed response.

Attached are the files generated using the "tree" command: list_run_folder for the run directory, and list_failed_folder for the failed run directory.
 

Attachments

  • list_run_folder.txt
    6.2 KB · Views: 3
  • list_failed_folder.txt
    4 KB · Views: 2
Apologies for my incorrect formatting and the confusion. I meant to refer to the "standard output stream" (a.k.a. stdout) of your shell/terminal and "standard error stream" (a.k.a. stderr). Any command you use in your terminal sends its output to one of these streams. A full explanation and how to use them is better explained outside this forum. Some helpful links based on my personal opinion:
Thank you for the explanation and the references about the "standard output stream".
 
This may relate to an important detail that I didn't say explicitly: you should always use the same executable that generated the core file with gdb. If you want to use a debug version of the code to get more information from gdb, you will need to run the code again with that new executable. That should create a new core file if nothing else changed. I only say "should" because some errors may be tied to the flags used during the build. It is all heavily dependent on the software you use, the error that caused the crash, your run-time variables, and many other factors.

Although, sometimes the information you get from gdb is exactly as helpful as you've seen here... (Again due to many very specific details.)
I did use the same executable that generated the core file with gdb.
 
A "segmentation fault" usually means the code tried to access some region of code, memory, or data on disk that either doesn't exist or isn't accessible (for various reasons). This mca_btl_smcude_sendi command accessed some region of memory (a variable or part of one) that it wasn't allowed to. An example of this would be trying to access a(9) when the array only has 8 elements or hasn't been allocated.
I’m trying to understand why the MPAS-A run failed in that directory. Perhaps you can help me identify the issue by reviewing the files listed in the two directories.
Thank you again for your time and help.
 
Unfortunately I don't think I see anything obvious from your listing_*_folders.txt files. (Thanks for mentioning the tree command, I'll keep that for future use.)

I've tried looking at the log.atmosphere.*txt files you shared earlier, but the runs a rather different. I'd like to try to understand why one run succeeded and the other didn't as well. Could you bundle together and post the namelist.atmosphere, streams.atmosphere, and log.atmosphere.0000.out files from the run inside the MPAS_Model directory and from the failed run?

One other suggestion to try would be running something like diff failed_dir success_dir. This will give a lot of output as it tries to compare every file in each directory, but you may find something useful in there.
 
Top