Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Error in wrf.exe program (Assertion failed)

mnardozzi74

New member
Hello All,

I am facing and "Assertion error" when I execute wrf.exe. When I run my real model simulation on real hardware, I get the following error when executing wrf.exe

mpirun -np 12 ./wrf.exe

Assertion failed in file src/mpid/ch3/channels/nemesis/netmod/tcp/socksm.c at line 590: hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_ID_INFO || hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_TMPVC_INFO


Please note: I am seeing wrf_out* files for each domain, but at some point in time the wrf.exe crashes before the simulation is completed.I have attached my namelist.input, namelist.wps, and error log files.

My WRF-ARW version is WRFV4.5.1

Hardware Specs:
32 GB system memory
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 48 bits physical, 48 bits virtual
CPU(s): 12
On-line CPU(s) list: 0-11
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 1
NUMA node(s): 2
Vendor ID: AuthenticAMD
CPU family: 21




Model: 2
Model name: AMD Opteron(tm) Processor 6338P
Stepping: 0
Frequency boost: enabled
CPU MHz: 1398.962
CPU max MHz: 2300.0000
 

Attachments

  • namelist.wps
    741 bytes · Views: 0
  • namelist.input
    4.1 KB · Views: 2
  • rsl.error.0001.txt
    6.8 KB · Views: 3
Please modify the options as follows in your namelist.input:
dx = 9000, 3000,
dy = 9000, 3000,
radt = 9, 9,
cu_physics = 11, 0

Then try again. If this case still failed, please turn off the lightning option (i.e. lightning_option = 0, 0) and see whether the case can run successfully.

This will help to determine whether the lightning option caused the model crash.
 
Hi Ming,

Thank you for the timely response. I will try your suggestions and get back to you ASAP. I am needing the lightning option so I hope I can still use it.

Best regards
 
Hi All,

Based on your suggestions above I performed the following actions and here are my results:

1.
Modified namelist.input
dx = 9000, 3000,
dy = 9000, 3000,
radt = 9, 9,
cu_physics = 11, 0

After the changes above, I ran wrf.exe and I am still getting the "Assertion failed" error causing the wrf.exe program to crash.

2. I turned off the lightning_option by setting lightning_option = 0, 0

After these changes, I ran wrf.exe and I am still getting the "Assertion failed" error causing the wrf.exe program to crash.

I have attached the namelist.input and namelist.wps files showing the changes above. I have also captured strace when the program wrf.exe was running and in one of the straces for wrf.exe you can see the error as well.

Is there anything else I can do to troubleshoot the wrf.exe Assertion failed error? Should I perhaps, use a previous version of the WRF?

Thank you
 

Attachments

  • strace.165896.txt
    4 KB · Views: 0
  • namelist.wps
    753 bytes · Views: 0
  • namelist.input
    4.1 KB · Views: 1
  • rsl.error.0007.txt
    7.4 KB · Views: 1
Is there anything else I can do to troubleshoot the wrf.exe Assertion failed error? Should I perhaps, use a previous version of the WRF?
 
Your namelist.input looks fine. However, the rsl file indicates that this case crashed immediately after ./wrf.exe started, is this correct? Please let me know if I am wrong.

Please tell me what data did you use to drive this case? Can you try to run it again but without nesting, i.e., max_dom =1? Let me know whether it works.
If not, I am suspicious that this is either a machine issue, or that your input data might be wrong.

By the way, have you ever run WRF successfully before in this machine?

If the case keeps failing, you may recompile WRF in debug mode, then rerun the case. The log file will tell exactly when and where the model failed first. that will provide some hints what is wrong.
 
Hi All,

Based on your suggestions above I performed the following actions and here are my results:

1.
Modified namelist.input
dx = 9000, 3000,
dy = 9000, 3000,
radt = 9, 9,
cu_physics = 11, 0

After the changes above, I ran wrf.exe and I am still getting the "Assertion failed" error causing the wrf.exe program to crash.

2. I turned off the lightning_option by setting lightning_option = 0, 0

After these changes, I ran wrf.exe and I am still getting the "Assertion failed" error causing the wrf.exe program to crash.

I have attached the namelist.input and namelist.wps files showing the changes above. I have also captured strace when the program wrf.exe was running and in one of the straces for wrf.exe you can see the error as well.

Is there anything else I can do to troubleshoot the wrf.exe Assertion failed error? Should I perhaps, use a previous version of the WRF?

Thank you
Couple of questions,

Did you build WRF with openMP and mpich?

Have you been able to run WRF sucessfully on this machine before?

Can you try mpirun -np 6 ./wrf.exe instead of 12. Sometimes the when there are too many cores running at one time the background processes of the machine cannot run if all the cores are using WRF. I see that you 12 cpus so trying to only using half of them instead of all of them might help. I'm thinking that the other processors might be needed for running the OS
 
The model crashes at different points in the simulation, for example, I am running a 24 hour simulation that produces hourly wrfout_d01*, wrfout_d02* files. Sometimes, the model runs and produces 6 or more wrfout_d01*, wrfout_d02* files. Other times, wrf.exe crashes after 2 hours (or more) of running the simulation. This is the concerning part, because no matter what I do, I cannot get a repeatable point in time when wrf.exe crashes. I have attached multiple rsl.* files so you can see evidence of the latest simulation and at what point wrf.exe crashed.

I am using the nam 12 km data set to drive the simulation (nam.t00z.awphys00.tm00.grib2 files downloaded via 'https://nomads.ncep.noaa.gov/pub/data/nccf/com/nam/prod/nam.$YYYY$MM$DD/nam.t${HH}z.awphys${FHR}.tm00.grib2'

I ran the model without nesting (max_dom=1) and it runs fine and produces the 24 hour forecast without errors. As soon as I set the max_dom=2, then it crashes randomly as I described above. Therefore, I think we can say there is an issue with running simulations where max_dom > 1.

I have only been able to successfully run the WRF-ARW on this machine if I use max_dom=1, i.e non-nested run. Maybe the best path forward is to re-compile WRF in debug mode.

Please advise

Thank you
 

Attachments

  • namelist.input
    4.1 KB · Views: 0
  • namelist.wps
    753 bytes · Views: 0
  • rsl.error.0000.txt
    381.7 KB · Views: 0
  • rsl.error.0010.txt
    10.9 KB · Views: 0
Couple of questions,

Did you build WRF with openMP and mpich?

Have you been able to run WRF sucessfully on this machine before?

Can you try mpirun -np 6 ./wrf.exe instead of 12. Sometimes the when there are too many cores running at one time the background processes of the machine cannot run if all the cores are using WRF. I see that you 12 cpus so trying to only using half of them instead of all of them might help. I'm thinking that the other processors might be needed for running the OS
Thank you for the suggestion. I will set np to 12 and see if this fixes the crash issue.
 
Top