Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Error while Running WRF

etmolina95

New member
Good day. I am very confused as to how I could prevent this error while running WRF. In my case, I attempt to run 2-domains with outer domain having 15-km and inner domain with 3-km spatial resolution. My Ubuntu system has 16-Gb RAM and I often run WRF with two processors with (mpirun -np 2 ./real.exe and mpirun -np 2 ./wrf.exe). The more processors I use, the more the WRF will cause this error. First, I run a 6-hrs FNL data to produce restart files because when I set to 12 hrs, it will cause this error at 10-hr. After producing the 12-hr restart file, the WRF cannot run anymore but produce again this error. It shows this error:

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 11701 RUNNING AT meteo14-Aspire-TC-1750
= EXIT CODE: 136
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Floating point exception (signal 8)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

I hereby attach my namelist.wps and namelist.input files for reference, together with the rsl.out and rsl.error files. Looking forward to your help. Thank you.
 

Attachments

  • namelist.wps
    1.1 KB · Views: 5
  • namelist.input
    3.7 KB · Views: 6
  • rsl.error.0000
    9.7 KB · Views: 6
  • rsl.out.0000
    8.3 KB · Views: 4
Hi,
Since the model first fails after 10 hours during your initial simulation, let's use that case to address the issue. It should not be failing after 10 hours. Using a restart case only makes the problem a bit more complicated. Can you go back to your initial (non-restart) simulation and run that again? After you do, please send me the updated namelist.input file, along with all of the rsl* files (you can package them all into a single *.tar file and attach that). Thanks!
 
Hello, sir. Hereby attached is the tar file containing the namelist.input and rsl files of the WRF run containing the error. Thank you very much for your response and still hoping for the possible solution to this error.
 

Attachments

  • Error_Test_run.zip
    39.2 KB · Views: 2
Hi,
Thanks for sending those. Since the error you're getting is a floating point exception, you may need to do some additional debugging. However, before you do that, I'd like to suggest a couple of things for your setup. First, you should make domain 01 a little larger. Domains should not be any smaller than 100x100 grid points. Second, you may want to try using more processors. If you have access to them, try using 9 or 36 processors to see if that happens to make any difference. If this is truly an "erroneous arithmetic operation" you will need to try to set up debugging to see where the model is stopping and what is causing that issue. Take a look at How can I debug the code to find where the model is stopping? for some instructions. And just to verify - you didn't make any modifications to the code, did you?
 
Good day sir, there's no modifications I've made on the code. I will do your suggestions, and do the debugging process. I'll update again soon if there's any problems encountered addressing these actions. Thank you very much.
 
I run a modified domain, now addressing the suggestions on domain size and included another domain (now with three domains). I have also run it through 16 processors since I only have 20 processors. However, the error still persists, particularly after running 7 hours of hourly output. Attached is the compressed file containing namelist.input and rsl files. I will do next the debugging process if it will resolve the issue.
 

Attachments

  • Error_16_processors.zip
    91.7 KB · Views: 2
I run a modified domain, now addressing the suggestions on domain size and included another domain (now with three domains). I have also run it through 16 processors since I only have 20 processors. However, the error still persists, particularly after running 7 hours of hourly output. Attached is the compressed file containing namelist.input and rsl files. I will do next the debugging process if it will resolve the issue.
an this is the result after debugging, the error still persists
 

Attachments

  • After_Debugging.zip
    92 KB · Views: 1
I don't expect the debugging process to fix the problem. It just usually helps to tell you where the issue is happening.

When we are trying to help solve the issue, we need everything to remain is simplified as possible, so adding a new 3rd domain creates a new complication. Additionally, you're now using a parent domain resolution of 45 km, which shouldn't be necessary. Can you just try your 2 domain simulation again, using a 15 km parent domain, but with the size at least 100x100? I expect it will likely still fail when you run it. If that's the case, will you share your files with me so that I can test this on my system? I will need the new namelist.input file, as well as your wrfinput* files and your wrfbdy_d01 file. Please also share the latest rsl* file package, just so that I can take a look. These files will likely be too large to attach here, so take a look at the home page of this forum for instructions on sharing large files. Thanks.
 
Good day. Attached is the zip file containing the updated files for running 2 domains, and addressing the minimum 100x100 domain size. I run it through various number of processors (i.e. 16, 9, 4, 2) but the error still persists. Thank you.
 
EDITED:

Hi,
Thanks for sending those. I ran a test with your case, using your namelist and your wrfbdy and wrfinput* files with WRFv4.5.1. I was able to run the full case without any issues. However, I then recompiled with debugging (i.e., I configured with "./configure -D") and when I did that, wrf fails immediately. It doesn't even run as long as you were able to run.

1) First, Did you happen to make any modifications to your WRF code that could be causing the problem? Or is this pristine, out-of-the-box, code? If you did make any modifications, can you try with unmodified code to see if it runs? If so, then you'll know there is something wrong with the modifications.

2) When you first ran into this issue, had you already built the code with a "-D" configuration? What happens if you just compile WRF without configuring with "-D?" Can you try that to see if it still stops in the same place?

Just so you know, I will be out of the office until Jan 8th, so if you need help during that time, you may need to start a new thread so that my colleague(s) will see it. You can point to this thread if you don't want to explain everything again. Otherwise, I'll look at this again when I return.
 
Last edited:
Hi again,
Just another update. I tried this test again with WRF v4.5.2, which was just released today, and even when I configured with "-D," your test runs to completion. Can you try this new version and see if it works for you? Thanks!
 
Hi again,
Just another update. I tried this test again with WRF v4.5.2, which was just released today, and even when I configured with "-D," your test runs to completion. Can you try this new version and see if it works for you? Thanks!
Good day, sir. Thank you very much for the assistance. I used to configure WRF with the "-D". I was finally able to run WRF successfully without the error mentioned above after recompiling it with ./configure. Even with the latest version, I tested it by configuring with the -D, but to no avail.
 
Top