Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

'BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES' during wrf.exe

kinguEnt

Member
I create real.exe files for the simulation period of start_date = '2012-06-27_00:00:00', end_date = '2012-09-30_18:00:00' to four domains successfully. However, when I start 'mpirun -np 8 ./wrf.exe' the following error message appeared:


BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 84908 RUNNING AT negusu-OptiPlex-3060
= EXIT CODE: 9
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions



Does anyone suggest me how to solve?

Thanks
 

Attachments

  • error_files.zip
    22.1 KB · Views: 9
Last edited:
I have the exact same issue for a few days now and I've searched for it in the forum in multiple ways but have not found a solution. All others excecutables have run successfully in my system using 32 cores. My system is Linux Ubuntu server (x86_64 GNU/Linux), CPU(s): 72, Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz. With calculations following appropriate number of processors , I have found out that 32 is an appropriate amount of processors for my case and also metgrid and real run successfully with "mpirun -np 32 ./ ". I only get the above issue with wrf.exe. I have installed the latest version of the model available 4.5 for both WRF and WPS and use input and boundary data from GFS with SST_FIXED also from GFS. I also used the domain wizard web to create my domains for wps.
I've also noticed this
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

in 4 of my rsl.error.00* files (rsl.error.0007,rsl.error.0008,rsl.error.0016 and rsl.error.0023)

I have been running version 4.2 of the model before but I'm new at trying to run version 4.5 without help so some advice will be greatly appreciated.
Best,
Zoi
 

Attachments

  • error_files.tar
    1 MB · Views: 7
Hi,

I'm encountering the same issues and am currently working on optimizing my time step. I recommend decreasing your time step to see if that resolves the problem. This adjustment worked for me. Like try 90 of 108 instead ? Same for kingu, try maybe less than 162 with 135 ? Here I was making 5*dx.

Also don't hesitate when fine tunning the time step for each domain to use the time_step_ratio.

Best,

Vazquez Ballesta Manuarii
 
Hi,

I'm encountering the same issues and am currently working on optimizing my time step. I recommend decreasing your time step to see if that resolves the problem. This adjustment worked for me. Like try 90 of 108 instead ? Same for kingu, try maybe less than 162 with 135 ? Here I was making 5*dx.

Also don't hesitate when fine tunning the time step for each domain to use the time_step_ratio.

Best,

Vazquez Ballesta Manuarii
Thank you so much for your recommendation,
I have changed the timestep to 90 and the same error occurs, with the only difference being that the error that I mentioned above is now shown only in 3 rsl.error.* files. I will try to run with different timesteps to see if that will maybe fix it as you have suggested.
Kindly,
Zoi
 
Ok, if you want, as an example for my configuration, I found that time step of 10s (with a small different time step ratio because the inner domain under 1km require smaller time step) help and I have the following in namelist :

&domains
time_step = 10,
time_step_fract_num = 0,
time_step_fract_den = 1,
max_dom = 5,
e_we = 106, 100, 100, 175, 169,
e_sn = 100, 100, 121, 178, 154,
e_vert = 46, 46, 46, 46, 46,
vert_refine_method = 0, 0, 0, 0, 0,

eta_levels(1:46) = 1.0000, 0.9987, 0.9974, 0.9962, 0.9949,
0.9924, 0.9899, 0.9859, 0.9809, 0.9759,
0.9709, 0.9659, 0.9606, 0.9520, 0.9427,
0.9326, 0.9219, 0.9077, 0.8932, 0.8769,
0.8656, 0.8574, 0.8462, 0.8351, 0.8235,
0.8113, 0.7958, 0.7756, 0.7494, 0.7133,
0.6742, 0.6323, 0.5876, 0.5406, 0.4915,
0.4409, 0.3895, 0.3379, 0.2871, 0.2378,
0.1907, 0.1465, 0.1056, 0.0682, 0.0332,
0.0000,


p_top_requested = 5000,
num_metgrid_levels = 38,
num_metgrid_soil_levels = 4,
dx = 9000, 3000, 1000, 333.333, 111.111,
dy = 9000, 3000, 1000, 333.333, 111.111,
grid_id = 1, 2, 3, 4, 5,
parent_id = 0, 1, 2, 3, 4,
i_parent_start = 1, 50, 30, 11, 75,
j_parent_start = 1, 35, 30, 27, 47,
parent_grid_ratio = 1, 3, 3, 3, 3,
parent_time_step_ratio = 1, 3, 3, 4, 3,
feedback = 1,
smooth_option = 0,
 
Ok, if you want, as an example for my configuration, I found that time step of 10s (with a small different time step ratio because the inner domain under 1km require smaller time step) help and I have the following in namelist :

&domains
time_step = 10,
time_step_fract_num = 0,
time_step_fract_den = 1,
max_dom = 5,
e_we = 106, 100, 100, 175, 169,
e_sn = 100, 100, 121, 178, 154,
e_vert = 46, 46, 46, 46, 46,
vert_refine_method = 0, 0, 0, 0, 0,

eta_levels(1:46) = 1.0000, 0.9987, 0.9974, 0.9962, 0.9949,
0.9924, 0.9899, 0.9859, 0.9809, 0.9759,
0.9709, 0.9659, 0.9606, 0.9520, 0.9427,
0.9326, 0.9219, 0.9077, 0.8932, 0.8769,
0.8656, 0.8574, 0.8462, 0.8351, 0.8235,
0.8113, 0.7958, 0.7756, 0.7494, 0.7133,
0.6742, 0.6323, 0.5876, 0.5406, 0.4915,
0.4409, 0.3895, 0.3379, 0.2871, 0.2378,
0.1907, 0.1465, 0.1056, 0.0682, 0.0332,
0.0000,


p_top_requested = 5000,
num_metgrid_levels = 38,
num_metgrid_soil_levels = 4,
dx = 9000, 3000, 1000, 333.333, 111.111,
dy = 9000, 3000, 1000, 333.333, 111.111,
grid_id = 1, 2, 3, 4, 5,
parent_id = 0, 1, 2, 3, 4,
i_parent_start = 1, 50, 30, 11, 75,
j_parent_start = 1, 35, 30, 27, 47,
parent_grid_ratio = 1, 3, 3, 3, 3,
parent_time_step_ratio = 1, 3, 3, 4, 3,
feedback = 1,
smooth_option = 0,
I have tried a timestep of 108, 90, 72, 54, 36, 18 and even 10s and I keep getting the same error. So unfortunately I don't think it is a timestep issue.
 
Hi @zoidimitriadou and @Manuarii
The error message "BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES" just simply means the simulation failed for some reason. And even if the rsl files don't seem to reveal any specific errors, it's likely your reasons for failures are all very different; therefore it would be best if you each post a new thread to discuss your issues if you are still experiencing them. Please make sure to include your namelist.input file and all of your rsl files in that post, as well. Thank you and I apologize for the inconvenience.
 
Hi,

I'm encountering the same issues and am currently working on optimizing my time step. I

Manuarii

decreasing your time step to see if that resolves the problem. This adjustment worked for me. Like try 90 of 108 instead ? Same for kingu, try maybe less than 162 with 135 ? Here I was making 5*dx.

Also don't hesitate when fine tunning the time step for each domain to use the time_step_ratio.

Best,

Vazquez Ballesta Manuarii
Thanks Manuarii for your recommendation.
However, I proved that reducing the time_step can't resolve my problem.
 
Unfortunately there isn't an alternative to that needing additional processors. In the rsl* files you sent at the beginning, the model seems to stop almost immediately, so I don't see where it ran for 6 hours. Even so, sometimes this can still be a lack of processors. Since your d01 is smaller (than d03), you could try running a single domain simulation to see if that fails. Although 8 processors is very few, I think it would still be able to process d01's size. If that works, you could then try d02, and then d03, until you find which domain causes the failure. You could also try using smaller domains for all 4 domains to see if you're able to run that. If so, it may point even more to the fact that it's an issue with the number of processors.

Another thing I notice is that your d01 is using a resolution of 27km, which is probably too coarse, depending on the resolution of the input data you're using. What is the resolution of your input data?
 
Unfortunately there isn't an alternative to that needing additional processors. In the rsl* files you sent at the beginning, the model seems to stop almost immediately, so I don't see where it ran for 6 hours. Even so, sometimes this can still be a lack of processors. Since your d01 is smaller (than d03), you could try running a single domain simulation to see if that fails. Although 8 processors is very few, I think it would still be able to process d01's size. If that works, you could then try d02, and then d03, until you find which domain causes the failure. You could also try using smaller domains for all 4 domains to see if you're able to run that. If so, it may point even more to the fact that it's an issue with the number of processors.

Another thing I notice is that your d01 is using a resolution of 27km, which is probably too coarse, depending on the resolution of the input data you're using. What is the resolution of your input data?
Thank you kwerner.
I use the NCEP Final Analysis (GFS-FNL) with 1-degree spatial resolution.
 
Thanks. I suppose then it makes sense to use a 27km parent domain; however, you may want to consider using a higher-resolution input (e.g., GFS 0.25 degree data). It may not make much of a difference, but we typically advise to use the highest resolution option available.
 
Thanks. I suppose then it makes sense to use a 27km parent domain; however, you may want to consider using a higher-resolution input (e.g., GFS 0.25 degree data). It may not make much of a difference, but we typically advise to use the highest resolution option available.
Thanks kewerner.
 
Top