Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Segmentation Fault for WRF simulation

mlwasserstein

New member
Hello:

I am attempting to run a WRF-LES simulation with 4 nested domains and an inner domain resolution of 333 m. I previously had no issues with this simulation when I used 3 nested domains, but when I add a fourth with the highest resolution, I continue to run into a segmentation fault at hour 5 in the simulation.

The error information (in the file rsl.error.0007) reads:

0 0x0000000000012cf0 __funlockfile() :0
1 0x0000000001fdb7f2 module_mp_thompson_mp_mp_thompson_() ???:0
2 0x0000000001fc5746 module_mp_thompson_mp_mp_gt_driver_() ???:0
3 0x0000000001d03d91 module_microphysics_driver_mp_microphysics_driver_() ???:0
4 0x00000000016d04e7 solve_em_() ???:0
5 0x0000000001505038 solve_interface_() ???:0
6 0x00000000005b8c41 module_integrate_mp_integrate_() ???:0
7 0x00000000005b925e module_integrate_mp_integrate_() ???:0
8 0x00000000005b925e module_integrate_mp_integrate_() ???:0
9 0x00000000005b925e module_integrate_mp_integrate_() ???:0
10 0x0000000000416451 module_wrf_top_mp_wrf_run_() ???:0
11 0x000000000041640f MAIN__() ???:0
12 0x00000000004163a2 main() ???:0
13 0x000000000003ad85 __libc_start_main() ???:0
14 0x00000000004162ae _start() ???:0


I am not totally sure how to interpret this, but the issue seems to be related to the microphysics (Thompson). It is also strange that I am able to get over 5 hours of the simulation but *then* run into the segmentation fault. I would appreciate any insight to help me figure out what is causing this issue.

I have attached my namelist.input file and the rsl.error.0007 file.

Thanks for your help,
Michael
 

Attachments

  • namelist.input.txt
    4.4 KB · Views: 6
  • rsl.error.0007.txt
    18.5 KB · Views: 4
I looked at your namelist. input. You have the options below:

km_opt = 4, 4, 4,3,

Is there any special reason that you set km_opt=3 for D04?

You also turned off PBL scheme in D04 and thus LES would be activated for this domain. However, note that dx=333m is a typical grey zone resolution, at which neither LES nor PBL can work fine. This is possibly the reason why this case failed.

I would suggest that you turn on PBL for D04 and set km_opt =4,4,4,4. Hope this may lead to successful run of this case. Please let me know whether it works or not.
 
Thanks for your reply and suggestion. I gave this a try, and the simulation successfully ran for 11 hours of simulation time (rather than the 5 hours I had previously run for), but I still get the same segmentation fault. I have attached the rsl.error.0022.txt file, which has information about this segmentation fault.

-Michael
 

Attachments

  • rsl.error.0022.txt
    26.1 KB · Views: 4
Hi Michael,
It's possible that you need to try using more processors to run this. It may also help if your decomposition was more even. It's currently 8x16, but if you could use 16x16, or even 32x32, that may work better. Take a look at Choosing an Appropriate Number of Processors, which discusses decomposition. If you use more and it's still failing, please package all of your rsl* files into a single *.tar file and attach that file, as well as your latest namelist.input file. Thanks!
 
Thanks for your reply. I've tried a 16x16 decomposition and still get a failure. This run, however, did last 13 hours, which is the longest I've gotten so far. I've attached the rsl* files into a .tar file and my latest namelist.input file.
 

Attachments

  • namelist.input.txt
    4.4 KB · Views: 2
  • rsl_files_wrf_03.tar.bz2
    5.5 MB · Views: 1
Thanks for doing that test. I have a couple more tests for you to try.
1) This may just be that you need to use even more processors, since using more is allowing it to run further. Can you try 32x32 (total of 1024) to see if you get further?
2) If not, can you test just running a single domain to see if it still stops? If that runs to completion, add d02 and try again. Then add d03, then d04. This may help us to narrow down which domain may be causing the issue (if one is).
 
Thanks for your response.

1) I tried 32x32 and get a new error, which happens almost immediately after the run begins:

Code:
*************************************
  Domain # 1: dx =  9000.000 m
  Domain # 2: dx =  3000.000 m
  Domain # 3: dx =  1000.000 m
  Domain # 4: dx =   333.333 m
   For domain            1 , the domain size is too small for this many processors, or the decomposition aspect ratio is poor.
   Minimum decomposed computational patch size, either x-dir or y-dir, is 10 grid cells.
  e_we =   260, nproc_x =   32, with cell width in x-direction =    8
  e_sn =   250, nproc_y =   32, with cell width in y-direction =    7
  --- ERROR: Reduce the MPI rank count, or redistribute the tasks.
   For domain            2 , the domain size is too small for this many processors, or the decomposition aspect ratio is poor.
   Minimum decomposed computational patch size, either x-dir or y-dir, is 10 grid cells.
  e_we =   220, nproc_x =   32, with cell width in x-direction =    6
  e_sn =   244, nproc_y =   32, with cell width in y-direction =    7
  --- ERROR: Reduce the MPI rank count, or redistribute the tasks.
   For domain            4 , the domain size is too small for this many processors, or the decomposition aspect ratio is poor.
   Minimum decomposed computational patch size, either x-dir or y-dir, is 10 grid cells.
  e_we =   298, nproc_x =   32, with cell width in x-direction =    9
  e_sn =   298, nproc_y =   32, with cell width in y-direction =    9
  --- ERROR: Reduce the MPI rank count, or redistribute the tasks.
-------------- FATAL CALLED ---------------
FATAL CALLED FROM FILE:  <stdin>  LINE:    2793
NOTE:       1 namelist settings are wrong. Please check and reset these options
-------------------------------------------

2) I've previously had success running nested simulations with 3 domains (d01, d02, and d03). But when I add this additional domain, I get issues.

Thanks for your help.
 
1) Apologies for that. I didn't realize the domain sizes would be too small for that test.
2) When you previously ran nested simulations with 3 domains, was it for this exact same namelist, dates, domain, input data, physics options, etc.? If not, can you try the test I mention above, just to see? Thanks.
 
Yes, when I previously ran nested simulations with 3 domains, it was for the exact same namelist.

As an update to this problem, I have now successfully run a simulation for the inner 333-m domain (d04) using nesting down (ndown.exe) and no PBL parameterization. My goal is to be able to do this using two-way nesting, but maybe this is a good start? Let me know if you if you have any additional thoughts about how I can tackle this problem using 2-way nesting.

Thanks for your help,
Michael
 
Here is another update:

As a test, I made my inner 333-m domain very small (31 indicies in the west-east direction and 31 in the south-north direction) and ran an LES simulation using two-way nesting. This simulation successfully ran to completion with no segmentation faults. To me, this suggests that I'm running into some sort of memory issue when I run with my full domain, rather than any sort of computational stability issue. As a next step, I'll try to reduce the size of my inner domain so that it only resolves the most important terrain features. Let me know if you have other thoughts about how I could proceed. I have attached my namelist file with the small inner domain that I used for this test.

-Michael
 

Attachments

  • namelist.input.txt
    4.4 KB · Views: 1
Michael,
I am sorry to get back to you late but this is because I was out of office in the past two weeks.
I don't think the failure of your previous run is caused by memory issue. There might be something wrong in the physics/dynamics. However, please try with your new options and keep us updated of the results. As I mentioned previously, 333-m resolution is always a concern in WRF simulation. We hope to gain more experiences from users regarding this resolution.
 
Top