Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

SIGSEGV: Segmentation fault - invalid memory reference

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

sekluzia

Member
Dear Colleges,

I am running the WRF model on a cluster using Openmpi. In my runs I use 400-440 CPUs. The WRF model runs during the first 10 min of simulation then stops with the "Program received signal SIGSEGV: Segmentation fault - invalid memory reference."

Please, find attached my rsl* and namelist files.
Please, upload my wrfinput and wrfbdy files at:

https://figshare.com/s/2fb8d792e4099bb512ba


I assume that this issue could be due to the number of CPUs used.

Best regards,
Artur
 

Attachments

  • namelist.input
    6.5 KB · Views: 158
  • rsl.error.0000.txt
    5.7 MB · Views: 159
  • rsl.out.0000.txt
    5.7 MB · Views: 147
Artur,
Yes, it's very possible that you're using too many processors. Take a look here for information regarding choosing an appropriate number of processors, based on your domain size:
https://forum.mmm.ucar.edu/phpBB3/viewtopic.php?f=73&t=5082
 
Hi,

Can I kindly ask you to compute the number of necessary processors for my case based on my namelist.input file attached previously?

Best regards,
Artur
 
Hi,

I do not believe that this is a issue of processor numbers, because I tried to run the model with both low number of processors and high number of processors. Yes, there is a some limit of number of processors exceeding which resulst in the error of too many processors used. However, below this limit the WRF shows other errors:

Timing for Writing wrfout_d03_2018-08-10_00:10:00 for domain 3: 0.60426 elapsed seconds

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0 0x2B3B825D1347
#1 0x2B3B825D194E
#2 0x2B3B830315CF
#3 0x1E74EA0 in taugb3.6163 at module_ra_rrtmg_lw.f90:?
#4 0x1E98989 in __rrtmg_lw_taumol_MOD_taumol
#5 0x1EB0826 in __rrtmg_lw_rad_MOD_rrtmg_lw
#6 0x1EC4AB2 in __module_ra_rrtmg_lw_MOD_rrtmg_lwrad
#7 0x18AB82E in __module_radiation_driver_MOD_radiation_driver
#8 0x19A5D97 in __module_first_rk_step_part1_MOD_first_rk_step_part1
#9 0x131DA8F in solve_em_
#10 0x11ACE15 in solve_interface_
#11 0x477ED2 in __module_integrate_MOD_integrate
#12 0x4784B3 in __module_integrate_MOD_integrate
#13 0x4784B3 in __module_integrate_MOD_integrate
#14 0x407693 in __module_wrf_top_MOD_wrf_run

I am not able to understand what is the problem?
I reduced the simulation domains.

Please, find attached my rsl and namelist files.
I am also submitting my wrfinput and wrfbdy files which you can download at:
https://figshare.com/s/b6c5cc6f0fb692d7acd5

Can I ask you to run the wrf files with my namelist file and see are there any problems at your side? Last time I used 56 CPUs for runnig the wrf.exe.


Best regards,
Artur
 

Attachments

  • namelist.input
    6.5 KB · Views: 111
  • rsl.error.0000.txt
    5.7 MB · Views: 80
  • rsl.out.0000.txt
    5.7 MB · Views: 72
There may be additional problems, but until you're running with a more reasonable number of processors, it's difficult to track down that problem because the model could first be stopping due to the number of processors. Based on that FAQ I sent you, you should be able to calculate that you need to use somewhere between ~12 and ~176 processors for your domain sizes. It's okay to use something closer to the upper end to run faster. Can you try a number smaller than 177 (I'm not sure how many processors you have available per node, but you can do the math to determine the best number of nodes to use).

Please also set debug_level = 0. We have taken this option out of the default namelists because it doesn't typically provide any useful information and just adds a lot of junk to the rsl files, making them nearly impossible to read through, and can also sometimes makes the rsl files so large that they use up all disk space, causing segmentation faults.

With those 2 updates, if you are still getting failures, please send your new rsl* files. Please package all of your rsl files together into one *.TAR file and attach that here. I am unable to open *.rar files, so please make sure they are in *.tar format. Thanks!
 
Thanks for your reply!

I am attaching the rsl.error and namelist files for my run using 112 CPUs. Please, also see attached the fortran files which are indicated in the rsl.error file. I can use up to 400 CPUs located in 20 slots. However, when I increase the number of CPUs to more than 124 the WRF complains that too many CPUs are used.

Best regards,
Artur
 

Attachments

  • wrf.f90
    219 bytes · Views: 69
  • mediation_wrfmain.f90
    181.6 KB · Views: 65
  • start_domain.f90
    9.2 KB · Views: 74
  • start_em.f90
    109 KB · Views: 72
  • module_physics_init.f90
    216.7 KB · Views: 69
  • module_mp_nssl_2mom.f90
    429.3 KB · Views: 70
  • namelist.input
    6.5 KB · Views: 104
  • rsl.error.0000_112CPU.txt
    2.3 KB · Views: 86
  • rsl.out.0000_112CPU.txt
    1.2 KB · Views: 86
Hi,
Decomposition is broken down by the 2 closest factors that make up the number of processors. So if you chose 124, then the 2 nearest factors are 4 and 31. Your d03 is 364 x 304. Therefore, the model will assign 4 tiles in the x-direction, and 31 in the y-direction. This means that in the x-direction, you will have 364/4 = 91 grid cells per tile. But in the y-direction, you will have 304/31 = 9.8 grid cells per tile. The rule is that you cannot have less than 10 grid cells per tile, so this value conflicts that rule, and gives an error.

As for the floating-point exception, to debug this, you may need to put in some print statements to see what kind of values you are getting. It looks like the last place the model goes is in the module_mp_nssl_2mom.f90 file, line 866. You will need to find that corresponding line in the module_mp_nssl_2mom.F file (it could be different) and put in some prints to see what kind of values it's giving for variables in that line (or perhaps just above or below it). You'll then need to recompile the code (but do not need to issue a "clean -a" or reconfigure - just simply recompile, and it should be fairly quick). When you do this, it will write a new module_mp_nssl_2mom.f90 file (which is why you don't want to put the modifications in that .f90 file). You can then run the model again to see if you see anything useful in the rsl.out.* files, perhaps giving bad values that may be causing the error.
 
Thanks for your reply! Before recompiling I would like to clarify how to put in some prints in the module_mp_nssl_2mom.F file?

I am showing the lines where the error occurs in the module_mp_nssl_2mom.f90 file, line 866:

855 ccn = Abs( nssl_params(1) )
856 alphah = nssl_params(2)
857 alphahl = nssl_params(3)
858 cnoh = nssl_params(4)
859 cnohl = nssl_params(5)
860 cnor = nssl_params(6)
861 cnos = nssl_params(7)
862 rho_qh = nssl_params(8)
863 rho_qhl = nssl_params(9)
864 rho_qs = nssl_params(10)
865
866 IF ( Nint(nssl_params(13)) == 1 ) THEN
867
868
869 turn_on_ccna = .true.
870 irenuc = 7
871
872 ENDIF



here are the corresponding lines in the module_mp_nssl_2mom.F:

! set some global values from namelist input
!

ccn = Abs( nssl_params(1) )
alphah = nssl_params(2)
alphahl = nssl_params(3)
cnoh = nssl_params(4)
cnohl = nssl_params(5)
cnor = nssl_params(6)
cnos = nssl_params(7)
rho_qh = nssl_params(8)
rho_qhl = nssl_params(9)
rho_qs = nssl_params(10)

IF ( Nint(nssl_params(13)) == 1 ) THEN
! hack to switch CCN field to CCNA (activated ccn)
! invertccn = .true.
turn_on_ccna = .true.
irenuc = 7

ENDIF


So, could you kindly suggest me where and what should I add (or change) in the module_mp_nssl_2mom.F file and then recompile the wrf?

Best regards,
Artur
 
Perhaps it will be easier for me to try to repeat your error. If you're able to attach the wrfbdy_d01 and wrfinput* files, please do so. The files are likely too large to attach, though, so please see the home page of this forum for instructions on submitting large data files.
 
Hello,

I would like to update this issue. I have recompiled the WRF model on the cluster I use. I used the intel fortran compiler.
When I use two domains in my simulations (9km-3km) the wrf runs successfully. The problem occurs when I add the d03, i.e. 9km-3km-1km. Please note, I use the same model configuration and 3:1 parent/nest ratio in both double and triple nesting experiments. The namelist_d02.input and namelist_d03.input files are attached.
I use ECMWF HRES data (interpolated to 0.08x0.08 deg. grid) to run my simulations.

The real.exe is successfully run creating the wrfinput* and wrfbdy files. You can download the wrfinput and wrfbdy files here:

https://figshare.com/s/0eb04a1f37d3ab212210


Then, the wrf.exe produces the first files:

-rw-r--r-- 1 gevorgya root 787565420 Oct 26 22:07 wrfout_d01_2018-06-24_00:00:00
-rw-r--r-- 1 gevorgya root 726924644 Oct 26 22:07 wrfout_d02_2018-06-24_00:00:00
-rw-r--r-- 1 gevorgya root 676866260 Oct 26 22:07 wrfout_d03_2018-06-24_00:00:00

and stops at the next forecast time-step after writing the file:
-rw-r--r-- 1 gevorgya root 676866260 Oct 26 22:08 wrfout_d03_2018-06-24_00:10:00

I reduced the model time-step up to 3*dx=27 sec (namelist_d03.input) in order to avoid the CFL issues. However, my simulation domains are characterized by steep mountain topography (the d03 topography map is attached). Therefore, I think that the model becomes unstable at 1 km spatial resolution and stops. Please, also find attached the rsl* files based on my last run using 160 CPUs. I have tested different numbers of CPUs varying from 60 to 400, but the same error is shown.

Do you think that the issue is related to the model instability issue at high-resolution simulation (1 km)?
If yes, how can I cope with this problem?

Best regards,
Artur
 

Attachments

  • namelist_d02.input
    6.2 KB · Views: 96
  • namelist_d03.input
    6.5 KB · Views: 99
  • rsl.out.0000.txt
    30.5 KB · Views: 76
  • rsl.error.0000.txt
    32.5 KB · Views: 87
Hi,
Okay, I am able to repeat your problem, and I'm also able to repeat the problem if I use my own prepared input data. I've at least tracked the problem down to the NSSL microphysics scheme you are using. When I change to something that is used a bit more often, and highly-tested, like Thompson (option 8), the run progresses without any problem. However, there was an additional physics modification I had to make when working with the input data you provided to me. I had to change from the Noah LSM to the Noah-MP LSM. Once I did that, I was able to run with your data. Would you be willing to change those two schemes, or do you absolutely need to use NSSL and Noah?

A couple other things to note:
1) Although changing the microphysics helped to get past the initial problem, I did still run into problems with CFL errors. Howevever, if you add the following 2 options to your namelist, and rerun real.exe, this helps to get rid of those errors:
Code:
&domains
smooth_cg_topo = .true.

&dynamics
epssm = 0.2, 0.2, 0.2

2) When setting a time_step, it's best to set it to a value that is nicely divisible with rounded time intervals. I changed your time step from 27 to 30, which allows output at nice intervals, and still works okay.
 
Hi,

Thanks a lot for your reply and helpful suggestions! Indeed, I also suspected that the problem might be the NSSL mp scheme. I only tried to change the NSSL to Morrison scheme (option 10), but the same problem occurred.
The issue of using the NSSL scheme is crucial. The recent sensitivity study demonstrated that the NSSL scheme (option 22) the best simulates convective rainfall over mountain terrain of Armenia (https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/2017JD028247). Unfortunately, the Thompson scheme mentioned by you (option 8) showed worse results. Therefore, I really need to use the NSSL scheme (option 22) in my simulations. Can we do that?

Regarding the Noah-MP LSM scheme, I tested that, and I am very happy with this scheme, especially after improvements implemented in version 4.1! By the way can you suggest me any additional NOAH-MP options which is worth to test for further improving the simulation of convective rainfall, hailstorm, etc. I would much appreciate!

Looking forward your feedback!
Artur
 
Hi,

I tried again to run the WRF model using your suggestions on NOAH MP, time step and &domains
smooth_cg_topo = .true.

&dynamics
epssm = 0.2, 0.2, 0.2

Now, the WRF runs at 1 km spatial resolution with the NSSL mp scheme.

Thank you so much for your help!

Artur
 
Hi,

I have to get back to this issue.
When I run another WRF simulations covering slightly extended domains, and thus larger number of grid-points (the namelist.input file is attached), the wrf.exe again shows errors (the rsl.error and rsl.out files are attached). I am using 160 CPUs in my run. Please, upload my wrf input files at:

https://figshare.com/s/46d03be6c5415c6f69c4

Is it the same error associated with the NSSL mp scheme. If yes, how to solve that?

Best regards,
Artur
 

Attachments

  • namelist.input
    6.5 KB · Views: 61
  • rsl.out.0000.txt
    19.8 KB · Views: 32
  • rsl.error.0000.txt
    21.7 KB · Views: 43
Hello,

I'm using WRF-3.9.1.1. I'm having the following error:

d01 2020-06-22_04:47:00 1 points exceeded cfl=2 in domain d01 at time 2020-06-22_04:47:00 hours
d01 2020-06-22_04:47:00 MAX AT i,j,k: 269 195 8 vert_cfl,w,d(eta)= 2.05959916 0.637236774 2.30002403E-03


Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0 0x7F7998EBE697
#1 0x7F7998EBECDE
#2 0x7F79983B93FF
#3 0x1B66BE7 in taugb3.5950 at module_ra_rrtmg_lw.f90:?
#4 0x1B87B19 in __rrtmg_lw_taumol_MOD_taumol
#5 0x1B9F86B in __rrtmg_lw_rad_MOD_rrtmg_lw
#6 0x1BB31DC in __module_ra_rrtmg_lw_MOD_rrtmg_lwrad
#7 0x16C6199 in __module_radiation_driver_MOD_radiation_driver
#8 0x17B0BCC in __module_first_rk_step_part1_MOD_first_rk_step_part1
#9 0x11B1ED9 in solve_em_
#10 0x108920A in solve_interface_
#11 0x47234A in __module_integrate_MOD_integrate
#12 0x407C73 in __module_wrf_top_MOD_wrf_run

I've used 80 cores in 4 nodes for my analysis. I'm running 1-month NARR datasets. However, on day 22 my wrf crash happen. I've checked all output files using ncdump and found all the time steps there.

Please see the attached rsl.error and namelist files and help me out.
 

Attachments

  • namelist.input.txt
    6.2 KB · Views: 64
  • rsl.error.0052.txt
    218.1 KB · Views: 20
  • rsl.error.0053.txt
    50.3 KB · Views: 21
  • rsl.error.0068.txt
    49.9 KB · Views: 16
  • rsl.error.0060.txt
    265.4 KB · Views: 21
  • rsl.error.0061.txt
    50.3 KB · Views: 20
You're namelist.input looks fine. All settings are reasonable. If the model crashed after 22 days of integration, it often indicates that some physical processes went wrong. CFL violation shows that the model cannot integrated stably.

I would suggest you reduce the time step to 48, then restart the run from the most recent time when wrfrst is available. Hope the model can pass the crashing point. Please save wrfrst as frequent as possible, for example with restart_interval = 60, just in case the model will crash again, you can restart from a time that is close to the crashing time and save wrfout every time step. Then please check these wrfout files and find when and where something first goes wrong.
 
Top