SIGSEGV: Segmentation fault - invalid memory reference

Topics specifically related to the wrf.exe program
Post Reply
sekluzia
Posts: 187
Joined: Mon Oct 15, 2018 12:42 pm

SIGSEGV: Segmentation fault - invalid memory reference

Post by sekluzia » Wed Oct 09, 2019 6:15 pm

Dear Colleges,

I am running the WRF model on a cluster using Openmpi. In my runs I use 400-440 CPUs. The WRF model runs during the first 10 min of simulation then stops with the "Program received signal SIGSEGV: Segmentation fault - invalid memory reference."

Please, find attached my rsl* and namelist files.
Please, upload my wrfinput and wrfbdy files at:

https://figshare.com/s/2fb8d792e4099bb512ba


I assume that this issue could be due to the number of CPUs used.

Best regards,
Artur
Attachments
rsl.out.0000.txt
(5.73 MiB) Downloaded 55 times
rsl.error.0000.txt
(5.74 MiB) Downloaded 53 times
namelist.input
(6.47 KiB) Downloaded 57 times

kwerner
Posts: 2287
Joined: Wed Feb 14, 2018 9:21 pm

Re: SIGSEGV: Segmentation fault - invalid memory reference

Post by kwerner » Thu Oct 10, 2019 3:02 pm

Artur,
Yes, it's very possible that you're using too many processors. Take a look here for information regarding choosing an appropriate number of processors, based on your domain size:
https://forum.mmm.ucar.edu/phpBB3/viewt ... =73&t=5082
NCAR/MMM

sekluzia
Posts: 187
Joined: Mon Oct 15, 2018 12:42 pm

Re: SIGSEGV: Segmentation fault - invalid memory reference

Post by sekluzia » Thu Oct 10, 2019 7:52 pm

Hi,

Can I kindly ask you to compute the number of necessary processors for my case based on my namelist.input file attached previously?

Best regards,
Artur

sekluzia
Posts: 187
Joined: Mon Oct 15, 2018 12:42 pm

Re: SIGSEGV: Segmentation fault - invalid memory reference

Post by sekluzia » Fri Oct 11, 2019 9:12 pm

Hi,

I do not believe that this is a issue of processor numbers, because I tried to run the model with both low number of processors and high number of processors. Yes, there is a some limit of number of processors exceeding which resulst in the error of too many processors used. However, below this limit the WRF shows other errors:
Timing for Writing wrfout_d03_2018-08-10_00:10:00 for domain 3: 0.60426 elapsed seconds

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0 0x2B3B825D1347
#1 0x2B3B825D194E
#2 0x2B3B830315CF
#3 0x1E74EA0 in taugb3.6163 at module_ra_rrtmg_lw.f90:?
#4 0x1E98989 in __rrtmg_lw_taumol_MOD_taumol
#5 0x1EB0826 in __rrtmg_lw_rad_MOD_rrtmg_lw
#6 0x1EC4AB2 in __module_ra_rrtmg_lw_MOD_rrtmg_lwrad
#7 0x18AB82E in __module_radiation_driver_MOD_radiation_driver
#8 0x19A5D97 in __module_first_rk_step_part1_MOD_first_rk_step_part1
#9 0x131DA8F in solve_em_
#10 0x11ACE15 in solve_interface_
#11 0x477ED2 in __module_integrate_MOD_integrate
#12 0x4784B3 in __module_integrate_MOD_integrate
#13 0x4784B3 in __module_integrate_MOD_integrate
#14 0x407693 in __module_wrf_top_MOD_wrf_run
I am not able to understand what is the problem?
I reduced the simulation domains.

Please, find attached my rsl and namelist files.
I am also submitting my wrfinput and wrfbdy files which you can download at:
https://figshare.com/s/b6c5cc6f0fb692d7acd5

Can I ask you to run the wrf files with my namelist file and see are there any problems at your side? Last time I used 56 CPUs for runnig the wrf.exe.


Best regards,
Artur
Attachments
rsl.out.0000.txt
(5.73 MiB) Downloaded 29 times
rsl.error.0000.txt
(5.74 MiB) Downloaded 31 times
namelist.input
(6.47 KiB) Downloaded 44 times

kwerner
Posts: 2287
Joined: Wed Feb 14, 2018 9:21 pm

Re: SIGSEGV: Segmentation fault - invalid memory reference

Post by kwerner » Thu Oct 17, 2019 5:54 pm

There may be additional problems, but until you're running with a more reasonable number of processors, it's difficult to track down that problem because the model could first be stopping due to the number of processors. Based on that FAQ I sent you, you should be able to calculate that you need to use somewhere between ~12 and ~176 processors for your domain sizes. It's okay to use something closer to the upper end to run faster. Can you try a number smaller than 177 (I'm not sure how many processors you have available per node, but you can do the math to determine the best number of nodes to use).

Please also set debug_level = 0. We have taken this option out of the default namelists because it doesn't typically provide any useful information and just adds a lot of junk to the rsl files, making them nearly impossible to read through, and can also sometimes makes the rsl files so large that they use up all disk space, causing segmentation faults.

With those 2 updates, if you are still getting failures, please send your new rsl* files. Please package all of your rsl files together into one *.TAR file and attach that here. I am unable to open *.rar files, so please make sure they are in *.tar format. Thanks!
NCAR/MMM

sekluzia
Posts: 187
Joined: Mon Oct 15, 2018 12:42 pm

Re: SIGSEGV: Segmentation fault - invalid memory reference

Post by sekluzia » Sun Oct 20, 2019 8:35 am

Thanks for your reply!

I am attaching the rsl.error and namelist files for my run using 112 CPUs. Please, also see attached the fortran files which are indicated in the rsl.error file. I can use up to 400 CPUs located in 20 slots. However, when I increase the number of CPUs to more than 124 the WRF complains that too many CPUs are used.

Best regards,
Artur
Attachments
rsl.out.0000_112CPU.txt
(1.24 KiB) Downloaded 37 times
rsl.error.0000_112CPU.txt
(2.29 KiB) Downloaded 32 times
namelist.input
(6.46 KiB) Downloaded 46 times
module_mp_nssl_2mom.f90
(429.31 KiB) Downloaded 30 times
module_physics_init.f90
(216.74 KiB) Downloaded 29 times
start_em.f90
(109.01 KiB) Downloaded 30 times
start_domain.f90
(9.23 KiB) Downloaded 31 times
mediation_wrfmain.f90
(181.64 KiB) Downloaded 29 times
wrf.f90
(219 Bytes) Downloaded 29 times

kwerner
Posts: 2287
Joined: Wed Feb 14, 2018 9:21 pm

Re: SIGSEGV: Segmentation fault - invalid memory reference

Post by kwerner » Mon Oct 21, 2019 6:09 pm

Hi,
Decomposition is broken down by the 2 closest factors that make up the number of processors. So if you chose 124, then the 2 nearest factors are 4 and 31. Your d03 is 364 x 304. Therefore, the model will assign 4 tiles in the x-direction, and 31 in the y-direction. This means that in the x-direction, you will have 364/4 = 91 grid cells per tile. But in the y-direction, you will have 304/31 = 9.8 grid cells per tile. The rule is that you cannot have less than 10 grid cells per tile, so this value conflicts that rule, and gives an error.

As for the floating-point exception, to debug this, you may need to put in some print statements to see what kind of values you are getting. It looks like the last place the model goes is in the module_mp_nssl_2mom.f90 file, line 866. You will need to find that corresponding line in the module_mp_nssl_2mom.F file (it could be different) and put in some prints to see what kind of values it's giving for variables in that line (or perhaps just above or below it). You'll then need to recompile the code (but do not need to issue a "clean -a" or reconfigure - just simply recompile, and it should be fairly quick). When you do this, it will write a new module_mp_nssl_2mom.f90 file (which is why you don't want to put the modifications in that .f90 file). You can then run the model again to see if you see anything useful in the rsl.out.* files, perhaps giving bad values that may be causing the error.
NCAR/MMM

sekluzia
Posts: 187
Joined: Mon Oct 15, 2018 12:42 pm

Re: SIGSEGV: Segmentation fault - invalid memory reference

Post by sekluzia » Tue Oct 22, 2019 1:28 pm

Thanks for your reply! Before recompiling I would like to clarify how to put in some prints in the module_mp_nssl_2mom.F file?

I am showing the lines where the error occurs in the module_mp_nssl_2mom.f90 file, line 866:
855 ccn = Abs( nssl_params(1) )
856 alphah = nssl_params(2)
857 alphahl = nssl_params(3)
858 cnoh = nssl_params(4)
859 cnohl = nssl_params(5)
860 cnor = nssl_params(6)
861 cnos = nssl_params(7)
862 rho_qh = nssl_params(8)
863 rho_qhl = nssl_params(9)
864 rho_qs = nssl_params(10)
865
866 IF ( Nint(nssl_params(13)) == 1 ) THEN
867
868
869 turn_on_ccna = .true.
870 irenuc = 7
871
872 ENDIF


here are the corresponding lines in the module_mp_nssl_2mom.F:
! set some global values from namelist input
!

ccn = Abs( nssl_params(1) )
alphah = nssl_params(2)
alphahl = nssl_params(3)
cnoh = nssl_params(4)
cnohl = nssl_params(5)
cnor = nssl_params(6)
cnos = nssl_params(7)
rho_qh = nssl_params(8)
rho_qhl = nssl_params(9)
rho_qs = nssl_params(10)

IF ( Nint(nssl_params(13)) == 1 ) THEN
! hack to switch CCN field to CCNA (activated ccn)
! invertccn = .true.
turn_on_ccna = .true.
irenuc = 7

ENDIF

So, could you kindly suggest me where and what should I add (or change) in the module_mp_nssl_2mom.F file and then recompile the wrf?

Best regards,
Artur

kwerner
Posts: 2287
Joined: Wed Feb 14, 2018 9:21 pm

Re: SIGSEGV: Segmentation fault - invalid memory reference

Post by kwerner » Tue Oct 22, 2019 9:06 pm

Perhaps it will be easier for me to try to repeat your error. If you're able to attach the wrfbdy_d01 and wrfinput* files, please do so. The files are likely too large to attach, though, so please see the home page of this forum for instructions on submitting large data files.
NCAR/MMM

sekluzia
Posts: 187
Joined: Mon Oct 15, 2018 12:42 pm

Re: SIGSEGV: Segmentation fault - invalid memory reference

Post by sekluzia » Wed Oct 23, 2019 6:37 am

Please upload the files (tar_wrf.tar.gz) at:

https://figshare.com/s/690174d337d030adf068

sekluzia
Posts: 187
Joined: Mon Oct 15, 2018 12:42 pm

Re: SIGSEGV: Segmentation fault - invalid memory reference

Post by sekluzia » Mon Oct 28, 2019 11:45 am

Hello,

I would like to update this issue. I have recompiled the WRF model on the cluster I use. I used the intel fortran compiler.
When I use two domains in my simulations (9km-3km) the wrf runs successfully. The problem occurs when I add the d03, i.e. 9km-3km-1km. Please note, I use the same model configuration and 3:1 parent/nest ratio in both double and triple nesting experiments. The namelist_d02.input and namelist_d03.input files are attached.
I use ECMWF HRES data (interpolated to 0.08x0.08 deg. grid) to run my simulations.

The real.exe is successfully run creating the wrfinput* and wrfbdy files. You can download the wrfinput and wrfbdy files here:

https://figshare.com/s/0eb04a1f37d3ab212210


Then, the wrf.exe produces the first files:

-rw-r--r-- 1 gevorgya root 787565420 Oct 26 22:07 wrfout_d01_2018-06-24_00:00:00
-rw-r--r-- 1 gevorgya root 726924644 Oct 26 22:07 wrfout_d02_2018-06-24_00:00:00
-rw-r--r-- 1 gevorgya root 676866260 Oct 26 22:07 wrfout_d03_2018-06-24_00:00:00

and stops at the next forecast time-step after writing the file:
-rw-r--r-- 1 gevorgya root 676866260 Oct 26 22:08 wrfout_d03_2018-06-24_00:10:00

I reduced the model time-step up to 3*dx=27 sec (namelist_d03.input) in order to avoid the CFL issues. However, my simulation domains are characterized by steep mountain topography (the d03 topography map is attached). Therefore, I think that the model becomes unstable at 1 km spatial resolution and stops. Please, also find attached the rsl* files based on my last run using 160 CPUs. I have tested different numbers of CPUs varying from 60 to 400, but the same error is shown.

Do you think that the issue is related to the model instability issue at high-resolution simulation (1 km)?
If yes, how can I cope with this problem?

Best regards,
Artur
Attachments
rsl.error.0000.txt
(32.54 KiB) Downloaded 39 times
rsl.out.0000.txt
(30.51 KiB) Downloaded 32 times
namelist_d03.input
(6.46 KiB) Downloaded 40 times
namelist_d02.input
(6.21 KiB) Downloaded 34 times
Last edited by sekluzia on Wed Dec 02, 2020 10:11 pm, edited 1 time in total.

kwerner
Posts: 2287
Joined: Wed Feb 14, 2018 9:21 pm

Re: SIGSEGV: Segmentation fault - invalid memory reference

Post by kwerner » Fri Nov 01, 2019 9:54 pm

Hi,
Okay, I am able to repeat your problem, and I'm also able to repeat the problem if I use my own prepared input data. I've at least tracked the problem down to the NSSL microphysics scheme you are using. When I change to something that is used a bit more often, and highly-tested, like Thompson (option 8), the run progresses without any problem. However, there was an additional physics modification I had to make when working with the input data you provided to me. I had to change from the Noah LSM to the Noah-MP LSM. Once I did that, I was able to run with your data. Would you be willing to change those two schemes, or do you absolutely need to use NSSL and Noah?

A couple other things to note:
1) Although changing the microphysics helped to get past the initial problem, I did still run into problems with CFL errors. Howevever, if you add the following 2 options to your namelist, and rerun real.exe, this helps to get rid of those errors:

Code: Select all

&domains
smooth_cg_topo = .true.

&dynamics
epssm = 0.2, 0.2, 0.2 
2) When setting a time_step, it's best to set it to a value that is nicely divisible with rounded time intervals. I changed your time step from 27 to 30, which allows output at nice intervals, and still works okay.
NCAR/MMM

sekluzia
Posts: 187
Joined: Mon Oct 15, 2018 12:42 pm

Re: SIGSEGV: Segmentation fault - invalid memory reference

Post by sekluzia » Mon Nov 04, 2019 6:31 pm

Hi,

Thanks a lot for your reply and helpful suggestions! Indeed, I also suspected that the problem might be the NSSL mp scheme. I only tried to change the NSSL to Morrison scheme (option 10), but the same problem occurred.
The issue of using the NSSL scheme is crucial. The recent sensitivity study demonstrated that the NSSL scheme (option 22) the best simulates convective rainfall over mountain terrain of Armenia (https://agupubs.onlinelibrary.wiley.com ... 17JD028247). Unfortunately, the Thompson scheme mentioned by you (option 8) showed worse results. Therefore, I really need to use the NSSL scheme (option 22) in my simulations. Can we do that?

Regarding the Noah-MP LSM scheme, I tested that, and I am very happy with this scheme, especially after improvements implemented in version 4.1! By the way can you suggest me any additional NOAH-MP options which is worth to test for further improving the simulation of convective rainfall, hailstorm, etc. I would much appreciate!

Looking forward your feedback!
Artur

sekluzia
Posts: 187
Joined: Mon Oct 15, 2018 12:42 pm

Re: SIGSEGV: Segmentation fault - invalid memory reference

Post by sekluzia » Mon Nov 11, 2019 6:27 pm

Hi,

I tried again to run the WRF model using your suggestions on NOAH MP, time step and &domains
smooth_cg_topo = .true.

&dynamics
epssm = 0.2, 0.2, 0.2

Now, the WRF runs at 1 km spatial resolution with the NSSL mp scheme.

Thank you so much for your help!

Artur

kwerner
Posts: 2287
Joined: Wed Feb 14, 2018 9:21 pm

Re: SIGSEGV: Segmentation fault - invalid memory reference

Post by kwerner » Thu Nov 14, 2019 12:16 am

Great! Thanks for updating the topic.
NCAR/MMM

sekluzia
Posts: 187
Joined: Mon Oct 15, 2018 12:42 pm

Re: SIGSEGV: Segmentation fault - invalid memory reference

Post by sekluzia » Wed Nov 27, 2019 6:12 pm

Hi,

I have to get back to this issue.
When I run another WRF simulations covering slightly extended domains, and thus larger number of grid-points (the namelist.input file is attached), the wrf.exe again shows errors (the rsl.error and rsl.out files are attached). I am using 160 CPUs in my run. Please, upload my wrf input files at:

https://figshare.com/s/46d03be6c5415c6f69c4

Is it the same error associated with the NSSL mp scheme. If yes, how to solve that?

Best regards,
Artur
Attachments
rsl.error.0000.txt
(21.74 KiB) Downloaded 18 times
rsl.out.0000.txt
(19.84 KiB) Downloaded 10 times
namelist.input
(6.54 KiB) Downloaded 27 times

Ming Chen
Posts: 1456
Joined: Mon Apr 23, 2018 9:42 pm

Re: SIGSEGV: Segmentation fault - invalid memory reference

Post by Ming Chen » Mon Dec 02, 2019 5:52 pm

Can you increase the value of epssm to 0.5 or even larger ( e.g. 0.9), then try again?
WRF Help Desk

sekluzia
Posts: 187
Joined: Mon Oct 15, 2018 12:42 pm

Re: SIGSEGV: Segmentation fault - invalid memory reference

Post by sekluzia » Sun Dec 08, 2019 8:22 pm

Hello Ming,

Thanks a lot, that helps!

Artur

mrasel
Posts: 2
Joined: Sat Feb 06, 2021 11:20 pm

Re: SIGSEGV: Segmentation fault - invalid memory reference

Post by mrasel » Fri May 14, 2021 8:56 pm

Hello,

I'm using WRF-3.9.1.1. I'm having the following error:

d01 2020-06-22_04:47:00 1 points exceeded cfl=2 in domain d01 at time 2020-06-22_04:47:00 hours
d01 2020-06-22_04:47:00 MAX AT i,j,k: 269 195 8 vert_cfl,w,d(eta)= 2.05959916 0.637236774 2.30002403E-03


Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0 0x7F7998EBE697
#1 0x7F7998EBECDE
#2 0x7F79983B93FF
#3 0x1B66BE7 in taugb3.5950 at module_ra_rrtmg_lw.f90:?
#4 0x1B87B19 in __rrtmg_lw_taumol_MOD_taumol
#5 0x1B9F86B in __rrtmg_lw_rad_MOD_rrtmg_lw
#6 0x1BB31DC in __module_ra_rrtmg_lw_MOD_rrtmg_lwrad
#7 0x16C6199 in __module_radiation_driver_MOD_radiation_driver
#8 0x17B0BCC in __module_first_rk_step_part1_MOD_first_rk_step_part1
#9 0x11B1ED9 in solve_em_
#10 0x108920A in solve_interface_
#11 0x47234A in __module_integrate_MOD_integrate
#12 0x407C73 in __module_wrf_top_MOD_wrf_run

I've used 80 cores in 4 nodes for my analysis. I'm running 1-month NARR datasets. However, on day 22 my wrf crash happen. I've checked all output files using ncdump and found all the time steps there.

Please see the attached rsl.error and namelist files and help me out.
Attachments
rsl.error.0061.txt
(50.28 KiB) Not downloaded yet
rsl.error.0060.txt
(265.44 KiB) Not downloaded yet
rsl.error.0068.txt
(49.89 KiB) Not downloaded yet
rsl.error.0053.txt
(50.28 KiB) Not downloaded yet
rsl.error.0052.txt
(218.12 KiB) Not downloaded yet
namelist.input.txt
(6.16 KiB) Downloaded 5 times

Ming Chen
Posts: 1456
Joined: Mon Apr 23, 2018 9:42 pm

Re: SIGSEGV: Segmentation fault - invalid memory reference

Post by Ming Chen » Tue May 18, 2021 5:05 pm

You're namelist.input looks fine. All settings are reasonable. If the model crashed after 22 days of integration, it often indicates that some physical processes went wrong. CFL violation shows that the model cannot integrated stably.

I would suggest you reduce the time step to 48, then restart the run from the most recent time when wrfrst is available. Hope the model can pass the crashing point. Please save wrfrst as frequent as possible, for example with restart_interval = 60, just in case the model will crash again, you can restart from a time that is close to the crashing time and save wrfout every time step. Then please check these wrfout files and find when and where something first goes wrong.
WRF Help Desk

Post Reply

Return to “wrf.exe”