Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

SIGSEGV when increasing # of processors

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

svangorder

New member
Hi,

wrf.exe runs successfully with 12, 20, 30 processors, but crashes with a segment fault at 42 and above. Here is the error output for a run with 48 processors.

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
coawstM 000000000304AE25 Unknown Unknown Unknown
coawstM 0000000003048A47 Unknown Unknown Unknown
coawstM 0000000002FE8934 Unknown Unknown Unknown
coawstM 0000000002FE8746 Unknown Unknown Unknown
coawstM 0000000002F74A76 Unknown Unknown Unknown
coawstM 0000000002F7B550 Unknown Unknown Unknown
libpthread.so.0 00002AC3EA1C55E0 Unknown Unknown Unknown
coawstM 00000000029B618C Unknown Unknown Unknown
coawstM 00000000029B2205 Unknown Unknown Unknown
coawstM 00000000023AF30D Unknown Unknown Unknown
coawstM 0000000001BB173F Unknown Unknown Unknown
coawstM 000000000131D228 Unknown Unknown Unknown
coawstM 00000000011A7FD0 Unknown Unknown Unknown
coawstM 000000000050F807 Unknown Unknown Unknown
coawstM 000000000040CAC1 Unknown Unknown Unknown
coawstM 000000000040CA7F Unknown Unknown Unknown
coawstM 000000000040CA1E Unknown Unknown Unknown
libc.so.6 00002AC3EA3F3C05 Unknown Unknown Unknown
coawstM 000000000040C929 Unknown Unknown Unknown

and with wrf.exe built with "configure -D".

forrtl: severe (408): fort: (3): Subscript #2 of the array ZO has value -858993425 which is less than the lower bound of 1

Image PC Routine Line Source
coawstM 000000000BF17136 Unknown Unknown Unknown
coawstM 0000000009F4FABB Unknown Unknown Unknown
coawstM 0000000009F2A29D Unknown Unknown Unknown
coawstM 00000000078181DD Unknown Unknown Unknown
coawstM 000000000469AA1F Unknown Unknown Unknown
coawstM 000000000360BC7A Unknown Unknown Unknown
coawstM 00000000030C647B Unknown Unknown Unknown
coawstM 000000000052DB2A Unknown Unknown Unknown
coawstM 000000000040D53D Unknown Unknown Unknown
coawstM 000000000040CA25 Unknown Unknown Unknown
coawstM 000000000040C9DE Unknown Unknown Unknown
libc.so.6 00002B21AC1D5C05 Unknown Unknown Unknown
coawstM 000000000040C8E9 Unknown Unknown Unknown

I think the problem is just after the first radiation time step as the last output I get without "configure -D" is

Timing for main: time 2010-04-13_00:00:30 on domain 1: 25.13916 elapsed seconds

I'm running WRF Version 4.0.3 with the COAWST modeling system.
ifort version 16.0.2

Any help you could give me would be greatly appreciated!

Regards,
Steve


View attachment namelist.input

View attachment wrf.err.cfg-D.txt

View attachment wrf.out.cfg-D.txt

View attachment wrf.err.txt

View attachment wrf.out.txt
 
Hi,
Do you also have each of the rsl.error.* and rsl.out.* files for the run with bounds checking on (configure -D)? Can you package those into one *.tar file and attach that? The individual files could possibly be more helpful than the one file with all the information in it. Thanks!
 
Hi Kelly,

Sorry to take so long responding. I was out of town for the holiday.

Unfortunately I don't have individual rsl files. The WRF in the COAWST model system has been modified to write to single standard out and error files. I don't know if there is a way around this.

Steve
 
Steve,
I ran this on my machine, but used my own data. Otherwise, I used your namelist.input file and V4.0.3. These are the tests and results I saw:
1) 360 processors - runs successfully
2) 48 processors - fails with a seg-fault
3) 36 processors - fails with a seg-fault
4) 72 processors - fails with a seg-fault
5) 108 processors - runs successfully

So for me, the sweet spot seems to be somewhere between 108 and 360 (possibly more - and these numbers come from the fact that we have 36 cores per node on our machine). You have a pretty large domain, so my test makes sense to me. I am actually a little shocked that running with fewer than 42 processors was working for you. Is it possible for you to try to go much larger in the numbers you are using?
 
Hi Kelly,

I'll try to run with more processors first thing Monday. In the mean time I was wondering why you expect a smaller number of processors to cause a segment fault?

Steve
 
Hi,
Simply because there isn't enough computing "power" to run such large domains, and the size of each computational grid is too large for a single processor. There isn't a lot of additional information this FAQ, I'd advise reading through it to understand a bit more about how to choose a "good" number of processors. Just remember that each case is different, and this is meant as a rough guide. The numbers could vary with each case.
 
Kelly,

I did debug runs of WRF (compiled with configure -D) with 96 and 192 processors. Both failed with a seg-fault.

The only potentially informative error message is ...

forrtl: severe (408): fort: (3): Subscript #2 of the array ZO has value -858993425 which is less than the lower bound of 1

This looks like and array bounds error.
I get this exact same error for all debug runs with 48, 96 and 192 processors. I haven't done a 42 processor debug run.

My code is definitely behaving differently from yours. To summarize,

12 processors - runs successfully
20 processors - runs successfully
30 processors - runs successfully
42 processors - fails with a seg-fault
48 processors - fails with a seg-fault
96 processors - fails with a seg-fault
192 processors - fails with a seg-fault

Steve
 
Hi Steve,

Is there any way for you to run this without the full COAWST system, and just with a basic build of WRF? I'm just curious if the problem is somehow related to the combination of the 2 models. If not, or if that doesn't help, can you attach the input files you are using to run this (basically any files I would need to run this in the WRF model)? I'd like to try to repeat the case with your actual files.
 
Hi Kelly,

I could give that a shot, I have the original WRF Model Version 4.0.3 (December 18, 2018) code. The difference should be the modifications that John Warner has made to the WRF 4.0.3 code, but before I do that here is another clue ...

We also plan to do runs with MYNN2 PBL and surface layer physics (bl_pbl_physics = 5, sf_sfclay_physics = 5) in addition to GB PBL with Revised MM5 Monin-Obukhov surface layer (bl_pbl_physics = 12, sf_sfclay_physics = 1). If I specify this, WRF runs successfully! I did runs with 12, 48, 96 and 192 processors. Only the 96 processor standard error file reports a SEGV which for some strange reason occurs after the successful completion of the code. I think that the output is OK. I will check it tomorrow.

So it seems that the array bounds error occurs is associated with the modules for the GB PBL and Revised MM5 Monin-Obukhov surface layer.

I attached the namelist.input file for these runs. See if you get the same result.

Steve
 

Attachments

  • namelist.MYNN.input.txt
    4.7 KB · Views: 58
Hi Kelly,

My input files are attached. I added the .nc file extension to get them to attach.

You should have the same additional files WRF reads for the RRTMG radiation scheme (ra_*_physics = 4) that come with the WRF distrubution.

ozone.formatted
ozone_lat.formatted
ozone_plev.formatted
RRTM_DATA
RRTMG_LW_DATA
lRRTMG_SW_DATA

Steve
 

Attachments

  • wrfbdy_d01.nc
    44.6 MB · Views: 51
  • wrfinput_d01.nc
    220.6 MB · Views: 53
  • wrflowinp_d01.nc
    9.7 MB · Views: 50
Hi Kelly,

I've completed runs with stand-alone WRF-4.0.3 from the GitHub archive (https://github.com/wrf-model/WRF/releases/tag/v4.0.3).
I found that the behavior of the COAWST WRF and stand-alone WRF-4.0.3 is essentially the same and is summarized as follows ...

------------------------------------------------------------------------------------------------------------
With GB PBL and revised MM5 Monin-Obukhov surface layer (bl_pbl_physics = 12, sf_sfclay_physics = 1),

12 processors - runs successfully
20 processors - runs successfully
30 processors - runs successfully
42 processors - fails with a seg-fault in 10 of the 42 processors.
48 processors - fails with a seg-fault in 18 of the 48 processors.
96 processors - runs successfully, despite a seg-fault generated by 1 processor [note: 1]
192 processors - runs successfully.


With MYNN2 PBL and surface layer physics (bl_pbl_physics = 5, sf_sfclay_physics = 5), all cases run successfully!

12 processors - runs successfully
20 processors - runs successfully
30 processors - runs successfully
42 processors - runs successfully
48 processors - runs successfully
96 processors - runs successfully, despite a seg-fault generated by 1 processor [note: 2]
192 processors - runs successfully.

[note: 1] This segment fault occurs in only 1 processor. The seg-fault error message comes immediately after "wrf: SUCCESS
COMPLETE WRF" in the single file rsl.error.0000. For this case, the fault only occurs with stand-alone WRF-4.0.3. COAWST-WRF
generates no seg-fault.

[note: 2] This segment fault occurs in only 1 processor, both with stand-alone WRF-4.0.3 and COAWST-WRF. For stand-alone
WRF-4.0.3, the fault fault error message appears immediately after "wrf: SUCCESS COMPLETE WRF" in the single file
rsl.error.0000. In COAWST-WRF the single seg-fault error message appears in the combined standart error file immediately after
the 96'th "wrf: SUCCESS COMPLETE WRF".
------------------------------------------------------------------------------------------------------------

Initially I had mistakenly told you that COAWST-WRF GB/MM5 crashed with a segment fault when the number of processors was 42 and above. This was because I ran the 42-192 processors cases only with the debug code (configure -D) which reported an array bounds error in each case. I had assumed, without checking, that this meant that the non-debug code would crash with a seg-fault. As you can see this is not necessarily the case.

For an intermediate number of processors (around 48) and with GB/MM5 physics, both the COAWST-WRF and stand-alone WRF-4.0.3 crash with a segment fault. For a small or large number of processors there is no segment fault. However for ALL cases, 12-192 processors, the debug code (configure -D) reports an array bounds error and then aborts. The separate rsl.error.* and rsl.out.* files from the stand-alone WRF-4.0.3 debug code contain a better traceback than COAWST-WRF, pointing to a particular line of code (attached). The array bounds error and traceback is exactly the same for all GB/MM5 cases.

So there appears to be an array bounds problem with the code involved in the GB/MM5 physics that produces a segment-fault when running with an intermediate number of processors. With MYNN2 physics, both the COAWST-WRF and stand-alone WRF-4.0.3 run fine without regard to the number of processors and the debug code reports no array bounds issues.

I suppose one could just say, "stay away from 48 processors or so and WRF will be fine". But I'm about to start coupling WRF-ROMS-SWAN and my understanding is that a coupled COAWST model is a single executable. I'm no expert on how COAWST works, but trying to do a coupled model with an array bounds issue lurking around in WRF is a bit worrisome to me. While I certainly plan to forge ahead with the coupled model, I might need some more help if I run into problems! If you guys were to reassure me that this array bounds issue is really not a problem or better yet come up with a fix, I'd feel way better! I think I read somewhere, that Fortran programmers sometimes intentionally do stuff like this for various reasons and it is even considered a valuable "feature" that Fortran allows it.

Please let me know if you want any additional info or files.

Best,
Steve
 

Attachments

  • traceback.txt
    2.6 KB · Views: 48
Steve,
Thank you so much for doing these tests. I have been able to reproduce most of what you are seeing, as well. I found that when I didn't compile with "-D" I was able to run without problems; however, I didn't test all the variations of number of processors that you did. I have been trying to track down what may be causing the problem, but without much luck so far. Unfortunately our super-computing system keeps going down for various reasons, making this quite difficult. I will continue to work on it, though, and will keep you posted when I figure something out. Again, thanks for the tests - they definitely help!
 
Hi Steve,
I have FINALLY tracked down the problem! Although the error kicks out during the cumulus scheme, it actually stems back to the PBL scheme (Option 12, Grenier-Bretherton-McCaa). In the file phys/module_bl_gbmpbl.F there is a variable (kpbl2dx) that is calculated in the subroutine "pblhgt" that is never initialized. Look for this section of the code:
Code:
    !     We should now have tops and bottoms for iconv layers
    ! NT not clear how kpbl2dx should work, but doesn't matter since
    ! NT only kmix2dx is used no matter what kpbl2dx is.
    ! NT Looks like it could be used to choose between mixing pbl and
    ! NT convective pbl height if there are more than 1 unstable layers
                        
    if(iconv.gt.0)then
       if(kbot(iconv).eq.kte+1)then
          kmix2dx = ktop(iconv)
          if(kpbl2dx.ge.0)then
             if(iconv.gt.1)then
                kpbl2dx = ktop(iconv-1)
             else
                kpbl2dx = kmix2dx
             endif
          else

Just above "if(iconv.gt.0)then", and after the last comment, add this line:
Code:
kpbl2dx = 0

Save that file and recompile the code (no need to issue a 'clean -a' or to reconfigure. Just simply recompile, and it should be fairly quick) and try again. Let me know if that fixes the problem for you.
 
Hi Kelly,

Sorry to take so long to get back to you and thank you very much for working on this issue. I know that you have spent a lot of
time on it (as have I). The good news is that the GBM PBL module fix you provided has eliminated all segment faults that cause WRF
to crash. However in some cases (48 & 96 CPU with GB PBL, 96 CPU with MYNN2 PBL), I am still getting the single segment fault from
task 0 during WRF's shutdown phase after is has written all output and reported "wrf: SUCCESS COMPLETE WRF". Here is the traceback
from rsl.error.0000 when this occurs.


View attachment traceback.txt


--------------------------------------------------------
Here is the relevant code from WRF/frame/module_dm.f90

SUBROUTINE wrf_dm_shutdown
IMPLICIT NONE
INTEGER ierr
CALL MPI_FINALIZE( ierr ) <-- line 6427 of module_dm.f90
RETURN
END SUBROUTINE wrf_dm_shutdown
--------------------------------------------------------

I don't think that this "shutdown" seg-fault has any effect on the model output because I have examples where WRF produces the same output when it terminates with and without the seg-fault. However this could be a problem for my COAWST WRF-ROMS-SWAN coupled model if WRF shuts down before either of the other two models has completed. So my first question for you is, do you have any ideas about what is causing this shutdown segment fault and how to fix it?



Unfortunately, I am now encountering a second serious problem with the output from WRF that I hope you can help me with. Even though WRF runs to completion, the model output can be dependent on the number of processors used! I don't think that this is related to the shutdown segment fault issue because as I said, there are examples both with and without the seg-fault where the output is the same. If you want to move this to another subject that's fine, but I will tell you about it here and apologize in advance for the length of this.

I initially thought things were ok because all the output from the 10 minute GB/MM5 (bl_pbl_physics = 12, sf_sfclay_physics = 1) test runs that I did for the segment fault issue was the same; regardless of the number of processors (#CPU=12,20,30,42,48,96,192) and regardless of whether or not the GBM PBL module was fixed. i.e. the output from the successful 10 minute runs with the unfixed GBM PBL module and the output from all the 10 minute runs with the fixed GBM PBL module was the same. However, when I checked the output from the 10 minute MYNN2 test runs, I found that it WAS dependent on the number of processors despite the fact that the MYNN2 runs never crashed with a segment fault and never generated any array bounds errors. Strangely enough, the fix to the GBM PBL module also changed the MYNN2 output in some of the cases I looked at but, at this point I do not know whether or not the "fixed" MYNN2 output is dependent on the number of processors.

To make matters worse, I subsequently did some 1-day GB/MM5 test runs with 12, 48, 99 & 195 processors, and they generate two different sets of (hourly) output as follows ...

output set 1: 12 & 48 processors
output set 2: 99 & 195 processors

All variables in the history and restart files of each output set are identical except for the variable MIN_PTCHSZ in the restart file (I believe MIN_PTCHSZ should be #CPU dependent). There are 213 variables in each history file; 96 have values that differ between the two output sets. These 1-day test results were done with COAWST-WRF, but I also did some runs to verify that the output from stand-alone NCAR WRF-4.0.3 behaves the same way.

Here is an example showing the difference in the variable HFX (UPWARD HEAT FLUX AT THE SURFACE in W m-2) between the two output sets.

Over the entire 24 hour integration, the magnitude of maximum difference in HFX between output sets 1 and 2 at any point in the field is 566 W m-2. This is about 20 times the mean value (over all time and space) of HFX, 29 W m-2. The max difference grows from zero at the start of the integration to 566 W m-2 at hour 19 and then decays to about 70 W m-2 at hour 24. Here is the output from a Matlab program that checks the difference in two netCDF files.

check HFX: NOT EQUAL
Set 1 mean value: 29.07323265
Set 1 min value: -101.4958344
Set 1 max value: 628.4731445
Set 2 mean value: 29.09247398
Set 2 min value: -101.463707
Set 2 max value: 627.6217041
max difference Set 2 & 1: 565.6324463

Here are plots of HFX (Set 2) and the difference in HFX (Set 2 - Set 1) at hour 19. Here the difference is largest over the land.


HFX
HFX_99_cpu.19hr.png



HFX (Set 2 - Set 1)
HFX_99-48_cpu.19hr.png



At hour 11 you can more easily see the difference between Sets 1 & 2 over water.


HFX
HFX_99_cpu.11hr.png



HFX (Set 2 - Set 1)
HFX_99-48_cpu.11hr.png



Here is a close-up of the difference in HFX (Set 2 - Set 1) at hour 11 in the region 86.84W - 83.18W and 22.91N - 25.24N in the south-eastern Guf of Mexico.


HFX
HFX_99-48_cpu.11hr-SE-GOM.png



The difference is clearly visible to the eye in these close-ups of the Set 1 (12-48 processor) HFX field and the Set 2 (99-195
processor) HFX field at hour 11.


HFX Set 1
HFX_48_cpu.11hr-SE-GOM.png



HFX Set 2
HFX_99_cpu.11hr-SE-GOM.png




Kelly, do you have any ideas about what is going on here? Why should WRF output be dependent on the number of processors?
This is really quite frustrating. I thank you for your help so far and any help you can give me with this issue.

Steve
 
Hi Kelly,

Update on 1-day GB/MM5 test runs: At the time of my previous post I had not actually checked the output from the 12 CPU GB/MM5 test run, assuming that it would be the same as the lower #CPU output (Set 1). In fact, the 12 CPU GB/MM5 output is the same as Set 2. I have also included results for standalone NCAR WRF 4.0.3 which are consistent with the COAWST-WRF results.

Here are two tables summarizing my 1-day GB/MM5 tests. One for COAWST WRF 4.0.3 and one for NCAR standalone WRF 4.0.3. For all runs there are only two different outputs, referred to as Set 1 and Set 2. The columns of the tables indicate the number of processors, the processor tiling, whether or not the code was built with your fixed phys/module_bl_gbmpbl.F, and the output.


View attachment tables.txt


Best,
Steve
 
Hi Steve,
I apologize for the delay on my part this time. I wanted to discuss this issue with some people here to see what they think.

1) For the first problem, in which there is a seg-fault during the shut-down process (after WRF completes successfully), we believe this is likely related to the communication with the coupled system. As WRF is actually completing, it's likely not a stand-alone WRF problem. If you are seeing this problem when running only WRF (without the COAST-ROMS-SWAN system), then we will definitely need to look into that a bit further. Let me know if that is the case.

2) As for the difference in output with different numbers of processors - unfortunately this is a problem that we do see with certain physics routines. Because these are written and distributed to us from outside sources, they can sometimes be problematic. We do our best to test them out, but occasionally there are particular combinations of options that can cause problems. Sometimes as the WRF code evolves, but the physics routines remain the same, problems can be introduced. I ran a very basic test with the options you mentioned (MM5/GBM). I ran once with 18 processors and once with 36 processors. I see the same problem that you are. In order to correct this, we will likely have to do a great deal of testing and probably send it back to the developers to see what they want to do about it, which could take quite some time. I apologize for this and wish I had an easy solution for it.
 
Hi Kelly,

Geeze ... somehow I got unsubscribed to this topic, so I missed your post until I happened to check just now. My turn to apologize again!

1) With regard to the seg-fault during the MPI shut-down process. Yes this does occur when running WRF without the COAST-ROMS-SWAN system. The traceback and code snippet in my Sept. 3 post are from a 48-processor run with stand-alone WRF 4.0.3.

As noted in that post, I don't think the shut-down seg-fault is effecting the output as I have runs with and without seg-faults that produce the same output. My main concern is that WRF could crash a coupled run if it completes first and generates a seg-fault before one or more of the other models completes. I now have a COAWST coupled WRF-ROMS model that ends with the shut-down seg-faults that appears to be OK ... but in that run ROMS completed before WRF. I suppose that if COAWST is set up so all the models are allowed to complete before any MPI shut-down process is initiated then there may never be a problem, but I don't know if this is the case.

2. If you ever do make any progress on the output issue, then please let me know. Bear in mind that I saw output dependent on the number of processors with both MM5/GBM and MYNN2 schemes.
 
Steve,
I have tried to reproduce the segmentation fault you are still seeing, but without any success. It runs successfully for me with a variety of numbers of processors (including 48 and 96). I am using the namelist you sent for the time starting at 2010-04-13_00:00:00, and with the GBM PBL. I ran for the 10 minutes you had it set to run (which I believe is the same amount of time you are running). I used V4.0.3, compiled with Intel, and with the GBM fix I provided earlier. I am afraid that the segmentation fault (after WRF completes) may be system/environment-related, since I'm not able to reproduce it. Unfortunately at this point, I would have to recommend either using a number of processors that doesn't cause the problem, or perhaps trying to work with a member of your systems administrators to see if they are able to track down the problem. If you are able to figure it out, please update us so that it may be helpful to someone else in the future.

I will also let you know if we solve the discrepancy issue with the physics options.
 
Hi Kelly,

Yes I believe that you are correct - there appears to be an issue with the MPI implementation here at FSU. I am now getting the same shutdown seg-faults with some recent COAWST ROMS-only runs. I had done many ROMS-only runs months ago with no issues, but there have been several system "upgrades" in the interim. Unfortunatly WRF mistakenly took the blame because the shut-down faults first appeared in my WRF-only runs after we (you) solved the other segment fault issue with GBM PBL code. Sorry for sending you down a rabbit hole. I will certainly let you know if our support folk figure out what is going on ...

and yes please let me know if you guys make any progress on the discrepancy issue with the physics options.

Thanks again Kelly for all your help,

Steve
 
Top