Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

WRF on AWS segmentation fault

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

sym04110

New member
Hi all,

I am trying to conduct a month-long WRF run on Amazon Web Services using ECMWF ERA-Intrim data. Every time I attempt to run it, wrf.exe terminates after the 8th day of my simulation. The termination message printed to the screen is:

Code:
===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 11293 RUNNING AT ip-172-31-25-52
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)

The error given in my rsl.error.0000 file is:

Code:
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x2AAFB9D3BE08
#1  0x2AAFB9D3AF90
#2  0x2AAFBA59F4AF
#3  0x1B1626C in taugb3.7253 at module_ra_rrtmg_lw.f90:?
#4  0x1B38A6D in __rrtmg_lw_taumol_MOD_taumol
#5  0x1B5CA68 in __rrtmg_lw_rad_MOD_rrtmg_lw
#6  0x1B6DE01 in __module_ra_rrtmg_lw_MOD_rrtmg_lwrad
#7  0x160B173 in __module_radiation_driver_MOD_radiation_driver
#8  0x16CCD33 in __module_first_rk_step_part1_MOD_first_rk_step_part1
#9  0x11FC36A in solve_em_
#10  0x10AFD52 in solve_interface_
#11  0x4713D5 in __module_integrate_MOD_integrate
#12  0x407B93 in __module_wrf_top_MOD_wrf_run

I’ve read through the WRF FAQ
(http://www2.mmm.ucar.edu/wrf/users/FAQ_files/FAQ_wrf_runtime.html) and have tried changing the stack size, but I still get this same error . If I set my namelist to start on the 8th, it will run through the 9th and terminate later on in the month with the same error message. Has anyone seen this before? Does anyone have any ideas about what could be causing this? I've attached both my namelists.

Thanks!
Sarah
 

Attachments

  • namelist.input.txt
    2.8 KB · Views: 63
  • namelist.wps.txt
    1.1 KB · Views: 50
Hi Sarah,
When you start the run on the 8th, and it stops later in the month, does it happen to stop after the same number of simulation hours as it does when it stops after the 8th day? Is it possible for you to package all of your rsl.* files into one *.TAR file and move it down to your local system so that you can attach that *.TAR file here on the forum (to attach files, when you have the text post window open, click on the tab with the 3 horizontal bars, and that will take you through instructions for attaching)? Can you also let me know which version of WRF you are using? Thanks!
 
Thanks for looking into this!

When I start the run on the 8th, it runs until the 21st, so it isn't the same number of simulation hours. The .tar file containing all of my rsl. files is attached. I am using the most recent version of WRF downloaded from https://github.com/wrf-model/WRF

Thanks again!
Sarah
 

Attachments

  • rsl_files.tar.gz
    54.3 MB · Views: 53
Thanks for sending those! I would first advise you to check your disk space on your instance to make sure you have room to store output files. You can read how to check your space in the command line from this page:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-describing-volumes.html
(look for "Viewing Free Disk Space")

I would also advise turning debug_level = 0. We have recently removed this option from the default namelist because we don't really want users using it. It rarely provides any useful information, and just adds a lot of junk to the rsl files, making them difficult to read, and large - sometimes so large they can take up all the disk space. This likely isn't what's happening in your case, but it's good to just verify the space, and make these files as small as is possible.

Assuming that won't make any difference, and you'll still get the same segmentation fault, can you then attach the input files (wrfinput_d01, wrfbdy_d01, & wrflowinp_d01) I would need to run your case? I would like to try to see if I can repeat the problem here.

Thanks!
 
I have 670gb left on my AWS instance, which I hope should be plenty to run for a month. Thanks for the tip about the debug level, I had previously changed it to a higher number in hopes that I could get more information about why I was getting this error. I ran my case again and I got the segmentation fault at the same place as expected.

The wrfinput_d01, wrfbdy_d01, and wrflowinp_d01 files were too large to attach here, so I uploaded them to my OneDrive account:
https://emailwsu-my.sharepoint.com/:u:/g/personal/sarah_y_murphy_wsu_edu/EVABS42aAlZIjndgt1KK74YBypE3smdOwHVSXFaJauplWg?e=8hq5s9

If there is an easier way to get these files to you please let me know. Thanks again!
Sarah
 
hello
I m also getting the same error message. And surprisingly previously I had setup mp_physics =6 and ra_lw_radiation and ra_sw_radiation = 1 it was running and gave results but I have just changed the domain and rest physics options are same but it is not running and get terminated just after 1 minute and sometimes it doesn't show any error also. I m attaching the tail of my rsl.out and rsl.error file and also my namelist. Also, I noticed the model is running and gave all the results when I set ra_lw_radiation and ra_sw_radiation = 0, which is obviously not correct. kindly help me with this as soon as possible.
thank you
 

Attachments

  • namelist.input
    3.1 KB · Views: 58
  • tail_rsl_error_file.txt
    525 bytes · Views: 54
  • slurm_out_file.txt
    727 bytes · Views: 52
  • tail_rsl_out.txt
    700 bytes · Views: 53
lstwrf,
Are you actually running your simulation on an Amazon Web Services cloud instance, or is this a segmentation fault on a local machine/cluster?
 
Sarah,
Thank you for sending those files. I am able to repeat your problem, and it doesn't seem to be related to running on an AWS instance. My run stops in the exact same place, with a lot of print-outs in the rsl.out* files that look like:
Code:
Flerchinger USEd in NEW version. Iterations=          10
 Flerchinger USEd in NEW version. Iterations=          10
 Flerchinger USEd in NEW version. Iterations=          10

This print comes from the Noah surface scheme, but if you weren't using that scheme, you likely would still get a segmentation fault. Many times this message indicates a problem with the input soil data. I notice that in the initial condition file (wrfinput_d01), the variable TSLB (soil temperature) has several spots with values of 0 K. Then later in the run (after 78 output times - or 78 hours), I see these 0 K values coming back in the same general area for T2, T, and TSLB, which I believe leads to the demise of the run. I would be curious to know whether the met_em* files also show similar values. You may need to play around with input. Take a look at this post from another WRF forum:
http://forum.wrfforum.com/viewtopic.php?p=26565
The user ronbeag had a similar problem (theirs was related to soil moisture instead of soil temp), and perhaps may have some information that will be useful for you.
 
Thanks for looking at my files. I've read the wrfinput into Python and can see that there are some zero values. I've edited the file so that these zero values are more reasonable and have saved new netCDF file. When I put this file in the /run directory and try to run wrf.exe again, I get an error with opening this file (see below) in my rsl.error files. I assume this has to do with the way that Python is saving the file being slightly different than the way it is created for reading in WRF. I'm not familiar with NCL, but I see that the other topic you've linked suggests that NCL is used for editing. Do you have any suggestions for making the file I've created with Python readable by WRF or would you advise I learn enough NCL to do it with that? I've included the file here. I've changed the file name here for organization purposes but in the /run directory the filename is 'wrfinput_d01'.

Code:
-------------- FATAL CALLED ---------------
FATAL CALLED FROM FILE:  <stdin>  LINE:      70
 program wrf: error opening wrfinput_d01 for reading ierr=       -1021

Thanks again!
Sarah
 

Attachments

  • wrfinput-mine.nc
    35 MB · Views: 55
Sarah,
I don't personally have any experience with doing this with Python, but as you already know it, it's likely easier to troubleshoot your Python script, than to start from scratch with NCL.

Your wrfinput_d01 file does seem to be in NetCDF format, as I'm able to use an 'ncdump' command on it, without problems. What I do notice is that the time on the file is in a weird format. If I issue 'ncdump -v Times wrfinput_d01', I would expect to see something like:
Code:
 Times =
  "2016-03-23_00:00:00" ;
at the bottom of the print-out. But instead I see:
Code:
 Times =
  "2",
  "0",
  "1",
  "5",
  "-",
  "0",
  "1",
  "-",
  "0",
  "1",
  "_",
  "0",
  "0",
  ":",
  "0",
  "0",
  ":",
  "0",
  "0" ;

So it looks like your Python script is writing things out 1 line at a time. Perhaps you may be able to find something in the script to explain that, and you can correct it?
If, however, you are unable to even issue simple commands on the wrfinput_d01 file (such as the ncdump command I mentioned above), then it could also be that the new wrfinput_d01 file is in NetCDF-4 format, and perhaps your version of WRF was built with NetCDF-3. If that's the case, you'll need to recompile WRF with NetCDF-4, or somehow output from Python in NetCDF-3 format. You will still need to fix the 'Times' issue, regardless.
 
Hi again,

I'd like to give you an update on this situation in hopes you can provide more assistance. I was able to edit the files with the correct 'Times' variables but was unable to prevent the segmentation fault. I examined and edited all variables to ensure that nothing was showing values where it shouldn't, yet it didn't seem to fix the issue.

Since this, I've moved on to attempt to complete a nested run. I've attached my namelist files for this nested run. I get the same error I was getting previously, but after only one output file per domain is created. I've attached these namelist files.

Do you have any other suggestions as to why I could be getting this segmentation fault? It seems that I cannot complete a run regardless of my domain without getting this error.
 

Attachments

  • namelist.wps-smurphy-nested.txt
    833 bytes · Views: 59
  • namelist.input-smurphy-nested.txt
    2.3 KB · Views: 59
Hi,
I am no longer able to access the previous link with your wrfinput/wrfbdy/etc. files, and I do not have them anymore. Do you mind sending me those files again (the same method as before was just fine)? Can you also send me the raw input data you are using when running ungrib, along with the Vtable.* you are linking to? I'd like to start from that point to see if I can track down the issue. Thanks!
 
Top