Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Segmentation fault 8 days into WRF simulation

mehtut

New member
Hello,

I cannot figure out why my WRF simulation is failing after eight days with a segmentation fault. I have read this post and have tried to address the potential issues listed, but have not had any success. For reference, I am running WRF version 4.4.1 with ERA5 data for the initial and boundary conditions. I ran with 3 nodes and 128 tasks per node. I have attached my namelist.input file and several of the rsl files (the tar file with all of the rsl files was too big to upload). I noticed that some of the rsl files contain the line "Warning BAD MEMORY ORDER" and I'm not sure what that means. Any help would be greatly appreciated!

Thanks,
Emily
 

Attachments

  • namelist.input
    5.4 KB · Views: 4
  • rsl.error.0357.txt
    42 KB · Views: 1
  • rsl.error.0383.txt
    41.5 KB · Views: 1
  • rsl.out.0357.txt
    17 KB · Views: 2
  • rsl.out.0383.txt
    17 KB · Views: 2
Hi Emily,

I've actually never seen the "BAD MEMORY ORDER" warning before, and it's possible that is causing issues, but the warning starts after 2 days of simulation and it continues to run until the 8th day. I first want to check that you are not running out of disk space or wallclock time (if you're submitting to a queueing system). Can you let me know if those are okay? And I know the packaged rsl* tar file is large, but can you take a look at the home page of this forum that gives instructions for sharing large files and share it with me that way? Thanks!
 
Hi,

Thanks for getting back to me. I am not running out of disk space (I'm at 46.7% of my allocated space) and I requested 4 hours of run time and the simulation crashed after approximately and hour of run time. I uploaded the rsl tar file using Nextcloud. The file is called rsl_files.tar.

Thanks!
Emily
 
Thanks for sending those and for verifying the disk space and wallclock time. Since there are no errors or helpful information in any of the files, can you try a couple of things to try to narrow down the issue?

1) Try to run this with V4.5.1 (the latest WRF version) to see if that changes anything
2) Try running with a very basic namelist - as close to the default version as is possible, obviously modifying your dates, domain size, resolution, etc., but leave out all the extra output options (aux*) just to make sure none of those are causing the problem.

Then let me know the result. Thanks!
 
I tried both of your suggestions and found the following:

1. Using WRF v4.5.1 and the original namelist that I used still resulted in the same crash point.

2. Using WRF v4.5.1 and a more basic namelist (attached) allowed me to get past the crash point. This makes me wonder if the aux options are causing the problem, specifically the SST update. Because of the domain for this simulation, I'd like to use the SST update option. Do you have any advice on how to proceed?

Thanks!
Emily
 

Attachments

  • namelist.input
    3.6 KB · Views: 5
Emily,
Thanks for trying those tests! Since you know the basic namelist works, have you then tried adding back the sst_update (and corresponding aux*) option to see if that causes it to fail again? I would try that, and then if that happens to work, you can rule that out. You can then try adding one of the other aux* options, until you find the culprit.
 
I tried adding back the sst_update option and the corresponding aux settings and my simulation is crashing on the eighth day again. I looked at the wrflowinp_d01 file and don't see anything suspicious that stands out. Would it be helpful if I shared the files generated from real.exe?

Emily
 
I'm attaching the namelist.input file that I used and have put all of the initial and boundary conditions in a file called ICBC_sst_update.tar.gz on Nextcloud. I also put the rsl files on Nextcloud in a file called rsl_files_sst_update.tar.

Thanks,
Emily
 

Attachments

  • namelist.input
    3.8 KB · Views: 4
Hello,

I was wondering if you had a chance to look at my recent files yet. Any help would be much appreciated!

Thanks,
Emily
 
Hi Emily,
Apologies, as I've gotten a bit behind on the forum due to some other deadlines. I'll hopefully get a chance to look at this more closely today or tomorrow and will get back to you.
 
Hi Emily,
Okay, I've been working on this a bit now, but unfortunately still don't have an answer for you. I am able to repeat the stop after 8 days; however, I do not get the "BAD MEMORY ORDER" error. Mine just stops, without any good explanation. I even tried to compile with debugging options (using ./configure -D), but still don't get anything. I have a question for you, though. Are you actually using additional sst fields at a higher resolution (temporal and/or spacial) than the ERA5 data, or are you just turning on sst_update when you run wrf?
 
I am not using additional SST fields other than the ERA5 data. I was just turning on the sst_update option when I ran real.exe and wrf.exe.

Emily
 
Hello,

I was wondering if there are any more steps that I could take to try to resolve this error.

Thanks for the help,
Emily
 
Hi Emily,
Okay, I've been working on this a bit now, but unfortunately still don't have an answer for you. I am able to repeat the stop after 8 days; however, I do not get the "BAD MEMORY ORDER" error. Mine just stops, without any good explanation. I even tried to compile with debugging options (using ./configure -D), but still don't get anything. I have a question for you, though. Are you actually using additional sst fields at a higher resolution (temporal and/or spacial) than the ERA5 data, or are you just turning on sst_update when you run wrf?
@mehtut @kwerner

I have seen this error before but only when I had the input formats being different.

I noticed that you are using the pnetcdf ( io_form_input = 11) for the ERA files and the Aux input as netcdf (io_form_auxinput4 = 2)

I believe the WRF handles the two different file formats differently when reading it in.

Perhaps changing them all to either pnetcdf or netcdf (11 or 2) might resolve this issue


I've also attached a namelist.input file that i have used with sst turn on that works for me. Maybe between the two of you you can see what might be needed.

regards,
Will
 

Attachments

  • sst_update_namelist.input
    12.6 KB · Views: 3
Emily,
Regarding Will's remark above, when I was running your simulations and getting the errors, I was using output 2, so I don't think your issue is related to the output formal.

I've done another test. I used my own input data (which was GFS 0.25 degree data) and went through the WPS process, all the way through WRF. I used everything else the same as you - your namelist.input, domains, dates, etc. I ran a total of 9 days, getting me past the "crash" point of your test, and it worked seamlessly. This indicates to me that the issue is potentially related to the input. I'm not sure if anything is wrong with the input data, or if it's that the input has some component that when the model goes through the calculations, it's causing it to stop.

1) Are you set on using ERA5 input? If not, you could try the GFS data and see if that works for you.
2) You could try playing around with the physics options - perhaps start with the "tropical" physics suite, since your domain covers much of the tropical ocean area.
 
Top