Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

crush with 'Domain average of dpsdt, dmudt NAN'

I am encountering an issue where WRF crashes at a specific time. I would appreciate any assistance you can provide.

I have installed the latest WRF v4.6.0 model on my Linux server to utilize new functions for wind farms. After successfully running an earlier case, I read literature and planned sensitivity tests for Typhoon 'Infa', modifying some options and adjusting area ranges.

I created the initial and lateral data from fnl data (ds083.3) and ran WRF. However, it consistently crashes at a certain point without clear error messages. Upon investigation, I discovered NAN values in the final calculation stage, and I currently don’t know how to resolve this issue.

Here is the error part:
Code:
Timing for main: time 2021-07-22_18:44:30 on domain   2:    0.73602 elapsed seconds
 d02   Domain average of dpsdt, dmudt (mb/3h):    404.5000       8.397835       3.262286
 d02   Max mu change time step:          110           2  3.2546468E-02
 d02   Domain average of dardt, drcdt, drndt (mm/sec):    404.5000      1.7035587E-04  3.0214911E-05  1.4014094E-04
 d02   Domain average of rt_sum, rc_sum, rnc_sum (mm):    404.5000       2.770452      0.6746788       2.095775
 d02   Max Accum Resolved Precip,   I,J  (mm):    165.6131             211         186
 d02   Max Accum Convective Precip,   I,J  (mm):    16.00648             200         202
 d02   Domain average of sfcevp, hfx, lh:    404.5000       1.401203       13.21041       143.0962
Timing for main: time 2021-07-22_18:45:00 on domain   2:    0.73642 elapsed seconds
Timing for main: time 2021-07-22_18:45:00 on domain   1:    2.69181 elapsed seconds
 d01   Domain average of dpsdt, dmudt (mb/3h):    405.0000                NaN            NaN
 d01   Max mu change time step:          143         104  7.5008301E-04
 d01   Domain average of dardt, drcdt, drndt (mm/sec):    405.0000      1.1270608E-04  4.6179335E-05  6.6526765E-05
 d01   Domain average of rt_sum, rc_sum, rnc_sum (mm):    405.0000       1.703290      0.6249699       1.078320
 d01   Max Accum Resolved Precip,   I,J  (mm):    156.9319             146          98
 d01   Max Accum Convective Precip,   I,J  (mm):    15.58068             224          74
 d01   Domain average of sfcevp, hfx, lh:    405.0000       1.006149       10.36697       102.8982
 d02   Domain average of dpsdt, dmudt (mb/3h):    405.0000                NaN            NaN
 d02   Max mu change time step:          217         118  5.8851804E-04
 d02   Domain average of dardt, drcdt, drndt (mm/sec):    405.0000      1.6426136E-04  3.0271098E-05  1.3399025E-04
 d02   Domain average of rt_sum, rc_sum, rnc_sum (mm):    405.0000       2.775380      0.6755869       2.099794
 d02   Max Accum Resolved Precip,   I,J  (mm):    165.7204             211         186
 d02   Max Accum Convective Precip,   I,J  (mm):    16.02392             200         202
 d02   Domain average of sfcevp, hfx, lh:    405.0000       1.402918       13.21963       143.1080
Timing for main: time 2021-07-22_18:45:30 on domain   2:    1.07063 elapsed seconds

I changed the time step from 90 to 60.
Although the situation improved somewhat, NaN values appeared again after running the program for more than 2 hours, causing the program to interrupt.
Should I continue to narrow down the time step?
Code:
Timing for main: time 2021-07-24_08:30:00 on domain   2:    0.74259 elapsed seconds
Timing for main: time 2021-07-24_08:30:00 on domain   1:    2.70604 elapsed seconds
 d01   Domain average of dpsdt, dmudt (mb/3h):    2670.000                NaN            NaN
 d01   Max mu change time step:           51          96   69.82448
 d01   Domain average of dardt, drcdt, drndt (mm/sec):    2670.000      1.0824289E-04  4.7659465E-05  6.0583436E-05
 d01   Domain average of rt_sum, rc_sum, rnc_sum (mm):    2670.000       15.55473       4.840905       10.71383
 d01   Max Accum Resolved Precip,   I,J  (mm):    674.7068             141         109
 d01   Max Accum Convective Precip,   I,J  (mm):    117.8609             254          64
 d01   Domain average of sfcevp, hfx, lh:    2670.000       8.766058       27.33279       158.9171
 d02   Domain average of dpsdt, dmudt (mb/3h):    2670.000       10.43634       2.705757
 d02   Max mu change time step:          377         127  4.8635740E-04
 d02   Domain average of dardt, drcdt, drndt (mm/sec):    2670.000      1.3764304E-04  2.0445495E-05  1.1719757E-04
 d02   Domain average of rt_sum, rc_sum, rnc_sum (mm):    2670.000       24.52558       4.057384       20.46820
 d02   Max Accum Resolved Precip,   I,J  (mm):    730.6641             204         217
 d02   Max Accum Convective Precip,   I,J  (mm):    72.22220             194         212
 d02   Domain average of sfcevp, hfx, lh:    2670.000       10.14799       14.81878       151.7243
Timing for main: time 2021-07-24_08:30:20 on domain   2:    3.16795 elapsed seconds
 d02   Domain average of dpsdt, dmudt (mb/3h):    2670.333       10.42248       2.691759
 d02   Max mu change time step:          377         127  4.6444396E-04
 d02   Domain average of dardt, drcdt, drndt (mm/sec):    2670.333      1.3754175E-04  2.0391140E-05  1.1715063E-04
 d02   Domain average of rt_sum, rc_sum, rnc_sum (mm):    2670.333       24.52834       4.057792       20.47054
 d02   Max Accum Resolved Precip,   I,J  (mm):    730.6641             204         217
 d02   Max Accum Convective Precip,   I,J  (mm):    72.22220             194         212
 d02   Domain average of sfcevp, hfx, lh:    2670.333       10.14920       14.78852       151.6871
Timing for main: time 2021-07-24_08:30:40 on domain   2:    0.73734 elapsed seconds
 d02   Domain average of dpsdt, dmudt (mb/3h):    2670.667       10.43617       2.691706
 d02   Max mu change time step:          377         127  4.7271789E-04
 d02   Domain average of dardt, drcdt, drndt (mm/sec):    2670.667      1.3748354E-04  2.0381407E-05  1.1710214E-04
 d02   Domain average of rt_sum, rc_sum, rnc_sum (mm):    2670.667       24.53109       4.058199       20.47288
 d02   Max Accum Resolved Precip,   I,J  (mm):    730.6641             204         217
 d02   Max Accum Convective Precip,   I,J  (mm):    72.22220             194         212
 d02   Domain average of sfcevp, hfx, lh:    2670.667       10.15042       14.79035       151.6866
Timing for main: time 2021-07-24_08:31:00 on domain   2:    0.73843 elapsed seconds
Timing for main: time 2021-07-24_08:31:00 on domain   1:    5.78849 elapsed seconds

I have attached my namelist files, including those for both successful and failed runs, along with the rsl.error files.

Additionally, I have uploaded the shell script files for processing the WPS and WRF components, which may be helpful.

Thank you in advance.
 

Attachments

  • namelist-without-crush.input
    6 KB · Views: 2
  • wps.sh.txt
    455 bytes · Views: 1
  • wrf-pbl2_mp6_cu11_fdda0_isftcflx2-namelist.input
    5.9 KB · Views: 1
  • wrf-pbl2_mp6_cu11_fdda0_isftcflx2-rsl.error.0000
    800.8 KB · Views: 1
  • wrf-pbl2_mp24_cu11_fdda0_isftcflx2.sh.txt
    613 bytes · Views: 2
  • wrf-pbl2_mp24_cu11_fdda0_isftcflx2-namelist.input
    5.9 KB · Views: 2
  • wrf-pbl2_mp24_cu11_fdda0_isftcflx2-rsl.error.0000
    821.3 KB · Views: 2
Sorry for the delay and thank you for your patience. There are a few things I notice between the simulation that worked and the one that didn't, which can allow you to test some things to see what component is causing the issue.

- You are using a different surface layer physics. Can you try running the failed case (mp24) with sf_sfclay_physics = 1, 1 to see if that makes a difference?
- I believe you're using a different input for this new test? It looks like the input was available every 3 hours for the one that worked, and every 6 hours for the failed case. If it was available every 3 hours, was this the higher-resolution GFS input (0.25 degrees)? If so, can you try using that same input for the failed case? It's always best to use the highest resolution input, if possible.
- The sizes of the domain are much larger for the failed case, which means you could use a lot more processors to run. I don't really think this could have anything to do with the NaN values you're seeing, but can you try using more to see how that affects the simulation? You could probably use 300+ for this simulation.
 
I'm really glad to have received your reply. I'm now planning to use a nested approach of 9km-3km to eliminate the effects of the gray area. But similarly, once I increase the time step to 6 times dx, I start getting NaN values within just one hour of running the simulation. When I use 3 times dx, everything runs normally, but the calculations are very slow, which is a burden for me, and I'm quite frustrated about it. My sensitivity testing hasn't officially started yet due to uncertainties in the scheme parameters and economic factors, so I'm urgently seeking help from all the experts right now.
- You are using a different surface layer physics. Can you try running the failed case (mp24) with sf_sfclay_physics = 1, 1 to see if that makes a difference?
I'll give it a try, but I'm curious about the reasoning. I adjusted the sf_sfclay_physics parameter to align with my PBL scheme due to their commonality. Honestly, I lack experience in selecting this parameter. Any insights you could share would be appreciated.
- I believe you're using a different input for this new test? It looks like the input was available every 3 hours for the one that worked, and every 6 hours for the failed case. If it was available every 3 hours, was this the higher-resolution GFS input (0.25 degrees)? If so, can you try using that same input for the failed case? It's always best to use the highest resolution input, if possible.
Yeah, this is also to make the simulation more meaningful. I understand that GDAS (ds083.3) seems to have an edge over GFS (ds084.1)? They're both 0.25-degree datasets, with the former being assimilated data and the latter being forecast data. I also want to use data with a three-hour interval as input. Honestly, I don't have the time right now to compare the simulation results between the two. I can't really tell which one is better. Choosing the better data source has always been a puzzle for me. I'd really appreciate any advice you can share.
- The sizes of the domain are much larger for the failed case, which means you could use a lot more processors to run. I don't really think this could have anything to do with the NaN values you're seeing, but can you try using more to see how that affects the simulation? You could probably use 300+ for this simulation.
I want to know how to determine the specific amount of cores to use for calculations. From what I understood before, it seems like the maximum shouldn’t exceed the product of the index count of the first area divided by 25, and the minimum shouldn’t be less than the product of the index count of the second area divided by 100.
 
I'll give it a try, but I'm curious about the reasoning. I adjusted the sf_sfclay_physics parameter to align with my PBL scheme due to their commonality. Honestly, I lack experience in selecting this parameter. Any insights you could share would be appreciated.
I only ask for this test to see if changing from option 1 to 2 caused the problem. It shouldn't, but since it's one of the changing variables, it's worth a test.


Yeah, this is also to make the simulation more meaningful. I understand that GDAS (ds083.3) seems to have an edge over GFS (ds084.1)? They're both 0.25-degree datasets, with the former being assimilated data and the latter being forecast data. I also want to use data with a three-hour interval as input. Honestly, I don't have the time right now to compare the simulation results between the two. I can't really tell which one is better. Choosing the better data source has always been a puzzle for me. I'd really appreciate any advice you can share.
I was always taught to choose the input that has the highest resolution. That being said, GFS FNL does include updated analysis, so I guess you could say it's more accurate than GFS forecast data. But the resolution of FNL data is so coarse that I'm not actually sure which is the better option. I realize I'm contradicting my statement from yesterday. I guess this is just an example of how everything in the modeling world is always trial and error. Sometimes we just have to run a few different tests to determine what works best for our specific application.


I want to know how to determine the specific amount of cores to use for calculations. From what I understood before, it seems like the maximum shouldn’t exceed the product of the index count of the first area divided by 25, and the minimum shouldn’t be less than the product of the index count of the second area divided by 100.
You're aren't wrong about what you learned previously, but that is kind of a very rough rule-of-thumb. See Choosing an Appropriate Number of Processors, which does mention the rule you're referring to. When I choose the number of processors to use, I typically choose, based on the information in the first paragraph on that FAQ. I try to keep the quotient of the domain sizes, divided by half of the decomposition, below 10, but I do get it close to 10 so that I can use as many processors as is possible. This typically works well for me.
 
Top