Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

Segmentation fault

kinguEnt

Member
Hello,

I tried various suggestions related to a segmentation fault, but it has not been solved so far.
Is there anyone who can help me to fix this problem?

Thanks,

Loading openmpi/4.1.6/gcc11.4.0-cuda12.3.2
Loading requirement: cuda/12.3.2 ucx/1.15.0/cuda12.3.2
starting wrf task 36 of 48
starting wrf task 2 of 48
starting wrf task 5 of 48
starting wrf task 11 of 48
starting wrf task 14 of 48
starting wrf task 39 of 48
starting wrf task 8 of 48
starting wrf task 10 of 48
starting wrf task 43 of 48
starting wrf task 3 of 48
starting wrf task 18 of 48
starting wrf task 6 of 48
starting wrf task 13 of 48
starting wrf task 20 of 48
starting wrf task 26 of 48
starting wrf task 37 of 48
starting wrf task 44 of 48
starting wrf task 42 of 48
starting wrf task 25 of 48
starting wrf task 7 of 48
starting wrf task 17 of 48
starting wrf task 41 of 48
starting wrf task 47 of 48
starting wrf task 34 of 48
starting wrf task 19 of 48
starting wrf task 22 of 48
starting wrf task 31 of 48
starting wrf task 46 of 48
starting wrf task 35 of 48
starting wrf task 27 of 48
starting wrf task 15 of 48
starting wrf task 29 of 48
starting wrf task 30 of 48
starting wrf task 45 of 48
starting wrf task 28 of 48
starting wrf task 38 of 48
starting wrf task 23 of 48
starting wrf task 24 of 48
starting wrf task 12 of 48
starting wrf task 0 of 48
starting wrf task 32 of 48
starting wrf task 40 of 48
starting wrf task 21 of 48
starting wrf task 9 of 48
starting wrf task 33 of 48
starting wrf task 1 of 48
starting wrf task 4 of 48
starting wrf task 16 of 48
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 24 with PID 0 on node bnode045 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
============================================================
Request ID: 440571.nqsv
Request Name: runwrf.sh
Queue: gpu@nqsv
Number of Jobs: 1
Created Request Time: Mon Oct 21 16:55:31 2024
Started Request Time: Mon Oct 21 16:56:04 2024
Ended Request Time: Mon Oct 21 18:36:57 2024
Resources Information:
Elapse: 6057S
Remaining Elapse: 80343S
============================================================
 

Attachments

  • namelist.input
    4.9 KB · Views: 4
  • rsl.error.zip
    3.3 KB · Views: 1
I can't say for sure that this is the problem, but you likely need to use more processors due to the size of your domain. You could be using a lot more. See Choosing an Appropriate Number of Processors.

If you add many more processors, and it still fails, please package all of your rsl files together into another zipped file and share that file. Also, make sure the issue isn't related to disk space available.
 
Thank you, kwerner.

It will be fine if I use more processors. Based on the guideline 'Choosing an Appropriate Number of Processors', I should use more than 74. Unfortunately, that's not possible at the moment.
Editing the following parameters in the namelist.input seems to resolve the issue, likely due to the feedback from the previous setup not functioning properly:
&domains
feedback = 0,
&physics
cu_physics = 0,
!cu_rad_feedback = .true.,
!kfeta_trigger = 2,
 
Top