Scheduled Downtime
On Friday 21 April 2023 @ 5pm MT, this website will be down for maintenance and expected to return online the morning of 24 April 2023 at the latest

wrfchem run stops after some time

This post was from a previous version of the WRF&MPAS-A Support Forum. New replies have been disabled and if you have follow up questions related to this post, then please start a new thread from the forum home page.

goharali

New member
Dear wrfchem forum members;

I am trying to run wrfchem on 56 nodes of the supercomputer. The simulation starts without any problem but stops after it continues for some time without showing any error in "rsl.error.0000". Upon checking the output reports from other nodes I found the following error message on some of the nodes (e.g. node 33 and 44):



-------------------------
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
wrf.exe 00000000043A168E for__signal_handl Unknown Unknown
libpthread-2.17.s 00002B85C00AD370 Unknown Unknown Unknown
libpthread-2.17.s 00002B85C00AA213 pthread_spin_lock Unknown Unknown



I have been trying different options to overcome this issue but unsuccessful so far.

Please help me resolve this problem. My namelist.input file is attached herewith.

Thanking you in anticipation.
Gohar Ali
 

Attachments

  • namelist.input
    8 KB · Views: 78
hi Gohar Ali,

Can you give us more information about the model setup? What region does your domain cover, and at what time does the simulation stop? Also, can you post the text from the rsl.error.0033 and rsl.error.0044 files which is above the SIGTERM error you posted (and also can you post the text from the rsl.out.0033 and rsl.out.0034 files - these sometimes have extra information).

Could you also try configuring the model using "./configure -d" - so that you compile it with debugging flags on? When you run with the executables compiled like this you should get more information about where in the model code the crash occurs. If you get more information from the rsl.error.* files when running like that could you post that too?

cheers,
Doug
 
hi Doug,

Thank you very much for your reply. My study domain is Pakistan. The rsl.error.0033 file is attached herewith.
Further, I was trying different options for the last few days. Today I ran it with the same conditions as was run before my first message on 26th April and the corresponding rsl.error.0000 and rsl.out.0000 files are also attached. This time I didn't get the error message (on any of the nodes) but the simulation has stopped after 5 days (2015-07-05_02:28:00). Though the simulation status on the server is "Running" but no progress is shown for the last 4 to 5 hours.

I am looking forward to hear from you as soon as possible.
Thanking you in anticipation.

best regards
Gohar Ali

NB: The attached files are larger in size, so if downloading is a problem for you then please let me know so that I may send the last few lines of these files only.
 

Attachments

  • rsl.error.0000_1-May.txt
    45.1 MB · Views: 64
  • rsl.out.0000_1-May.txt
    45.4 MB · Views: 60
  • rsl.error.0033_26-April.txt
    9.6 MB · Views: 66
hi Gohar Ali,

I suspect it is your aerosol fields which are causing your problem. Errors such as this (from your rsl.error.0000) shows that you've got layers of very concentrated aerosols in your model, which can cause your radiative routines to fail:
-------------------------
WARNING: Large total sw optical depth of 36.51 at point i,j,nb= 12 4 7
Diagnostics 1: k, tauaer300, tauaer400, tauaer600, tauaer999, tauaer
1 0.05 0.05 0.05 0.05 0.05
2 0.01 0.01 0.01 0.01 0.01
3 0.00 0.00 0.00 0.00 0.00
4 0.02 0.02 0.02 0.02 0.02
5 3.13 3.12 3.26 3.45 3.42
6 0.00 0.00 0.00 0.00 0.00
7 0.00 0.00 0.00 0.00 0.00
8 0.00 0.00 0.00 0.00 0.00
9 0.00 0.00 0.00 0.00 0.00
10 0.00 0.00 0.00 0.00 0.00
11 0.00 0.00 0.00 0.00 0.00
12 0.00 0.00 0.00 0.00 0.00
13 0.00 0.00 0.00 0.00 0.00
14 0.00 0.00 0.00 0.00 0.00
15 0.01 0.01 0.01 0.00 0.00
16 9.38 9.36 9.82 10.43 10.36
17 0.00 0.00 0.00 0.00 0.00
18 0.00 0.00 0.00 0.00 0.00
19 0.00 0.00 0.00 0.00 0.00
20 0.00 0.00 0.00 0.00 0.00
21 3.01 3.02 3.10 3.37 3.37
22 0.41 0.42 0.42 0.44 0.46
23 1.16 1.14 1.18 1.33 1.29
24 0.48 0.48 0.49 0.58 0.56
25 0.19 0.19 0.19 0.24 0.24
26 1.36 1.42 1.41 1.45 1.50
27 1.02 1.03 1.11 1.26 1.27
28 1.89 1.90 1.90 2.34 2.34
29 1.04 1.08 1.09 1.36 1.40
30 1.16 1.18 1.19 1.49 1.50
31 3.73 3.87 3.87 4.09 4.24
32 0.28 0.30 0.33 0.39 0.41
33 0.29 0.31 0.33 0.40 0.43
34 0.42 0.42 0.43 0.47 0.46
35 0.12 0.13 0.12 0.13 0.13
36 0.55 0.55 0.57 0.69 0.69
37 0.00 0.00 0.00 0.00 0.00
38 0.06 0.06 0.06 0.06 0.06
39 0.34 0.35 0.34 0.43 0.44
40 0.03 0.03 0.03 0.03 0.03
41 0.08 0.08 0.08 0.09 0.10
42 0.61 0.60 0.62 0.72 0.71
43 0.06 0.07 0.07 0.07 0.08
44 0.55 0.55 0.56 0.66 0.65
45 0.15 0.15 0.16 0.19 0.19
46 0.00 0.00 0.00 0.00 0.00
47 0.00 0.00 0.00 0.00 0.00
48 0.03 0.03 0.03 0.03 0.03
49 0.05 0.05 0.05 0.06 0.06

In this case, WRF has coped with these aerosols, and continued running for a little time longer - but there might have been aerosols in another patch which caused it to fail later (I believe that WRF can hang, without properly crashing and throwing an error out, when a NaN is generated somewhere in the radiative calculations, which then propogates to the rest of the domain, and causes the MPI communications to fail).

Your first test is to turn off aerosol radiative feedbacks - to see if that solves the problem for you. If it does then your problem is definitely your aerosols. Then you need to check individual aerosol sources. I suspect that your biomass burning emissions might be an issue - so you could try running without those. Alternatively it could be your boundary conditions - what are you using (MOZART or something else)?

cheers,
Doug
 
hi Doug,
Your suggestions have worked for me. Turning off the radiative feedbacks solved the problem and subsequently turning it on and removing the "gocart background data", as a first attemp to check the aerosol sources, the simulation ran without any problem.

Thank you very much once again.

best regards
Gohar Ali
 
Top