Page MenuHomePhabricator

Queue consume failed with code 137 - (ERROR Killed) on the donations queue consumer and recurring queue consumer
Closed, ResolvedPublic

Description

We've noticed this since 2024-08-27 , it doesn't happen too often but when it does we completely lose the message and it doesn't get retried or go to the damaged message

wfan@frlog1002:/srv/archive/frlog1002/logs$ zgrep 'Donations queue consume failed with code 137' error-2024*.gz | wc -l
16
wfan@frlog1002:/srv/archive/frlog1002/logs$ zgrep 'Recurring queue consume failed with code 137' error-2024*.gz | wc -l
18
wfan@frlog1002:/srv/archive/frlog1002/logs$ zgrep 'Gravy audit parser failed with code 137' error-2025*.gz | wc -l
6

I don't see any of these type of failmails for other queues

Event Timeline

XenoRyet moved this task from Triage to Chaos Crew Backlog on the Fundraising-Backlog board.

Exit code 137 = 128 + 9 (SIGKILL signal) so it's more related with running out of memory (OOM) error, so what we can do is to log the memory usage

Change #1162074 had a related patch set uploaded (by Wfan; author: Wfan):

[wikimedia/fundraising/crm@master] Log memory usage for queue message

https://gerrit.wikimedia.org/r/1162074

Dwisehaupt subscribed.

Examples of timeframes for 2025:

Infrequent with donations queue consumer:

$ zgrep 'Donations queue consume failed with code 137' /srv/archive/frlog1002/logs/error-2025*.gz
/srv/archive/frlog1002/logs/error-20250117.gz:Jan 16 13:21:06 civi1002 run-job[3277061]: Donations queue consume failed with code 137
/srv/archive/frlog1002/logs/error-20250327.gz:Mar 26 15:21:12 civi1002 run-job[2493880]: Donations queue consume failed with code 137
/srv/archive/frlog1002/logs/error-20250610.gz:Jun  9 04:51:01 civi1002 run-job[1897359]: Donations queue consume failed with code 137

Occasional with recurring queue consumer:

$ zgrep 'Recurring queue consume failed with code 137' /srv/archive/frlog1002/logs/error-2025*.gz
/srv/archive/frlog1002/logs/error-20250101.gz:Dec 31 01:51:11 civi1002 run-job[342733]: Recurring queue consume failed with code 137
/srv/archive/frlog1002/logs/error-20250101.gz:Dec 31 03:21:04 civi1002 run-job[450234]: Recurring queue consume failed with code 137
/srv/archive/frlog1002/logs/error-20250123.gz:Jan 22 02:51:04 civi1002 run-job[2957366]: Recurring queue consume failed with code 137
/srv/archive/frlog1002/logs/error-20250123.gz:Jan 22 23:42:09 civi1002 run-job[23472]: Recurring queue consume failed with code 137
/srv/archive/frlog1002/logs/error-20250305.gz:Mar  4 17:51:16 civi1002 run-job[4110235]: Recurring queue consume failed with code 137
/srv/archive/frlog1002/logs/error-20250321.gz:Mar 20 09:51:14 civi1002 run-job[1837710]: Recurring queue consume failed with code 137
/srv/archive/frlog1002/logs/error-20250415.gz:Apr 14 19:51:12 civi1002 run-job[1330867]: Recurring queue consume failed with code 137
/srv/archive/frlog1002/logs/error-20250608.gz:Jun  7 06:21:09 civi1002 run-job[3314147]: Recurring queue consume failed with code 137

Almost daily for the audit parser:

$ zgrep 'Gravy audit parser failed with code 137' /srv/archive/frlog1002/logs/error-2025*.gz
/srv/archive/frlog1002/logs/error-20250607.gz:Jun  6 02:54:03 civi1002 run-job[1511437]: Gravy audit parser failed with code 137
/srv/archive/frlog1002/logs/error-20250608.gz:Jun  7 02:55:24 civi1002 run-job[3058140]: Gravy audit parser failed with code 137
/srv/archive/frlog1002/logs/error-20250609.gz:Jun  8 02:55:06 civi1002 run-job[307258]: Gravy audit parser failed with code 137
/srv/archive/frlog1002/logs/error-20250610.gz:Jun  9 02:56:46 civi1002 run-job[1745741]: Gravy audit parser failed with code 137
/srv/archive/frlog1002/logs/error-20250611.gz:Jun 10 02:54:32 civi1002 run-job[3189013]: Gravy audit parser failed with code 137
/srv/archive/frlog1002/logs/error-20250613.gz:Jun 12 02:55:19 civi1002 run-job[2217252]: Gravy audit parser failed with code 137

link to host metrics in grafana for this spate of jobs being culled: https://frmon.wikimedia.org/d/000000377/host-overview?orgId=1&from=2025-06-02T01:00:00.000Z&to=2025-06-15T23:59:59.000Z&timezone=utc&var-host=civi1002

Looking at an instance of the Gravy audit parser.

Edited for conciseness snippet from kernel log of oomkiller:

Jun  8 02:55:03 civi1002 kernel: [986192.967320] postqueue invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_sc
Jun  8 02:55:03 civi1002 kernel: [986192.967330] CPU: 1 PID: 336042 Comm: postqueue Not tainted 6.1.0-37-amd64 #1  Debian 6.1.140-1

Jun  8 02:55:03 civi1002 kernel: [986192.967572] Tasks state (memory values in pages):
Jun  8 02:55:03 civi1002 kernel: [986192.967572] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Jun  8 02:55:03 civi1002 kernel: [986192.967883] [ 277589]   205 277589    44631     2952   114688        0             0 run-job
Jun  8 02:55:03 civi1002 kernel: [986192.967886] [ 277634]   205 277634    63069     2954   118784        0             0 run-job
Jun  8 02:55:03 civi1002 kernel: [986192.967890] [ 278016]   205 278016     9039     3658   110592        0             0 python3
Jun  8 02:55:03 civi1002 kernel: [986192.967893] [ 307221]     0 307221     2506      141    57344        0             0 cron
Jun  8 02:55:03 civi1002 kernel: [986192.967896] [ 307240]   205 307240      645       24    40960        0             0 sh
Jun  8 02:55:03 civi1002 kernel: [986192.967899] [ 307258]   205 307258    63068     2953   122880        0             0 run-job
Jun  8 02:55:03 civi1002 kernel: [986192.967902] [ 307311]   205 307311      645       24    45056        0             0 wmf-cv
Jun  8 02:55:03 civi1002 kernel: [986192.967906] [ 307312]   205 307312     2832      152    61440        0             0 sudo
Jun  8 02:55:03 civi1002 kernel: [986192.967908] [ 307324]    33 307324 15711139 15626497 125485056        0             0 php
Jun  8 02:55:03 civi1002 kernel: [986192.967913] [ 325441]   113 325441    10764        1    69632      166             0 showq
Jun  8 02:55:03 civi1002 kernel: [986192.967916] [ 326204]   113 326204    10764       62    77824      104             0 pickup
Jun  8 02:55:03 civi1002 kernel: [986192.967920] [ 333084]     0 333084     2506      141    57344        0             0 cron
Jun  8 02:55:03 civi1002 kernel: [986192.967923] [ 333116]   205 333116      645       23    45056        0             0 sh
Jun  8 02:55:03 civi1002 kernel: [986192.967926] [ 333131]   205 333131    63068     2952   118784        0             0 run-job
Jun  8 02:55:03 civi1002 kernel: [986192.967929] [ 333197]   205 333197    36216     2438   233472        0             0 php
Jun  8 02:55:03 civi1002 kernel: [986192.967932] [ 333843]     0 333843     2506      141    57344        0             0 cron
Jun  8 02:55:03 civi1002 kernel: [986192.967936] [ 333862]   205 333862      645       22    45056        0             0 sh
Jun  8 02:55:03 civi1002 kernel: [986192.967939] [ 333875]   205 333875    63068     2957   118784        0             0 run-job
Jun  8 02:55:03 civi1002 kernel: [986192.967941] [ 333888]   205 333888      645       22    45056        0             0 drush
Jun  8 02:55:03 civi1002 kernel: [986192.967944] [ 333890]   205 333890     2832      151    65536        0             0 sudo
Jun  8 02:55:03 civi1002 kernel: [986192.967946] [ 333897]    33 333897    35668     1989   225280        0             0 php
Jun  8 02:55:03 civi1002 kernel: [986192.967949] [ 333909]    33 333909      645       23    49152        0             0 sh
Jun  8 02:55:03 civi1002 kernel: [986192.967952] [ 333910]    33 333910    78642    26191   454656        0             0 php

Jun  8 02:55:03 civi1002 kernel: [986192.968130] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/cron.service,task=php,pid=307324,uid=33
Jun  8 02:55:03 civi1002 kernel: [986192.968147] Out of memory: Killed process 307324 (php) total-vm:62844556kB, anon-rss:62505988kB, file-rss:0kB, shmem-rss:0kB, UID:33 pgtables:122544kB oom_score_adj:0

You can see the process terminated 307324 is using many orders of magnitude more memory than the other standard run-job/drush/php jobs lauched at the same time. Since it is the largest memory user, it is picked as the process to terminate.

Is it possible that the audit parser could be altered to use memory more effectively?

I'll look into an instance of the recurring queue consumer next.

Also of note of that run on Jun 8, there are ~8682729 log line entries like this in the misc log:

$ zgrep "SmashPig: gravy_audit" /srv/archive/frlog1002/logs/fundraising-misc-20250609
Jun  8 02:10:41 civi1002 SmashPig: gravy_audit | Column name status not found! |  | 
Jun  8 02:10:41 civi1002 SmashPig: gravy_audit | Column name status not found! |  | 
Jun  8 02:10:41 civi1002 SmashPig: gravy_audit | Column name status not found! |  | 

Perhaps this caused the memory bloat which caused the process to be terminated.

Looking at an instance of the Gravy recurring queue consumer.

Edited for conciseness snippet from kernel log of oomkiller:

Jun  7 06:21:06 civi1002 kernel: [912158.235337] psad invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0 
Jun  7 06:21:06 civi1002 kernel: [912158.235344] CPU: 5 PID: 326784 Comm: psad Not tainted 6.1.0-37-amd64 #1  Debian 6.1.140-1

Jun  7 06:21:06 civi1002 kernel: [912158.235524] Tasks state (memory values in pages):
Jun  7 06:21:06 civi1002 kernel: [912158.235524] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Jun  7 06:21:06 civi1002 kernel: [912158.235747] [3314099]     0 3314099     2506      141    65536        1             0 cron
Jun  7 06:21:06 civi1002 kernel: [912158.235749] [3314100]     0 3314100     2506      141    65536        1             0 cron
Jun  7 06:21:06 civi1002 kernel: [912158.235750] [3314121]   205 3314121      645       23    45056        0             0 sh
Jun  7 06:21:06 civi1002 kernel: [912158.235752] [3314126]   205 3314126      645       23    45056        0             0 sh
Jun  7 06:21:06 civi1002 kernel: [912158.235754] [3314145]   205 3314145    63068     2952   122880        0             0 run-job
Jun  7 06:21:06 civi1002 kernel: [912158.235756] [3314147]   205 3314147    63068     2952   122880        0             0 run-job
Jun  7 06:21:06 civi1002 kernel: [912158.235758] [3314216]   205 3314216    36216     2418   237568        0             0 php
Jun  7 06:21:06 civi1002 kernel: [912158.235759] [3314246]   205 3314246      645       24    45056        0             0 wmf-cv
Jun  7 06:21:06 civi1002 kernel: [912158.235761] [3314252]   205 3314252     2832      152    61440        0             0 sudo
Jun  7 06:21:06 civi1002 kernel: [912158.235763] [3314267]    33 3314267 251767673 15184812 121987072        0             0 php
Jun  7 06:21:06 civi1002 kernel: [912158.235766] [3315064]     0 3315064     2506      141    65536        1             0 cron
Jun  7 06:21:06 civi1002 kernel: [912158.235767] [3315083]   205 3315083      645       22    45056        0             0 sh
Jun  7 06:21:06 civi1002 kernel: [912158.235769] [3315096]   205 3315096    63068     2951   114688        0             0 run-job
Jun  7 06:21:06 civi1002 kernel: [912158.235771] [3315120]   205 3315120      645       23    45056        0             0 drush
Jun  7 06:21:06 civi1002 kernel: [912158.235773] [3315122]   205 3315122     2832      151    61440        0             0 sudo
Jun  7 06:21:06 civi1002 kernel: [912158.235774] [3315132]    33 3315132    35668     2033   229376        0             0 php
Jun  7 06:21:06 civi1002 kernel: [912158.235776] [3315143]    33 3315143      645       24    45056        0             0 sh
Jun  7 06:21:06 civi1002 kernel: [912158.235778] [3315144]    33 3315144    68599    16619   380928        0             0 php

Jun  7 06:21:06 civi1002 kernel: [912158.235780] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/cron.service,task=php,pid=3314267,uid=33
Jun  7 06:21:06 civi1002 kernel: [912158.235792] Out of memory: Killed process 3314267 (php) total-vm:1007070692kB, anon-rss:60739248kB, file-rss:0kB, shmem-rss:0kB, UID:33 pgtables:119128kB oom_score_adj:0

Just like in the previous case, you can see the process 3314267 terminated is using many orders of magnitude more memory than the other standard run-job/drush/php jobs lauched at the same time. Since it is the largest memory user, it is picked as the process to terminate.

Also of note, here are the logs related to that process id before the termination (with sensitive data lines removed):

$ grep 3314267 fundraising-misc-20250608 | egrep -v "(email|QueueConsumer array|recipient_address|subscriber id:)" 
Jun  7 06:20:01 civi1002 civicrm.wmf[3314267]: civicrm.wmf.INFO: Executing: recurring {"queue_consumer":"recurring"} []
Jun  7 06:20:01 civi1002 civicrm.wmf[3314267]: civicrm.wmf.INFO: Beginning DB transaction [] []
Jun  7 06:20:02 civi1002 civicrm.wmf[3314267]: civicrm.wmf.INFO: wmf_common: Committing DB transaction [] []
Jun  7 06:20:02 civi1002 civicrm.wmf[3314267]: civicrm.wmf.INFO: Beginning DB transaction [] []
Jun  7 06:20:02 civi1002 civicrm.wmf[3314267]: civicrm.wmf.INFO: wmf_civicrm: Converting to settlement currency: GBP -> USD {"old":"GBP","new":"USD"} []
Jun  7 06:20:02 civi1002 civicrm.wmf[3314267]: civicrm.wmf.INFO: wmf_common: Committing DB transaction [] []
Jun  7 06:20:02 civi1002 civicrm.wmf[3314267]: civicrm.wmf.INFO: Beginning DB transaction [] []
Jun  7 06:20:02 civi1002 civicrm.wmf[3314267]: civicrm.wmf.INFO: wmf_common: Committing DB transaction [] []
Jun  7 06:20:02 civi1002 civicrm.wmf[3314267]: civicrm.wmf.INFO: Beginning DB transaction [] []
Jun  7 06:20:02 civi1002 civicrm.wmf[3314267]: civicrm.wmf.INFO: wmf_civicrm: Converting to settlement currency: MYR -> USD {"old":"MYR","new":"USD"} []
Jun  7 06:20:02 civi1002 civicrm.wmf[3314267]: civicrm.wmf.INFO: thank_you: Calling ThankYou::render [] []
Jun  7 06:20:02 civi1002 civicrm.wmf[3314267]: civicrm.wmf.INFO: thank_you: Done ThankYou::render [] []
Jun  7 06:20:02 civi1002 civicrm.wmf[3314267]: civicrm.wmf.INFO: thank_you: Creating CiviMail record [] []
Jun  7 06:20:02 civi1002 civicrm.wmf[3314267]: civicrm.wmf.INFO: thank_you: Done creating CiviMail record [] []
Jun  7 06:20:02 civi1002 civicrm.wmf[3314267]: civicrm.wmf.INFO: wmf_common: Committing DB transaction [] []
Jun  7 06:20:02 civi1002 civicrm.wmf[3314267]: civicrm.wmf.INFO: Beginning DB transaction [] []
Jun  7 06:20:02 civi1002 civicrm.wmf[3314267]: civicrm.wmf.INFO: wmf_civicrm: Converting to settlement currency: EUR -> USD {"old":"EUR","new":"USD"} []
Jun  7 06:20:03 civi1002 civicrm.wmf[3314267]: civicrm.wmf.INFO: wmf_common: Aborting DB transaction. [] []
Jun  7 06:20:03 civi1002 civicrm.wmf[3314267]: civicrm.wmf.ERROR: wmf_common: Failure while processing message: MISSING_PREDECESSOR Recurring queue processed before donations queue Source: array ( ) {"message":"MISSING_PREDECESSOR Recurring queue processed before donations queue\nSource: array (\n)"} []
Jun  7 06:20:03 civi1002 civicrm.wmf[3314267]: civicrm.wmf.INFO: Beginning DB transaction [] []
Jun  7 06:20:03 civi1002 civicrm.wmf[3314267]: civicrm.wmf.INFO: wmf_civicrm: Converting to settlement currency: JPY -> USD {"old":"JPY","new":"USD"} []
Jun  7 06:20:03 civi1002 civicrm.wmf[3314267]: civicrm.wmf.DEBUG: wmf_civicrm: Successfully created contact: 67890282 {"id":67890282} []
Jun  7 06:20:03 civi1002 civicrm.wmf[3314267]: civicrm.wmf.INFO: wmf_common: Committing DB transaction [] []
Jun  7 06:20:03 civi1002 civicrm.wmf[3314267]: civicrm.wmf.INFO: Beginning DB transaction [] []
Jun  7 06:20:03 civi1002 civicrm.wmf[3314267]: civicrm.wmf.INFO: wmf_civicrm: Converting to settlement currency: JPY -> USD {"old":"JPY","new":"USD"} []
Jun  7 06:20:03 civi1002 civicrm.wmf[3314267]: civicrm.wmf.INFO: wmf_common: Committing DB transaction [] []
Jun  7 06:20:03 civi1002 civicrm.wmf[3314267]: civicrm.wmf.INFO: Beginning DB transaction [] []
Jun  7 06:20:03 civi1002 civicrm.wmf[3314267]: civicrm.wmf.INFO: wmf_civicrm: Converting to settlement currency: EUR -> USD {"old":"EUR","new":"USD"} []
Jun  7 06:20:03 civi1002 civicrm.wmf[3314267]: civicrm.wmf.DEBUG: wmf_civicrm: Successfully created contact: 67890283 {"id":67890283} []
Jun  7 06:20:03 civi1002 civicrm.wmf[3314267]: civicrm.wmf.INFO: wmf_common: Aborting DB transaction. [] []
Jun  7 06:20:03 civi1002 civicrm.wmf[3314267]: civicrm.wmf.ERROR: wmf_common: Failure while processing message: MISSING_PREDECESSOR Recurring queue processed before donations queue Source: array ( ) {"message":"MISSING_PREDECESSOR Recurring queue processed before donations queue\nSource: array (\n)"} []
Jun  7 06:20:03 civi1002 civicrm.wmf[3314267]: civicrm.wmf.INFO: Beginning DB transaction [] []
Jun  7 06:20:03 civi1002 civicrm.wmf[3314267]: civicrm.wmf.INFO: wmf_civicrm: Converting to settlement currency: MYR -> USD {"old":"MYR","new":"USD"} []
Jun  7 06:20:21 civi1002 civicrm.wmf[3314267]: civicrm.wmf.INFO: wmf_common: Aborting DB transaction. [] []

What shows up in the error log for that PID are similar MISSING_PREDECESSOR errors:

$ zgrep 3314267 /srv/archive/frlog1002/logs/error-20250608.gz 
Jun  7 06:20:03 civi1002 civicrm.wmf[3314267]: civicrm.wmf.ERROR: wmf_common: Failure while processing message: MISSING_PREDECESSOR Recurring queue processed before donations queue Source: array ( ) {"message":"MISSING_PREDECESSOR Recurring queue processed before donations queue\nSource: array (\n)"} []
Jun  7 06:20:03 civi1002 civicrm.wmf[3314267]: civicrm.wmf.ERROR: wmf_common: Failure while processing message: MISSING_PREDECESSOR Recurring queue processed before donations queue Source: array ( ) {"message":"MISSING_PREDECESSOR Recurring queue processed before donations queue\nSource: array (\n)"} []
Jun  7 06:21:06 civi1002 kernel: [912158.235792] Out of memory: Killed process 3314267 (php) total-vm:1007070692kB, anon-rss:60739248kB, file-rss:0kB, shmem-rss:0kB, UID:33 pgtables:119128kB oom_score_adj:0

Perhaps it didn't handle the aborted DB connections well since there doesn't appear to be a lot else in that process.. *shrug*

And finally, for Donations queue consumer.

Edited for conciseness snippet from kernel log of oomkiller:

Jun  9 02:56:43 civi1002 kernel: [1072690.994074] php invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0 
Jun  9 02:56:43 civi1002 kernel: [1072690.994080] CPU: 15 PID: 29816 Comm: php Not tainted 6.1.0-37-amd64 #1  Debian 6.1.140-1

Jun  9 02:56:43 civi1002 kernel: [1072690.994253] Tasks state (memory values in pages):
Jun  9 02:56:43 civi1002 kernel: [1072690.994254] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Jun  9 02:56:43 civi1002 kernel: [1072690.994463] [1716046]     0 1716046     2506       16    57344      125             0 cron
Jun  9 02:56:43 civi1002 kernel: [1072690.994465] [1716083]   205 1716083      645        1    49152       22             0 sh
Jun  9 02:56:43 civi1002 kernel: [1072690.994467] [1716097]   205 1716097    44631      534   110592     2418             0 run-job
Jun  9 02:56:43 civi1002 kernel: [1072690.994469] [1716254]   205 1716254    63069     1061   114688     1892             0 run-job
Jun  9 02:56:43 civi1002 kernel: [1072690.994471] [1716484]   205 1716484     9039     1704   110592     1954             0 python3
Jun  9 02:56:43 civi1002 kernel: [1072690.994473] [1745705]     0 1745705     2506       12    57344      129             0 cron
Jun  9 02:56:43 civi1002 kernel: [1072690.994475] [1745722]   205 1745722      645        0    45056       24             0 sh
Jun  9 02:56:43 civi1002 kernel: [1072690.994477] [1745741]   205 1745741    63068      212   122880     2740             0 run-job
Jun  9 02:56:43 civi1002 kernel: [1072690.994479] [1745804]   205 1745804      645        2    49152       21             0 wmf-cv
Jun  9 02:56:43 civi1002 kernel: [1072690.994480] [1745805]   205 1745805     2832        5    61440      147             0 sudo
Jun  9 02:56:43 civi1002 kernel: [1072690.994482] [1745810]    33 1745810 15899043 15642758 127062016   181059             0 php
Jun  9 02:56:43 civi1002 kernel: [1072690.994486] [1771778]     0 1771778     2506      127    57344       14             0 cron
Jun  9 02:56:43 civi1002 kernel: [1072690.994488] [1771818]   205 1771818      645       23    40960        0             0 sh
Jun  9 02:56:43 civi1002 kernel: [1072690.994490] [1771828]   205 1771828    63068     2957   114688        0             0 run-job
Jun  9 02:56:43 civi1002 kernel: [1072690.994492] [1771877]   205 1771877    36216     2438   233472        0             0 php
Jun  9 02:56:43 civi1002 kernel: [1072690.994494] [1772524]     0 1772524     2506      127    57344       14             0 cron
Jun  9 02:56:43 civi1002 kernel: [1072690.994495] [1772544]   205 1772544      645       23    49152        0             0 sh
Jun  9 02:56:43 civi1002 kernel: [1072690.994497] [1772557]   205 1772557    63068     2957   118784        0             0 run-job
Jun  9 02:56:43 civi1002 kernel: [1072690.994499] [1772574]   205 1772574      645       23    45056        0             0 drush
Jun  9 02:56:43 civi1002 kernel: [1072690.994501] [1772576]   205 1772576     2832      152    65536        0             0 sudo
Jun  9 02:56:43 civi1002 kernel: [1072690.994502] [1772587]    33 1772587    35668     1987   221184        0             0 php
Jun  9 02:56:43 civi1002 kernel: [1072690.994504] [1772595]    33 1772595      645       24    40960        0             0 sh
Jun  9 02:56:43 civi1002 kernel: [1072690.994506] [1772596]    33 1772596    71789    19810   417792        0             0 php
Jun  9 02:56:43 civi1002 kernel: [1072690.994508] [1773214]   113 1773214    12596       30    90112      460             0 smtpd
Jun  9 02:56:43 civi1002 kernel: [1072690.994510] [1773216]   113 1773216    10764       37    81920      133             0 anvil
Jun  9 02:56:43 civi1002 kernel: [1072690.994511] [1773217]   113 1773217    10767      171    73728        0             0 trivial-rewrite
Jun  9 02:56:43 civi1002 kernel: [1072690.994513] [1773674]   113 1773674    10764      167    73728        0             0 showq
Jun  9 02:56:43 civi1002 kernel: [1072690.994515] [1775382]     0 1775382     2506      131    57344       12             0 cron
Jun  9 02:56:43 civi1002 kernel: [1072690.994517] [1775397]   205 1775397      645       23    45056        0             0 sh
Jun  9 02:56:43 civi1002 kernel: [1072690.994519] [1775414]   205 1775414    63068     2971   122880        0             0 run-job
Jun  9 02:56:43 civi1002 kernel: [1072690.994521] [1775440]   205 1775440      645       22    49152        0             0 wmf-cv
Jun  9 02:56:43 civi1002 kernel: [1072690.994523] [1775441]   205 1775441     2832      163    69632        0             0 sudo
Jun  9 02:56:43 civi1002 kernel: [1072690.994524] [1775467]    33 1775467    53016    19430   393216        0             0 php

Jun  9 02:56:43 civi1002 kernel: [1072690.994533] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/cron.service,task=php,pid=1745810,uid=33 
Jun  9 02:56:43 civi1002 kernel: [1072690.994546] Out of memory: Killed process 1745810 (php) total-vm:63596172kB, anon-rss:62570948kB, file-rss:0kB, shmem-rss:0kB, UID:33 pgtables:124084kB oom_score_adj:0

This one is a little more interesting when it gets to the process logs. It looks like it is spewing a ton of messages at rsyslog which then rate limits it.

$ grepr 1745810 fundraising-misc-20250610
Jun  9 02:10:02 civi1002 civicrm.wmf[1745810]: civicrm.wmf.NOTICE: Parsing /var/spool/audit//gravy/incoming//gravy_settlement_report_2025_06_09-010201.csv [] []
Jun  9 02:10:03 civi1002 civicrm.wmf[1745810]: civicrm.wmf.NOTICE: 11430 results found in 0.5330638885498 seconds  [] []
Jun  9 02:10:03 civi1002 civicrm.wmf[1745810]: civicrm.wmf.NOTICE: /var/spool/audit//gravy/incoming//gravy_settlement_report_2025_06_09-010201.csvget missing [] []
Jun  9 02:10:19 civi1002 civicrm.wmf[1745810]: civicrm.wmf.NOTICE: Transactions [] []
Jun  9 02:10:19 civi1002 civicrm.wmf[1745810]: civicrm.wmf.NOTICE: main|  total : 0    | found : 11430  (visa: 4797,mastercard: 6318,amex: 264,other: 3,diners_club: 3,discover: 8,: 37)     | missing: 0  [] []
Jun  9 02:10:19 civi1002 civicrm.wmf[1745810]: civicrm.wmf.NOTICE: refund|  total : 0    | found : 0       | missing: 0  [] []
Jun  9 02:10:19 civi1002 civicrm.wmf[1745810]: civicrm.wmf.NOTICE: cancel|  total : 0    | found : 0       | missing: 0  [] []
Jun  9 02:10:19 civi1002 civicrm.wmf[1745810]: civicrm.wmf.NOTICE: chargeback|  total : 0    | found : 0       | missing: 0  [] []
Jun  9 02:10:19 civi1002 civicrm.wmf[1745810]: civicrm.wmf.NOTICE: 0 missing transactions (of a possible 11430) identified in 15.961488008499 seconds  [] []
Jun  9 02:10:19 civi1002 civicrm.wmf[1745810]: civicrm.wmf.NOTICE: Moved /var/spool/audit//gravy/incoming//gravy_settlement_report_2025_06_09-010201.csv to /var/spool/audit//gravy/completed//gravy_settlement_report_2025_06_09-010201.csv [] []
Jun  9 02:10:19 civi1002 civicrm.wmf[1745810]: civicrm.wmf.NOTICE: Parsing /var/spool/audit//gravy/incoming//gravy_settlement_report_2025_06_07-010202.csv [] []
Jun  9 02:10:19 civi1002 civicrm.wmf[1745810]: civicrm.wmf.NOTICE: 12719 results found in 0.59647107124329 seconds  [] []
Jun  9 02:10:19 civi1002 civicrm.wmf[1745810]: civicrm.wmf.NOTICE: /var/spool/audit//gravy/incoming//gravy_settlement_report_2025_06_07-010202.csvget missing [] []
Jun  9 02:10:38 civi1002 civicrm.wmf[1745810]: civicrm.wmf.NOTICE: Transactions [] []
Jun  9 02:10:38 civi1002 civicrm.wmf[1745810]: civicrm.wmf.NOTICE: main|  total : 0    | found : 12717  (discover: 6,mastercard: 7398,visa: 4940,amex: 285,: 86,other: 1,diners_club: 1)     | missing: 2 (visa: 2) [] []
Jun  9 02:10:38 civi1002 civicrm.wmf[1745810]: civicrm.wmf.NOTICE: refund|  total : 0    | found : 0       | missing: 0  [] []
Jun  9 02:10:38 civi1002 civicrm.wmf[1745810]: civicrm.wmf.NOTICE: cancel|  total : 0    | found : 0       | missing: 0  [] []
Jun  9 02:10:38 civi1002 civicrm.wmf[1745810]: civicrm.wmf.NOTICE: chargeback|  total : 0    | found : 0       | missing: 0  [] []
Jun  9 02:10:38 civi1002 civicrm.wmf[1745810]: civicrm.wmf.NOTICE: 2 missing transactions (of a possible 12719) identified in 19.146283149719 seconds  [] []
Jun  9 02:10:38 civi1002 civicrm.wmf[1745810]: civicrm.wmf.NOTICE: Parsing /var/spool/audit//gravy/incoming//gravy_settlement_report_2025_06_06-010201.csv [] []
Jun  9 02:10:39 civi1002 rsyslogd: imuxsock[pid: 1745810, name: php] from <civi1002:SmashPig>: begin to drop messages due to rate-limiting
Jun  9 02:10:44 civi1002 rsyslogd: imuxsock[pid: 1745810, name: php]: 339380 messages lost due to rate-limiting (25000 allowed within 5 seconds)
Jun  9 02:10:44 civi1002 rsyslogd: imuxsock[pid: 1745810, name: php] from <civi1002:SmashPig>: begin to drop messages due to rate-limiting
Jun  9 02:10:50 civi1002 rsyslogd: imuxsock[pid: 1745810, name: php]: 404918 messages lost due to rate-limiting (25000 allowed within 5 seconds)

<--- cutting out an additional 688 lines --->

Jun  9 02:55:02 civi1002 rsyslogd: imuxsock[pid: 1745810, name: php]: 376071 messages lost due to rate-limiting (25000 allowed within 5 seconds)
Jun  9 02:55:02 civi1002 rsyslogd: imuxsock[pid: 1745810, name: php] from <civi1002:SmashPig>: begin to drop messages due to rate-limiting

That is a sum total of 117939623 log lines. It seems a bit over the top to me to have over 10318 log lines per entry in the csv file. Not sure what went wrong there.

Only happen 4 times in the whole year 2025

zgrep 'Donations queue consume failed with code 137' error-20250*.gz | wc -l
4
let's close this ticket and reopen if happen more
error-20250725.gz-Jul 24 19:50:46 pfw1-eqiad xntpd[31101] NTP Server 10.3.0.6 is Unreachable
error-20250725.gz-Jul 24 19:51:06 civi1002 kernel: [5021451.139436] Out of memory: Killed process 3668504 (php) total-vm:1001857928kB, anon-rss:61280532kB, file-rss:0kB, shmem-rss:0kB, UID:33 pgtables:120172kB oom_score_adj:0
error-20250725.gz:Jul 24 19:51:10 civi1002 run-job[3668431]: Donations queue consume failed with code 137

error-20250610.gz-Jun 9 04:51:01 civi1002 opendkim[1023]: 40C019FDDF: key retrieval failed (s=s2048, d=yahoo.com): 's2048._domainkey.yahoo.com' query timed out
error-20250610.gz:Jun 9 04:51:01 civi1002 run-job[1897359]: Donations queue consume failed with code 137
error-20250610.gz-Jun 9 04:51:02 pay-lb1002 bird: bfd1: Socket error: Operation not permitted
error-20250610.gz-Jun 9 04:51:03 pfw1-codfw xntpd[16148] NTP Server 10.3.0.5 is Unreachable

Change #1187915 had a related patch set uploaded (by Wfan; author: Wfan):

[wikimedia/fundraising/SmashPig@master] Log queue message before consume to avoid message loss due to memory exceed

https://gerrit.wikimedia.org/r/1187915

Change #1162074 merged by jenkins-bot:

[wikimedia/fundraising/crm@master] Log queue message incase we have memory lost then message lost

https://gerrit.wikimedia.org/r/1162074

Change #1187915 merged by jenkins-bot:

[wikimedia/fundraising/SmashPig@master] Log queue message before consume to avoid message loss due to memory exceed

https://gerrit.wikimedia.org/r/1187915

XenoRyet set Final Story Points to 2.