Page MenuHomePhabricator

Restarting fifo-log-demux should not restart nginx
Closed, ResolvedPublic

Description

Deploying a change to the systemd unit for fifo-log-demux caused nginx to be restarted on the ncredir servers:

# Pastebin apLf3fOA
2024-01-24T16:54:06.839601+00:00 ncredir3004 puppet-agent[2251389]: Applying configuration version '(796cfab564) BCornwall - fifo-log-demux: Update project homepage'            
2024-01-24T16:55:00.133063+00:00 ncredir3004 puppet-agent[2251389]: (/Stage[main]/Profile::Ncredir/Profile::Ncredir::Log[access_log]/Fifo_log_demux::Instance[ncredir_access_log]
/Systemd::Service[fifo-log-demux@ncredir_access_log]/Systemd::Unit[fifo-log-demux@ncredir_access_log]/File[/lib/systemd/system/fifo-log-demux@ncredir_access_log.service]/content
)                                                                                                                                                                                
2024-01-24T16:55:00.133252+00:00 ncredir3004 puppet-agent[2251389]: (/Stage[main]/Profile::Ncredir/Profile::Ncredir::Log[access_log]/Fifo_log_demux::Instance[ncredir_access_log]
/Systemd::Service[fifo-log-demux@ncredir_access_log]/Systemd::Unit[fifo-log-demux@ncredir_access_log]/File[/lib/systemd/system/fifo-log-demux@ncredir_access_log.service]/content
) --- /lib/systemd/system/fifo-log-demux@ncredir_access_log.service#0112023-10-11 17:47:40.903934095 +0000                                                                       
2024-01-24T16:55:00.133287+00:00 ncredir3004 puppet-agent[2251389]: (/Stage[main]/Profile::Ncredir/Profile::Ncredir::Log[access_log]/Fifo_log_demux::Instance[ncredir_access_log]
/Systemd::Service[fifo-log-demux@ncredir_access_log]/Systemd::Unit[fifo-log-demux@ncredir_access_log]/File[/lib/systemd/system/fifo-log-demux@ncredir_access_log.service]/content
) +++ /tmp/puppet-file20240124-2251389-7fuyuv#0112024-01-24 16:55:00.097526921 +0000                                                                                             
2024-01-24T16:55:00.133316+00:00 ncredir3004 puppet-agent[2251389]: (/Stage[main]/Profile::Ncredir/Profile::Ncredir::Log[access_log]/Fifo_log_demux::Instance[ncredir_access_log]
/Systemd::Service[fifo-log-demux@ncredir_access_log]/Systemd::Unit[fifo-log-demux@ncredir_access_log]/File[/lib/systemd/system/fifo-log-demux@ncredir_access_log.service]/content
) @@ -1,6 +1,6 @@                                                                                                                                                                
2024-01-24T16:55:00.133348+00:00 ncredir3004 puppet-agent[2251389]: (/Stage[main]/Profile::Ncredir/Profile::Ncredir::Log[access_log]/Fifo_log_demux::Instance[ncredir_access_log]
/Systemd::Service[fifo-log-demux@ncredir_access_log]/Systemd::Unit[fifo-log-demux@ncredir_access_log]/File[/lib/systemd/system/fifo-log-demux@ncredir_access_log.service]/content
)  [Unit]                                                                                                                                                                        
2024-01-24T16:55:00.133376+00:00 ncredir3004 puppet-agent[2251389]: (/Stage[main]/Profile::Ncredir/Profile::Ncredir::Log[access_log]/Fifo_log_demux::Instance[ncredir_access_log]
/Systemd::Service[fifo-log-demux@ncredir_access_log]/Systemd::Unit[fifo-log-demux@ncredir_access_log]/File[/lib/systemd/system/fifo-log-demux@ncredir_access_log.service]/content
)  Description=FIFO log demultiplexer (instance %i)                                                                                                                              
2024-01-24T16:55:00.133406+00:00 ncredir3004 puppet-agent[2251389]: (/Stage[main]/Profile::Ncredir/Profile::Ncredir::Log[access_log]/Fifo_log_demux::Instance[ncredir_access_log]
/Systemd::Service[fifo-log-demux@ncredir_access_log]/Systemd::Unit[fifo-log-demux@ncredir_access_log]/File[/lib/systemd/system/fifo-log-demux@ncredir_access_log.service]/content
) -Documentation=https://github.com/wikimedia/operations-software-fifo-log-demux                                                                                                 
2024-01-24T16:55:00.133437+00:00 ncredir3004 puppet-agent[2251389]: (/Stage[main]/Profile::Ncredir/Profile::Ncredir::Log[access_log]/Fifo_log_demux::Instance[ncredir_access_log]
/Systemd::Service[fifo-log-demux@ncredir_access_log]/Systemd::Unit[fifo-log-demux@ncredir_access_log]/File[/lib/systemd/system/fifo-log-demux@ncredir_access_log.service]/content
) +Documentation=https://gitlab.wikimedia.org/repos/sre/fifo-log-demux                                                                                                           
2024-01-24T16:55:00.133464+00:00 ncredir3004 puppet-agent[2251389]: (/Stage[main]/Profile::Ncredir/Profile::Ncredir::Log[access_log]/Fifo_log_demux::Instance[ncredir_access_log]
/Systemd::Service[fifo-log-demux@ncredir_access_log]/Systemd::Unit[fifo-log-demux@ncredir_access_log]/File[/lib/systemd/system/fifo-log-demux@ncredir_access_log.service]/content
)                                                                                                                                                                                
2024-01-24T16:55:00.133496+00:00 ncredir3004 puppet-agent[2251389]: (/Stage[main]/Profile::Ncredir/Profile::Ncredir::Log[access_log]/Fifo_log_demux::Instance[ncredir_access_log]
/Systemd::Service[fifo-log-demux@ncredir_access_log]/Systemd::Unit[fifo-log-demux@ncredir_access_log]/File[/lib/systemd/system/fifo-log-demux@ncredir_access_log.service]/content
)  [Service]                                                                                                                                                                     
2024-01-24T16:55:00.133524+00:00 ncredir3004 puppet-agent[2251389]: (/Stage[main]/Profile::Ncredir/Profile::Ncredir::Log[access_log]/Fifo_log_demux::Instance[ncredir_access_log]
/Systemd::Service[fifo-log-demux@ncredir_access_log]/Systemd::Unit[fifo-log-demux@ncredir_access_log]/File[/lib/systemd/system/fifo-log-demux@ncredir_access_log.service]/content
)  Restart=always                                                                                                                                                                
2024-01-24T16:55:00.140571+00:00 ncredir3004 puppet-agent[2251389]: (/Stage[main]/Profile::Ncredir/Profile::Ncredir::Log[access_log]/Fifo_log_demux::Instance[ncredir_access_log]
/Systemd::Service[fifo-log-demux@ncredir_access_log]/Systemd::Unit[fifo-log-demux@ncredir_access_log]/File[/lib/systemd/system/fifo-log-demux@ncredir_access_log.service]/content
) content changed '{sha256}9f0f45003b5f1dcb297103e11340d05c8922d78f9153d99557f083be0b36a7b0' to '{sha256}1ffd4edfa051d095561a286d435302ebb68c2933311166b78bf82d735f9968ec'       
2024-01-24T16:55:00.141553+00:00 ncredir3004 puppet-agent[2251389]: (/Stage[main]/Profile::Ncredir/Profile::Ncredir::Log[access_log]/Fifo_log_demux::Instance[ncredir_access_log]
/Systemd::Service[fifo-log-demux@ncredir_access_log]/Systemd::Unit[fifo-log-demux@ncredir_access_log]/File[/lib/systemd/system/fifo-log-demux@ncredir_access_log.service]) Schedu
ling refresh of Exec[systemd daemon-reload for fifo-log-demux@ncredir_access_log.service (fifo-log-demux@ncredir_access_log)]                                                    
2024-01-24T16:55:00.446842+00:00 ncredir3004 puppet-agent[2251389]: (/Stage[main]/Profile::Ncredir/Profile::Ncredir::Log[access_log]/Fifo_log_demux::Instance[ncredir_access_log]
/Systemd::Service[fifo-log-demux@ncredir_access_log]/Systemd::Unit[fifo-log-demux@ncredir_access_log]/Exec[systemd daemon-reload for fifo-log-demux@ncredir_access_log.service (f
ifo-log-demux@ncredir_access_log)]) Triggered 'refresh' from 1 event                                                                                                             
2024-01-24T16:55:00.446925+00:00 ncredir3004 puppet-agent[2251389]: (/Stage[main]/Profile::Ncredir/Profile::Ncredir::Log[access_log]/Fifo_log_demux::Instance[ncredir_access_log]
/Systemd::Service[fifo-log-demux@ncredir_access_log]/Systemd::Unit[fifo-log-demux@ncredir_access_log]/Exec[systemd daemon-reload for fifo-log-demux@ncredir_access_log.service (f
ifo-log-demux@ncredir_access_log)]) Scheduling refresh of Service[fifo-log-demux@ncredir_access_log]                                                                             
2024-01-24T16:55:00.598900+00:00 ncredir3004 puppet-agent[2251389]: (/Stage[main]/Profile::Ncredir/Profile::Ncredir::Log[access_log]/Fifo_log_demux::Instance[ncredir_access_log]
/Systemd::Service[fifo-log-demux@ncredir_access_log]/Service[fifo-log-demux@ncredir_access_log]) Triggered 'refresh' from 1 event                                                
2024-01-24T16:55:05.789191+00:00 ncredir3004 puppet-agent[2251389]: (/Stage[main]/Nginx/Service[nginx]/ensure) ensure changed 'stopped' to 'running' (corrective)                
2024-01-24T16:55:05.789974+00:00 ncredir3004 puppet-agent[2251389]: (/Stage[main]/Nginx/Service[nginx]) Unscheduling refresh on Service[nginx]                                   
2024-01-24T16:55:08.114384+00:00 ncredir3004 puppet-agent[2251389]: Applied catalog in 62.68 seconds

fifo-log-demux should be able to be restarted without bringing nginx down with it.

Details

TitleReferenceAuthorSource BranchDest Branch
fifo-log-demux: Exit(70) on unable to spawn serverrepos/sre/fifo-log-demux!1brettT355905-exit-70-on-server-failmain
Customize query in GitLab

Event Timeline

BCornwall changed the task status from Open to In Progress.Jan 25 2024, 6:21 PM
BCornwall triaged this task as Low priority.

Change 993804 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] ncredir: Set fifo_log_demux/nginx as wanted_by

https://gerrit.wikimedia.org/r/993804

@BBlack, @ssingh, @Vgutierrez T284555 added in a BindsTo= so that fifo-log-demux and nginx were dependent on each other. Just want to make sure that we do, in fact, want fifo-log-demux to have a soft dependency instead. With the combination of WantedBy= and automatic service restarts, fifo-log-demux *should* be fine if it goes down even if it's not strictly bound to nginx.

IIRC that was done to smooth the reimage process and first puppet run on various roles using fifo-log-demux.

nginx needs something (fifo-log-demux) consuming log messages on the pipe to be able to process requests as expected, so fifo-log-demux can''t be restarted automatically by puppet nor trigger an nginx restart cause that effectively depools the node.

For cp nodes the picture is slightly better. ATS won't drop requests but we lose tons of metrics if fifo-log-demux isn't there so it's far from ideal

Mentioned in SAL (#wikimedia-operations) [2024-03-04T20:34:21Z] <brett@cumin2002> START - Cookbook sre.hosts.downtime for 3:00:00 on cp5025.eqsin.wmnet with reason: T355905

Mentioned in SAL (#wikimedia-operations) [2024-03-04T20:34:38Z] <brett@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cp5025.eqsin.wmnet with reason: T355905

Mentioned in SAL (#wikimedia-operations) [2024-03-04T22:19:22Z] <brett@cumin2002> START - Cookbook sre.hosts.downtime for 3:00:00 on cp5025.eqsin.wmnet with reason: T355905

Mentioned in SAL (#wikimedia-operations) [2024-03-04T22:19:28Z] <brett@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cp5025.eqsin.wmnet with reason: T355905

Mentioned in SAL (#wikimedia-operations) [2024-03-05T20:46:07Z] <brett> Start rolling out updated fifo-log-demux and configuration to A:cp and A:ncredir - T355905

Mentioned in SAL (#wikimedia-operations) [2024-03-05T20:46:27Z] <brett> Disable puppet on A:cp and A:ncredir - T355905

Change 993804 merged by BCornwall:

[operations/puppet@production] fifo-log-demux: Decouple service from nginx/ats

https://gerrit.wikimedia.org/r/993804