Page MenuHomePhabricator

puppet restarts nginx instead of reloading it on ncredir servers
Closed, ResolvedPublic

Description

Today icinga registered a TLS handshake issue on ncredir2001:

09:15:07 PROBLEM - HTTPS non-canonical-redirect-1 on ncredir2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir
09:16:40 RECOVERY - HTTPS non-canonical-redirect-1 on ncredir2001 is OK: SSL OK - OCSP staple validity for wikipedia.com has 431002 seconds left:Certificate wikipedia.com valid until 2019-10-29 08:00:32 +0000 (expires in 36 days) https://wikitech.wikimedia.org/wiki/Ncredir

Checking ncredir2001 logs, the event matches the OCSP response renewal that triggered a restart instead of a nginx reload:

Sep 22 09:14:55 ncredir2001 puppet-agent[15013]: Computing checksum on file /etc/acmecerts/non-canonical-redirect-1/1dbc98ba20d04ec5ae501e6c1bdddeda/ec-prime256v1.ocsp
Sep 22 09:14:55 ncredir2001 puppet-agent[15013]: (/Stage[main]/Profile::Ncredir/Acme_chief::Cert[non-canonical-redirect-1]/File[/etc/acmecerts/non-canonical-redirect-1/1dbc98ba20d04ec5ae501e6c1bdddeda/ec-prime256v1.ocsp]) Filebucketed /etc/acmecerts/non-canonical-redirect-1/1dbc98ba20d04ec5ae501e6c1bdddeda/ec-prime256v1.ocsp to puppet with sum 9ac96d01c6688ace70cd0be528291fff
Sep 22 09:14:55 ncredir2001 puppet-agent[15013]: (/Stage[main]/Profile::Ncredir/Acme_chief::Cert[non-canonical-redirect-1]/File[/etc/acmecerts/non-canonical-redirect-1/1dbc98ba20d04ec5ae501e6c1bdddeda/ec-prime256v1.ocsp]/content) content changed '{md5}9ac96d01c6688ace70cd0be528291fff' to '{md5}4fc57adcea379a06dbf9aac20c5050f1'
Sep 22 09:14:55 ncredir2001 puppet-agent[15013]: Computing checksum on file /etc/acmecerts/non-canonical-redirect-1/1dbc98ba20d04ec5ae501e6c1bdddeda/rsa-2048.ocsp
Sep 22 09:14:55 ncredir2001 puppet-agent[15013]: (/Stage[main]/Profile::Ncredir/Acme_chief::Cert[non-canonical-redirect-1]/File[/etc/acmecerts/non-canonical-redirect-1/1dbc98ba20d04ec5ae501e6c1bdddeda/rsa-2048.ocsp]) Filebucketed /etc/acmecerts/non-canonical-redirect-1/1dbc98ba20d04ec5ae501e6c1bdddeda/rsa-2048.ocsp to puppet with sum 14540e85d398a196864e574480bd4451
Sep 22 09:14:55 ncredir2001 puppet-agent[15013]: (/Stage[main]/Profile::Ncredir/Acme_chief::Cert[non-canonical-redirect-1]/File[/etc/acmecerts/non-canonical-redirect-1/1dbc98ba20d04ec5ae501e6c1bdddeda/rsa-2048.ocsp]/content) content changed '{md5}14540e85d398a196864e574480bd4451' to '{md5}597b43371d3b5b1d86e4fc6571dab04b'
Sep 22 09:14:55 ncredir2001 puppet-agent[15013]: (/etc/acmecerts/non-canonical-redirect-1) Scheduling refresh of Service[nginx]
Sep 22 09:14:56 ncredir2001 puppet-agent[15013]: Computing checksum on file /etc/acmecerts/non-canonical-redirect-3/3e8e91962d894366ab26da256de2d54b/ec-prime256v1.ocsp
Sep 22 09:14:56 ncredir2001 puppet-agent[15013]: (/Stage[main]/Profile::Ncredir/Acme_chief::Cert[non-canonical-redirect-3]/File[/etc/acmecerts/non-canonical-redirect-3/3e8e91962d894366ab26da256de2d54b/ec-prime256v1.ocsp]) Filebucketed /etc/acmecerts/non-canonical-redirect-3/3e8e91962d894366ab26da256de2d54b/ec-prime256v1.ocsp to puppet with sum 2b55f10204b5b61a475b652f73131b3c
Sep 22 09:14:56 ncredir2001 puppet-agent[15013]: (/Stage[main]/Profile::Ncredir/Acme_chief::Cert[non-canonical-redirect-3]/File[/etc/acmecerts/non-canonical-redirect-3/3e8e91962d894366ab26da256de2d54b/ec-prime256v1.ocsp]/content) content changed '{md5}2b55f10204b5b61a475b652f73131b3c' to '{md5}4727e58309653f5361e5a33892e8d8c8'
Sep 22 09:14:56 ncredir2001 puppet-agent[15013]: Computing checksum on file /etc/acmecerts/non-canonical-redirect-3/3e8e91962d894366ab26da256de2d54b/rsa-2048.ocsp
Sep 22 09:14:56 ncredir2001 puppet-agent[15013]: (/Stage[main]/Profile::Ncredir/Acme_chief::Cert[non-canonical-redirect-3]/File[/etc/acmecerts/non-canonical-redirect-3/3e8e91962d894366ab26da256de2d54b/rsa-2048.ocsp]) Filebucketed /etc/acmecerts/non-canonical-redirect-3/3e8e91962d894366ab26da256de2d54b/rsa-2048.ocsp to puppet with sum 409d400c1504f00f79acf48cf9326527
Sep 22 09:14:56 ncredir2001 puppet-agent[15013]: (/Stage[main]/Profile::Ncredir/Acme_chief::Cert[non-canonical-redirect-3]/File[/etc/acmecerts/non-canonical-redirect-3/3e8e91962d894366ab26da256de2d54b/rsa-2048.ocsp]/content) content changed '{md5}409d400c1504f00f79acf48cf9326527' to '{md5}b2f7d27ef3a8c6c3f308ba62488036c9'
Sep 22 09:14:56 ncredir2001 puppet-agent[15013]: (/etc/acmecerts/non-canonical-redirect-3) Scheduling refresh of Service[nginx]
Sep 22 09:14:56 ncredir2001 puppet-agent[15013]: Computing checksum on file /etc/acmecerts/non-canonical-redirect-4/4bc0798dc32b45cd92129ac1f15d7988/ec-prime256v1.ocsp
Sep 22 09:14:56 ncredir2001 puppet-agent[15013]: (/Stage[main]/Profile::Ncredir/Acme_chief::Cert[non-canonical-redirect-4]/File[/etc/acmecerts/non-canonical-redirect-4/4bc0798dc32b45cd92129ac1f15d7988/ec-prime256v1.ocsp]) Filebucketed /etc/acmecerts/non-canonical-redirect-4/4bc0798dc32b45cd92129ac1f15d7988/ec-prime256v1.ocsp to puppet with sum a36d818a125242918919014daa2df6a0
Sep 22 09:14:56 ncredir2001 puppet-agent[15013]: (/Stage[main]/Profile::Ncredir/Acme_chief::Cert[non-canonical-redirect-4]/File[/etc/acmecerts/non-canonical-redirect-4/4bc0798dc32b45cd92129ac1f15d7988/ec-prime256v1.ocsp]/content) content changed '{md5}a36d818a125242918919014daa2df6a0' to '{md5}893c8b5d5b0fddb76d23c8b69052ee9f'
Sep 22 09:14:56 ncredir2001 puppet-agent[15013]: Computing checksum on file /etc/acmecerts/non-canonical-redirect-4/4bc0798dc32b45cd92129ac1f15d7988/rsa-2048.ocsp
Sep 22 09:14:56 ncredir2001 puppet-agent[15013]: (/Stage[main]/Profile::Ncredir/Acme_chief::Cert[non-canonical-redirect-4]/File[/etc/acmecerts/non-canonical-redirect-4/4bc0798dc32b45cd92129ac1f15d7988/rsa-2048.ocsp]) Filebucketed /etc/acmecerts/non-canonical-redirect-4/4bc0798dc32b45cd92129ac1f15d7988/rsa-2048.ocsp to puppet with sum 02c90e09010000f523731badbf29a2d2
Sep 22 09:14:56 ncredir2001 puppet-agent[15013]: (/Stage[main]/Profile::Ncredir/Acme_chief::Cert[non-canonical-redirect-4]/File[/etc/acmecerts/non-canonical-redirect-4/4bc0798dc32b45cd92129ac1f15d7988/rsa-2048.ocsp]/content) content changed '{md5}02c90e09010000f523731badbf29a2d2' to '{md5}ba52b0eb2da8b4a8e102765b5568f2be'
Sep 22 09:14:56 ncredir2001 puppet-agent[15013]: (/etc/acmecerts/non-canonical-redirect-4) Scheduling refresh of Service[nginx]
Sep 22 09:14:57 ncredir2001 puppet-agent[15013]: Computing checksum on file /etc/acmecerts/non-canonical-redirect-5/f78e9c1310ce48e29ec058e32cb3a515/ec-prime256v1.ocsp
Sep 22 09:14:57 ncredir2001 puppet-agent[15013]: (/Stage[main]/Profile::Ncredir/Acme_chief::Cert[non-canonical-redirect-5]/File[/etc/acmecerts/non-canonical-redirect-5/f78e9c1310ce48e29ec058e32cb3a515/ec-prime256v1.ocsp]) Filebucketed /etc/acmecerts/non-canonical-redirect-5/f78e9c1310ce48e29ec058e32cb3a515/ec-prime256v1.ocsp to puppet with sum 0af36c366e6cfd8e7e5ab0cabaa0798a
Sep 22 09:14:57 ncredir2001 puppet-agent[15013]: (/Stage[main]/Profile::Ncredir/Acme_chief::Cert[non-canonical-redirect-5]/File[/etc/acmecerts/non-canonical-redirect-5/f78e9c1310ce48e29ec058e32cb3a515/ec-prime256v1.ocsp]/content) content changed '{md5}0af36c366e6cfd8e7e5ab0cabaa0798a' to '{md5}a116b616b44d8be95ee8955c7487ff63'
Sep 22 09:14:57 ncredir2001 puppet-agent[15013]: Computing checksum on file /etc/acmecerts/non-canonical-redirect-5/f78e9c1310ce48e29ec058e32cb3a515/rsa-2048.ocsp
Sep 22 09:14:57 ncredir2001 puppet-agent[15013]: (/Stage[main]/Profile::Ncredir/Acme_chief::Cert[non-canonical-redirect-5]/File[/etc/acmecerts/non-canonical-redirect-5/f78e9c1310ce48e29ec058e32cb3a515/rsa-2048.ocsp]) Filebucketed /etc/acmecerts/non-canonical-redirect-5/f78e9c1310ce48e29ec058e32cb3a515/rsa-2048.ocsp to puppet with sum 81118f19d556fcd424210abdd9bf0d47
Sep 22 09:14:57 ncredir2001 puppet-agent[15013]: (/Stage[main]/Profile::Ncredir/Acme_chief::Cert[non-canonical-redirect-5]/File[/etc/acmecerts/non-canonical-redirect-5/f78e9c1310ce48e29ec058e32cb3a515/rsa-2048.ocsp]/content) content changed '{md5}81118f19d556fcd424210abdd9bf0d47' to '{md5}29c4554ba4305a34accf7553576f9305'
Sep 22 09:14:57 ncredir2001 puppet-agent[15013]: (/etc/acmecerts/non-canonical-redirect-5) Scheduling refresh of Service[nginx]
Sep 22 09:15:00 ncredir2001 systemd[1]: Stopping A high performance web server and a reverse proxy server...
Sep 22 09:15:05 ncredir2001 systemd[1]: Stopped A high performance web server and a reverse proxy server.
Sep 22 09:15:05 ncredir2001 systemd[1]: Starting A high performance web server and a reverse proxy server...
Sep 22 09:15:05 ncredir2001 systemd[1]: Started A high performance web server and a reverse proxy server.
Sep 22 09:15:05 ncredir2001 puppet-agent[15013]: (/Stage[main]/Nginx/Service[nginx]) Triggered 'refresh' from 4 events

Event Timeline

Restricted Application added a project: Operations. · View Herald TranscriptSun, Sep 22, 11:55 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Vgutierrez triaged this task as Normal priority.Mon, Sep 23, 7:52 AM

Change 538567 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] acme_chief: Allow specifying a custom resource to get notified on cert updates

https://gerrit.wikimedia.org/r/538567

Change 538568 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ncredir: Notify Exec[nginx-reload] instead of Service[nginx] on TLS material changes

https://gerrit.wikimedia.org/r/538568

Change 538567 merged by Vgutierrez:
[operations/puppet@production] acme_chief: Allow specifying a custom resource to get notified on cert updates

https://gerrit.wikimedia.org/r/538567

Change 538568 merged by Vgutierrez:
[operations/puppet@production] ncredir: Notify Exec[nginx-reload] instead of Service[nginx] on cert changes

https://gerrit.wikimedia.org/r/538568

Change 538573 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] acme_chief: Simplify notification handling

https://gerrit.wikimedia.org/r/538573

Change 538573 merged by Vgutierrez:
[operations/puppet@production] acme_chief: Simplify notification handling on acme_chief::cert

https://gerrit.wikimedia.org/r/538573

Change 538574 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] install_server: Reload nginx instead of restarting it on cert updates

https://gerrit.wikimedia.org/r/538574

Vgutierrez closed this task as Resolved.Mon, Sep 23, 9:12 AM
Vgutierrez claimed this task.

Change 538574 merged by Vgutierrez:
[operations/puppet@production] install_server: Reload nginx instead of restarting it on cert updates

https://gerrit.wikimedia.org/r/538574