Page MenuHomePhabricator

ats-tls-restart failed on cp4027
Closed, ResolvedPublic

Description

I've upgraded trafficserver to 8.0.5-1wm10 on cp4027, and restarted successfully the backend with ats-backend-restart.

After that, I've tried restarting the tls instance too, but the command failed:

root@cp4027:~# ats-tls-restart
ulsfo/cache_text/nginx/cp4027.ulsfo.wmnet: pooled changed yes => no
trafficserver-tls.service is not active, cannot reload.

Relevant logs at that time:

Nov 05 15:22:46 cp4027 traffic_manager[14370]: [TrafficManager] ==> Cleaning up and reissuing signal #15
Nov 05 15:22:46 cp4027 traffic_manager[14370]: [Nov  5 15:22:46.723] {0x7feefdd394c0} NOTE: [TrafficManager] ==> Cleaning up and reissuing signal #15
Nov 05 15:22:46 cp4027 systemd[1]: Stopping trafficserver-tls.service...
Nov 05 15:22:47 cp4027 traffic_manager[14370]: imum payload space in a log buffer
Nov 05 15:22:47 cp4027 traffic_manager[14370]: [Nov  5 14:10:22.024] {0x2ae813131700} NOTE: Skipping the current log entry for tls.pipe because its size (19376) exceeds the maximum payload space in a log buffer
Nov 05 15:22:47 cp4027 traffic_manager[14370]: [Nov  5 14:13:42.905] {0x2ae835647700} NOTE: Skipping the current log entry for tls.pipe because its size (20448) exceeds the maximum payload space in a log buffer
Nov 05 15:22:47 cp4027 traffic_manager[14370]: [Nov  5 14:18:05.269] {0x2ae813131700} NOTE: Skipping the current log entry for tls.pipe because its size (19696) exceeds the maximum payload space in a log buffer
Nov 05 15:22:47 cp4027 traffic_manager[14370]: [Nov  5 14:18:18.764] {0x2ae835647700} NOTE: Skipping the current log entry for tls.pipe because its size (9360) exceeds the maximum payload space in a log buffer
Nov 05 15:22:47 cp4027 traffic_manager[14370]: [Nov  5 14:18:42.663] {0x2ae813131700} NOTE: Skipping the current log entry for tls.pipe because its size (20352) exceeds the maximum payload space in a log buffer
Nov 05 15:22:47 cp4027 traffic_manager[14370]: [Nov  5 14:18:57.064] {0x2ae8f84e7700} NOTE: Skipping the current log entry for tls.pipe because its size (9664) exceeds the maximum payload space in a log buffer
Nov 05 15:22:47 cp4027 traffic_manager[14370]: [Nov  5 14:21:18.585] {0x2ae8f82a3700} NOTE: Skipping the current log entry for tls.pipe because its size (9504) exceeds the maximum payload space in a log buffer
Nov 05 15:22:47 cp4027 traffic_manager[14370]: [Nov  5 14:54:37.035] {0x2ae836d31700} NOTE: Skipping the current log entry for tls.pipe because its size (9056) exceeds the maximum payload space in a log buffer
Nov 05 15:22:47 cp4027 traffic_manager[14370]: [Nov  5 14:57:51.016] {0x2ae835647700} NOTE: Skipping the current log entry for tls.pipe because its size (9024) exceeds the maximum payload space in a log buffer
Nov 05 15:22:47 cp4027 traffic_manager[14370]: [Nov  5 15:07:38.073] {0x2ae835647700} NOTE: Skipping the current log entry for tls.pipe because its size (9360) exceeds the maximum payload space in a log buffer
Nov 05 15:22:47 cp4027 traffic_manager[14370]: [Nov  5 15:09:07.323] {0x2ae837a49700} NOTE: Skipping the current log entry for tls.pipe because its size (9136) exceeds the maximum payload space in a log buffer
Nov 05 15:22:47 cp4027 traffic_manager[14370]: [Nov  5 15:14:07.339] {0x2ae8fb55f700} NOTE: loading SSL certificate configuration from /srv/trafficserver/tls/etc/ssl_multicert.config
Nov 05 15:22:47 cp4027 traffic_manager[14370]: [Nov  5 15:14:07.339] {0x2ae8fb55f700} NOTE: successfully initialized stapling for oscerts/localcerts/globalsign-2018-ecdsa-unified.chained.crt into SSL_CTX: 0x2ae96
Nov 05 15:22:47 cp4027 traffic_manager[14370]: [Nov  5 15:14:07.340] {0x2ae8fb55f700} NOTE: successfully initialized stapling for oscerts/localcerts/globalsign-2018-rsa-unified.chained.crt into SSL_CTX: 0x2ae9640
Nov 05 15:22:47 cp4027 traffic_manager[14370]: [Nov  5 15:14:07.341] {0x2ae8fb55f700} NOTE: ssl_multicert.config done reloading!
Nov 05 15:22:47 cp4027 traffic_manager[14370]: [Nov  5 15:17:31.300] {0x2ae835647700} NOTE: Skipping the current log entry for tls.pipe because its size (19040) exceeds the maximum payload space in a log buffer
Nov 05 15:22:47 cp4027 traffic_manager[14370]: [Nov  5 15:18:09.122] {0x2ae835647700} NOTE: Skipping the current log entry for tls.pipe because its size (19696) exceeds the maximum payload space in a log buffer
Nov 05 15:22:47 cp4027 traffic_manager[14370]: [Nov  5 15:18:47.431] {0x2ae835647700} NOTE: Skipping the current log entry for tls.pipe because its size (20032) exceeds the maximum payload space in a log buffer
Nov 05 15:22:47 cp4027 traffic_manager[14370]: [Nov  5 15:18:53.413] {0x2ae835647700} NOTE: Skipping the current log entry for tls.pipe because its size (20352) exceeds the maximum payload space in a log buffer
Nov 05 15:22:47 cp4027 traffic_manager[14370]: [Nov  5 15:19:01.262] {0x2ae813131700} NOTE: Skipping the current log entry for tls.pipe because its size (18928) exceeds the maximum payload space in a log buffer
Nov 05 15:22:47 cp4027 traffic_manager[14370]: [Nov  5 15:19:01.318] {0x2ae835647700} NOTE: Skipping the current log entry for tls.pipe because its size (11168) exceeds the maximum payload space in a log buffer
Nov 05 15:22:47 cp4027 traffic_manager[14370]: [Nov  5 15:20:35.659] {0x2ae835647700} NOTE: Skipping the current log entry for tls.pipe because its size (19920) exceeds the maximum payload space in a log buffer
Nov 05 15:22:48 cp4027 traffic_manager[14370]: [TrafficManager] ==> signal #15
Nov 05 15:22:48 cp4027 traffic_manager[14370]: [Nov  5 15:22:48.082] {0x7feefdd394c0} NOTE: [TrafficManager] ==> signal #15
Nov 05 15:22:48 cp4027 systemd[1]: Stopped trafficserver-tls.service.
Nov 05 15:22:48 cp4027 systemd[1]: Starting trafficserver-tls.service...
Nov 05 15:22:50 cp4027 update-ocsp-all[24836]: touch: cannot touch '/srv/trafficserver/tls/etc/ssl_multicert.config': Read-only file system
Nov 05 15:22:50 cp4027 systemd[1]: trafficserver-tls.service: Unit cannot be reloaded because it is inactive.
Nov 05 15:22:50 cp4027 update-ocsp-all[24836]: run-parts: /etc/update-ocsp.d/hooks/trafficserver-tls-ocsp exited with return code 99
Nov 05 15:22:50 cp4027 traffic_manager[24947]: NOTE: using command line path as RUNROOT
Nov 05 15:22:50 cp4027 traffic_manager[24947]: NOTE: using command line path as RUNROOT
Nov 05 15:22:50 cp4027 traffic_manager[24947]: [E. Mgmt] log ==> [TrafficManager] using root directory '/srv/trafficserver/tls'
Nov 05 15:22:50 cp4027 traffic_manager[24947]: [Nov  5 15:22:50.222] {0x7f112f40f4c0} NOTE: updated diags config
Nov 05 15:22:50 cp4027 traffic_manager[24947]: [Nov  5 15:22:50.225] {0x7f112f40f4c0} NOTE: [LocalManager::listenForProxy] Listening on port: 443 (ipv4)
Nov 05 15:22:50 cp4027 traffic_manager[24947]: [Nov  5 15:22:50.226] {0x7f112f40f4c0} NOTE: [LocalManager::listenForProxy] Listening on port: 443 (ipv6)
Nov 05 15:22:50 cp4027 traffic_manager[24947]: [Nov  5 15:22:50.226] {0x7f112f40f4c0} NOTE: [TrafficManager] Setup complete
Nov 05 15:22:51 cp4027 traffic_manager[24947]: [Nov  5 15:22:51.227] {0x7f112f40f4c0} NOTE: [ProxyStateSet] Traffic Server Args: ' -M'
Nov 05 15:22:51 cp4027 traffic_manager[24947]: [Nov  5 15:22:51.227] {0x7f112f40f4c0} NOTE: [LocalManager::listenForProxy] Listening on port: 443 (ipv4)
Nov 05 15:22:51 cp4027 traffic_manager[24947]: [Nov  5 15:22:51.227] {0x7f112f40f4c0} NOTE: [LocalManager::listenForProxy] Listening on port: 443 (ipv6)
Nov 05 15:22:51 cp4027 traffic_manager[24947]: [Nov  5 15:22:51.227] {0x7f112f40f4c0} NOTE: [LocalManager::startProxy] Launching ts process
Nov 05 15:22:51 cp4027 traffic_manager[24947]: NOTE: using command line path as RUNROOT
Nov 05 15:22:51 cp4027 traffic_server[24961]: NOTE: using command line path as RUNROOT
Nov 05 15:22:51 cp4027 traffic_manager[24947]: [Nov  5 15:22:51.258] {0x7f112f40f4c0} NOTE: [LocalManager::pollMgmtProcessServer] New process connecting fd '12'
Nov 05 15:22:51 cp4027 traffic_manager[24947]: [Nov  5 15:22:51.258] {0x7f112f40f4c0} NOTE: [Alarms::signalAlarm] Server Process born
Nov 05 15:22:53 cp4027 traffic_server[24961]: NOTE: --- traffic_server Starting ---
Nov 05 15:22:53 cp4027 traffic_server[24961]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 8.0.5 - (build # 110413 on Nov  4 2019 at 13:30:22)
Nov 05 15:22:53 cp4027 traffic_manager[24947]: [Nov  5 15:22:53.554] {0x2b76da00f180} ERROR: [CacheVolition] Can not open /srv/trafficserver/tls/etc/volume.config file : No such file or directory

The restart looked ok, so I've pooled the service manually with:

confctl --host --quiet select name=cp4027.ulsfo.wmnet,service=nginx set/pooled=yes

Event Timeline

ema triaged this task as Medium priority.Nov 5 2019, 3:35 PM
ema added a project: Traffic.

this is a known issue caused by update-ocsp-all

Nov 05 15:22:48 cp4027 systemd[1]: Starting trafficserver-tls.service...
Nov 05 15:22:50 cp4027 update-ocsp-all[24836]: touch: cannot touch '/srv/trafficserver/tls/etc/ssl_multicert.config': Read-only file system
Nov 05 15:22:50 cp4027 systemd[1]: trafficserver-tls.service: Unit cannot be reloaded because it is inactive.
Nov 05 15:22:50 cp4027 update-ocsp-all[24836]: run-parts: /etc/update-ocsp.d/hooks/trafficserver-tls-ocsp exited with return code 99

I fixed the touch issue with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/550475/, but still update-ocsp-all tries to reload trafficserver-tls.service, which fails due to the unit being inactive. @Vgutierrez: ideas for how to tackle this?

Change 550650 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Avoid reloading trafficserver-tls if the service is being restarted

https://gerrit.wikimedia.org/r/550650

Change 550650 merged by Vgutierrez:
[operations/puppet@production] ATS: Avoid reloading trafficserver-tls if the service is being restarted

https://gerrit.wikimedia.org/r/550650

Mentioned in SAL (#wikimedia-operations) [2019-11-13T10:53:21Z] <vgutierrez> Testing ats-tls-restart on cp5007 - T237425

Vgutierrez claimed this task.

This has been fixed with the conditional reload:

vgutierrez@cp5007:~$ sudo -i ats-tls-restart
eqsin/cache_text/nginx/cp5007.eqsin.wmnet: pooled changed yes => no
eqsin/cache_text/nginx/cp5007.eqsin.wmnet: pooled changed no => yes