Page MenuHomePhabricator

port 80 paging on scheduled single host maintenance in text@esams
Closed, ResolvedPublic

Description

Yesterday we got a few pages involving port 80 after triggering a reboot for a kernel upgrade on cp3050:

08:30:27 <fabfur> !log rebooting cp3051 and cp3051 for kernel upgrade (T335835)
08:33:07: <jinxer-wm> (ProbeDown) firing: (3) Service text:80 has failed probes (http_text_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown

This is the first time that we experience this kind of issue while working on T335835 suggesting that it could be related to text@esams getting more traffic than the other DCs

Event Timeline

My log message contained a typo: should've been "rebooting cp3050 and cp3051 for kernel upgrade (T335835)"

During the issue text@esams never went higher than ~400 rps on port 80 per instance:

image.png (500×1 px, 97 KB)

While a few days ago text@esams handled spikes of almost 3k rps per instance:

image.png (951×1 px, 101 KB)

pybal on lvs3005 and lvs3007 didn't report any healthcheck failures during the issue (besides the expected one for cp3050/cp3051 under maintenance at that moment)

for both IPv4 and IPv6 the alert reports "context deadline exceeded":

target=http://[91.198.174.192]:80/wiki/Special:BlankPage msg="Error for HTTP request" err="Get \"http://91.198.174.192:80/wiki/Special:BlankPage\": context deadline exceeded"
target=http://[2620:0:862:ed1a::1]:80/wiki/Special:BlankPage msg="Error for HTTP request" err="Get \"http://[2620:0:862:ed1a::1]:80/wiki/Special:BlankPage\": context deadline exceeded"

pybal timeout for ProxyFetch is set to 5s while prometheus blackbox http probe timeouts at 3s. this could explain the gap mentioned on https://phabricator.wikimedia.org/T339898#8948287

Change 931625 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] mtail: Track locally processed requests in cache::haproxy

https://gerrit.wikimedia.org/r/931625

vgutierrez@prometheus3002:~$ curl -w @- -o /dev/null --resolve www.wikipedia.org:80:91.198.174.192 -s http://www.wikipedia.org <<'EOF'
    time_namelookup:  %{time_namelookup}\n
       time_connect:  %{time_connect}\n
    time_appconnect:  %{time_appconnect}\n
   time_pretransfer:  %{time_pretransfer}\n
      time_redirect:  %{time_redirect}\n
 time_starttransfer:  %{time_starttransfer}\n
                    ----------\n
         time_total:  %{time_total}\n
EOF
    time_namelookup:  0.000096
       time_connect:  0.000552
    time_appconnect:  0.000000
   time_pretransfer:  0.000699
      time_redirect:  0.000000
 time_starttransfer:  0.000909
                    ----------
         time_total:  0.001033
vgutierrez@prometheus3002:~$ curl -w @- -o /dev/null --resolve www.wikipedia.org:80:2620:0:862:ed1a::1 -s http://www.wikipedia.org <<'EOF'
    time_namelookup:  %{time_namelookup}\n
       time_connect:  %{time_connect}\n
    time_appconnect:  %{time_appconnect}\n
   time_pretransfer:  %{time_pretransfer}\n
      time_redirect:  %{time_redirect}\n
 time_starttransfer:  %{time_starttransfer}\n
                    ----------\n
         time_total:  %{time_total}\n
EOF
    time_namelookup:  0.000038
       time_connect:  0.000486
    time_appconnect:  0.000000
   time_pretransfer:  0.000541
      time_redirect:  0.000000
 time_starttransfer:  24.387219
                    ----------
         time_total:  24.387403

prometheus3002 is having some issues with IPv6 apparently

Change 931625 merged by Vgutierrez:

[operations/puppet@production] haproxy,mtail: Track locally processed requests in cache::haproxy

https://gerrit.wikimedia.org/r/931625

Change 931878 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] haproxy: Capture host header in port 80 as well

https://gerrit.wikimedia.org/r/931878

Change 931878 merged by Vgutierrez:

[operations/puppet@production] haproxy: Capture host header in port 80 as well

https://gerrit.wikimedia.org/r/931878

Change 931881 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] haproxy: Add X-Cache-Status for port 80 responses

https://gerrit.wikimedia.org/r/931881

Change 931881 merged by Vgutierrez:

[operations/puppet@production] haproxy: Add self-id headers for port 80 responses

https://gerrit.wikimedia.org/r/931881

Change 931891 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] cache: Capture X-Cache-Status on port 80 frontend

https://gerrit.wikimedia.org/r/931891

Change 931891 merged by Vgutierrez:

[operations/puppet@production] cache: Capture X-Cache-Status on port 80 frontend

https://gerrit.wikimedia.org/r/931891

Mentioned in SAL (#wikimedia-operations) [2023-06-21T10:58:38Z] <vgutierrez> re-enable puppet in A:cp - T339898

Change 931947 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] haproxy: Let set port 80 timeouts via hiera

https://gerrit.wikimedia.org/r/931947

Change 931948 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] hiera: Set stricter timeouts for port 80 on esams

https://gerrit.wikimedia.org/r/931948

Change 931947 merged by Vgutierrez:

[operations/puppet@production] haproxy: Let set port 80 timeouts via hiera

https://gerrit.wikimedia.org/r/931947

Change 931948 merged by Vgutierrez:

[operations/puppet@production] hiera: Set stricter timeouts for port 80 on esams

https://gerrit.wikimedia.org/r/931948

Change 932173 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] hiera: Tighten haproxy port 80 timeouts globally

https://gerrit.wikimedia.org/r/932173

Change 932174 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] haproxy: Set port 80 maxconns to 2000

https://gerrit.wikimedia.org/r/932174

Change 932173 merged by Vgutierrez:

[operations/puppet@production] hiera: Tighten haproxy port 80 timeouts globally

https://gerrit.wikimedia.org/r/932173

Mentioned in SAL (#wikimedia-operations) [2023-06-22T08:50:34Z] <vgutierrez> tighten HAProxy timeouts on port 80 globally - T339898

Change 932174 merged by Vgutierrez:

[operations/puppet@production] haproxy: Set port 80 maxconns to 2000

https://gerrit.wikimedia.org/r/932174

Mentioned in SAL (#wikimedia-operations) [2023-06-22T09:12:38Z] <vgutierrez> increasing maxconns to 2000 in haproxy for port 80 - T339898

Mitigated by tightening port 80 timeouts (https://gerrit.wikimedia.org/r/c/operations/puppet/+/932173/1/hieradata/common/profile/cache/haproxy.yaml) and increasing maxconns on port 80 to 2000. We shouldn't increase this any further as we are already allocating 16000 sessions per DC and cluster to serve 80->443 redirections and that should be only used by 3% of users (and some bots)