Page MenuHomePhabricator

Increased latency, timeouts from wikifeeds since march 10th
Closed, ResolvedPublic

Assigned To
Authored By
Joe
Mar 21 2024, 8:58 AM
Referenced Files
F43348891: image.png
Mar 25 2024, 5:18 PM
F43348867: image.png
Mar 25 2024, 5:18 PM
F43348896: image.png
Mar 25 2024, 5:17 PM
F43348837: image.png
Mar 25 2024, 5:17 PM
F43320731: image.png
Mar 25 2024, 11:25 AM
F43320613: image.png
Mar 25 2024, 11:25 AM
F42940190: Screenshot from 2024-03-21 11-59-27.png
Mar 21 2024, 11:21 AM
F42930223: image.png
Mar 21 2024, 8:58 AM

Description

Since march 10th, the rest gateway is seeing an increased rate of timeouts from wikifeeds:

image.png (1×3 px, 489 KB)

which, upon inspection, seem to all come from the "featured feed" endpoints.

Specifically, it looks like the "domain_v1_aggregated_featured" latency went from around 10 seconds to 1 minute as p99 (!!!)

image.png (1×3 px, 229 KB)

and also domain_v1_feed_onthisday_selected"

image.png (1×3 px, 200 KB)

this also reflects on the overall average latency from the service:

image.png (1×3 px, 480 KB)

We already tried adding resources to the service and a rolling restart; neither had any effect on the timeouts or the latencies.

These p99 seem hardly acceptable from a service like wikifeeds. Any idea what happened on that date?

I'm tagging the Content-Transform-Team as they should know what's the status of this.

Event Timeline

Joe triaged this task as High priority.Mar 21 2024, 8:58 AM

Note also the increase in RX (but not really TX) traffic that coincides with these:

Screenshot from 2024-03-21 11-59-27.png (1×1 px, 217 KB)

Also, this is only visible in eqiad.

This issue has paged SREs out of business hours today for elevated 5xx errors three times (at 17:11, 18:07 and 19:36 UTC).

I created a silence (976da07f-44cd-4582-9c3f-247ec252364d) for ATSBackendErrorsHigh and backend rest-gateway (there is no label for wikifeeds) until Monday morning.

Change #1013948 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/alerts@master] Alert separately for api gateway backend errors

https://gerrit.wikimedia.org/r/1013948

Change #1013948 merged by jenkins-bot:

[operations/alerts@master] Alert separately for api gateway backend errors

https://gerrit.wikimedia.org/r/1013948

From logs I think there are 2 things to investigate:

  • What happened since ~10th March ?
  • Why traffic was completely dropped on the 19th of March on codfw?
    • Could be related to the datacenter switchover ?
    • Meanwhile eqiad kept having traffic before and after that time.

codfw:

image.png (583×2 px, 335 KB)

eqiad:

image.png (591×2 px, 547 KB)

Turnilo says that this is mostly being caused by clients using the mobile apps, various versions. Logstash makes it clear that Wikifeeds itself is timing out when talking to restbase. The URL pattern in that query shows an oddity - request.url shows the featured link (or similar) but the failing internalURI is almost always always an image. I can't find reference for it but I'm fairly sure we've seen this timeout-upon-image-request behaviour in restbase before, but with changeprop.

However, a much more direct impact related to this might be the restbase hosts pooled by @Eevans within a few minutes of the error rate spiking. There might be some kind of config drift, or could this be related to T360548. Puppet is disabled on these hosts so I won't risk playing around with them

From logs I think there are 2 things to investigate:

  • What happened since ~10th March ?
  • Why traffic was completely dropped on the 19th of March on codfw?
    • Could be related to the datacenter switchover ?
    • Meanwhile eqiad kept having traffic before and after that time.

codfw:

image.png (583×2 px, 335 KB)

eqiad:

image.png (591×2 px, 547 KB)

It looks like wikifeeds was pooled in both DCs before the switchover and is now only pooled in eqiad. Concentrating all of the errors in one place has caused the pages to emerge, but the issue at hand isn't related to the switchover as it was happening long before

I was trying to see if there is a correlation between this issue and switching over parsoid from restbase to MW core but it doesn't look something is related.

From the URLs from logstash as @hnowlan pointed out it looks like the main cause of timeouts is outgoing requests to v1/<domain>/page/summary/<title> and more specifically to media files (svg, png, webm).
From restbase:

jgiannelos@deploy1002:~$ curl -v restbase.svc.codfw.wmnet:7233/fr.wikipedia.org/v1/page/summary/Fichier%3ACleopatra_poster.jpg

< HTTP/1.1 302
< location: /commons.wikimedia.org/v1/page/summary/File%3ACleopatra_poster.jpg

From one of the wikifeeds containers:

curl http://localhost:6503/fr.wikipedia.org/v1/page/summary/Fichier%3ACleopatra_poster.jpg -v

< HTTP/1.1 302 Found
< location: https://commons.wikimedia.org/api/rest_v1/page/summary/File%3ACleopatra_poster.jpg

Where the location is the public URL that's unreachable from the internal network of the wikifeeds container.

It looks like this path was not deployed using scap:
https://gerrit.wikimedia.org/r/c/mediawiki/services/restbase/deploy/+/1009842

Last deployment on scap is:

Author: Yiannis Giannelos <jgiannelos@wikimedia.org>
Date:   Mon Feb 12 18:58:35 2024 +0200

    parsoid: Default to storage disabled

    Change-Id: I4c1eb5c57eb706efb72b5d2992b6d15c068ea8d6

But meanwhile new nodes are pooled:
https://wikitech.wikimedia.org/wiki/Server_Admin_Log#2024-03-10

That means that the restbase nodes are not in the same state. For example:
Here is an example of the failing issue (public URL causing wikifeeds timeout)
https://phabricator.wikimedia.org/P58908

Some nodes are responding with the public URL on commons.wikimedia.org and some with the relative URL of the restbase node.

Turnilo says that this is mostly being caused by clients using the mobile apps, various versions. Logstash makes it clear that Wikifeeds itself is timing out when talking to restbase. The URL pattern in that query shows an oddity - request.url shows the featured link (or similar) but the failing internalURI is almost always always an image. I can't find reference for it but I'm fairly sure we've seen this timeout-upon-image-request behaviour in restbase before, but with changeprop.

However, a much more direct impact related to this might be the restbase hosts pooled by @Eevans within a few minutes of the error rate spiking. There might be some kind of config drift...

There does seem to be a difference between deployments (even if that difference should be a "noop"):

restbase1028
eevans@restbase1028:~$ sudo -u deploy-service git -C /srv/deployment/restbase/deploy log -n 1
commit e5ed8d0f95671701df291f786f4c0972d2e72142 (HEAD, tag: scap/sync/2024-02-19/0011, tag: scap/sync/2024-02-19/0010, tag: scap/sync/2024-02-19/0009, tag: scap/sync/2024-02-19/0008, tag: scap/sync/2024-02-19/0007, tag: scap/sync/2024-02-19/0006, tag: scap/sync/2024-02-19/0005, tag: scap/sync/2024-02-19/0004, tag: scap/sync/2024-02-19/0003, tag: scap/sync/2024-02-19/0002)
Author: Yiannis Giannelos <jgiannelos@wikimedia.org>
Date:   Mon Feb 12 18:58:35 2024 +0200

    parsoid: Default to storage disabled
    
    Change-Id: I4c1eb5c57eb706efb72b5d2992b6d15c068ea8d6
eevans@restbase1028:~$
restbase1042
eevans@restbase1042:/srv/deployment/restbase/deploy$ sudo -u deploy-service git -C /srv/deployment/restbase/deploy log -n 1
commit 7e5e72087d8331131669babfb8f40b269c024cd7 (HEAD, tag: scap/sync/2024-02-19/0011, origin/master, origin/HEAD, master)
Author: Yiannis Giannelos <jgiannelos@wikimedia.org>
Date:   Mon Feb 19 17:22:59 2024 +0100

    noop: Force new deployment
    
    Change-Id: I6461f403e758d4fd40ee7e97adfe88d10cd151f9
eevans@restbase1042:/srv/deployment/restbase/deploy$

...or could this be related to T360548. Puppet is disabled on these hosts so I won't risk playing around with them

This seems unlikely. That issues is resulting in a very low error rate (~5-7 per minute), and has not been a constant since the 10th. For example, the most recent decommission completed on Friday, the corresponding errors subsided then, but the wikifeeds latency is unabated.

Mentioned in SAL (#wikimedia-operations) [2024-03-25T16:11:30Z] <urandom> depooling restbase10[34-42] — T360597

Mentioned in SAL (#wikimedia-operations) [2024-03-25T16:36:13Z] <urandom> pooling restbase10[19-21] — T360597

Mentioned in SAL (#wikimedia-operations) [2024-03-25T16:47:23Z] <urandom> pooling restbase10[31-33] — T360597

Mentioned in SAL (#wikimedia-operations) [2024-03-25T16:47:32Z] <urandom> correction: depooling restbase10[31-33] — T360597

Mentioned in SAL (#wikimedia-operations) [2024-03-25T17:06:17Z] <urandom> restarting restbase service, restbase1024 — T360597

Mentioned in SAL (#wikimedia-operations) [2024-03-25T17:09:05Z] <urandom> restarting restbase service, restbase1031 — T360597

It looks like errors/latency are stabilized after depooling some nodes:

image.png (1×3 px, 507 KB)

image.png (1×3 px, 639 KB)

The restbase deployments are somewhat out of sync, with HEAD (eqiad) looking like:

restbase1019.eqiad.wmnet: commit e5ed8d0f95671701df291f786f4c0972d2e72142
restbase1020.eqiad.wmnet: commit e5ed8d0f95671701df291f786f4c0972d2e72142
restbase1021.eqiad.wmnet: commit e5ed8d0f95671701df291f786f4c0972d2e72142
restbase1022.eqiad.wmnet: commit e5ed8d0f95671701df291f786f4c0972d2e72142
restbase1023.eqiad.wmnet: commit e5ed8d0f95671701df291f786f4c0972d2e72142
restbase1024.eqiad.wmnet: commit e5ed8d0f95671701df291f786f4c0972d2e72142
restbase1025.eqiad.wmnet: commit e5ed8d0f95671701df291f786f4c0972d2e72142
restbase1026.eqiad.wmnet: commit e5ed8d0f95671701df291f786f4c0972d2e72142
restbase1027.eqiad.wmnet: commit e5ed8d0f95671701df291f786f4c0972d2e72142
restbase1028.eqiad.wmnet: commit e5ed8d0f95671701df291f786f4c0972d2e72142
restbase1029.eqiad.wmnet: commit e5ed8d0f95671701df291f786f4c0972d2e72142
restbase1030.eqiad.wmnet: commit e5ed8d0f95671701df291f786f4c0972d2e72142
restbase1031.eqiad.wmnet: commit 7e5e72087d8331131669babfb8f40b269c024cd7
restbase1032.eqiad.wmnet: commit 7e5e72087d8331131669babfb8f40b269c024cd7
restbase1033.eqiad.wmnet: commit 7e5e72087d8331131669babfb8f40b269c024cd7
restbase1034.eqiad.wmnet: commit 897fc7e8e18544f04473ab7c965fa9eaf5fc9576
restbase1035.eqiad.wmnet: commit 7e5e72087d8331131669babfb8f40b269c024cd7
restbase1036.eqiad.wmnet: commit 7e5e72087d8331131669babfb8f40b269c024cd7
restbase1037.eqiad.wmnet: commit 7e5e72087d8331131669babfb8f40b269c024cd7
restbase1038.eqiad.wmnet: commit 7e5e72087d8331131669babfb8f40b269c024cd7
restbase1039.eqiad.wmnet: commit 7e5e72087d8331131669babfb8f40b269c024cd7
restbase1040.eqiad.wmnet: commit 7e5e72087d8331131669babfb8f40b269c024cd7
restbase1041.eqiad.wmnet: commit 7e5e72087d8331131669babfb8f40b269c024cd7
restbase1042.eqiad.wmnet: commit 7e5e72087d8331131669babfb8f40b269c024cd7

The hosts with e5ed8d0f95671701df291f786f4c0972d2e72142 deployed correspond with the expected response in P58908, the others do not. After adjusting the list of pooled hosts to match this, the latency and error rate returned to normal:

image.png (835×1 px, 179 KB)

image.png (835×1 px, 165 KB)

After experimenting with targeted deployments, both to hosts that correlate to the correct behavior, and to those that do not, it doesn't seem to be tied to a particular git commit (not entirely surprising, since there weren't any normative changes between the versions).

At present the situation is stable —but precarious — as it depends on having the right set of hosts pooled/depooled.

# This file is managed by puppet.
# This is a golang template (http://golang.org/pkg/text/template/)
# for creating an external node pool reference file
#
# This file is generated from the etcd directories:
#  ["/pools/eqiad/restbase/restbase-ssl/"]

{ 'host': 'restbase1019.eqiad.wmnet', 'weight':10, 'enabled': True }
{ 'host': 'restbase1020.eqiad.wmnet', 'weight':10, 'enabled': True }
{ 'host': 'restbase1021.eqiad.wmnet', 'weight':10, 'enabled': True }
{ 'host': 'restbase1022.eqiad.wmnet', 'weight':10, 'enabled': False }
{ 'host': 'restbase1023.eqiad.wmnet', 'weight':10, 'enabled': False }
{ 'host': 'restbase1024.eqiad.wmnet', 'weight':10, 'enabled': False }
{ 'host': 'restbase1025.eqiad.wmnet', 'weight':10, 'enabled': False }
{ 'host': 'restbase1026.eqiad.wmnet', 'weight':10, 'enabled': False }
{ 'host': 'restbase1027.eqiad.wmnet', 'weight':10, 'enabled': False }
{ 'host': 'restbase1028.eqiad.wmnet', 'weight':10, 'enabled': True }
{ 'host': 'restbase1029.eqiad.wmnet', 'weight':10, 'enabled': True }
{ 'host': 'restbase1030.eqiad.wmnet', 'weight':10, 'enabled': True }
{ 'host': 'restbase1031.eqiad.wmnet', 'weight':10, 'enabled': False }
{ 'host': 'restbase1032.eqiad.wmnet', 'weight':10, 'enabled': False }
{ 'host': 'restbase1033.eqiad.wmnet', 'weight':10, 'enabled': False }
{ 'host': 'restbase1034.eqiad.wmnet', 'weight':10, 'enabled': False }
{ 'host': 'restbase1035.eqiad.wmnet', 'weight':10, 'enabled': False }
{ 'host': 'restbase1036.eqiad.wmnet', 'weight':10, 'enabled': False }
{ 'host': 'restbase1037.eqiad.wmnet', 'weight':10, 'enabled': False }
{ 'host': 'restbase1038.eqiad.wmnet', 'weight':10, 'enabled': False }
{ 'host': 'restbase1039.eqiad.wmnet', 'weight':10, 'enabled': False }
{ 'host': 'restbase1040.eqiad.wmnet', 'weight':10, 'enabled': False }
{ 'host': 'restbase1041.eqiad.wmnet', 'weight':10, 'enabled': False }
{ 'host': 'restbase1042.eqiad.wmnet', 'weight':10, 'enabled': False }

From IRC:

1:13 PM <nemo-yiannis> so apparently the 404 is expected behaviour for "cache-control: no-cache" when we have a file like the one on the example request
1:13 PM <nemo-yiannis> in order to get the internal path (aka relative) you need to be an internal request
1:14 PM <nemo-yiannis> i am curious if the bad nodes have a different config that the routing doesn't look like "internal"
1:16 PM <nemo-yiannis> Here is the logic: https://gerrit.wikimedia.org/g/mediawiki/services/change-propagation/deploy/+/464d3bcc59e68764be3e3765e5511e2e3eff4559/node_modules/hyperswitch/lib/server.js#287
1:18 PM <nemo-yiannis> For example a trace request from wikifeeds is tagged as external here: https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-default-1-1.11.0-6-2024.13?id=tbezdo4BWjhRzdxngVqZ
1:19 PM <nemo-yiannis> The client ip is: `2620:0:861:101:10:64:0:165`
1:21 PM <nemo-yiannis> which doesn't match the regex restbase uses to detect the local ip thus all the failures
1:21 PM <nemo-yiannis> Are the new (bad) nodes having something different in their network setup ?
1:28 PM <nemo-yiannis> This is an example of a request that worked properly (aka internal): https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-default-1-1.11.0-6-2024.13?id=uVvVdo4B1Aouzw__lGAA
1:29 PM <nemo-yiannis> the client ip matches the regex to detect for local ip

Some internal requests seem to have an ipv4 source IP, and some an ipv6. Based on the logic RESTBase uses, it's not going to match the latter, and thus returns an external url for them.

if (/^::1|(?:::ffff:)?(?:10|127)\./.test(remoteAddr)) {
    if (req.headers['cache-control'] && /no-cache/i.test(req.headers['cache-control'])) {
        reqOpts.metrics = opts.child_metrics.internal_update;
        req.headers['x-request-class'] = 'internal_update';
    } else {
        reqOpts.metrics = opts.child_metrics.internal;
        req.headers['x-request-class'] = 'internal';
    }
}

So the root cause looks like is the following:

It looks like the new nodes that serve the absolute redirects causing the incident must have a different network setup.

From IRC:

1:13 PM <nemo-yiannis> so apparently the 404 is expected behaviour for "cache-control: no-cache" when we have a file like the one on the example request
1:13 PM <nemo-yiannis> in order to get the internal path (aka relative) you need to be an internal request
1:14 PM <nemo-yiannis> i am curious if the bad nodes have a different config that the routing doesn't look like "internal"
1:16 PM <nemo-yiannis> Here is the logic: https://gerrit.wikimedia.org/g/mediawiki/services/change-propagation/deploy/+/464d3bcc59e68764be3e3765e5511e2e3eff4559/node_modules/hyperswitch/lib/server.js#287
1:18 PM <nemo-yiannis> For example a trace request from wikifeeds is tagged as external here: https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-default-1-1.11.0-6-2024.13?id=tbezdo4BWjhRzdxngVqZ
1:19 PM <nemo-yiannis> The client ip is: `2620:0:861:101:10:64:0:165`
1:21 PM <nemo-yiannis> which doesn't match the regex restbase uses to detect the local ip thus all the failures
1:21 PM <nemo-yiannis> Are the new (bad) nodes having something different in their network setup ?
1:28 PM <nemo-yiannis> This is an example of a request that worked properly (aka internal): https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-default-1-1.11.0-6-2024.13?id=uVvVdo4B1Aouzw__lGAA
1:29 PM <nemo-yiannis> the client ip matches the regex to detect for local ip

Some internal requests seem to have an ipv4 source IP, and some an ipv6. Based on the logic RESTBase uses, it's not going to match the latter, and thus returns an external url for them.

if (/^::1|(?:::ffff:)?(?:10|127)\./.test(remoteAddr)) {
    if (req.headers['cache-control'] && /no-cache/i.test(req.headers['cache-control'])) {
        reqOpts.metrics = opts.child_metrics.internal_update;
        req.headers['x-request-class'] = 'internal_update';
    } else {
        reqOpts.metrics = opts.child_metrics.internal;
        req.headers['x-request-class'] = 'internal';
    }
}

I think the answer here will be to extend the pattern to match ipv6 as well, but let's have @cmooney or @ayounsi weigh in.

What changed here was that some hosts have been deployed with ipv6 dns records:

eevans@restbase1019:~$ host restbase1031
restbase1031.eqiad.wmnet has address 10.64.0.165
restbase1031.eqiad.wmnet has IPv6 address 2620:0:861:101:10:64:0:165

vs.

eevans@restbase1019:~$ host restbase1019
restbase1019.eqiad.wmnet has address 10.64.0.100

Clients finding an ipv6 will use that, which restbase has been incorrectly interpreting to mean "external".

We've had a small number of these around for a while, so I suspect the errors were there, but did not elevate to the point where we noticed them. With new hosts going online, and older ones being depooled, the number of hosts returning an external URL redirect increased significantly.

Mentioned in SAL (#wikimedia-operations) [2024-03-25T20:11:35Z] <urandom> pool restbase10[31-33] — T360597

Mentioned in SAL (#wikimedia-operations) [2024-03-25T20:33:54Z] <urandom> pool restbase10[34-42] — T360597

Mentioned in SAL (#wikimedia-operations) [2024-03-25T20:45:52Z] <urandom> depooling restbase10[19-21].eqiad.wmnet — T360597

Eevans claimed this task.

The ipv6 dns records for all restbase hosts have all been removed (thanks @cmooney!), the hosts have been re-pooled, and error rates and latency appear normal. We'll need to make sure that any new restbase hosts that go online in the future are not provisioned with these records, at least until the restbase service has been removed from the cluster.

Closing; Feel free to reopen if there is something here still unresolved.