Page MenuHomePhabricator

cache_text cluster consistently backlogged on purge requests
Closed, ResolvedPublic

Description

Purging queues are often backlogged in all datacenters, and in esams they are persistently backlogged. (Other locations tend to recover on their own, but esams is so loaded this doesn't happen there.)

image.png (682×1 px, 114 KB)

https://grafana.wikimedia.org/d/wBCQKHjWz/vhtcpd?orgId=1&from=now-7d&to=now&var-datasource=esams%20prometheus%2Fops&var-cluster=All&var-worker=Purger0&fullscreen&panelId=4

Work is ongoing in T249583: Create vhtcpd replacement to build and ship a newer, more scalable purged daemon to issue purges in parallel instead of serially. There's also T250205: Reduce rate of purges emitted by MediaWiki to lower the number of them, because as-is it's quite excessive.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@WDoranWMF Infrastructure side of things would fall probably under Traffic but who would be the right maintainer of Mediawiki Maintenance scripts- your team (Platform Engineering) , Performance-Team?

WDoranWMF added subscribers: AMooney, Anomie.

@jcrespo Excellent question <- which is what people say when they aren't positive what the answer is. My reaction is that it likely should be Platform Engineering - the dev maintainers list does not have us as owners. I'm going to tag this task with CPT to triage and ask @AMooney and @Anomie to comment. We should update https://www.mediawiki.org/wiki/Developers/Maintainers to be us.

What is the severity for this task? It sounds like High but is it UBN?

It sounds like High but is it UBN?

I've done a very superficial triage, just pinging some teams to know who could help. Either @Urbanecm, releng, or traffic may know more about impact. Sorry I cannot be more useful.

Most likely the cause is that the Varnish rule for normalizing /static/ to the enwiki hostname hasn't been applied to ATS, and thus this isn't effectively purging /static from the ATS backend caches. Will check...

@WDoranWMF if BBlack is right, this may not need mw code changes, we should wait for that.

Yeah, the varnish (frontend) code for this is in modules/varnish/templates/text-frontend.inc.vcl.erb:

# normalize all /static to the same hostname for caching
if (req.url ~ "^/static/") { set req.http.host = "<%= @vcl_config.fetch("static_host") %>"; }

# normalize all /w/static.php to the same wiki host for caching
# ignore urls without hash query as those are affected by multiversion
if (req.url ~ "^/w/(skins|resources|extensions)/.+\?[a-fA-F0-9]+$" ) {
        set req.http.host = "<%= @vcl_config.fetch("static_host") %>";
}

This code should apply to both PURGE and GET, normalizing both.

ATS doesn't appear to have similar code. However, now that I really think about this, it still should've worked in practice. Public requests would be normalized to the enwiki hostname before reaching ATS, and your purge to ATS was also the enwiki hostname...

The maintenance script seems like it should be functioning, assuming any purging from MediaWiki is functioning (which I think we'd have heard about if it weren't). All the script does is take the input and pass it to the same CdnCacheUpdate class that is used by MediaWiki to send all purges.

Assuming that's even a correct URL to purge, I'd guess BBlack is probably right.

My reaction is that it likely should be Platform Engineering - the dev maintainers list does not have us as owners. I'm going to tag this task with CPT to triage and ask @AMooney and @Anomie to comment. We should update https://www.mediawiki.org/wiki/Developers/Maintainers to be us.

I'd be very wary of us claiming "all maintenance scripts". In general maintainership of a maintenance script should go with whatever component the maintenance script is maintaining IMO.

bblack@cumin1001:~$ sudo cumin A:cp-text 'curl -s https://en.wikipedia.org/static/images/project-logos/cswiki.png|md5sum'
36 hosts will be targeted:
cp[2027,2029,2031,2033,2035,2037,2039,2041].codfw.wmnet,cp[1075,1077,1079,1081,1083,1085,1087,1089].eqiad.wmnet,cp[5007-5012].eqsin.wmnet,cp[3050,3052,3054,3056,3058,3060,3062,3064].esams.wmnet,cp[4027-4032].ulsfo.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====                                                                                                                             
(8) cp[3050,3052,3054,3056,3058,3060,3062,3064].esams.wmnet                                                                                        
----- OUTPUT of 'curl -s https://...swiki.png|md5sum' -----                                                                                        
d67c283275f6441b49458077ea9e22ed  -                                                                                                                
===== NODE GROUP =====                                                                                                                             
(28) cp[2027,2029,2031,2033,2035,2037,2039,2041].codfw.wmnet,cp[1075,1077,1079,1081,1083,1085,1087,1089].eqiad.wmnet,cp[5007-5012].eqsin.wmnet,cp[4027-4032].ulsfo.wmnet                                                                                                                              
----- OUTPUT of 'curl -s https://...swiki.png|md5sum' -----                                                                                        
de8be0dfd83481f739a05e837b054b90  -

Note that de8be0dfd83481f739a05e837b054b90 is the checksum of the new version, and d67c283275f6441b49458077ea9e22ed is the old one. So it appears that only esams has failed to purge this, while the other DCs are fine. Checking a little deeper...

So, the backend purging queues in esams are way behind. On the one node I'm staring at the most, there are currently about 87 million backlogged purge requests, which is probably somewhere in the ballpark of 10 hours of lag time. The backlog is in the local daemon on the esams hosts themselves (so this isn't a network issue with delivering the purges to the hosts over the WAN), so the culprit is likely the ATS daemon consuming them slowly.

This is text@eqiad GETs vs PURGEs the past week. You can see GETs have the usual organic pattern, and PURGEs are fairly spiky as we normally see.

2020-04-03-150927_1884x847_scrot.png (847×1 px, 370 KB)

Whereas with text@esams, we see a curious pattern to the PURGE traffic, seems like it's being load-limited and somewhat-recovering when organic traffic is low overnight:

2020-04-03-150943_1873x859_scrot.png (859×1 px, 211 KB)

<edited here>: Those purge rates are from the frontend varnish, but the frontend varnish PURGE queue doesn't receive entries until they've traversed the backend one, which is backlogged, so that's why we still see the effect there.

It sounds like High but is it UBN?

I've done a very superficial triage, just pinging some teams to know who could help. Either @Urbanecm, releng, or traffic may know more about impact. Sorry I cannot be more useful.

The impact of this bug is that no one can update logos or other static resources, like logos, favicons and so on, because that requires being able to purge an URL.

I've been trying to purge https://wikitech.wikimedia.org/wiki/Incident_documentation dozens of times, both by making null edits, by making action=purge submissions, and via mwscript purgeList.php. But nothing seems to let it go from the cache.

The logged-out view of that URL seems to be stuck at an old version, and using Minerva skin (T190384).

Logged out
age: 52089
cache-control: s-maxage=86400, must-revalidate, max-age=0
content-length: 5993
content-type: text/html; charset=UTF-8
date: Fri, 03 Apr 2020 08:56:24 GMT
last-modified: Thu, 02 Apr 2020 23:40:01 GMT

server: labweb1001.wikimedia.org
server-timing: cache;desc="hit-front"
strict-transport-security: max-age=106384710; includeSubDomains; preload
vary: Accept-Encoding,Cookie,Authorization
x-cache: cp3060 hit, cp3062 hit/45
x-cache-status: hit-front
Logged in
last-modified: Fri, 03 Apr 2020 23:27:21 GMT
x-cache: cp3052 miss, cp3050 pass
Krinkle renamed this task from purgeList.php/HTCP purge doesn't seem to invalidate cache correctly to CDN cache not being purged (RESTBase/PCS, Wikitech, and purgeList.php).Apr 3 2020, 11:32 PM
Krinkle added subscribers: bearND, Pchelolo.
CDanis renamed this task from CDN cache not being purged (RESTBase/PCS, Wikitech, and purgeList.php) to esams cache_text cluster consistently backlogged on purge requests.Apr 4 2020, 3:52 PM

@CDanis @BBlack T169894 is likely the same (or similar) issue. https://en.wikipedia.org/wiki/User:GeoffreyT2000 says user lives in California through...

Thanks. The reports from 2017 cannot be due to the same cause; however, the reports from other users starting March 15 this year are likely related.

Change 586111 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] update vhtcpd exporter to match reality

https://gerrit.wikimedia.org/r/586111

Change 586111 merged by CDanis:
[operations/puppet@production] update vhtcpd exporter to match reality

https://gerrit.wikimedia.org/r/586111

It isn't just esams that often has a backlog: looking at the past 10-20 minutes of monitoring data now available, codfw has some machines with 20 million entries of backlog, eqiad has some with 30 million entries, eqsin some with 40 million, ulsfo some with 25 million. However, the other sites at least have noticeable downwards slope.

https://grafana.wikimedia.org/d/wBCQKHjWz/vhtcpd?orgId=1&var-datasource=esams%20prometheus%2Fops&fullscreen&panelId=2

Yes, I also think it's more than just esams since the merged in task I created (T249290) was about an experience in the east coast of the US. I very much doubt that my colleague there would have been served content by esams in Amsterdam. (Ok, this more anecdotal and that particular purge delay for this colleague has been resolved by now.)

ema renamed this task from esams cache_text cluster consistently backlogged on purge requests to cache_text cluster consistently backlogged on purge requests.Apr 6 2020, 8:35 AM
ema moved this task from Backlog to Caching on the Traffic board.

Change 586390 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/mediawiki-config@master] disable rebound purges

https://gerrit.wikimedia.org/r/586390

We have discussed this during yesterday's Traffic meeting and the current plan to attack the issue is: (1) @BBlack tries to find the time to patch vhtcpd to support multiple TCP connections while (2) I work on T249583. Meanwhile, thanks @CDanis for working on this and for https://gerrit.wikimedia.org/r/586390.

I think this should have it's priority increased to UBN. I'm receiving many reports from community that purge doesn't work - even when the purge is MW-based. Since esams's case is worse, is it possible to...improve that :).

Change 586390 abandoned by CDanis:
reverse-proxy: Disable rebound purges

https://gerrit.wikimedia.org/r/586390

I think this should have it's priority increased to UBN. I'm receiving many reports from community that purge doesn't work - even when the purge is MW-based. Since esams's case is worse, is it possible to...improve that :).

The ones most likely to complain (editors with registered accounts) cannot see that their editors are not getting through but there's a myriad of problems with logged-in users fixing things and logged-out editors not being able to see the fixed version (BLP issues and the like). A note that even revision-deleted revisions are visible (which is very not ideal) and I think this came up on enwiki VPT/ANI (don't recall exactly).

The work on purged (T249583) is proceeding, and I think at this point I've fixed the dumbest bugs. The system is currently deployed on cp2029 (codfw) and cp3050 (esams). Long story kept short (for now), on both systems varnish-fe and ats-be are receiving and processing the same amount of PURGEs, and there's no backlog in processing multicast HTCP packets.

It things are stable overnight I'm gonna switch more cache nodes to purged tomorrow and we should hopefully begin to see improvements.

Also, @Joe and I have been looking at some of the PURGE traffic today and observed that we can probably reduce it quite a bit: T250261

https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Edits_invisible_or_variably_visible_after_logging_out was the only VPT thread I know about. I just handled an OTRS ticket (Ticket#2020041510004397) about it too. Thank you to the developers and operations folk working on this.

https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Edits_invisible_or_variably_visible_after_logging_out was the only VPT thread I know about. I just handled an OTRS ticket (Ticket#2020041510004397) about it too. Thank you to the developers and operations folk working on this.

Thank you @AntiCompositeNumber. During the EU night purged behaved well on cp3050:

Screenshot from 2020-04-16 08-50-04.png (1×2 px, 251 KB)

Also interesting is the PURGE traffic difference between cp3050 and cp3052:

Screenshot from 2020-04-16 08-50-55.png (1×2 px, 320 KB)

I am going to extend the deployment to the rest of text@esams now.

Change 589270 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache: move text@esams to purged

https://gerrit.wikimedia.org/r/589270

Change 589270 merged by Ema:
[operations/puppet@production] cache: move text@esams to purged

https://gerrit.wikimedia.org/r/589270

Mentioned in SAL (#wikimedia-operations) [2020-04-16T09:17:04Z] <ema> text@esams: stop vhtcpd, start purged T249325

@Urbanecm, @AntiCompositeNumber: esams has now been running with the new system for the past few hours, the situation should have improved significantly for folks mapped there. Tomorrow I'll upgrade the other data centres too. Thanks!

Change 589540 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache: use purged on cache_text

https://gerrit.wikimedia.org/r/589540

Change 589540 merged by Ema:
[operations/puppet@production] cache: use purged on cache_text

https://gerrit.wikimedia.org/r/589540

Mentioned in SAL (#wikimedia-operations) [2020-04-17T07:54:00Z] <ema> cache_text: puppet run to stop vhtcpd and start purged T249325

@Urbanecm, @AntiCompositeNumber: esams has now been running with the new system for the past few hours, the situation should have improved significantly for folks mapped there. Tomorrow I'll upgrade the other data centres too. Thanks!

Thanks @ema!

ema claimed this task.

The cache_text cluster has now been running purged instead of vhtcpd for the past 10+ days. PURGE backlog is gone, closing.