cache_text cluster consistently backlogged on purge requests
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Urbanecm
	Apr 3 2020, 11:29 AM

Description

Purging queues are often backlogged in all datacenters, and in esams they are persistently backlogged. (Other locations tend to recover on their own, but esams is so loaded this doesn't happen there.)

https://grafana.wikimedia.org/d/wBCQKHjWz/vhtcpd?orgId=1&from=now-7d&to=now&var-datasource=esams%20prometheus%2Fops&var-cluster=All&var-worker=Purger0&fullscreen&panelId=4

Work is ongoing in T249583: Create vhtcpd replacement to build and ship a newer, more scalable purged daemon to issue purges in parallel instead of serially. There's also T250205: Reduce rate of purges emitted by MediaWiki to lower the number of them, because as-is it's quite excessive.

Details

Subject	Repo	Branch	Lines +/-
cache: use purged on cache_text	operations/puppet	production	+1 -3
cache: move text@esams to purged	operations/puppet	production	+1 -0
reverse-proxy: Disable rebound purges	operations/mediawiki-config	master	+0 -3
update vhtcpd exporter to match reality	operations/puppet	production	+78 -36

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
		Restricted Task
Duplicate	None	T109331 Deleted files sometimes remain visible to non-privileged users if permanently linked
Duplicate	None	T133819 upload-lb.ulsfo.wikimedia.org still allow access to some deleted files
Duplicate	BBlack	T119038 Image cache issue when 'over-writing' an image on commons
Resolved	• ema	T133821 Make CDN purges reliable
Resolved	• ema	T249325 cache_text cluster consistently backlogged on purge requests

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@WDoranWMF Infrastructure side of things would fall probably under Traffic but who would be the right maintainer of Mediawiki Maintenance scripts- your team (Platform Engineering) , Performance-Team?

@jcrespo Excellent question <- which is what people say when they aren't positive what the answer is. My reaction is that it likely should be Platform Engineering - the dev maintainers list does not have us as owners. I'm going to tag this task with CPT to triage and ask @AMooney and @Anomie to comment. We should update https://www.mediawiki.org/wiki/Developers/Maintainers to be us.

What is the severity for this task? It sounds like High but is it UBN?

It sounds like High but is it UBN?

I've done a very superficial triage, just pinging some teams to know who could help. Either @Urbanecm, releng, or traffic may know more about impact. Sorry I cannot be more useful.

Most likely the cause is that the Varnish rule for normalizing /static/ to the enwiki hostname hasn't been applied to ATS, and thus this isn't effectively purging /static from the ATS backend caches. Will check...

@WDoranWMF if BBlack is right, this may not need mw code changes, we should wait for that.

Yeah, the varnish (frontend) code for this is in modules/varnish/templates/text-frontend.inc.vcl.erb:

# normalize all /static to the same hostname for caching
if (req.url ~ "^/static/") { set req.http.host = "<%= @vcl_config.fetch("static_host") %>"; }

# normalize all /w/static.php to the same wiki host for caching
# ignore urls without hash query as those are affected by multiversion
if (req.url ~ "^/w/(skins|resources|extensions)/.+\?[a-fA-F0-9]+$" ) {
        set req.http.host = "<%= @vcl_config.fetch("static_host") %>";
}

This code should apply to both PURGE and GET, normalizing both.

ATS doesn't appear to have similar code. However, now that I really think about this, it still should've worked in practice. Public requests would be normalized to the enwiki hostname before reaching ATS, and your purge to ATS was also the enwiki hostname...

The maintenance script seems like it should be functioning, assuming any purging from MediaWiki is functioning (which I think we'd have heard about if it weren't). All the script does is take the input and pass it to the same CdnCacheUpdate class that is used by MediaWiki to send all purges.

Assuming that's even a correct URL to purge, I'd guess BBlack is probably right.

In T249325#6026394, @WDoranWMF wrote:

My reaction is that it likely should be Platform Engineering - the dev maintainers list does not have us as owners. I'm going to tag this task with CPT to triage and ask @AMooney and @Anomie to comment. We should update https://www.mediawiki.org/wiki/Developers/Maintainers to be us.

I'd be very wary of us claiming "all maintenance scripts". In general maintainership of a maintenance script should go with whatever component the maintenance script is maintaining IMO.

Anomie moved this task from Inbox to Done on the Platform Team Workboards (Clinic Duty Team) board.Apr 3 2020, 2:45 PM

bblack@cumin1001:~$ sudo cumin A:cp-text 'curl -s https://en.wikipedia.org/static/images/project-logos/cswiki.png|md5sum'
36 hosts will be targeted:
cp[2027,2029,2031,2033,2035,2037,2039,2041].codfw.wmnet,cp[1075,1077,1079,1081,1083,1085,1087,1089].eqiad.wmnet,cp[5007-5012].eqsin.wmnet,cp[3050,3052,3054,3056,3058,3060,3062,3064].esams.wmnet,cp[4027-4032].ulsfo.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====                                                                                                                             
(8) cp[3050,3052,3054,3056,3058,3060,3062,3064].esams.wmnet                                                                                        
----- OUTPUT of 'curl -s https://...swiki.png|md5sum' -----                                                                                        
d67c283275f6441b49458077ea9e22ed  -                                                                                                                
===== NODE GROUP =====                                                                                                                             
(28) cp[2027,2029,2031,2033,2035,2037,2039,2041].codfw.wmnet,cp[1075,1077,1079,1081,1083,1085,1087,1089].eqiad.wmnet,cp[5007-5012].eqsin.wmnet,cp[4027-4032].ulsfo.wmnet                                                                                                                              
----- OUTPUT of 'curl -s https://...swiki.png|md5sum' -----                                                                                        
de8be0dfd83481f739a05e837b054b90  -

Note that de8be0dfd83481f739a05e837b054b90 is the checksum of the new version, and d67c283275f6441b49458077ea9e22ed is the old one. So it appears that only esams has failed to purge this, while the other DCs are fine. Checking a little deeper...

Krinkle edited projects, added MediaWiki-libs-BagOStuff; removed MediaWiki-Maintenance-system.Apr 3 2020, 2:52 PM

Krinkle moved this task from General to HTTP Cache on the MediaWiki-libs-BagOStuff board.

So, the backend purging queues in esams are way behind. On the one node I'm staring at the most, there are currently about 87 million backlogged purge requests, which is probably somewhere in the ballpark of 10 hours of lag time. The backlog is in the local daemon on the esams hosts themselves (so this isn't a network issue with delivering the purges to the hosts over the WAN), so the culprit is likely the ATS daemon consuming them slowly.

This is text@eqiad GETs vs PURGEs the past week. You can see GETs have the usual organic pattern, and PURGEs are fairly spiky as we normally see.

2020-04-03-150927_1884x847_scrot.png (847×1 px, 370 KB)

Whereas with text@esams, we see a curious pattern to the PURGE traffic, seems like it's being load-limited and somewhat-recovering when organic traffic is low overnight:

2020-04-03-150943_1873x859_scrot.png (859×1 px, 211 KB)

<edited here>: Those purge rates are from the frontend varnish, but the frontend varnish PURGE queue doesn't receive entries until they've traversed the backend one, which is backlogged, so that's why we still see the effect there.

Probably-related: T241232

BBlack mentioned this in T249346: vhtcpd prometheus metrics broken; prometheus-vhtcpd-stats.py out-of-date with reality.Apr 3 2020, 5:19 PM

In T249325#6026427, @jcrespo wrote:

It sounds like High but is it UBN?

I've done a very superficial triage, just pinging some teams to know who could help. Either @Urbanecm, releng, or traffic may know more about impact. Sorry I cannot be more useful.

The impact of this bug is that no one can update logos or other static resources, like logos, favicons and so on, because that requires being able to purge an URL.

I've been trying to purge https://wikitech.wikimedia.org/wiki/Incident_documentation dozens of times, both by making null edits, by making action=purge submissions, and via mwscript purgeList.php. But nothing seems to let it go from the cache.

The logged-out view of that URL seems to be stuck at an old version, and using Minerva skin (T190384).

Logged out

age: 52089
cache-control: s-maxage=86400, must-revalidate, max-age=0
content-length: 5993
content-type: text/html; charset=UTF-8
date: Fri, 03 Apr 2020 08:56:24 GMT
last-modified: Thu, 02 Apr 2020 23:40:01 GMT

server: labweb1001.wikimedia.org
server-timing: cache;desc="hit-front"
strict-transport-security: max-age=106384710; includeSubDomains; preload
vary: Accept-Encoding,Cookie,Authorization
x-cache: cp3060 hit, cp3062 hit/45
x-cache-status: hit-front

Logged in

last-modified: Fri, 03 Apr 2020 23:27:21 GMT
x-cache: cp3052 miss, cp3050 pass

Krinkle renamed this task from purgeList.php/HTCP purge doesn't seem to invalidate cache correctly to CDN cache not being purged (RESTBase/PCS, Wikitech, and purgeList.php).Apr 3 2020, 11:32 PM

Krinkle merged a task: T249290: Cache not consistently updated for PCS JS endpoint.

Krinkle added projects: Page Content Service, Product-Infrastructure-Team-Backlog-Deprecated.

Krinkle added subscribers: • bearND, • Pchelolo.

CDanis renamed this task from CDN cache not being purged (RESTBase/PCS, Wikitech, and purgeList.php) to esams cache_text cluster consistently backlogged on purge requests.Apr 4 2020, 3:52 PM

CDanis triaged this task as High priority.Apr 4 2020, 6:53 PM

CDanis merged a task: T246185: Different age of history logged in and out when from the EU (but not SF office).

CDanis updated the task description. (Show Details)

CDanis added subscribers: Luke081515, Count_Count, DB111 and 5 others.

Krinkle added a project: Wikimedia-Incident.Apr 4 2020, 7:29 PM

@CDanis @BBlack T169894 is likely the same (or similar) issue. https://en.wikipedia.org/wiki/User:GeoffreyT2000 says user lives in California through...

Thanks. The reports from 2017 cannot be due to the same cause; however, the reports from other users starting March 15 this year are likely related.

CDanis merged a task: T169894: Cache not updated for anonymous users.Apr 5 2020, 2:29 PM

CDanis added subscribers: GTrang, Sakretsu, Ciencia_Al_Poder and 2 others.

Change 586111 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] update vhtcpd exporter to match reality

https://gerrit.wikimedia.org/r/586111

gerritbot added a project: Patch-For-Review.Apr 5 2020, 4:19 PM

Change 586111 merged by CDanis:
[operations/puppet@production] update vhtcpd exporter to match reality

https://gerrit.wikimedia.org/r/586111

It isn't just esams that often has a backlog: looking at the past 10-20 minutes of monitoring data now available, codfw has some machines with 20 million entries of backlog, eqiad has some with 30 million entries, eqsin some with 40 million, ulsfo some with 25 million. However, the other sites at least have noticeable downwards slope.

https://grafana.wikimedia.org/d/wBCQKHjWz/vhtcpd?orgId=1&var-datasource=esams%20prometheus%2Fops&fullscreen&panelId=2

Maintenance_bot removed a project: Patch-For-Review.Apr 5 2020, 5:10 PM

Yes, I also think it's more than just esams since the merged in task I created (T249290) was about an experience in the east coast of the US. I very much doubt that my colleague there would have been served content by esams in Amsterdam. (Ok, this more anecdotal and that particular purge delay for this colleague has been resolved by now.)

• ema renamed this task from esams cache_text cluster consistently backlogged on purge requests to cache_text cluster consistently backlogged on purge requests.Apr 6 2020, 8:35 AM

• ema moved this task from Backlog to Caching on the Traffic board.

Change 586390 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/mediawiki-config@master] disable rebound purges

https://gerrit.wikimedia.org/r/586390

gerritbot added a project: Patch-For-Review.Apr 6 2020, 4:22 PM

Krinkle added a parent task: T133821: Make CDN purges reliable.Apr 6 2020, 7:41 PM

• ema mentioned this in T238494: 15% response start regression as of 2019-11-11 (Varnish->ATS).Apr 7 2020, 5:57 AM

We have discussed this during yesterday's Traffic meeting and the current plan to attack the issue is: (1) @BBlack tries to find the time to patch vhtcpd to support multiple TCP connections while (2) I work on T249583. Meanwhile, thanks @CDanis for working on this and for https://gerrit.wikimedia.org/r/586390.

Joe mentioned this in T249595: Purge / Reject client pages that were cached in parser cache during the T249565 (wb_items_per_site) incident.Apr 7 2020, 9:23 AM

mark subscribed.Apr 7 2020, 1:51 PM

LGoto moved this task from Needs triage to Tracking on the Product-Infrastructure-Team-Backlog-Deprecated board.Apr 8 2020, 3:33 PM

• Mholloway subscribed.Apr 8 2020, 3:34 PM

CDanis merged a task: T250108: Cache not being invalidated on new edits on Vaginal steaming.Apr 13 2020, 8:09 PM

CDanis added a subscriber: AntiCompositeNumber.

Rxy subscribed.Apr 14 2020, 8:17 AM

I think this should have it's priority increased to UBN. I'm receiving many reports from community that purge doesn't work - even when the purge is MW-based. Since esams's case is worse, is it possible to...improve that :).

Change 586390 abandoned by CDanis:
reverse-proxy: Disable rebound purges

https://gerrit.wikimedia.org/r/586390

Maintenance_bot removed a project: Patch-For-Review.Apr 14 2020, 2:11 PM

CDanis updated the task description. (Show Details)Apr 14 2020, 5:07 PM

Joe mentioned this in T250205: Reduce rate of purges emitted by MediaWiki.Apr 14 2020, 5:26 PM

In T249325#6054065, @Urbanecm wrote:

I think this should have it's priority increased to UBN. I'm receiving many reports from community that purge doesn't work - even when the purge is MW-based. Since esams's case is worse, is it possible to...improve that :).

The ones most likely to complain (editors with registered accounts) cannot see that their editors are not getting through but there's a myriad of problems with logged-in users fixing things and logged-out editors not being able to see the fixed version (BLP issues and the like). A note that even revision-deleted revisions are visible (which is very not ideal) and I think this came up on enwiki VPT/ANI (don't recall exactly).

The work on purged (T249583) is proceeding, and I think at this point I've fixed the dumbest bugs. The system is currently deployed on cp2029 (codfw) and cp3050 (esams). Long story kept short (for now), on both systems varnish-fe and ats-be are receiving and processing the same amount of PURGEs, and there's no backlog in processing multicast HTCP packets.

It things are stable overnight I'm gonna switch more cache nodes to purged tomorrow and we should hopefully begin to see improvements.

Also, @Joe and I have been looking at some of the PURGE traffic today and observed that we can probably reduce it quite a bit: T250261

https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Edits_invisible_or_variably_visible_after_logging_out was the only VPT thread I know about. I just handled an OTRS ticket (Ticket#2020041510004397) about it too. Thank you to the developers and operations folk working on this.

In T249325#6060912, @AntiCompositeNumber wrote:

https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Edits_invisible_or_variably_visible_after_logging_out was the only VPT thread I know about. I just handled an OTRS ticket (Ticket#2020041510004397) about it too. Thank you to the developers and operations folk working on this.

Thank you @AntiCompositeNumber. During the EU night purged behaved well on cp3050:

Screenshot from 2020-04-16 08-50-04.png (1×2 px, 251 KB)

Also interesting is the PURGE traffic difference between cp3050 and cp3052:

Screenshot from 2020-04-16 08-50-55.png (1×2 px, 320 KB)

I am going to extend the deployment to the rest of text@esams now.

Change 589270 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache: move text@esams to purged

https://gerrit.wikimedia.org/r/589270

gerritbot added a project: Patch-For-Review.Apr 16 2020, 8:47 AM

Change 589270 merged by Ema:
[operations/puppet@production] cache: move text@esams to purged

https://gerrit.wikimedia.org/r/589270

Maintenance_bot removed a project: Patch-For-Review.Apr 16 2020, 9:10 AM

Mentioned in SAL (#wikimedia-operations) [2020-04-16T09:17:04Z] <ema> text@esams: stop vhtcpd, start purged T249325

@Urbanecm, @AntiCompositeNumber: esams has now been running with the new system for the past few hours, the situation should have improved significantly for folks mapped there. Tomorrow I'll upgrade the other data centres too. Thanks!

Change 589540 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache: use purged on cache_text

https://gerrit.wikimedia.org/r/589540

gerritbot added a project: Patch-For-Review.Apr 17 2020, 7:38 AM

Change 589540 merged by Ema:
[operations/puppet@production] cache: use purged on cache_text

https://gerrit.wikimedia.org/r/589540

Mentioned in SAL (#wikimedia-operations) [2020-04-17T07:54:00Z] <ema> cache_text: puppet run to stop vhtcpd and start purged T249325

Maintenance_bot removed a project: Patch-For-Review.Apr 17 2020, 8:10 AM

In T249325#6063043, @ema wrote:

@Urbanecm, @AntiCompositeNumber: esams has now been running with the new system for the past few hours, the situation should have improved significantly for folks mapped there. Tomorrow I'll upgrade the other data centres too. Thanks!

Thanks @ema!

The cache_text cluster has now been running purged instead of vhtcpd for the past 10+ days. PURGE backlog is gone, closing.

jcrespo awarded a token.Apr 28 2020, 10:54 AM

• bearND awarded a token.Apr 28 2020, 2:31 PM

• ema mentioned this in T249770: Inconsistent caching/staleness of mobile-html responses for certain articles.Apr 30 2020, 4:15 PM

Aklapper removed a subscriber: Anomie.Oct 16 2020, 5:40 PM

Krinkle mentioned this in T318349: Text cluster is being hit with an average of 1.8k PURGE requests per second per host.Sep 23 2022, 1:40 PM

	F31758960: Screenshot from 2020-04-16 08-50-55.png
	Apr 16 2020, 6:52 AM

	F31758958: Screenshot from 2020-04-16 08-50-04.png
	Apr 16 2020, 6:52 AM

	F31726689: 2020-04-03-150943_1873x859_scrot.png
	Apr 3 2020, 3:16 PM

	F31726687: 2020-04-03-150927_1884x847_scrot.png
	Apr 3 2020, 3:16 PM

cache_text cluster consistently backlogged on purge requestsClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

cache_text cluster consistently backlogged on purge requests
Closed, ResolvedPublic
Actions

Related Objects
Search...