Parsercache purging can create lag
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• jcrespo
	Nov 6 2016, 2:07 PM

Description

Parsercaches are a big offenders of lag causes- when deleting things, they do queries such as :

DELETE /* SqlBagOStuff::deleteObjectsExpiringBefore */ FROM pc141 WHERE (exptime >= '2016-10-31 12:23:30') AND (exptime < '20161106010001') AND keyname IN ('commonswiki:pcache:XXXXXX','commonswiki:pcache:XXXXXX,' + 1000 more items )

Screenshot from 2016-11-06 14:59:39.png (1×1 px, 84 KB)

Creating lag on the secondary datacenter. This is relatively new, because we didn't use to replicate the parsercache, but I think it threatens cross-dc and dc-failover reliability. It is low impacting, however, because it is a cache- but we should try to fix it for pure performance reasons.

This could be solved with ROW based replication- but doing STATEMENT REPLACEs allows to keep replication running even after being consistent. Maybe we can play with it at application layer? Or maybe we can purge slower, as purging is not critical? Or just purging at infrastructure layer- per server, without replication being involved-, with no application involvement.

Details

Subject	Repo	Branch	Lines +/-
Bump parser cache purging batch wait time	operations/puppet	production	+1 -1
Stagger parser cache purges to avoid lag	operations/puppet	production	+1 -1
Add --msleep option to purgeParserCache.php	mediawiki/core	master	+18 -8

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Declined	None	T3268 Database replication lag issues (tracking)
Duplicate	None	T108551 Database locked error while publishing article using CX
Resolved	aaron	T95501 Fix causes of replica lag and get it to under 5 seconds at peak
Resolved	aaron	T150124 Parsercache purging can create lag

Event Timeline

I think it copied the assignment- it was not intended.

The anomaly can be seen at:
https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?var-dc=codfw%20prometheus%2Fops&var-group=parsercache&var-shard=All&var-role=All&from=1478387170388&to=1478450421558

• Gilles moved this task from Inbox, needs triage to Backlog: Maintenance, non-prioritized on the Performance-Team board.Nov 10 2016, 8:35 PM

• Gilles moved this task from Backlog: Maintenance, non-prioritized to To-do: Goals prioritized current Quarter on the Performance-Team board.

Change 320863 had a related patch set uploaded (by Aaron Schulz):
Add --msleep option to purgeParserCache.php

https://gerrit.wikimedia.org/r/320863

gerritbot added a project: Patch-For-Review.Nov 10 2016, 9:58 PM

Change 320928 had a related patch set uploaded (by Aaron Schulz):
Stagger parser cache purges to avoid lag

https://gerrit.wikimedia.org/r/320928

Change 320863 merged by jenkins-bot:
Add --msleep option to purgeParserCache.php

https://gerrit.wikimedia.org/r/320863

ReleaseTaggerBot added projects: MW-1.29-release-notes, MW-1.29-release (WMF-deploy-2016-11-29_(1.29.0-wmf.4)).Nov 16 2016, 7:00 PM

Change 320928 merged by Jcrespo:
Stagger parser cache purges to avoid lag

https://gerrit.wikimedia.org/r/320928

I would close this as resolved, and I will monitor the lag in the following months.

Krinkle closed this task as Resolved.Nov 21 2016, 9:57 PM

Krinkle assigned this task to aaron.

Krinkle moved this task from To-do: Goals prioritized current Quarter to Doing (old) on the Performance-Team board.

Krinkle moved this task from Tag to Doing on the Sustainability board.

Krinkle removed a project: Patch-For-Review.

These are the parser caches right now:

Screenshot from 2016-11-27 11:46:10.png (1×1 px, 90 KB)

$ sudo crontab -l -u www-data | grep -A1 parser_cache_purging
# Puppet Name: parser_cache_purging
0 1 * * 0 /usr/local/bin/mwscript purgeParserCache.php --wiki=aawiki --age=2592000 --msleep 100 >/dev/null 2>&1

$ ps aux | grep purgeP
www-data 18418  0.0  0.0   4440   652 ?        Ss   01:00   0:00 /bin/sh -c /usr/local/bin/mwscript purgeParserCache.php --wiki=aawiki --age=2592000 --msleep 100 >/dev/null 2>&1
www-data 18442  0.0  0.0  12404  1436 ?        S    01:00   0:00 /bin/bash /usr/local/bin/mwscript purgeParserCache.php --wiki=aawiki --age=2592000 --msleep 100
www-data 18517  5.1  0.1 326236 48876 ?        S    01:00  30:14 php5 /srv/mediawiki-staging/multiversion/MWScript.php purgeParserCache.php --wiki=aawiki --age=2592000 --msleep 100

Change 323764 had a related patch set uploaded (by Aaron Schulz):
Bump parser cache purging batch wait time

https://gerrit.wikimedia.org/r/323764

gerritbot added a project: Patch-For-Review.Nov 27 2016, 9:32 PM

Change 323764 merged by Jcrespo:
Bump parser cache purging batch wait time

https://gerrit.wikimedia.org/r/323764

terbium:~$ sudo crontab -l -u www-data | grep -A1 parser_cache_purging
# Puppet Name: parser_cache_purging
0 1 * * 0 /usr/local/bin/mwscript purgeParserCache.php --wiki=aawiki --age=2592000 --msleep 500 >/dev/null 2>&1

I have added the codfw parsercaches here for easier monitoring:
https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag?panelId=9&fullscreen&var-dc=codfw%20prometheus%2Fops&from=now-30d&to=now

Should we close again, or only stall, close after a couple of weeks?

Lets tentatively close.

Krinkle mentioned this in T282761: purgeParserCache.php should not take over 24 hours for its daily run.May 13 2021, 3:18 AM

	F4864309: Screenshot from 2016-11-27 11:46:10.png
	Nov 27 2016, 10:46 AM

	F4864307: Screenshot from 2016-11-27 11:44:34.png
	Nov 27 2016, 10:46 AM

	F4700872: Screenshot from 2016-11-06 14:59:39.png
	Nov 6 2016, 2:07 PM

Parsercache purging can create lagClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Parsercache purging can create lag
Closed, ResolvedPublic
Actions

Related Objects
Search...