Page MenuHomePhabricator

Parsercache purging can create lag
Closed, ResolvedPublic

Description

Parsercaches are a big offenders of lag causes- when deleting things, they do queries such as :

DELETE /* SqlBagOStuff::deleteObjectsExpiringBefore */ FROM pc141 WHERE (exptime >= '2016-10-31 12:23:30') AND (exptime < '20161106010001') AND keyname IN ('commonswiki:pcache:XXXXXX','commonswiki:pcache:XXXXXX,' + 1000 more items )

Creating lag on the secondary datacenter. This is relatively new, because we didn't use to replicate the parsercache, but I think it threatens cross-dc and dc-failover reliability. It is low impacting, however, because it is a cache- but we should try to fix it for pure performance reasons.

This could be solved with ROW based replication- but doing STATEMENT REPLACEs allows to keep replication running even after being consistent. Maybe we can play with it at application layer? Or maybe we can purge slower, as purging is not critical? Or just purging at infrastructure layer- per server, without replication being involved-, with no application involvement.

Event Timeline

jcrespo removed aaron as the assignee of this task.Nov 6 2016, 2:07 PM
jcrespo created this task.

I think it copied the assignment- it was not intended.

Change 320863 had a related patch set uploaded (by Aaron Schulz):
Add --msleep option to purgeParserCache.php

https://gerrit.wikimedia.org/r/320863

Change 320928 had a related patch set uploaded (by Aaron Schulz):
Stagger parser cache purges to avoid lag

https://gerrit.wikimedia.org/r/320928

Change 320863 merged by jenkins-bot:
Add --msleep option to purgeParserCache.php

https://gerrit.wikimedia.org/r/320863

Change 320928 merged by Jcrespo:
Stagger parser cache purges to avoid lag

https://gerrit.wikimedia.org/r/320928

I would close this as resolved, and I will monitor the lag in the following months.

Krinkle closed this task as Resolved.Nov 21 2016, 9:57 PM
Krinkle assigned this task to aaron.
Krinkle moved this task from This Quarter (FY1920Q1 Jul-Sep) to Doing on the Performance-Team board.
Krinkle moved this task from Backlog to Doing on the Availability board.
Krinkle removed a project: Patch-For-Review.
jcrespo reopened this task as Open.EditedNov 27 2016, 10:46 AM

These are the parser caches right now:


jcrespo added a comment.EditedNov 27 2016, 10:50 AM
$ sudo crontab -l -u www-data | grep -A1 parser_cache_purging
# Puppet Name: parser_cache_purging
0 1 * * 0 /usr/local/bin/mwscript purgeParserCache.php --wiki=aawiki --age=2592000 --msleep 100 >/dev/null 2>&1

$ ps aux | grep purgeP
www-data 18418  0.0  0.0   4440   652 ?        Ss   01:00   0:00 /bin/sh -c /usr/local/bin/mwscript purgeParserCache.php --wiki=aawiki --age=2592000 --msleep 100 >/dev/null 2>&1
www-data 18442  0.0  0.0  12404  1436 ?        S    01:00   0:00 /bin/bash /usr/local/bin/mwscript purgeParserCache.php --wiki=aawiki --age=2592000 --msleep 100
www-data 18517  5.1  0.1 326236 48876 ?        S    01:00  30:14 php5 /srv/mediawiki-staging/multiversion/MWScript.php purgeParserCache.php --wiki=aawiki --age=2592000 --msleep 100

Change 323764 had a related patch set uploaded (by Aaron Schulz):
Bump parser cache purging batch wait time

https://gerrit.wikimedia.org/r/323764

Change 323764 merged by Jcrespo:
Bump parser cache purging batch wait time

https://gerrit.wikimedia.org/r/323764

terbium:~$ sudo crontab -l -u www-data | grep -A1 parser_cache_purging
# Puppet Name: parser_cache_purging
0 1 * * 0 /usr/local/bin/mwscript purgeParserCache.php --wiki=aawiki --age=2592000 --msleep 500 >/dev/null 2>&1

I have added the codfw parsercaches here for easier monitoring:
https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag?panelId=9&fullscreen&var-dc=codfw%20prometheus%2Fops&from=now-30d&to=now

Should we close again, or only stall, close after a couple of weeks?

aaron closed this task as Resolved.Dec 16 2016, 10:52 PM

Lets tentatively close.