Page MenuHomePhabricator

Make PageAssessments roll-out less intense on the databases
Closed, ResolvedPublic

Description

As you can see in the screenshot below, when the assessment parser function was embedded in the master assessment template on English Wikivoyage, it caused a large spike in database inserts and some short-lived spikes in database lag (up to 5 seconds):

PageAssessments deployment.png (2×1 px, 623 KB)

This was deemed unacceptable by the DBAs, so we need to find a way to smooth out these spikes before we can roll-out PageAssessments to English Wikipedia.

There are at least 3 possible solutions here:

  1. Slow down the automatic reparsing of pages when their caches are invalidated by template edits
  2. Slow down the processing of the refreshLinks job (by changing how it is handled by the job runner or changing how all low priority jobs are handled)
  3. Move PageAssessments database interactions back to their own job and make sure the job runner runs that job slowly (perhaps by creating a new 3rd class job designation).

This isn't really a new problem, but is the same sort of database interaction that arises from adding a category or link into a heavily used template, so I'm guessing that solutions #1 or #2 are going to be preferable. I'm looking for advice from the Performance-Team on the best way forward, and depending on the choice, help implementing that solution.

Event Timeline

kaldari triaged this task as Medium priority.Sep 12 2016, 10:57 PM
kaldari raised the priority of this task from Medium to High.
kaldari updated the task description. (Show Details)
kaldari added a project: Performance-Team.
kaldari added a subscriber: aaron.

Marking as high priority since this is blocking deployment, which in turn is blocking other work (like using PageAssessments data for CopyPatrol).

Note that some of the lag is caused by other things in LinksUpdate. I've made patches to GeoDato and the file/category invalidation code used by LinksUpdate in core, both of which lacked batching and were sometimes doing 1-2k row updates at once.

@aaron: Do you think that will be sufficient to smooth out the spikes or should we investigate any of the solutions mentioned in the description? How soon do you expect your changes to be live? We can do another test roll-out on English Wikivoyage to see what the impact is.

I have only one question, @kaldari. If pages are only added to the table on reparse, why do we have so many inserts at 21:57+

The edit rate of enwikivoyage is very low, there has been only 75K edits in the last month, but 50K pages have been inserted on page assessments. A random template would cause issues, but also happens to perfectly fill-in the new table? And the spike happens shortly after it is enabled. Too much of a coincidence.

Maybe the solution here is not "technical", but social. It is silently enabled, but only announced after being enabled for some time to prevent massive purges and let it some time to fill in naturally/slowly.

I have only one question, @kaldari. If pages are only added to the table on reparse, why do we have so many inserts at 21:57+

It was added to a popular template (https://en.wikivoyage.org/w/index.php?title=Template%3AStbox&type=revision&diff=3046884&oldid=3035148), which then triggered jobs to reparse all pages that use that template.

@jcrespo: The parser function is designed specifically for a single template. Most wikis that have any sort of page assessment system have a single "master" assessment template. On English Wikivoyage, it's the Template:Stbox. On English Wikipedia it's the Template:WPBannerMeta. Deploying PageAssessments to these wikis, in and of itself, has no effect on the database. It's the act of inserting the new parser function into the master assessment template that causes the flood of database inserts (which is a 1 time event).

It would be possible for us to spread out this impact by adding the parser function to all of the individual WikiProject templates (for example, Template:WikiProject Medicine) separately over a period of several days rather than adding it to the master assessment template. There are over 1,000 WikiProject templates on English Wikipedia though (and new ones are created frequently), so this would be tedious and inefficient, but doable.

If you want more context on how PageAssessments is supposed to work, you may want to read through https://meta.wikimedia.org/wiki/Community_Tech/PageAssessments.

It's the act of inserting the new parser function into the master assessment template that causes the flood of database inserts (which is a 1 time event).

And that is exactly what I am blocking, until the problem is solved. Bringing down a wiki because a new feature is rolled in is never a good idea. Blocking doesn't mean "we will not do it", it means "we will wait until we can do it properly". I bet that there is a much larger number of global usages for the template on enwiki than enwikivoyage, which means way worse problems!

Aaron suspects that the replication lag spikes are actually not caused by the PageAssessments code, but by other processes that also piggyback on the refreshLinks jobs.

Not being your code's fault doesn't mean we now allow it to be done without code or procedure changes. We fix the bug wherever it is or provide a workaround (you are already providing some suggestions, so thank you!), then we continue the deployment (that is all!). It doesn't have to be perfect, whatever you propose that avoids side effects, I will be happy about. I do not think anybody here opposes to this being deployed in the end, I just want to do it properly. :-)

which is a 1 time event

If we know it is going to happen and we can avoid it, we do it.

There are over 1,000 WikiProject templates on English Wikipedia though (and new ones are created frequently), so this would be tedious and inefficient, but doable.

This is probably not the way to go, but it certainly can be used as a testbed to check invalidatetitle behavior.

Now that there are some performance improvements in place for LinksUpdate, I'd like to try re-running the PageAssessments roll-out on English Wikivoyage and compare the graphs. Unfortunately, it seems that @jcrespo and I have no overlapping working hours, so I guess I'll just do the same thing that I did last time: monitor the DBs myself and take some screenshots of the graphs. If no one objects to this plan, I'll do this tomorrow at ~10am PST.

This is blocked by the stalled deployment train (T144644). The currently deployed 1.28.0-wmf.18 code doesn't have all of aaron's performance fixes.

@jcrespo, @aaron: I did an identical PageAssessments roll-out to English Wikivoyage this evening. As you can see from the charts below, the peak database insertion rate was slightly lower across the board (and basically in line with similar peaks seen throughout the day). More importantly, there were no replication lag spikes on any of the databases during the roll-out (although you can see several spikes from other events earlier in the day).

The roll-out event is the hump centered on 6am in the charts below. It lasted from 5:41am to 6:17am (approximately 36 minutes).

PageAssessments deployment 2.png (2×1 px, 669 KB)

I also monitored the s3 shard on grafana and saw no lag or other issues there.

@kaldari Extra load is not a problem, as long as it doesn't create lag or latency issues/errors. I do not see either.

I would suggest rolling it silently to enwiki, and test it on a subset of templates first (not the parent at first), ok with that?

kaldari claimed this task.