Page MenuHomePhabricator

Test PageAssessements on English Wikivoyage
Closed, ResolvedPublic3 Estimated Story Points

Description

Already tested on Test Wikipedia. Let's test it on English Wikivoyage before we go to English Wikipedia. English Wikivoyage has a master assessment template that is highly used, so this is an ideal test case (for both errors and performance). The template is https://en.wikivoyage.org/wiki/Template:Stbox. This is an invisible back-end only feature, so we should post a notice on the Travellers' pub, but I don't think we need to run a vote or anything.

First step is to add the 2 PageAssessments tables to the en.wikivoyage database. Then we need to configure the jobrunners properly. Next, recruit someone from the performance team to help us monitor the deployment. Deploy the config change to turn on the extension on English Wikivoyage. Add the parser function to the master assessment template. Wait until the jobs have finished running. Test the results by looking at the tables and testing the PageAssessments API.

Event Timeline

DannyH set the point value for this task to 3.

New tables have been created for enwikivoyage. Discussion on wiki has received positive response. Let's deploy PageAssessments there next week.

@jcrespo: My current plan is to roll this out on English Wikivoyage next Tuesday morning at 10am. How does that sound for you?

Good, ping me if I am around just in case (I assume 10 am Pacific?).

Steps for deployment:

  1. In wmf-config/InitialiseSettings.php, you'll add a new record to the $wgConf->settings['wmgUsePageAssessments'] array: 'enwikivoyage' => true, // T142056
  2. Commit your change and push it to Gerrit (referencing bug T142056)
  3. Go to https://wikitech.wikimedia.org/wiki/Deployments and add the commit as a deployment request during the Thursday morning SWAT window.
  4. Be in the wikimedia-operations IRC channel at the beginning of the SWAT window (18:00 UTC).
  5. After the SWAT deployer has merged the change, but before they sync it across the production servers, you'll start running the fatalmonitor script on fluorine to keep an eye on fatal errors. You can also watch the chart at https://grafana.wikimedia.org/dashboard/file/varnish-http-errors.json. If you notice serious problems caused by the change, request the SWAT deployer to roll-back the change.
  6. There isn't really a good way to test this via the debug server (mw1099), but I'm sure they'll ask you to anyway. After it is synced to mw1099, load English Wikivoyage, turn on X-Wikimedia-Debug, and try doing a null edit to a page (open the edit interface and click save). If nothing goes wrong, tell them it is safe to sync the config change across the cluster.
  7. Once it is synced, turn off X-Wikimedia-Debug and try doing another null edit.
  8. If that works fine, try adding the parser function to a page, for example, your user talk page or user sandbox: {{#assessment:India|A|Low}} (but don't add it to any templates; we'll try that on Tuesday)
  9. Query the PageAssessments API to see if the data was successfully recorded in the database. (It may take a few minutes since it's in a low-priority job).

Change 307927 had a related patch set uploaded (by Niharika29):
Test PageAssessments on English Wikivoyage

https://gerrit.wikimedia.org/r/307927

Change 307927 merged by jenkins-bot:
Test PageAssessments on English Wikivoyage

https://gerrit.wikimedia.org/r/307927

Mentioned in SAL [2016-09-01T18:11:28Z] <thcipriani@tin> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:307927|Test PageAssessments on English Wikivoyage (T142056)]] (duration: 04m 54s)

Mentioned in SAL [2016-09-01T18:19:10Z] <thcipriani@tin> Synchronized wmf-config/InitialiseSettings.php: REVERT because proxy down SWAT: [[gerrit:307927|Test PageAssessments on English Wikivoyage (T142056)]] (duration: 03m 15s)

Mentioned in SAL [2016-09-01T18:33:35Z] <thcipriani@tin> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:307927|Test PageAssessments on English Wikivoyage (T142056)]] (duration: 02m 48s)

@jcrespo: Yes, 10am San Francisco time (Tuesday). The extension is already on English Wikivoyage, so this shouldn't take long. We'll just add the parser function to the master assessment template there and wait for all the jobs to populate the database while you monitor things. Mainly we're hoping to find out:

  • How bad does the job queue get swamped? (Should be no worse than a regular edit to a popular template.)
  • Does the flood of database inserts cause any database lag? (It shouldn't.)
  • How long does it take for the process to complete? (No idea on this one.)

This will help inform our later roll-out to English Wikipedia.

Roll-out delayed by T144841. Stay tuned...

Rescheduling for tomorrow (Wednesday) afternoon PST.

kaldari closed this task as Resolved.EditedSep 7 2016, 10:53 PM

Report for English Wikivoyage roll-out:

I monitored the job queue number via https://en.wikivoyage.org/w/api.php?action=query&meta=siteinfo&siprop=statistics&format=jsonfm. It fluctuated constantly but never rose above 500 jobs.

I monitored database lag at https://tendril.wikimedia.org/host/view/db1078.eqiad.wmnet/3306. The highest lag I saw was a 3 second spike soon after adding the parser function.

I monitored database insert volume at https://tendril.wikimedia.org/host/view/db1075.eqiad.wmnet/3306. The highest volume was about 30,000 inserts per 5 minutes (which is about 5x normal). In total about 50,000 rows were inserted into the assessments table and 353 rows were inserted into the projects table.

Job and database loads went back to normal levels within half an hour.

Without investigation I will block further deployments. 3 seconds of lag and 5x load is not tolerable.

@jcrespo: Here are snapshots of all the s3 databases (excluding 2nd level replicas). The edit to English Wikivoyage was made at 9:57pm UTC. You can see replication lag spikes on 4 of the 7 replicas: 1038 and 1037 hit 3 seconds, 1077 hit 4 seconds, and 1044 hit 5 seconds. The spikes were all short-lived, however. Based on feedback from @aaron, we decided to piggyback on the existing refreshLinks jobs, as these are already marked as low-priority by the production jobrunner configuration. I would love to get @ori and aaron's feedback on the graphs below and see if they have any suggestions for making further improvements. As the current behavior is likely to be pretty much identical to adding a link or category to a heavily used template (which would also cause a flood of refreshLinks jobs and database inserts), I'm wondering if the best approach is to actually tweak the jobrunning code for low priority jobs.

PageAssessments deployment.png (2×1 px, 623 KB)

I can confirm that refreshlinks is indeed a source of issues, and I opened recently a ticket about the spiky nature of job queue: T144382. I am not saying this is a problem with your code, there is a chance this is related to how the job queue works, but in any case either the original code should be fixed or we should workaround it. I think @aaron will be the #1 person insterested on avoiding lag as it will interfere with cross-dc work.

I will investigate the binary log at 9:57pm and try to understand what happened then, and report back.

@jcrespo: One other thing you could help us with. According to the graphs there was also an increase in delete and update actions during the job flood, which I wasn't expecting. Our code does have the potential to spawn deletes and updates, but in this case (for the initial flood) I was only expecting inserts. While you're looking at the logs, can you see what exactly was being deleted or updated and show me a couple of examples?

At 2016-09-07 21:57:42 I see:

# at 282433499
#160907 21:57:42 server id 171966669  end_log_pos 282433537     GTID 0-171966669-3061496764
INSERT /* Revision::insertOn Kaldari */ (adding assessments per talk page)

So I see lots of

INSERT /* PageAssessmentsBody::insertProject 127.0.0.1 */  INTO `page_assessments_projects` 
INSERT /* PageAssessmentsBody::insertRecord 127.0.0.1 */  INTO `page_assessments`

I would like to review the code for it.

However, I also see very intensive:
UPDATE /* HTMLCacheUpdateJob::invalidateTitles */

Which I cannot say for sure, but it looks like *all* pages, or a huge number of them were being reparsed, creating a lot of jobs for invalidation, refresh links, refresh geodata, etc.

Are you forcing an invalidation on all pages? Or maybe you are doing it indirectly (not on purpose?). In ether case this will not be a good idea, and neither me nor Traffic will be happy about it for enwiki in the current state. I would highly suggest to perform it at much slower pace (enwiki is 300 times larger than enwikivoyage, and much more read; so the issues will be >300 times worse).

So I see lots of

INSERT /* PageAssessmentsBody::insertProject 127.0.0.1 */  INTO `page_assessments_projects` 
INSERT /* PageAssessmentsBody::insertRecord 127.0.0.1 */  INTO `page_assessments`

Those are expected. Do you see any updateRecord or deleteRecord entries? Those would not be expected.

I would like to review the code for it.

If you ignore the API code, all of the code currently in the PageAssessments extension is just for supporting this parser function and handling the action of storing the data in the database. After consulting with the performance team, we rewrote a lot of it so that it piggybacked on the refreshLinks job instead of defining our own job. You can see the change for that here: https://gerrit.wikimedia.org/r/#/c/306314/

However, I also see very intensive:
UPDATE /* HTMLCacheUpdateJob::invalidateTitles */

Which I cannot say for sure, but it looks like *all* pages, or a huge number of them were being reparsed, creating a lot of jobs for invalidation, refresh links, refresh geodata, etc.

Are you forcing an invalidation on all pages? Or maybe you are doing it indirectly (not on purpose?). In ether case this will not be a good idea, and neither me nor Traffic will be happy about it for enwiki in the current state. I would highly suggest to perform it at much slower pace (enwiki is 300 times larger than enwikivoyage, and much more read; so the issues will be >300 times worse).

That's expected. I changed a template transcluded by thousands of pages as part of the rollout. This has nothing to do with the PageAssessments extension, it's just how templates work. This sort of mass cache invalidation happens all the time. For example, every time someone edits the Cite web template on English Wikipedia it invalidates the cache of over 2 million pages.

@jcrespo: I've created a new task for smoothing out the database spikes (T145473) with 3 proposals for possible solutions. I'm hoping to get some advice from the performance team on the best way forward. In the meantime, if you could investigate the apparent increase in update and delete actions during the roll-out (which was not expected), that would be helpful.