Page MenuHomePhabricator

Enable WarmParsoidParserCache on all wikis
Open, Needs TriagePublic

Assigned To
Authored By
daniel
Feb 10 2023, 1:07 PM
Referenced Files
Restricted File
Mon, May 8, 1:38 PM
Restricted File
Mon, May 8, 1:28 PM
F36941078: image.png
Apr 6 2023, 10:38 AM
F36941072: image.png
Apr 6 2023, 10:38 AM

Description

What?

Before we can switch VE to using Parsoid in MW (rather than going through RESTbase), we need to ensure that parsoid output for the page is present in the ParserCache.

Context

Currently, RESTbase will be notified when a page needs to be re-parsed, and it will then call the page/html endpoint exposed by the Parsoid extension to get the updated HTML. These requests are currently routed to the Parsoid cluster. At this point, this HTML will be written to the parser cache.

In the future, we want to turn of caching in RESTbase. When we do this, we need another mechanism to be in place that ensures we have up to date parsoid renderings of each page in the parser cache. This can be done by setting WarmParsoidParserCache to true in $wgParsoidCacheConfig, which will cause a ParsoidCachePrewarmJobs to be scheduled when a page need to be re-rendered.

These jobs are currently only enabled on testwiki and mediawikiwiki, because we did not want to overlaod the JobRunner cluster. The purpose of this job is to track all work needed to enable the ParsoidCachePrewarmJobs.

The request flow is as follows: mw-> eventgate-main -> kafka -> changeprop-jobqueue -> jobrunners

How?

Per @Joe's suggestion:

  • Increase the memory limit for the jobrunner cluster in mediawiki-config
  • Move a few servers from the parsoid cluster to the jobrunner cluster (in puppet) [Not needed pre-emptively]
  • Enable the jobs for wikis in batches, with SRE assistance. Possibly move more parsoid nodes to jobrunners if needed
  • For each batch, configure restbase to disable the caching for that wiki - basically only send out the purges for its urls.
  • rinse, repeat until nothing is left

At this point, the scope of this specific task is complete. The remainder is here just to complete the list.
Decide how we'll send the purges to the CDN for the restbase urls. My favourite option if we don't want to set up a complete system now would be to generate the purges when the warmup job is completed.

  • Stop sending requests to restbase from change-propagation for parsoid.
  • Move the requests for parsoid from all sources to bypass restbase (via the api gateway restbase compatibility layer)
  • Kill the restbase storage for parsoid

Alternative steps (elegant but riskier)
Basically instead of point 1-4 of the previous procedure:

  • Make changeprop handle these jobs separately.
  • Either submit them to the parsoid cluster as jobs (this requires some SRE work) or (preferred) just calling the URL for the rendering of the page directly (this requires probably some work on changeprop)
  • Once the parsercache is sufficiently warm, disable restbase caching for a batch of wikis, only keeping the purges emitted.
Dashboards to monitor

Event Timeline

I see a few ways to be able to enable this job on all wikis, but fundamentally the procedure I think would make sense is something as follows:

  1. Increase the memory limit for the jobrunner cluster in mediawiki-config
  2. Move a few servers from the parsoid cluster to the jobrunner cluster (in puppet)
  3. Enable the jobs for wikis in batches, with SRE assistance. Possibly move more parsoid nodes to jobrunners if needed
  4. For each batch, configure restbase to disable the caching for that wiki - basically only send out the purges for its urls.
  5. rinse, repeat until nothing is left
    • At this point, the scope of this specific task is complete. The remainder is here just to complete the list.
  6. Decide how we'll send the purges to the CDN for the restbase urls. My favourite option if we don't want to set up a complete system now would be to generate the purges when the warmup job is completed.
  7. Stop sending requests to restbase from change-propagation for parsoid.
  8. Move the requests for parsoid from all sources to bypass restbase (via the api gateway restbase compatibility layer)
  9. Kill the restbase storage for parsoid

There is an alternative route, which is I think more elegant but riskier. Basically instead of point 1-4 of the previous procedure:

  1. Make changeprop handle these jobs separately. Either submit them to the parsoid cluster as jobs (this requires some SRE work) or (preferred) just calling the URL for the rendering of the page directly (this requires probably some work on changeprop)
  2. Once the parsercache is sufficiently warm, disable restbase caching for a batch of wikis, only keeping the purges emitted.

Do we have an idea of how much load would be shifted from parsoid to jobrunner so I can try and evaluate how many hosts should be moved over?

Do we have an idea of how much load would be shifted from parsoid to jobrunner so I can try and evaluate how many hosts should be moved over?

Quite a lot: about 10k parses per minute.

The problem is: if we enable the jobs and also leave pre-generation in restbase enabled, then we don't know which of the two clusters is going to do the parse - whichever is hit first will parse, the other one will see a parser cache hit.

I don't think we have a way to disable pre-generation on parsoid per wiki right now. We would have to just "flip the switch" on everything. But we wouldn't want to do this unless we know that the jobs are working and we have enough capacity.

The jobrunner cluster in codfw is currently serving max 3krps, so back of the envelope, it would mean about a 5% increase in rps.

p50 latency for jobrunner is around 3 minutes, seeing the current latencies for parsoid I don't think it would meaningfully raise jobrunner's.

Do we have a way to enable the jobs per wiki, even with pre-generation being globally enabled? This could at least give us a way to check that the jobs are running correctly, even if we can't control what proportion of parses will be done via jobs.

jijiki updated the task description. (Show Details)

We've gone over the maths again with @akosiaris and the current provisioning for the jobrunner cluster should be able to handle the load transfer. We don't foresee needing to move servers pre-emptively from the parsoid cluster.

In any case, if we find it is getting overloaded, we can move servers from the parsoid cluster as needed.

We still have to dig exactly how we can disable restbase caching for the wikis we enable the jobs for, turning restbase into a transparent reverse proxy and not a caching reverse proxy for them. @Eevans might be able to help on this. It won't solve the race between parsoid and jobrunner, but it will at least ensure we have the data in parsercache.

Last, we need to re-evaluate how the load on parsoid and jobrunner evolves with each iteration to make sure it stays within acceptable bounds.

Hi @daniel , as the Svc Ops team is figuring out what needs to be done, I would like to understand the priority of this task. The reason I am asking this is looks like there needs to be some ground work done and then implemented. Next week most of the team members are out for Easter. Is it possible for team members to pick this as soon as team members are back after easter week?

Please let me know if you think otherwise.

Let me add 1 data point I just figured out. It doesn't look prudent to remove hosts from the parsoid cluster right now. Judging by codfw parsoid cluster CPU usage in the past 30d, the cluster would suffer from a removal of capacity. CPU usage spikes up to ~75% with a mean at ~35%. Furthermore, the trend over the last 90d (across DCs) appears to be in favor of CPU usage increasing.

codfw parsoid CPU Usage 30d

image.png (1ร—2 px, 327 KB)

Total parsoid CPU usage 90d

image.png (1ร—2 px, 266 KB)

Do we have a way to enable the jobs per wiki, even with pre-generation being globally enabled? This could at least give us a way to check that the jobs are running correctly, even if we can't control what proportion of parses will be done via jobs.

Yes, we can enabled the jobs per wiki.

Hi @daniel , as the Svc Ops team is figuring out what needs to be done, I would like to understand the priority of this task. The reason I am asking this is looks like there needs to be some ground work done and then implemented. Next week most of the team members are out for Easter. Is it possible for team members to pick this as soon as team members are back after easter week?

Yes, sure. I'm out until the 17th myself. Would be good to get this done by the end of the April.

Noting that I think (not sure, Daniel can confirm?) this is not going to be enabled for commons and wikidata which is the biggest firehose of edits so this is way smaller amount of jobs than you think.

@Clement_Goubert, @daniel, and I had a short meeting today, we agreed to have the task description better worded as to provide more context, while the actual work will happen after everyone is back from PTO.

[ ... ]

We still have to dig exactly how we can disable restbase caching for the wikis we enable the jobs for, turning restbase into a transparent reverse proxy and not a caching reverse proxy for them. @Eevans might be able to help on this...

I don't know how to do this; AFAIK, anyone who ever had the necessary institutional knowledge for this has long ago "left the building".

I'm happy to help, but it would require the same sort of spelunking it would anyone else, sorry. :(

It seems like the next step here is "Enable the jobs for wikis in batches, with SRE assistance. Possibly move more parsoid nodes to jobrunners if needed". If I understand the situation correctly it seems like the JobRunner cluster should be able to handle the load for all rendering, but we should be careful anyway.

@jijiki Can you tell me a good batch of wikis to enable this on? Then I can make a config patch and deploy it this week.

Noting that I think (not sure, Daniel can confirm?) this is not going to be enabled for commons and wikidata which is the biggest firehose of edits so this is way smaller amount of jobs than you think.

Yes, we should do this. Though we may have to revisit if we need the parsoid renderings for talk pages on these wikis.

Change 912929 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[operations/mediawiki-config@master] Enable parser cache warming jobs for parsoid on group 0

https://gerrit.wikimedia.org/r/912929

After chatting with @daniel, either serviceops merges 912929 on Tuesday, if we feel confident enough. Alternatively, we deploy it this Thursday together

Change 912929 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable parser cache warming jobs for parsoid on small wikis

https://gerrit.wikimedia.org/r/912929

Mentioned in SAL (#wikimedia-operations) [2023-05-08T10:44:45Z] <daniel@deploy1002> Started scap: Backport for [[gerrit:912929|Enable parser cache warming jobs for parsoid on small wikis (T329366)]]

Mentioned in SAL (#wikimedia-operations) [2023-05-08T11:20:17Z] <daniel@deploy1002> Started scap: Backport for [[gerrit:912929|Enable parser cache warming jobs for parsoid on small wikis (T329366)]]

Mentioned in SAL (#wikimedia-operations) [2023-05-08T11:21:41Z] <daniel@deploy1002> daniel: Backport for [[gerrit:912929|Enable parser cache warming jobs for parsoid on small wikis (T329366)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-05-08T11:35:44Z] <daniel@deploy1002> Finished scap: Backport for [[gerrit:912929|Enable parser cache warming jobs for parsoid on small wikis (T329366)]] (duration: 15m 26s)

Change 918388 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[operations/mediawiki-config@master] Enable parser cache warming jobs for parsoid on medium wikis

https://gerrit.wikimedia.org/r/918388

Change 918388 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable parser cache warming jobs for parsoid on medium wikis

https://gerrit.wikimedia.org/r/918388

Mentioned in SAL (#wikimedia-operations) [2023-05-10T09:30:01Z] <daniel@deploy1002> Started scap: Backport for [[gerrit:918388|Enable parser cache warming jobs for parsoid on medium wikis (T329366)]]

Mentioned in SAL (#wikimedia-operations) [2023-05-10T09:31:34Z] <daniel@deploy1002> daniel: Backport for [[gerrit:918388|Enable parser cache warming jobs for parsoid on medium wikis (T329366)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-05-10T09:38:12Z] <daniel@deploy1002> Finished scap: Backport for [[gerrit:918388|Enable parser cache warming jobs for parsoid on medium wikis (T329366)]] (duration: 08m 10s)

From IRC:

<_joe_> duesen, effie before we enable more jobs, I want us to take a hard look at the jobrunners cpus
<_joe_> it seems we're at 75% utilization, which is way too much

Looks like we need to put more servers to the problem, even if it is not this specific job that is adding on utilisation, since we have the hardware to do so, we should. @daniel It is more likely to do be able to add more servers next week, I will let you know.

Change 923426 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] conftool: Add more servers to the jobrunner problem

https://gerrit.wikimedia.org/r/923426

Change 923426 merged by Effie Mouzeli:

[operations/puppet@production] conftool: Add more servers to the jobrunner problem

https://gerrit.wikimedia.org/r/923426

Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host parse1013.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host parse1014.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host parse1015.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host parse1016.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host parse1015.eqiad.wmnet with OS buster executed with errors:

  • parse1015 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Unable to downtime the new host on Icinga/Alertmanager, the sre.hosts.downtime cookbook returned 99
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202305260854_jiji_3767551_parse1015.out
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host parse1013.eqiad.wmnet with OS buster completed:

  • parse1013 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202305260854_jiji_3767513_parse1013.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host parse1014.eqiad.wmnet with OS buster completed:

  • parse1014 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202305260856_jiji_3767538_parse1014.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2023-05-26T09:26:20Z] <effie> parse1013-parse1016 have neen depooled and removed from the parsoid-php service - T329366

Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host parse1016.eqiad.wmnet with OS buster completed:

  • parse1016 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202305260859_jiji_3767559_parse1016.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2023-05-26T09:54:02Z] <effie> pool parse1013-parse1016 to the jobrunner cluster - T329366

Change 923588 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[operations/mediawiki-config@master] Enable parser cache warming jobs for parsoid on some top wikis

https://gerrit.wikimedia.org/r/923588

Change 923588 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable parser cache warming jobs for parsoid on frwiki

https://gerrit.wikimedia.org/r/923588

Mentioned in SAL (#wikimedia-operations) [2023-06-01T07:46:43Z] <daniel@deploy1002> Started scap: Backport for [[gerrit:923588|Enable parser cache warming jobs for parsoid on frwiki (T329366)]]

Mentioned in SAL (#wikimedia-operations) [2023-06-01T07:48:19Z] <daniel@deploy1002> daniel: Backport for [[gerrit:923588|Enable parser cache warming jobs for parsoid on frwiki (T329366)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-06-01T07:55:53Z] <daniel@deploy1002> Finished scap: Backport for [[gerrit:923588|Enable parser cache warming jobs for parsoid on frwiki (T329366)]] (duration: 09m 09s)

Note to self:

12:51 <_joe_> so the changeprop change - as a quick pointer - you need to edit operations/deployment-charts:helmfile.d/services/changeprop-jobqueue/values.yaml
12:51 <_joe_> add a configuration for this job to high_traffic_jobs_config
12:51 <_joe_> I would suggest we start relatively low with the concurrency, something similar to cdnPurge, maybe

....

<_joe_> and https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=parsoidCachePrewarm&viewPanel=74
12:54 <_joe_> tells me the concurrency is low enough right now
12:55 <_joe_> https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=parsoidCachePrewarm&viewPanel=5 this is the mean backlog time