⚓ T329366 Enable WarmParsoidParserCache on all wikis

Subject	Repo	Branch	Lines +/-
conftool: Add more servers to jobrunner cluster	operations/puppet	production	+10 -10
changeprop-jobqueue: Increase memory limits	operations/deployment-charts	master	+2 -2
Enable cache warming jobs for parsoid per default.	operations/mediawiki-config	master	+1 -31
Enable parser cache warming jobs for parsoid on enwiki	operations/mediawiki-config	master	+6 -0
Enable parser cache warming jobs for parsoid on frwiki	operations/mediawiki-config	master	+6 -0
conftool: Add more servers to the jobrunner problem	operations/puppet	production	+20 -9
Enable parser cache warming jobs for parsoid on medium wikis	operations/mediawiki-config	master	+6 -0
Enable parser cache warming jobs for parsoid on small wikis	operations/mediawiki-config	master	+13 -7

Status	Subtype	Assigned	Task
Stalled		None	T324931 Clean up open RESTBase related tickets
In Progress		None	T262315 <CORE TECHNOLOGY> API Migration & RESTBase Sunset
Open		None	T264669 Move VE API from RESTBase into core.
Open		None	T265846 Turn RESTBase tests for /html and /transform endpoints into api-testing integration tests
Open		None	T267980 Benchmark cached Parsoid HTML endpoints RESTBase vs Core
Resolved		daniel	T267981 Implement /revision/{id}/html endpoint in core
Resolved		• WDoranWMF	T262597 MediaWiki Developer stores revision HTML in parser cache
Resolved		• Pchelolo	T269663 Implement parser output caching for /revision/{id}/html endpoint
Resolved		None	T267990 Support stashing in page/html and revision/html endpoints in MW core
Resolved		None	T267991 Simple per-IP ratelimit for HTML stashing
Duplicate		None	T308684 Determine directory to place ParsoidStash logic
Resolved		daniel	T308743 PageHTMLHandler: add support for getting fully annotated parsoid HTML.
Resolved		daniel	T308744 PageHTMLHandler should return different eTag for different output modes
Resolved		daniel	T309016 Determine storage requirements for stashing parsoid output for VE edits
Resolved		DAlangi_WMF	T309017 Collect stats on stash and cache usage in the page/html and revision/html endpoints
Invalid		None	T262572 Use Parsoid-specific Parser Cache for Visual Editor
Invalid		None	T262605 Parsoid REST API Client gets a speedy result
Open		None	T262593 MediaWiki Developer stores different HTML output for different language variants in the parser cache
Duplicate		None	T262586 MediaWiki Developer stores parse metadata in parser cache
Resolved		• eprodromou	T262604 Reader quickly reads a wiki page
Resolved		• Pchelolo	T268234 Evaluate ETags in core HTML endpoints
In Progress		None	T301370 Move transform endpoints from RESTBase to MediaWiki
Resolved		daniel	T310467 Remove implementation of the transform endpoint from the parsoid extension
Resolved		None	T311477 Remove pagebundle support from public transform endpoint
Open		None	T311645 Split the TransformHandler class
Resolved		daniel	T311819 Make the transform endpoint match ETags emitted by the page endpoint
Resolved		daniel	T311867 Support redirects to private endpoints
Open		None	T301371 Preemptively warm caches for Parsoid output
Resolved		DAlangi_WMF	T308588 Do not cache fast parsoid output in ParserCache (exploration)
Resolved		ppelberg	T308511 [SPIKE] Determine necessity of edit session continuity during data center switchovers
Resolved		daniel	T310710 Do not ignore HTTP conditionals in transform REST endpoint.
Resolved		None	T310904 Find out if Varnish is messing with ETags, and what to do about it.
Resolved		daniel	T310878 Consolidate ParsoidHandler base class with HtmlOutputRendererHelper
Duplicate		None	T312916 Parsoid: Migrate 3rd party clients from RESTbase to the MW core API
Open		None	T312917 Parsoid: Proxy old RESTbase URLs to the core endpoints
Resolved		MSantos	T314244 Parsoid: make roundtrip-test.js run as a Mocha test in CI
Resolved		daniel	T317017 Make the core page/html endpoint return system message content when appropriate
Resolved		None	T318393 Parsoid: Maintain context for selser on the server side
Resolved		None	T310464 Make use of stashed data-parsoid mapping for html-to-wikitext transformation
Resolved		daniel	T310489 Collect statistics on stash read failures
Resolved		None	T318395 Parsoid: use rendering from ParserCache for selser if the render ID matches.
Resolved		None	T318398 Parsoid: use current revision rendering to enable selser, when no stash key/etag is provided.
Resolved		daniel	T318399 HtmlInputTransformHelper: collect statistics on selser context
Open		None	T344944 Move Parsoid endpoints out of RESTBase
Resolved		Jgiannelos	T344945 Disable storage of Parsoid content in RESTbase
In Progress		None	T328559 Replace usage of RESTbase parsoid endpoints
Resolved		daniel	T320529 Configure VE backend to use Parsoid directly, instead of calling RESTbase
Resolved		None	T310376 Make VE Action API use Parsoid directly
Resolved		None	T310377 Make parsoid transformation API available internally
Resolved		matmarex	T314565 ApiVisualEditor: Convert ApiParsoidTrait into a helper object.
Resolved	PRODUCTION ERROR	daniel	T320703 VisualEditor can't save pages on de.wikipedia.beta
Resolved		daniel	T320704 When VisualEditor saves an edit, it should use the same backend that the HTML was originally loaded from
Resolved		MSantos	T333536 Survey RESTBase services and find which ones accesses Parsoid via RESTBase
Resolved		DAlangi_WMF	T339228 Remove VRS code from ContentTranslation
Resolved		DAlangi_WMF	T339227 Remove VRS code from VisualEditor
Resolved		• R_Rana	T338922 Introduce maintenance script for warming cache with Parsoid Outputs
Resolved		Jgiannelos	T339865 PCS should use parsoid endpoints in MediaWiki, not RESTbase
Resolved	PRODUCTION ERROR	cscott	T356368 Revision endpoint: InvalidArgumentException: ParserOutput does not have a render ID
Resolved	PRODUCTION ERROR	Jgiannelos	T356369 Missing content-language header on PCS responses
Resolved		Clement_Goubert	T356497 Raise mw-api-int replicas for increased load from mobileapps
Resolved		KartikMistry	T344982 Make cxserver call parsoid endpoints on MediaWiki, instead of going through RESTbase
Resolved		daniel	T350661 Expose parsoid transformation API from MediaWiki core.
Resolved	BUG REPORT	ngkountas	T357769 cxserver "fetch segmented page content" API endpoint doesn't work for space-separated multi-word titles
Resolved		daniel	T359426 REST: endpoints that return rendered page content should support all content models
Resolved		cscott	T311648 Allow WikitextContentHandler to use Parsoid for rendering wikitext
Resolved		cscott	T300190 Ensure that if 'useParsoid' flag is present in ParserOptions that we use Parsoid to parse the page.
Open		None	T310520 Parsoid content not compatible with `index.php?title=` URLs
Resolved		DAlangi_WMF	T327769 ParserOutputAccess should use a separate ParserCache instance for Parsoid output
Resolved		DAlangi_WMF	T311728 Gracefully handle non-wikitext pages when returning HTML from the page endpoints
Resolved		daniel	T320534 Put Parsoid output into the ParserCache on every edit
Resolved		daniel	T322427 Job for pre-emptive population of the ParserCache with Parsoid output
Resolved		daniel	T329366 Enable WarmParsoidParserCache on all wikis
Resolved		Clement_Goubert	T333528 Increase memory_limit for jobrunners to $wmgMemoryLimitParsoid
Resolved		• Pchelolo	T267982 Extract cached Parsoid access to an internal service object
Resolved		• Pchelolo	T263587 CAPEX for ParserCache for Parsoid
Resolved		daniel	T309549 Create end-to-end tests for the page/html and revision/html endpoints.
Resolved		daniel	T268205 Store data-parsoid in parsoid ParserCache
Open		None	T308424 Determine http cache control and active purging for REST endpoints serving parsoid output

Mentioned in SAL (#wikimedia-operations) [2023-05-08T10:44:45Z] <daniel@deploy1002> Started scap: Backport for [[gerrit:912929|Enable parser cache warming jobs for parsoid on small wikis (T329366)]]

Maintenance_bot removed a project: Patch-For-Review.May 8 2023, 11:11 AM

Mentioned in SAL (#wikimedia-operations) [2023-05-08T11:20:17Z] <daniel@deploy1002> Started scap: Backport for [[gerrit:912929|Enable parser cache warming jobs for parsoid on small wikis (T329366)]]

Mentioned in SAL (#wikimedia-operations) [2023-05-08T11:21:41Z] <daniel@deploy1002> daniel: Backport for [[gerrit:912929|Enable parser cache warming jobs for parsoid on small wikis (T329366)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-05-08T11:35:44Z] <daniel@deploy1002> Finished scap: Backport for [[gerrit:912929|Enable parser cache warming jobs for parsoid on small wikis (T329366)]] (duration: 15m 26s)

Jobs are working, confirmation at https://w.wiki/6gDJ

Adding the graph

{F36988754}

{F36988765}
(https://grafana-rw.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1&from=now-3h&to=now)

jijiki updated the task description. (Show Details)May 8 2023, 1:30 PM

RhinosF1 subscribed.May 8 2023, 4:02 PM

Change 918388 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[operations/mediawiki-config@master] Enable parser cache warming jobs for parsoid on medium wikis

https://gerrit.wikimedia.org/r/918388

gerritbot added a project: Patch-For-Review.May 10 2023, 8:37 AM

Change 918388 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable parser cache warming jobs for parsoid on medium wikis

https://gerrit.wikimedia.org/r/918388

Mentioned in SAL (#wikimedia-operations) [2023-05-10T09:30:01Z] <daniel@deploy1002> Started scap: Backport for [[gerrit:918388|Enable parser cache warming jobs for parsoid on medium wikis (T329366)]]

Maintenance_bot removed a project: Patch-For-Review.May 10 2023, 9:30 AM

Mentioned in SAL (#wikimedia-operations) [2023-05-10T09:31:34Z] <daniel@deploy1002> daniel: Backport for [[gerrit:918388|Enable parser cache warming jobs for parsoid on medium wikis (T329366)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-05-10T09:38:12Z] <daniel@deploy1002> Finished scap: Backport for [[gerrit:918388|Enable parser cache warming jobs for parsoid on medium wikis (T329366)]] (duration: 08m 10s)

jijiki updated the task description. (Show Details)May 10 2023, 11:11 AM

From IRC:

<_joe_> duesen, effie before we enable more jobs, I want us to take a hard look at the jobrunners cpus
<_joe_> it seems we're at 75% utilization, which is way too much

It's mostly cirrus search though:
https://performance.wikimedia.org/arclamp/svgs/daily/2023-05-11.excimer.RunSingleJob.svgz

Looks like we need to put more servers to the problem, even if it is not this specific job that is adding on utilisation, since we have the hardware to do so, we should. @daniel It is more likely to do be able to add more servers next week, I will let you know.

Change 923426 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] conftool: Add more servers to the jobrunner problem

https://gerrit.wikimedia.org/r/923426

gerritbot added a project: Patch-For-Review.May 25 2023, 9:42 PM

Change 923426 merged by Effie Mouzeli:

[operations/puppet@production] conftool: Add more servers to the jobrunner problem

https://gerrit.wikimedia.org/r/923426

Maintenance_bot removed a project: Patch-For-Review.May 26 2023, 8:29 AM

Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host parse1013.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host parse1014.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host parse1015.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host parse1016.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host parse1015.eqiad.wmnet with OS buster executed with errors:

parse1015 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh buster OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Unable to downtime the new host on Icinga/Alertmanager, the sre.hosts.downtime cookbook returned 99
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202305260854_jiji_3767551_parse1015.out
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host parse1013.eqiad.wmnet with OS buster completed:

parse1013 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh buster OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202305260854_jiji_3767513_parse1013.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host parse1014.eqiad.wmnet with OS buster completed:

parse1014 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh buster OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202305260856_jiji_3767538_parse1014.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2023-05-26T09:26:20Z] <effie> parse1013-parse1016 have neen depooled and removed from the parsoid-php service - T329366

Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host parse1016.eqiad.wmnet with OS buster completed:

parse1016 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh buster OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202305260859_jiji_3767559_parse1016.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2023-05-26T09:54:02Z] <effie> pool parse1013-parse1016 to the jobrunner cluster - T329366

Change 923588 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[operations/mediawiki-config@master] Enable parser cache warming jobs for parsoid on some top wikis

https://gerrit.wikimedia.org/r/923588

gerritbot added a project: Patch-For-Review.May 26 2023, 12:12 PM

daniel mentioned this in T320534: Put Parsoid output into the ParserCache on every edit.May 26 2023, 3:07 PM

Change 923588 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable parser cache warming jobs for parsoid on frwiki

https://gerrit.wikimedia.org/r/923588

Mentioned in SAL (#wikimedia-operations) [2023-06-01T07:46:43Z] <daniel@deploy1002> Started scap: Backport for [[gerrit:923588|Enable parser cache warming jobs for parsoid on frwiki (T329366)]]

Mentioned in SAL (#wikimedia-operations) [2023-06-01T07:48:19Z] <daniel@deploy1002> daniel: Backport for [[gerrit:923588|Enable parser cache warming jobs for parsoid on frwiki (T329366)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-06-01T07:55:53Z] <daniel@deploy1002> Finished scap: Backport for [[gerrit:923588|Enable parser cache warming jobs for parsoid on frwiki (T329366)]] (duration: 09m 09s)

Maintenance_bot removed a project: Patch-For-Review.Jun 1 2023, 8:10 AM

Note to self:

12:51 <_joe_> so the changeprop change - as a quick pointer - you need to edit operations/deployment-charts:helmfile.d/services/changeprop-jobqueue/values.yaml
12:51 <_joe_> add a configuration for this job to high_traffic_jobs_config
12:51 <_joe_> I would suggest we start relatively low with the concurrency, something similar to cdnPurge, maybe

....

<_joe_> and https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=parsoidCachePrewarm&viewPanel=74
12:54 <_joe_> tells me the concurrency is low enough right now
12:55 <_joe_> https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=parsoidCachePrewarm&viewPanel=5 this is the mean backlog time

Change 927236 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[operations/mediawiki-config@master] Enable parser cache warming jobs for parsoid on enwiki

https://gerrit.wikimedia.org/r/927236

gerritbot added a project: Patch-For-Review.Jun 5 2023, 4:03 PM

daniel triaged this task as High priority.Jun 5 2023, 6:13 PM

daniel moved this task from Unsorted to Doing on the RESTBase Sunsetting board.

Change 927236 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable parser cache warming jobs for parsoid on enwiki

https://gerrit.wikimedia.org/r/927236

Mentioned in SAL (#wikimedia-operations) [2023-06-06T13:53:40Z] <oblivian@deploy1002> Started scap: Backport for [[gerrit:927236|Enable parser cache warming jobs for parsoid on enwiki (T329366)]]

Mentioned in SAL (#wikimedia-operations) [2023-06-06T13:55:05Z] <oblivian@deploy1002> oblivian and daniel: Backport for [[gerrit:927236|Enable parser cache warming jobs for parsoid on enwiki (T329366)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-06-06T14:01:37Z] <oblivian@deploy1002> Finished scap: Backport for [[gerrit:927236|Enable parser cache warming jobs for parsoid on enwiki (T329366)]] (duration: 07m 57s)

Maintenance_bot removed a project: Patch-For-Review.Jun 6 2023, 2:11 PM

Change 927758 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[operations/mediawiki-config@master] Enable cache warming jobs for parsoid per default.

https://gerrit.wikimedia.org/r/927758

gerritbot added a project: Patch-For-Review.Jun 6 2023, 6:48 PM

Change 927758 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable cache warming jobs for parsoid per default.

https://gerrit.wikimedia.org/r/927758

Mentioned in SAL (#wikimedia-operations) [2023-06-07T13:37:20Z] <lucaswerkmeister-wmde@deploy1002> Started scap: Backport for [[gerrit:927758|Enable cache warming jobs for parsoid per default. (T329366)]]

Mentioned in SAL (#wikimedia-operations) [2023-06-07T13:38:55Z] <lucaswerkmeister-wmde@deploy1002> daniel and lucaswerkmeister-wmde: Backport for [[gerrit:927758|Enable cache warming jobs for parsoid per default. (T329366)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-06-07T13:47:48Z] <lucaswerkmeister-wmde@deploy1002> Finished scap: Backport for [[gerrit:927758|Enable cache warming jobs for parsoid per default. (T329366)]] (duration: 10m 27s)

Maintenance_bot removed a project: Patch-For-Review.Jun 7 2023, 2:11 PM

Following this deployment and backlog times growing, @Ladsgroup added a specific lane for parsoidCachePrewarm in changeprop-jobqueue
https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/928063/

That wasn't quite enough, so we bumped the concurrency to 45 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/928069

In T329366#8910189, @Clement_Goubert wrote:

Following this deployment and backlog times growing, @Ladsgroup added a specific lane for parsoidCachePrewarm in changeprop-jobqueue
https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/928063/

That wasn't quite enough, so we bumped the concurrency to 45 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/928069

Thanks, I had originally suggested a max concurrency of 40, and indeed your change has solved the issue quite quickly

I think we should monitor these timings during this week to make sure they don't explode at the current concurrency and possibly add an alert for when the backlog is higher than 5 minutes or so.

We ended up bumping to 60 https://gerrit.wikimedia.org/r/928120 because backlog started growing again.

For future reference, what would be the consequence of these jobs being held for more than 5 minutes?

In T329366#8912637, @Clement_Goubert wrote:

For future reference, what would be the consequence of these jobs being held for more than 5 minutes?

It's not a hard number - but it's conceivable that prewarming is only useful if it happens reasonably close to an edit - else we're probably not doing much with this job, and surely not now given we also are re-generating this parsercache via restbase/changeprop calling the parsoid cluster.

We need to move more servers from the parsoid cluster to the jobrunners:
Parsoid saturation last 24h:

20230609-parsoid-24h.png (500×1 px, 81 KB)

Jobrunners saturation last 24:

20230609-jobrunner-24h.png (500×1 px, 118 KB)

Change 929691 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] changeprop-jobqueue: Increase memory limits

https://gerrit.wikimedia.org/r/929691

gerritbot added a project: Patch-For-Review.Jun 13 2023, 12:35 PM

I 've noticed erratic and spiky max memory use in changeprop-jobqueue past 2023-06-12 15:08

This is a critical piece of the infrastructure, we probably don't want weirdness even in the edge case of the pods consuming most memory, I am upload a change to bump that limit.

Change 929691 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop-jobqueue: Increase memory limits

https://gerrit.wikimedia.org/r/929691

Maintenance_bot removed a project: Patch-For-Review.Jun 13 2023, 1:11 PM

Change 934337 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] conftool: Add more servers to jobrunner cluster

https://gerrit.wikimedia.org/r/934337

gerritbot added a project: Patch-For-Review.Jun 29 2023, 2:02 PM

Mentioned in SAL (#wikimedia-operations) [2023-06-29T14:20:55Z] <claime> Depooling mw148[2-6].eqiad.wmnet from api_appserver to move them to jobrunners - T329366

Change 934337 merged by Clément Goubert:

[operations/puppet@production] conftool: Add more servers to jobrunner cluster

https://gerrit.wikimedia.org/r/934337

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1001 for host mw1482.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1001 for host mw1483.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1001 for host mw1484.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1001 for host mw1485.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1001 for host mw1486.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1001 for host mw1484.eqiad.wmnet with OS buster executed with errors:

mw1484 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- The reimage failed, see the cookbook logs for the details

Maintenance_bot removed a project: Patch-For-Review.Jun 29 2023, 2:32 PM

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1001 for host mw1482.eqiad.wmnet with OS buster completed:

mw1482 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh buster OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202306291446_cgoubert_1170366_mw1482.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1001 for host mw1483.eqiad.wmnet with OS buster completed:

mw1483 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh buster OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202306291449_cgoubert_1170447_mw1483.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1001 for host mw1485.eqiad.wmnet with OS buster completed:

mw1485 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh buster OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202306291451_cgoubert_1170512_mw1485.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1001 for host mw1484.eqiad.wmnet with OS buster completed:

mw1484 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh buster OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202306291453_cgoubert_1170473_mw1484.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2023-06-29T15:30:59Z] <claime> Pooled mw148[2-6].eqiad.wmnet as jobrunners - T329366

5 servers moved from api_appserver to jobrunners:

Krinkle mentioned this in T341666: Wikimedia\RequestTimeout\RequestTimeoutException on de:Holomorphe_Funktion and several other math-heavy articles.Jul 13 2023, 4:28 PM

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1001 for host mw1486.eqiad.wmnet with OS buster completed:

mw1486 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh buster OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Unable to downtime the new host on Icinga/Alertmanager, the sre.hosts.downtime cookbook returned 99
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202306291454_cgoubert_1170538_mw1486.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB
- Updated Netbox status failed -> active
- Failed to run the sre.puppet.sync-netbox-hiera cookbook, run it manually

daniel added a parent task: T344945: Disable storage of Parsoid content in RESTbase.Aug 24 2023, 5:37 PM

Parsoid cache warming has been enabled everywhere for a couple of months now.

Enable WarmParsoidParserCache on all wikis
Closed, ResolvedPublic
Actions

Description

What?

Context

How?

Dashboards to monitor

Details

Related Objects
Search...

Event Timeline

	daniel
	Feb 10 2023, 1:07 PM

	F37123224: image.png
	Jun 29 2023, 3:38 PM

	F37102500: image.png
	Jun 13 2023, 12:37 PM

	F37097636: image.png
	Jun 8 2023, 5:37 AM

Enable WarmParsoidParserCache on all wikisClosed, ResolvedPublicActions