Page MenuHomePhabricator

Enable WarmParsoidParserCache on all wikis
Closed, ResolvedPublic

Assigned To
Authored By
daniel
Feb 10 2023, 1:07 PM
Referenced Files
F37123224: image.png
Jun 29 2023, 3:38 PM
F37102500: image.png
Jun 13 2023, 12:37 PM
F37099009: 20230609-jobrunner-24h.png
Jun 9 2023, 10:24 AM
F37099007: 20230609-parsoid-24h.png
Jun 9 2023, 10:24 AM
F37097636: image.png
Jun 8 2023, 5:37 AM
Restricted File
May 8 2023, 1:38 PM
Restricted File
May 8 2023, 1:28 PM
F36941078: image.png
Apr 6 2023, 10:38 AM

Description

What?

Before we can switch VE to using Parsoid in MW (rather than going through RESTbase), we need to ensure that parsoid output for the page is present in the ParserCache.

Context

Currently, RESTbase will be notified when a page needs to be re-parsed, and it will then call the page/html endpoint exposed by the Parsoid extension to get the updated HTML. These requests are currently routed to the Parsoid cluster. At this point, this HTML will be written to the parser cache.

In the future, we want to turn of caching in RESTbase. When we do this, we need another mechanism to be in place that ensures we have up to date parsoid renderings of each page in the parser cache. This can be done by setting WarmParsoidParserCache to true in $wgParsoidCacheConfig, which will cause a ParsoidCachePrewarmJobs to be scheduled when a page need to be re-rendered.

These jobs are currently only enabled on testwiki and mediawikiwiki, because we did not want to overlaod the JobRunner cluster. The purpose of this job is to track all work needed to enable the ParsoidCachePrewarmJobs.

The request flow is as follows: mw-> eventgate-main -> kafka -> changeprop-jobqueue -> jobrunners

How?

Per @Joe's suggestion:

  • Increase the memory limit for the jobrunner cluster in mediawiki-config
  • Move a few servers from the parsoid cluster to the jobrunner cluster (in puppet) [Not needed pre-emptively]
  • Enable the jobs for wikis in batches, with SRE assistance. Possibly move more parsoid nodes to jobrunners if needed
  • For each batch, configure restbase to disable the caching for that wiki - basically only send out the purges for its urls.
  • rinse, repeat until nothing is left

At this point, the scope of this specific task is complete. The remainder is here just to complete the list.
Decide how we'll send the purges to the CDN for the restbase urls. My favourite option if we don't want to set up a complete system now would be to generate the purges when the warmup job is completed.

  • Stop sending requests to restbase from change-propagation for parsoid.
  • Move the requests for parsoid from all sources to bypass restbase (via the api gateway restbase compatibility layer)
  • Kill the restbase storage for parsoid

Alternative steps (elegant but riskier)
Basically instead of point 1-4 of the previous procedure:

  • Make changeprop handle these jobs separately.
  • Either submit them to the parsoid cluster as jobs (this requires some SRE work) or (preferred) just calling the URL for the rendering of the page directly (this requires probably some work on changeprop)
  • Once the parsercache is sufficiently warm, disable restbase caching for a batch of wikis, only keeping the purges emitted.
Dashboards to monitor

Related Objects

StatusSubtypeAssignedTask
StalledNone
In ProgressNone
OpenNone
OpenNone
OpenNone
Resolveddaniel
Resolved WDoranWMF
Resolved Pchelolo
ResolvedNone
ResolvedNone
DuplicateNone
Resolveddaniel
Resolveddaniel
Resolveddaniel
ResolvedDAlangi_WMF
InvalidNone
InvalidNone
OpenNone
DuplicateNone
Resolved eprodromou
Resolved Pchelolo
In ProgressNone
Resolveddaniel
ResolvedNone
OpenNone
Resolveddaniel
Resolveddaniel
OpenNone
ResolvedDAlangi_WMF
Resolvedppelberg
Resolveddaniel
ResolvedNone
Resolveddaniel
DuplicateNone
OpenNone
ResolvedMSantos
Resolveddaniel
ResolvedNone
ResolvedNone
Resolveddaniel
ResolvedNone
ResolvedNone
Resolveddaniel
OpenNone
ResolvedJgiannelos
In ProgressNone
Resolveddaniel
ResolvedNone
ResolvedNone
Resolvedmatmarex
ResolvedPRODUCTION ERRORdaniel
Resolveddaniel
ResolvedMSantos
ResolvedDAlangi_WMF
ResolvedDAlangi_WMF
Resolved R_Rana
ResolvedJgiannelos
ResolvedPRODUCTION ERRORcscott
ResolvedPRODUCTION ERRORJgiannelos
ResolvedClement_Goubert
ResolvedKartikMistry
Resolveddaniel
ResolvedBUG REPORTngkountas
Resolveddaniel
Resolvedcscott
Resolvedcscott
OpenNone
ResolvedDAlangi_WMF
ResolvedDAlangi_WMF
Resolveddaniel
Resolveddaniel
Resolveddaniel
ResolvedClement_Goubert
Resolved Pchelolo
Resolved Pchelolo
Resolveddaniel
Resolveddaniel
OpenNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-operations) [2023-05-08T10:44:45Z] <daniel@deploy1002> Started scap: Backport for [[gerrit:912929|Enable parser cache warming jobs for parsoid on small wikis (T329366)]]

Mentioned in SAL (#wikimedia-operations) [2023-05-08T11:20:17Z] <daniel@deploy1002> Started scap: Backport for [[gerrit:912929|Enable parser cache warming jobs for parsoid on small wikis (T329366)]]

Mentioned in SAL (#wikimedia-operations) [2023-05-08T11:21:41Z] <daniel@deploy1002> daniel: Backport for [[gerrit:912929|Enable parser cache warming jobs for parsoid on small wikis (T329366)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-05-08T11:35:44Z] <daniel@deploy1002> Finished scap: Backport for [[gerrit:912929|Enable parser cache warming jobs for parsoid on small wikis (T329366)]] (duration: 15m 26s)

Change 918388 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[operations/mediawiki-config@master] Enable parser cache warming jobs for parsoid on medium wikis

https://gerrit.wikimedia.org/r/918388

Change 918388 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable parser cache warming jobs for parsoid on medium wikis

https://gerrit.wikimedia.org/r/918388

Mentioned in SAL (#wikimedia-operations) [2023-05-10T09:30:01Z] <daniel@deploy1002> Started scap: Backport for [[gerrit:918388|Enable parser cache warming jobs for parsoid on medium wikis (T329366)]]

Mentioned in SAL (#wikimedia-operations) [2023-05-10T09:31:34Z] <daniel@deploy1002> daniel: Backport for [[gerrit:918388|Enable parser cache warming jobs for parsoid on medium wikis (T329366)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-05-10T09:38:12Z] <daniel@deploy1002> Finished scap: Backport for [[gerrit:918388|Enable parser cache warming jobs for parsoid on medium wikis (T329366)]] (duration: 08m 10s)

From IRC:

<_joe_> duesen, effie before we enable more jobs, I want us to take a hard look at the jobrunners cpus
<_joe_> it seems we're at 75% utilization, which is way too much

Looks like we need to put more servers to the problem, even if it is not this specific job that is adding on utilisation, since we have the hardware to do so, we should. @daniel It is more likely to do be able to add more servers next week, I will let you know.

Change 923426 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] conftool: Add more servers to the jobrunner problem

https://gerrit.wikimedia.org/r/923426

Change 923426 merged by Effie Mouzeli:

[operations/puppet@production] conftool: Add more servers to the jobrunner problem

https://gerrit.wikimedia.org/r/923426

Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host parse1013.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host parse1014.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host parse1015.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host parse1016.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host parse1015.eqiad.wmnet with OS buster executed with errors:

  • parse1015 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Unable to downtime the new host on Icinga/Alertmanager, the sre.hosts.downtime cookbook returned 99
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202305260854_jiji_3767551_parse1015.out
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host parse1013.eqiad.wmnet with OS buster completed:

  • parse1013 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202305260854_jiji_3767513_parse1013.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host parse1014.eqiad.wmnet with OS buster completed:

  • parse1014 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202305260856_jiji_3767538_parse1014.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2023-05-26T09:26:20Z] <effie> parse1013-parse1016 have neen depooled and removed from the parsoid-php service - T329366

Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host parse1016.eqiad.wmnet with OS buster completed:

  • parse1016 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202305260859_jiji_3767559_parse1016.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2023-05-26T09:54:02Z] <effie> pool parse1013-parse1016 to the jobrunner cluster - T329366

Change 923588 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[operations/mediawiki-config@master] Enable parser cache warming jobs for parsoid on some top wikis

https://gerrit.wikimedia.org/r/923588

Change 923588 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable parser cache warming jobs for parsoid on frwiki

https://gerrit.wikimedia.org/r/923588

Mentioned in SAL (#wikimedia-operations) [2023-06-01T07:46:43Z] <daniel@deploy1002> Started scap: Backport for [[gerrit:923588|Enable parser cache warming jobs for parsoid on frwiki (T329366)]]

Mentioned in SAL (#wikimedia-operations) [2023-06-01T07:48:19Z] <daniel@deploy1002> daniel: Backport for [[gerrit:923588|Enable parser cache warming jobs for parsoid on frwiki (T329366)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-06-01T07:55:53Z] <daniel@deploy1002> Finished scap: Backport for [[gerrit:923588|Enable parser cache warming jobs for parsoid on frwiki (T329366)]] (duration: 09m 09s)

Note to self:

12:51 <_joe_> so the changeprop change - as a quick pointer - you need to edit operations/deployment-charts:helmfile.d/services/changeprop-jobqueue/values.yaml
12:51 <_joe_> add a configuration for this job to high_traffic_jobs_config
12:51 <_joe_> I would suggest we start relatively low with the concurrency, something similar to cdnPurge, maybe

....

<_joe_> and https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=parsoidCachePrewarm&viewPanel=74
12:54 <_joe_> tells me the concurrency is low enough right now
12:55 <_joe_> https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=parsoidCachePrewarm&viewPanel=5 this is the mean backlog time

Change 927236 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[operations/mediawiki-config@master] Enable parser cache warming jobs for parsoid on enwiki

https://gerrit.wikimedia.org/r/927236

daniel triaged this task as High priority.Jun 5 2023, 6:13 PM
daniel moved this task from Unsorted to Doing on the RESTBase Sunsetting board.

Change 927236 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable parser cache warming jobs for parsoid on enwiki

https://gerrit.wikimedia.org/r/927236

Mentioned in SAL (#wikimedia-operations) [2023-06-06T13:53:40Z] <oblivian@deploy1002> Started scap: Backport for [[gerrit:927236|Enable parser cache warming jobs for parsoid on enwiki (T329366)]]

Mentioned in SAL (#wikimedia-operations) [2023-06-06T13:55:05Z] <oblivian@deploy1002> oblivian and daniel: Backport for [[gerrit:927236|Enable parser cache warming jobs for parsoid on enwiki (T329366)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-06-06T14:01:37Z] <oblivian@deploy1002> Finished scap: Backport for [[gerrit:927236|Enable parser cache warming jobs for parsoid on enwiki (T329366)]] (duration: 07m 57s)

Change 927758 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[operations/mediawiki-config@master] Enable cache warming jobs for parsoid per default.

https://gerrit.wikimedia.org/r/927758

Change 927758 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable cache warming jobs for parsoid per default.

https://gerrit.wikimedia.org/r/927758

Mentioned in SAL (#wikimedia-operations) [2023-06-07T13:37:20Z] <lucaswerkmeister-wmde@deploy1002> Started scap: Backport for [[gerrit:927758|Enable cache warming jobs for parsoid per default. (T329366)]]

Mentioned in SAL (#wikimedia-operations) [2023-06-07T13:38:55Z] <lucaswerkmeister-wmde@deploy1002> daniel and lucaswerkmeister-wmde: Backport for [[gerrit:927758|Enable cache warming jobs for parsoid per default. (T329366)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-06-07T13:47:48Z] <lucaswerkmeister-wmde@deploy1002> Finished scap: Backport for [[gerrit:927758|Enable cache warming jobs for parsoid per default. (T329366)]] (duration: 10m 27s)

Following this deployment and backlog times growing, @Ladsgroup added a specific lane for parsoidCachePrewarm in changeprop-jobqueue
https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/928063/

That wasn't quite enough, so we bumped the concurrency to 45 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/928069

Following this deployment and backlog times growing, @Ladsgroup added a specific lane for parsoidCachePrewarm in changeprop-jobqueue
https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/928063/

That wasn't quite enough, so we bumped the concurrency to 45 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/928069

Thanks, I had originally suggested a max concurrency of 40, and indeed your change has solved the issue quite quickly

image.png (940×2 px, 87 KB)

I think we should monitor these timings during this week to make sure they don't explode at the current concurrency and possibly add an alert for when the backlog is higher than 5 minutes or so.

We ended up bumping to 60 https://gerrit.wikimedia.org/r/928120 because backlog started growing again.

For future reference, what would be the consequence of these jobs being held for more than 5 minutes?

For future reference, what would be the consequence of these jobs being held for more than 5 minutes?

It's not a hard number - but it's conceivable that prewarming is only useful if it happens reasonably close to an edit - else we're probably not doing much with this job, and surely not now given we also are re-generating this parsercache via restbase/changeprop calling the parsoid cluster.

We need to move more servers from the parsoid cluster to the jobrunners:
Parsoid saturation last 24h:

20230609-parsoid-24h.png (500×1 px, 81 KB)

Jobrunners saturation last 24:
20230609-jobrunner-24h.png (500×1 px, 118 KB)

Change 929691 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] changeprop-jobqueue: Increase memory limits

https://gerrit.wikimedia.org/r/929691

I 've noticed erratic and spiky max memory use in changeprop-jobqueue past 2023-06-12 15:08

image.png (1×1 px, 266 KB)

This is a critical piece of the infrastructure, we probably don't want weirdness even in the edge case of the pods consuming most memory, I am upload a change to bump that limit.

Change 929691 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop-jobqueue: Increase memory limits

https://gerrit.wikimedia.org/r/929691

Change 934337 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] conftool: Add more servers to jobrunner cluster

https://gerrit.wikimedia.org/r/934337

Mentioned in SAL (#wikimedia-operations) [2023-06-29T14:20:55Z] <claime> Depooling mw148[2-6].eqiad.wmnet from api_appserver to move them to jobrunners - T329366

Change 934337 merged by Clément Goubert:

[operations/puppet@production] conftool: Add more servers to jobrunner cluster

https://gerrit.wikimedia.org/r/934337

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1001 for host mw1482.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1001 for host mw1483.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1001 for host mw1484.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1001 for host mw1484.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1001 for host mw1485.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1001 for host mw1486.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1001 for host mw1484.eqiad.wmnet with OS buster executed with errors:

  • mw1484 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1001 for host mw1482.eqiad.wmnet with OS buster completed:

  • mw1482 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202306291446_cgoubert_1170366_mw1482.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1001 for host mw1483.eqiad.wmnet with OS buster completed:

  • mw1483 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202306291449_cgoubert_1170447_mw1483.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1001 for host mw1485.eqiad.wmnet with OS buster completed:

  • mw1485 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202306291451_cgoubert_1170512_mw1485.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1001 for host mw1484.eqiad.wmnet with OS buster completed:

  • mw1484 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202306291453_cgoubert_1170473_mw1484.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2023-06-29T15:30:59Z] <claime> Pooled mw148[2-6].eqiad.wmnet as jobrunners - T329366

5 servers moved from api_appserver to jobrunners:

image.png (500×1 px, 87 KB)

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1001 for host mw1486.eqiad.wmnet with OS buster completed:

  • mw1486 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Unable to downtime the new host on Icinga/Alertmanager, the sre.hosts.downtime cookbook returned 99
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202306291454_cgoubert_1170538_mw1486.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status failed -> active
    • Failed to run the sre.puppet.sync-netbox-hiera cookbook, run it manually

Parsoid cache warming has been enabled everywhere for a couple of months now.