Page MenuHomePhabricator

Transfer a sane journal (subgraph:main) to wdqs2021 from wdqs2022
Closed, ResolvedPublic

Description

wdqs2021 appears to have its updater consuming from codfw.rdf-streaming-updater.mutation and thus causing a high number of divergences since its journal was probably transferred from a journal loaded with the main graph. The data in this journal is probably not salvageable and a transfer from a sane journal should be done.
The service definition /lib/systemd/system/wdqs-updater.service does properly mention codfw.rdf-streaming-updater.mutation-main so I'm not sure what happened.

AC:

Event Timeline

Gehel triaged this task as High priority.Sep 2 2024, 3:08 PM
Gehel moved this task from Incoming to Operations/SRE on the Wikidata-Query-Service board.

Mentioned in SAL (#wikimedia-operations) [2024-09-03T16:09:29Z] <bking@cumin2002> START - Cookbook sre.wdqs.data-transfer (None, T373791) xfer wikidata_main from wdqs2022.codfw.wmnet -> wdqs2021.codfw.wmnet w/ force delete existing files, repooling neither afterwards

Mentioned in SAL (#wikimedia-operations) [2024-09-03T16:57:55Z] <bking@cumin2002> END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (None, T373791) xfer wikidata_main from wdqs2022.codfw.wmnet -> wdqs2021.codfw.wmnet w/ force delete existing files, repooling neither afterwards

Unfortunately wdqs2021 is still consumer from the wrong topic after the transfer.
Looking closer it appears that the service definition for the wdqs-updater is duplicated in two locations:

  • /etc/systemd/system/wdqs-updater.service containing the wrong topic codfw.rdf-streaming-updater.mutation
  • /lib/systemd/system/wdqs-updater.service with the right topic codfw.rdf-streaming-updater.mutation-main

The version in /etc is probably taking precedence, this explains why previous transfer did not work as expect. The cause as to why /etc/systemd/system/wdqs-updater.service is there is unknown to me.

So we need to cleanup this host (possibly re-image it) to make sure no other stale files are present and re-run the transfer.

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host wdqs2021.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host wdqs2021.codfw.wmnet with OS bullseye executed with errors:

  • wdqs2021 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "sudo install-console wdqs2021.codfw.wmnet" to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host wdqs2021.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host wdqs2021.codfw.wmnet with OS bullseye executed with errors:

  • wdqs2021 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202409111344_bking_876306_wdqs2021.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details,You can also try typing "sudo install-console wdqs2021.codfw.wmnet" to get a root shellbut depending on the failure this may not work.

Mentioned in SAL (#wikimedia-operations) [2024-09-11T15:50:21Z] <bking@cumin2002> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs2021.codfw.wmnet with reason: T373791

Mentioned in SAL (#wikimedia-operations) [2024-09-11T15:50:36Z] <bking@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs2021.codfw.wmnet with reason: T373791

Mentioned in SAL (#wikimedia-operations) [2024-09-11T16:25:40Z] <bking@cumin2002> START - Cookbook sre.wdqs.data-transfer (None, T373791) xfer wikidata_main from wdqs2022.codfw.wmnet -> wdqs2021.codfw.wmnet w/ force delete existing files, repooling neither afterwards

Mentioned in SAL (#wikimedia-operations) [2024-09-11T17:17:42Z] <bking@cumin2002> END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (None, T373791) xfer wikidata_main from wdqs2022.codfw.wmnet -> wdqs2021.codfw.wmnet w/ force delete existing files, repooling neither afterwards

The reimage and data-transfer from wdqs2022 to wdqs2021 appear to have worked.

  • I've repooled the wdqs-main pool in codfw: sudo confctl --object-type discovery select 'dnsdisc=wdqs-main,name=codfw' set/pooled=true
  • I've looked at the dashboards linked above and it appears the triples divergences are zero.
  • I'm not 100% sure about what the Consumer Kafka dashboard is supposed to look like.

Leaving in "Needs Review" so @dcausse can confirm it's working or send back to SRE if not.

Mentioned in SAL (#wikimedia-operations) [2024-09-13T16:55:41Z] <bking@cumin2002> START - Cookbook sre.hosts.downtime for 3:00:00 on wdqs[2021-2024].codfw.wmnet with reason: T373791

Mentioned in SAL (#wikimedia-operations) [2024-09-13T16:55:57Z] <bking@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on wdqs[2021-2024].codfw.wmnet with reason: T373791