Page MenuHomePhabricator

Test backfilling for cirrus-streaming-updater
Closed, ResolvedPublic5 Estimated Story Points

Description

At today's Wednesday meeting, @EBernhardson, @dcausse and the rest of us discussed the next steps for the Search Update Pipeline.

We believe a backfill test is needed, so we have a better idea of how the Search Update Pipeline application, Flink, and our infrastructure (Kafka) behave during a backfill.

Things that needs to be addressed by the test:

  • capacity on k8s and on all backends (swift, mw-api, kafka, elasticsearch)
  • functional correctness

Potential issues:

  • We don't have a target elasticsearch cluster
  • This will require enabling re-render for wikis under test (with load impact)

Creating this ticket to:

  • Devise a backfill test
  • Run the test
  • Collect the results
  • Address blocking issues

Event Timeline

My rough notes around this subject are here . I'm still learning Flink and Kafka, so will need some help creating the backfill test.

Gehel set the point value for this task to 5.Nov 13 2023, 4:43 PM
Gehel triaged this task as High priority.Nov 15 2023, 9:36 AM
Gehel moved this task from Incoming to Misc on the Data-Platform-SRE board.
Gehel moved this task from Misc to Quarterly Goals on the Data-Platform-SRE board.

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1008.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1009.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1010.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1008.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1008 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1009.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1009 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1010.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1010 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Apologies for the reimage spam, it's from an unrelated operation.

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1010.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1009.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1010.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1010 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1009.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1009 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1009.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1010.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1010.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1010 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1010.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1010.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1010 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1010.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1009.wikimedia.org with OS bullseye completed:

  • cloudelastic1009 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Unable to downtime the new host on Icinga/Alertmanager, the sre.hosts.downtime cookbook returned 99
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311161713_bking_2708107_cloudelastic1009.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1010.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1010 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details
Gehel subscribed.

Once we have a working deployment on Cloudelastic (T352335), we can just re-run a backfill operation there.

Another option we came up with was to backfill to a null flink sink, this would allow measuring capacity of the flink pipeline by itself, separate from the ability of the chosen elasticsearch cluster to consumer those updates.

We probably want to run the backfill test once we have enabled all wikis, so that the load generated is maximized.

Still needs to be done:

  • HTTP timeout
  • Documentation
  • Validate elasticsearch ingestion throughput once we're on Cloudelastic

One of the key findings of the backfill tests was a lower-than-expected throughput, see T353460. That is mainly caused by a bug inside the flink consumer but is still a unexpectedly high rate of timeouts (from enovys perspective) see T354289.