Page MenuHomePhabricator

Test categories performance under Ganeti
Closed, ResolvedPublic

Description

Per pairing with @dcausse this morning, we think it would be wise to look into Ganeti as a potential option for the categories migration.

Creating this ticket to:

  • Investigate current wdqs-categories resource usage
  • (separate ticket) Assuming the resources are within Ganeti's capabilities, provision a VM with the appropriate resources
  • (separate ticket) Work through the Puppet code until we are confident we can deploy categories independently from the rest of the wdqs stack. Even if we don't end up migrating to Ganeti, this will force us to clean up and logically separate the Puppet code, which is an operational win in itself.

Event Timeline

bking changed the task status from Open to In Progress.Wed, Sep 25, 7:43 PM
bking triaged this task as Medium priority.

I have a few panels at the bottom of the WDQS dashboard which should help us estimate the resource needs for Categories.

This dashboard has data about the individual blazegraph instances’ performance. Based on the data, categories’ resource usages are as follows:

Change #1076841 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs-categories: introduce VM for testing

https://gerrit.wikimedia.org/r/1076841

Change #1076841 merged by Bking:

[operations/puppet@production] wdqs-categories: introduce VM for testing

https://gerrit.wikimedia.org/r/1076841

Change #1077427 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs-categories: use correct insetup role

https://gerrit.wikimedia.org/r/1077427

Change #1077427 merged by Bking:

[operations/puppet@production] wdqs-categories: use correct insetup role

https://gerrit.wikimedia.org/r/1077427

Mentioned in SAL (#wikimedia-operations) [2024-10-02T21:54:58Z] <bking@cumin2002> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs-categories1001.eqiad.wmnet with reason: T375687

Mentioned in SAL (#wikimedia-operations) [2024-10-02T21:55:13Z] <bking@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs-categories1001.eqiad.wmnet with reason: T375687

Change #1077777 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/cookbooks@master] wdqs.categories-reload: don't check host

https://gerrit.wikimedia.org/r/1077777

Change #1077777 merged by Bking:

[operations/cookbooks@master] wdqs.categories-reload: don't check host

https://gerrit.wikimedia.org/r/1077777

bking closed this task as Resolved.EditedTue, Oct 8, 6:47 PM
bking claimed this task.

I provisioned wdqs-categories1001 in T376079. After provisioning, I one-offed the host and loaded categories via /usr/local/bin/reloadCategories.sh wdqs . As demonstrated by this graph , the reload took ~2 hours. Post-reload, memory usage has been stable at ~10 GB. I think this evidence confirms that we can run categories in the Ganeti infrastructure if necessary. At this point, I'm ready to decom/destroy this VM and work on a migration in a future task.*

Moving to "Needs Review" so @dcausse has an opportunity to check out the VM, ask questions, run tests etc before we get rid of this VM.

  • Where we migrate is still an open question, this shouldn't be taken to mean we are definitely going to Ganeti.

I provisioned wdqs-categories1001 in T376079. After provisioning, I one-offed the host and loaded categories via /usr/local/bin/reloadCategories.sh wdqs . As demonstrated by this graph , the reload took ~2 hours. Post-reload, memory usage has been stable at ~10 GB. I think this evidence confirms that we can run categories in the Ganeti infrastructure if necessary. At this point, I'm ready to decom/destroy this VM and work on a migration in a future task.*

Moving to "Needs Review" so @dcausse has an opportunity to check out the VM, ask questions, run tests etc before we get rid of this VM.

  • Where we migrate is still an open question, this shouldn't be taken to mean we are definitely going to Ganeti.

Thanks for running the test, I can't access the VM but if the reload worked fine this is a very good indication that this should be enough.