Page MenuHomePhabricator

Investigate performance differences between wdqs2022 and older hosts
Closed, ResolvedPublic8 Estimated Story Points

Description

While working on T321605 and T331300 , we noticed a performance discrepancy between wdqs2022 (newest active host) and prior hosts . The linked graph suggests that the older hosts are 20-40% faster at triples ingestion, a key metric for WDQS. Data import times are recorded in T241128.

Hardware differences are noted here . wdqs2022 is our first R450 in production, and it's also the first Bullseye host running the WDQS stack.

I also noticed that our CPU frequency governors are set to 'powersave' and they should probably be 'performance'. Per IRC conversation in #wikimedia-sre , tickets T225713 T315398 and T328957 have some history and insights on past efforts to choose a CPU performance governor. Note that even the older hosts are using 'powersave,' so this is probably not our root cause.

Creating this ticket to:

  • Identify the root cause of performance differences

Event Timeline

The most notable difference in metrics I see is in the disk utilization per host of the cluster overview dashboard. During the backfilling period all the other codfw hosts are reporting a max per-disk value of around 10%. For 2022 half the disks were at 25% and the other half at 45-50%.

A general review of io differences in in graphs between 2022 and 2009-2012:

  • Total iops and throughput are reduced on 2022, at a rate roughly similar to the difference in ingestion rate. Overall iops are in the 1k-2k range and shouldn't be significant.
  • Total throughput on all instances is not significant, low 10's of MB/s. nothing should be throttling there.
  • There is a curious difference in the graphs of read operations between the other hosts and 2022. The rate is only 10MB/s which shouldn't be significant,
  • iowait doesn't seem all that high and is comparable between instances. suggests the application may not be waiting on IO very often.

A very quick health check on the instances:

  • all instances appear to be using 6Gb/s sata connections
  • not obviously complaining in dmesg
  • all instances have the same readahead settings

Overall i don't have a clear sign on whats going on here. The IO demands of the application don't seem to be all that significant, healthy instances see 2k iops and 20MB/s per disk, yet the new host is showing 50% utilization (vs 10% on healthy) and reduced throughput. While suspicious it's not clear to me that this is the source of our throughput reduction.

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs2021.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs2021.codfw.wmnet with OS buster executed with errors:

  • wdqs2021 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs2021.codfw.wmnet with OS buster

Notes from today's pairing session:

  • New hosts (R-450 chassis) have more CPUs/threads (48 as opposed to prior hosts' 32). But the R-450 has slower clock speed per CPU (2.1 Ghz vs 2.5Ghz). This is a good trade-off in normal circumstances, because it allows the host to respond to more queries in parallel.

However, because the triples ingestion process is effectively single-threaded, it seems reasonable that the older chassis outperforms the R-450. Regardless, we still want to verify that hardware is the root cause, so we're reimaging wdqs2021 (R-450 chassis) back to Buster.

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs2021.codfw.wmnet with OS buster executed with errors:

  • wdqs2021 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs2021.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs2021.codfw.wmnet with OS buster completed:

  • wdqs2021 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202306202219_bking_3030130_wdqs2021.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Looking at triples ingestion during a data-transfer , we see the "added" metric is pretty much identical between wdqs2021 (Buster) and wdqs2022 (Bullseye).

Thus, I think we can safely conclude that the performance differences are NOT OS-related. As stated previously, we aren't too concerned about performance differences, so long as they are hardware-related. As such, I believe we can resolve this ticket.

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs2021.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs2021.codfw.wmnet with OS bullseye executed with errors:

  • wdqs2021 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs2021.codfw.wmnet with OS bullseye

Gehel claimed this task.

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs2021.codfw.wmnet with OS bullseye executed with errors:

  • wdqs2021 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202306292128_bking_1244684_wdqs2021.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details
bking reopened this task as In Progress.Nov 20 2023, 4:10 PM
bking claimed this task.
bking updated the task description. (Show Details)
bking added subscribers: dcausse, Gehel.

Reopening per today's IRC conversation. We really need this process to be faster, so we'll try enabling the performance governor and seeing what happens next.

Gehel moved this task from Done to Ready for Work on the Data-Platform-SRE board.

The ticket description mentions wdqs2022 but the slow down was also observed on wdqs1022 and wdqs1023 which are both 2.40Ghz CPUs, the disk utilization saturation that Erik observed in T336443#8845469 was also observed with these hosts during the reload.

Gehel changed the point value for this task from 5 to 8.

Created an Etherpad for brainstorming/test results/etc.

Loading only a few chunks can be done with loadData.sh -s and -e options (start and end).

Gehel triaged this task as Medium priority.Dec 6 2023, 1:14 PM

Change #1020834 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] query_service: enable CPU performance governor for w[cd]qs

https://gerrit.wikimedia.org/r/1020834

Change #1020834 merged by Bking:

[operations/puppet@production] query_service: enable CPU performance governor for w[cd]qs

https://gerrit.wikimedia.org/r/1020834

I've made the following changes:
wdqs2022: CPU performance governor changed
wdqs2023: BIOS settings set to performance/maximum performance under System Profile Settings; rebooted to apply the settings

Based on the Grafana dashboard, it seems the CPU performance governor change nearly doubled wdqs2022 's triples ingestion rate, which helped clear the corrupted backlog as discussed in T362508 . More details (and some questionable math) in this Etherpad.

I'll start rolling out the performance governor changes to the rest of the WDQS hosts. Once that's done, we might want to explore this optimization for other Search Platform-owned hosts. I'll probably open a separate ticket for that.

I've enabled the performance governor changes to all WDQS hosts (although it looks like only R450s are seeing major benefits). I also rolled back the system profile change to wdqs2023.

As such, I am closing out this ticket.