Investigate performance differences between wdqs2022 and older hosts
Closed, ResolvedPublic8 Estimated Story Points
Actions

Assigned To

Authored By

	bking
	May 10 2023, 9:39 PM

Description

While working on T321605 and T331300 , we noticed a performance discrepancy between wdqs2022 (newest active host) and prior hosts . The linked graph suggests that the older hosts are 20-40% faster at triples ingestion, a key metric for WDQS. Data import times are recorded in T241128.

Hardware differences are noted here . wdqs2022 is our first R450 in production, and it's also the first Bullseye host running the WDQS stack.

I also noticed that our CPU frequency governors are set to 'powersave' and they should probably be 'performance'. Per IRC conversation in #wikimedia-sre , tickets T225713 T315398 and T328957 have some history and insights on past efforts to choose a CPU performance governor. Note that even the older hosts are using 'powersave,' so this is probably not our root cause.

Creating this ticket to:

Identify the root cause of performance differences

Details

	Subject	Repo	Branch	Lines +/-
	query_service: enable CPU performance governor for w[cd]qs	operations/puppet	production	+3 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T291916 Tracking task for Bullseye migrations in production
Resolved	Gehel	T323921 [Epic] Migrate all Search Platform servers to Debian Bullseye
Resolved	bking	T332314 Service implementation for wdqs20[13-22]
Resolved	bking	T331300 Ensure WDQS stack works on Bullseye
Resolved	bking	T336443 Investigate performance differences between wdqs2022 and older hosts
Resolved	VRiley-WMF	T358727 Reclaim recently-decommed CP host for WDQS (see T352253)

Event Timeline

bking created this task.May 10 2023, 9:39 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 10 2023, 9:39 PM

bking updated the task description. (Show Details)May 10 2023, 9:41 PM

bking updated the task description. (Show Details)May 11 2023, 1:10 PM

The most notable difference in metrics I see is in the disk utilization per host of the cluster overview dashboard. During the backfilling period all the other codfw hosts are reporting a max per-disk value of around 10%. For 2022 half the disks were at 25% and the other half at 45-50%.

A general review of io differences in in graphs between 2022 and 2009-2012:

Total iops and throughput are reduced on 2022, at a rate roughly similar to the difference in ingestion rate. Overall iops are in the 1k-2k range and shouldn't be significant.
Total throughput on all instances is not significant, low 10's of MB/s. nothing should be throttling there.
There is a curious difference in the graphs of read operations between the other hosts and 2022. The rate is only 10MB/s which shouldn't be significant,
iowait doesn't seem all that high and is comparable between instances. suggests the application may not be waiting on IO very often.

A very quick health check on the instances:

all instances appear to be using 6Gb/s sata connections
not obviously complaining in dmesg
all instances have the same readahead settings

Overall i don't have a clear sign on whats going on here. The IO demands of the application don't seem to be all that significant, healthy instances see 2k iops and 20MB/s per disk, yet the new host is showing 50% utilization (vs 10% on healthy) and reduced throughput. While suspicious it's not clear to me that this is the source of our throughput reduction.

MPhamWMF moved this task from needs triage to Current work on the Discovery-Search board.May 15 2023, 3:30 PM

MPhamWMF edited projects, added Discovery-Search (Current work); removed Discovery-Search.

Gehel added a parent task: T331300: Ensure WDQS stack works on Bullseye.May 15 2023, 3:31 PM

MPhamWMF set the point value for this task to 5.May 15 2023, 3:47 PM

MPhamWMF moved this task from Incoming to Ready for Dev -- SRE/Ops on the Discovery-Search (Current work) board.

bking updated the task description. (Show Details)May 15 2023, 6:18 PM

bking added a subscriber: BTullis.May 16 2023, 3:50 PM

Gehel added a project: Data-Platform-SRE.Jun 13 2023, 8:26 AM

Gehel moved this task from Incoming to Ready for Work on the Data-Platform-SRE board.

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs2021.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs2021.codfw.wmnet with OS buster executed with errors:

wdqs2021 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs2021.codfw.wmnet with OS buster

Notes from today's pairing session:

New hosts (R-450 chassis) have more CPUs/threads (48 as opposed to prior hosts' 32). But the R-450 has slower clock speed per CPU (2.1 Ghz vs 2.5Ghz). This is a good trade-off in normal circumstances, because it allows the host to respond to more queries in parallel.

However, because the triples ingestion process is effectively single-threaded, it seems reasonable that the older chassis outperforms the R-450. Regardless, we still want to verify that hardware is the root cause, so we're reimaging wdqs2021 (R-450 chassis) back to Buster.

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs2021.codfw.wmnet with OS buster executed with errors:

wdqs2021 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs2021.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs2021.codfw.wmnet with OS buster completed:

wdqs2021 (WARN)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh buster OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202306202219_bking_3030130_wdqs2021.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

Looking at triples ingestion during a data-transfer , we see the "added" metric is pretty much identical between wdqs2021 (Buster) and wdqs2022 (Bullseye).

Thus, I think we can safely conclude that the performance differences are NOT OS-related. As stated previously, we aren't too concerned about performance differences, so long as they are hardware-related. As such, I believe we can resolve this ticket.

bking moved this task from Ready for Dev -- SRE/Ops to Needs Reporting on the Discovery-Search (Current work) board.Jun 27 2023, 3:08 PM

Gehel moved this task from Ready for Work to Needs Reporting on the Data-Platform-SRE board.Jun 27 2023, 3:47 PM

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs2021.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs2021.codfw.wmnet with OS bullseye executed with errors:

wdqs2021 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs2021.codfw.wmnet with OS bullseye

Gehel closed this task as Resolved.Jun 30 2023, 8:11 AM

Gehel claimed this task.

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs2021.codfw.wmnet with OS bullseye executed with errors:

wdqs2021 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202306292128_bking_1244684_wdqs2021.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- The reimage failed, see the cookbook logs for the details

Gehel moved this task from Needs Reporting to Done on the Data-Platform-SRE board.Jul 19 2023, 8:50 AM

bking reopened this task as In Progress.Nov 20 2023, 4:10 PM

bking claimed this task.

bking updated the task description. (Show Details)

bking added subscribers: dcausse, Gehel.

Gehel moved this task from Needs Reporting to Incoming on the Discovery-Search (Current work) board.Nov 20 2023, 4:10 PM

Reopening per today's IRC conversation. We really need this process to be faster, so we'll try enabling the performance governor and seeing what happens next.

bking mentioned this in T351662: Test hardware-based performance optimizations for WDQS import.Nov 20 2023, 5:14 PM

dr0ptp4kt subscribed.Nov 20 2023, 8:30 PM

bking updated the task description. (Show Details)Nov 28 2023, 3:59 PM

Gehel closed this task as Resolved.Dec 1 2023, 10:12 AM

Gehel reopened this task as Open.Dec 1 2023, 10:16 AM

Gehel moved this task from Done to Ready for Work on the Data-Platform-SRE board.

The ticket description mentions wdqs2022 but the slow down was also observed on wdqs1022 and wdqs1023 which are both 2.40Ghz CPUs, the disk utilization saturation that Erik observed in T336443#8845469 was also observed with these hosts during the reload.

Gehel updated the task description. (Show Details)Dec 4 2023, 4:32 PM

Gehel changed the point value for this task from 5 to 8.

Gehel moved this task from Incoming to Ready for Dev -- SRE/Ops on the Discovery-Search (Current work) board.

Created an Etherpad for brainstorming/test results/etc.

@Addshore has completed an extensive battery of tests related to data reloading. Any benchmarking we do should probably follow his approach.

Loading only a few chunks can be done with loadData.sh -s and -e options (start and end).

Gehel triaged this task as Medium priority.Dec 6 2023, 1:14 PM

Gehel moved this task from Ready for Work to Misc on the Data-Platform-SRE board.Dec 6 2023, 1:37 PM

Gehel moved this task from Ready for Dev -- SRE/Ops to DPE-SRE on the Discovery-Search (Current work) board.Jan 16 2024, 3:20 PM

Gehel removed a project: Discovery-Search (Current work).Jan 16 2024, 3:24 PM

dr0ptp4kt added a subtask: T358727: Reclaim recently-decommed CP host for WDQS (see T352253).Feb 29 2024, 9:26 PM

bking mentioned this in T358533: Hardware requests for Search Platform FY2024-2025.Feb 29 2024, 9:33 PM

dr0ptp4kt mentioned this in T359062: Assess Wikidata dump import hardware.Mar 4 2024, 3:24 PM

VRiley-WMF closed subtask T358727: Reclaim recently-decommed CP host for WDQS (see T352253) as Resolved.Mar 5 2024, 3:51 PM

bking reopened subtask T358727: Reclaim recently-decommed CP host for WDQS (see T352253) as Open.Mar 6 2024, 3:13 PM

bking closed subtask T358727: Reclaim recently-decommed CP host for WDQS (see T352253) as Resolved.Mar 19 2024, 6:53 PM

bking edited projects, added Data-Platform-SRE (2024.04.15 - 2024.05.05); removed Data-Platform-SRE.Wed, Apr 17, 1:31 PM

bking moved this task from Misc to Infrastructure on the Data-Platform-SRE board.

bking moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.04.15 - 2024.05.05) board.

Change #1020834 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] query_service: enable CPU performance governor for w[cd]qs

https://gerrit.wikimedia.org/r/1020834

gerritbot added a project: Patch-For-Review.Wed, Apr 17, 1:36 PM

Change #1020834 merged by Bking:

[operations/puppet@production] query_service: enable CPU performance governor for w[cd]qs

https://gerrit.wikimedia.org/r/1020834

I've made the following changes:
wdqs2022: CPU performance governor changed
wdqs2023: BIOS settings set to performance/maximum performance under System Profile Settings; rebooted to apply the settings

Based on the Grafana dashboard, it seems the CPU performance governor change nearly doubled wdqs2022 's triples ingestion rate, which helped clear the corrupted backlog as discussed in T362508 . More details (and some questionable math) in this Etherpad.

I'll start rolling out the performance governor changes to the rest of the WDQS hosts. Once that's done, we might want to explore this optimization for other Search Platform-owned hosts. I'll probably open a separate ticket for that.

dr0ptp4kt awarded a token.Thu, Apr 18, 3:48 PM

dr0ptp4kt mentioned this in T362920: Benchmark Blazegraph import with increased buffer capacity (and other factors).Thu, Apr 18, 6:17 PM

bking mentioned this in T362922: Audit/consider enabling CPU performance governor on DPE SRE-owned hosts.Thu, Apr 18, 6:30 PM

Addshore awarded a token.Fri, Apr 19, 8:09 AM

I've enabled the performance governor changes to all WDQS hosts (although it looks like only R450s are seeing major benefits). I also rolled back the system profile change to wdqs2023.

As such, I am closing out this ticket.

dcausse awarded a token.Fri, Apr 19, 4:30 PM

Maintenance_bot removed a project: Patch-For-Review.Fri, Apr 26, 7:04 PM

Investigate performance differences between wdqs2022 and older hostsClosed, ResolvedPublic8 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Investigate performance differences between wdqs2022 and older hosts
Closed, ResolvedPublic8 Estimated Story Points
Actions

Related Objects
Search...