Page MenuHomePhabricator

provision more machines for eqsin caches
Closed, ResolvedPublic

Description

Unlike most other edge sites, eqsin and ulsfo have 6 machines per cluster, not 8. ulsfo sees much less traffic, so this is fine there; however, eqsin has become a sizable fraction of our overall traffic at its daily peak: approx as loaded as eqiad's peak; approx 1/3rd of esams peak; and approx 1/3rd of global rps at eqsin daily peak.

We should probably increase from 6->8 machines for upload and text in eqsin.

Related Objects

StatusSubtypeAssignedTask
OpenNone
ResolvedBBlack
ResolvedRobH

Event Timeline

BBlack mentioned this in Unknown Object (Task).Feb 25 2021, 5:52 PM
BBlack added a subtask: Unknown Object (Task).
RobH closed subtask Unknown Object (Task) as Resolved.May 4 2021, 5:58 PM

Change 683026 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/puppet@production] Puppetize cp501[3456]

https://gerrit.wikimedia.org/r/683026

Change 683026 merged by BBlack:

[operations/puppet@production] Puppetize cp501[3456]

https://gerrit.wikimedia.org/r/683026

Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts:

['cp5013.eqsin.wmnet', 'cp5014.eqsin.wmnet', 'cp5015.eqsin.wmnet', 'cp5016.eqsin.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202105042216_bblack_1599.log.

Completed auto-reimage of hosts:

['cp5015.eqsin.wmnet', 'cp5013.eqsin.wmnet', 'cp5014.eqsin.wmnet', 'cp5016.eqsin.wmnet']

and were ALL successful.

These are just about ready and running correct puppetization, but don't pool these yet. I think they may have some bad BIOS settings or something, at least related to power mgmt. cpufreq keeps attempting to reset the governor on every puppet run. Will check tomorrow.

I checked the BIOS/iDRAC settings on cp5013 against https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/Dell_Documentation#Initial_System_Setup (+ the one custom setting we use on these modern cps, which is to disable the unused onboard NICs), and ended up making these 3 changes to bring it into conformance:

BIOS SectionSettingOld ValueNew Value
Integrated DevicesEmbedded NIC1 and NIC2EnabledDisabled (OS)
System Profile SettingsSystem ProfilePerformance Per Watt (DAPC)Performance Per Watt (OS)
MiscellaneousAssert Tag(Blank)(Value from netbox)

The cpufreq issue is gone and the extra eno[12] interfaces vanished as well, so I think we're good. Will check the other three shortly, assuming they'll all need the same changes.

The others were in the same state. All are fixed and rebooted now, icinga downtimes are removed, netbox status is set to Active, and confctl weights are set correctly, but the pooled attribute is still set to inactive.

Will begin pooling these into service today while eqsin is in its daily load valley.

BBlack claimed this task.

These are all pooled now and slowly filling their caches. Optimistically closing this task for now!

Change 691170 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/puppet@production] Add missing cache::nodes for cp501[3456]

https://gerrit.wikimedia.org/r/691170

Change 691170 merged by BBlack:

[operations/puppet@production] Add missing cache::nodes for cp501[3456]

https://gerrit.wikimedia.org/r/691170