Page MenuHomePhabricator

Current codfw caches have wrong NVME format
Closed, ResolvedPublic

Description

The late_command support for switching our NVME drives to 4K block size was only set for eqiad and esams, and thus the latest codfw caches didn't get it at install time. Will upload a patch to correct the regex for future reimages, but we also need to correct the existing nodes. Without this everything "works", but uses the default 512 byte format, which isn't as efficient/performant as native 4K (same as memory page size).

The conceptually-simplest fix is just to reimage them all, but it might be easier and faster to just take a downtime of ats-be on each of them and reformat the storage before restarting, using the same commands as the late_command stuff:

/usr/sbin/nvme format /dev/nvme0n1 -l 2
echo ';' | /usr/sbin/sfdisk /dev/nvme0n1

Event Timeline

BBlack triaged this task as Low priority.Mon, Jun 29, 3:55 PM
BBlack created this task.
Restricted Application added a project: Operations. · View Herald TranscriptMon, Jun 29, 3:55 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 608425 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] nvme formatting was missing for new codfw caches

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608425

BBlack updated the task description. (Show Details)Mon, Jun 29, 3:57 PM
BBlack updated the task description. (Show Details)

we have scheduled a system reboot of these boxes.. I'll sync that with the "re-format" of the NVMe devices.

BBlack updated the task description. (Show Details)Mon, Jun 29, 4:46 PM

Mentioned in SAL (#wikimedia-operations) [2020-06-30T08:51:28Z] <vgutierrez> rolling restart of codfw cp nodes after "re-formatting" nvme devices - T256655

Icinga downtime for 0:30:00 set by vgutierrez@cumin1001 on 2 host(s) and their services with reason: kernel upgrade

cp[2027-2028].codfw.wmnet

Icinga downtime for 0:30:00 set by vgutierrez@cumin1001 on 2 host(s) and their services with reason: kernel upgrade

cp[2029-2030].codfw.wmnet

Icinga downtime for 0:30:00 set by vgutierrez@cumin1001 on 2 host(s) and their services with reason: kernel upgrade

cp[2031-2032].codfw.wmnet

Icinga downtime for 0:30:00 set by vgutierrez@cumin1001 on 2 host(s) and their services with reason: kernel upgrade

cp[2033-2034].codfw.wmnet

Icinga downtime for 0:30:00 set by vgutierrez@cumin1001 on 2 host(s) and their services with reason: kernel upgrade

cp[2035-2036].codfw.wmnet
BBlack moved this task from Triage to Caching on the Traffic board.Tue, Jun 30, 4:38 PM

Change 608425 merged by BBlack:
[operations/puppet@production] nvme formatting was missing for new codfw caches

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608425

Icinga downtime for 0:30:00 set by vgutierrez@cumin1001 on 2 host(s) and their services with reason: kernel upgrade

cp[2037-2038].codfw.wmnet

Icinga downtime for 0:30:00 set by vgutierrez@cumin1001 on 2 host(s) and their services with reason: kernel upgrade

cp[2039-2040].codfw.wmnet

Icinga downtime for 0:30:00 set by vgutierrez@cumin1001 on 2 host(s) and their services with reason: kernel upgrade

cp[2041-2042].codfw.wmnet
Vgutierrez closed this task as Resolved.Wed, Jul 1, 9:44 AM
Vgutierrez claimed this task.