Page MenuHomePhabricator

WDQS hosts low on /srv disk space
Open, HighPublic8 Estimated Story Points

Description

Context

Following the data-transfer of the most recent wikidata.jnl, we've hit low enough disk space to trigger the warning threshold.

DISK WARNING - free space: /srv 45621 MB (4% inode=99%)

While Blazegraph's need for free space for compaction specifically is quite low compared to other datastores, the raw amount of space left gives us an unacceptably low amount of headroom for our journal file(s) to keep expanding.

We should take short-term action to address the lack of available disk space. We can double our existing space by migrating from raid10 to raid0. This will cost us redundancy, but it's an acceptable tradeoff in the short term. Medium-term, our newer instances will have more storage and in particular will have at least 4 expansion slots free each if we use the same spec we used for WCQS.

Acceptance criteria

  • Migrated to raid0
    • Switch partman recipe to raid0
    • Re-image each server
  • Do a combover of all the current servers, verifying which hosts this issue applies to (currently looks like it might be every server except potentially wdqs101[1-3])

Current Status

[EQIAD PUBLIC]
wdqs1004.eqiad.wmnet => (FAILED [x3], N/A)
wdqs1005.eqiad.wmnet => (DON'T_REIMAGE_TILL_LATER, NEW_JOURNAL)
wdqs1006.eqiad.wmnet => (SUCCESS, NEW_JOURNAL)
wdqs1007.eqiad.wmnet => (REIMAGING, NEW_JOURNAL)
wdqs1012.eqiad.wmnet => (NOT_REIMAGED, NEW_JOURNAL)
wdqs1013.eqiad.wmnet => (SUCCESS, NEW_JOURNAL)

[EQIAD INTERNAL]
wdqs1003.eqiad.wmnet => (NOT_REIMAGED, NEW_JOURNAL)
wdqs1008.eqiad.wmnet => (NOT_REIMAGED, NEW_JOURNAL)
wdqs1011.eqiad.wmnet => (SUCCESS, NEW_JOURNAL)


[CODFW PUBLIC]
wdqs2001.codfw.wmnet => (NOT_REIMAGED, NEW_JOURNAL)
wdqs2002.codfw.wmnet => (NOT_REIMAGED, NEW_JOURNAL)
wdqs2003.codfw.wmnet => (NOT_REIMAGED, NEW_JOURNAL)
wdqs2004.codfw.wmnet => (REIMAGING, N/A)
wdqs2007.codfw.wmnet => (REIMAGED [but HW FAILURE], NEW_JOURNAL)

[CODFW INTERNAL]
wdqs2005.codfw.wmnet => (NOT_REIMAGED, OLD_JOURNAL)
wdqs2006.codfw.wmnet => (NOT_REIMAGED, OLD_JOURNAL)
wdqs2008.codfw.wmnet => (NOT_REIMAGED, NEW_JOURNAL)

[TEST]
wdqs1009.eqiad.wmnet => (DON'T_REIMAGE_TILL_LATER)
wdqs1010.eqiad.wmnet => (SUCCESS, NEW_JOURNAL)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 682735 merged by Ryan Kemper:

[operations/puppet@production] wdqs: add missing raid0 dependency

https://gerrit.wikimedia.org/r/682735

Script wmf-auto-reimage was launched by ryankemper on cumin1001.eqiad.wmnet for hosts:

wdqs1006.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202104262209_ryankemper_19056_wdqs1006_eqiad_wmnet.log.

Mentioned in SAL (#wikimedia-operations) [2021-04-26T22:11:29Z] <ryankemper> T280382 sudo -i wmf-auto-reimage-host -p T280382 --new wdqs1006.eqiad.wmnet on ryankemper@cumin1001 tmux session reimage

Completed auto-reimage of hosts:

['wdqs1006.eqiad.wmnet']

and were ALL successful.

Mentioned in SAL (#wikimedia-operations) [2021-04-27T01:21:49Z] <ryankemper> T280382 sudo -i cookbook sre.wdqs.data-transfer --source wdqs1004.eqiad.wmnet --dest wdqs1006.eqiad.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance categories on ryankemper@cumin1001 tmux session reimage

Mentioned in SAL (#wikimedia-operations) [2021-04-27T01:29:36Z] <ryankemper> T280382 sudo -i cookbook sre.wdqs.data-transfer --source wdqs1004.eqiad.wmnet --dest wdqs1006.eqiad.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph --task-id T280382 on ryankemper@cumin1001 tmux session reimage

Mentioned in SAL (#wikimedia-operations) [2021-04-27T03:17:15Z] <ryankemper> T280382 wdqs1006 has been re-imaged and had the appropriate wikidata/categories journal files transferred. df -h shows disk space is no longer an issue following the switch to raid0: /dev/md2 2.6T 998G 1.5T 40% /srv

RKemper triaged this task as High priority.Wed, Apr 28, 3:18 AM
RKemper updated the task description. (Show Details)

Script wmf-auto-reimage was launched by ryankemper on cumin1001.eqiad.wmnet for hosts:

wdqs1013.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202104280330_ryankemper_22524_wdqs1013_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by ryankemper on cumin1001.eqiad.wmnet for hosts:

wdqs2007.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202104280330_ryankemper_22554_wdqs2007_codfw_wmnet.log.

Mentioned in SAL (#wikimedia-operations) [2021-04-28T03:32:06Z] <ryankemper> T280382 sudo -i wmf-auto-reimage-host -p T280382 wdqs1013.eqiad.wmnet on ryankemper@cumin1001 tmux session reimage

Mentioned in SAL (#wikimedia-operations) [2021-04-28T03:32:17Z] <ryankemper> T280382 sudo -i wmf-auto-reimage-host -p T280382 wdqs2007.codfw.wmnet on ryankemper@cumin1001 tmux session reimage

Completed auto-reimage of hosts:

['wdqs2007.codfw.wmnet']

and were ALL successful.

Mentioned in SAL (#wikimedia-operations) [2021-04-28T04:08:21Z] <ryankemper> T280382 sudo -i cookbook sre.wdqs.data-transfer --source wdqs2001.codfw.wmnet --dest wdqs2007.codfw.wmnet --reason "transferring fresh categories journal following reimage" --blazegraph_instance categories on ryankemper@cumin1001 tmux session reimage

Mentioned in SAL (#wikimedia-operations) [2021-04-28T04:14:39Z] <ryankemper> T280382 sudo -i cookbook sre.wdqs.data-transfer --source wdqs2001.codfw.wmnet --dest wdqs2007.codfw.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph on ryankemper@cumin1001 tmux session reimage

Completed auto-reimage of hosts:

['wdqs1013.eqiad.wmnet']

Of which those FAILED:

['wdqs1013.eqiad.wmnet']

Script wmf-auto-reimage was launched by ryankemper on cumin1001.eqiad.wmnet for hosts:

wdqs1013.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202104282122_ryankemper_24267_wdqs1013_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['wdqs1013.eqiad.wmnet']

Of which those FAILED:

['wdqs1013.eqiad.wmnet']

Script wmf-auto-reimage was launched by ryankemper on cumin1001.eqiad.wmnet for hosts:

wdqs1013.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202104282122_ryankemper_24311_wdqs1013_eqiad_wmnet.log.

Mentioned in SAL (#wikimedia-operations) [2021-04-28T21:24:13Z] <ryankemper> T280382 sudo -i wmf-auto-reimage-host -p T280382 --new wdqs1013.eqiad.wmnet on ryankemper@cumin1001 tmux session reimage (previous reimage timed out, instance appears to have rebooted)

Mentioned in SAL (#wikimedia-operations) [2021-04-28T21:32:16Z] <ryankemper> T280382 [WDQS] wdqs2007 ssh is unreachable; power cycling via racadm>>racadm serveraction powercycle

Mentioned in SAL (#wikimedia-operations) [2021-04-28T21:37:31Z] <ryankemper> T280382 wdqs2007 is reachable again; glancing at /srv/wdqs its wikidata.jnl is 839G when it should be 975G so I'll re-do the wikidata journal transfer

Mentioned in SAL (#wikimedia-operations) [2021-04-28T21:38:55Z] <ryankemper> T280382 sudo -i cookbook sre.wdqs.data-transfer --source wdqs2001.codfw.wmnet --dest wdqs2007.codfw.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph on ryankemper@cumin1001 tmux session reimage

Completed auto-reimage of hosts:

['wdqs1013.eqiad.wmnet']

and were ALL successful.

Mentioned in SAL (#wikimedia-operations) [2021-04-28T22:18:29Z] <ryankemper> T280382 sudo -i cookbook sre.wdqs.data-transfer --source wdqs1004.eqiad.wmnet --dest wdqs1013.eqiad.wmnet --reason "transferring fresh categories journal following reimage" --blazegraph_instance categories on ryankemper@cumin1001 tmux session reimage

Mentioned in SAL (#wikimedia-operations) [2021-04-28T22:26:49Z] <ryankemper> T280382 sudo -i cookbook sre.wdqs.data-transfer --source wdqs1004.eqiad.wmnet --dest wdqs1013.eqiad.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph on ryankemper@cumin1001 tmux session reimage

Mentioned in SAL (#wikimedia-operations) [2021-04-29T00:06:17Z] <ryankemper> T280382 wdqs1013.eqiad.wmnet has been re-imaged and had the appropriate wikidata/categories journal files transferred. df -h shows disk space is no longer an issue following the switch to raid0: /dev/mapper/vg0-srv 2.7T 998G 1.6T 39% /srv

Script wmf-auto-reimage was launched by ryankemper on cumin1001.eqiad.wmnet for hosts:

wdqs1004.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202104290010_ryankemper_29015_wdqs1004_eqiad_wmnet.log.

Mentioned in SAL (#wikimedia-operations) [2021-04-29T00:11:13Z] <ryankemper> T280382 sudo -i wmf-auto-reimage-host -p T280382 wdqs1004.eqiad.wmnet on ryankemper@cumin1001 tmux session reimage

Completed auto-reimage of hosts:

['wdqs1004.eqiad.wmnet']

Of which those FAILED:

['wdqs1004.eqiad.wmnet']

Script wmf-auto-reimage was launched by ryankemper on cumin1001.eqiad.wmnet for hosts:

wdqs1004.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202104290122_ryankemper_7968_wdqs1004_eqiad_wmnet.log.

Mentioned in SAL (#wikimedia-operations) [2021-04-29T01:23:06Z] <ryankemper> T280382 sudo -i wmf-auto-reimage-host -p T280382 --new wdqs1004.eqiad.wmnet on ryankemper@cumin1001 tmux session reimage

Completed auto-reimage of hosts:

['wdqs1004.eqiad.wmnet']

Of which those FAILED:

['wdqs1004.eqiad.wmnet']

Script wmf-auto-reimage was launched by ryankemper on cumin1001.eqiad.wmnet for hosts:

wdqs1004.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202104291541_ryankemper_10990_wdqs1004_eqiad_wmnet.log.

Mentioned in SAL (#wikimedia-operations) [2021-04-29T15:44:04Z] <ryankemper> T280382 sudo -i wmf-auto-reimage-host -p T280382 --new wdqs1004.eqiad.wmnet on ryankemper@cumin1001 tmux session reimage (trying reimaging this host one final time, if this fails again will need to do a deeper investigation into what's going wrong here)

Completed auto-reimage of hosts:

['wdqs1004.eqiad.wmnet']

Of which those FAILED:

['wdqs1004.eqiad.wmnet']

Script wmf-auto-reimage was launched by ryankemper on cumin1001.eqiad.wmnet for hosts:

wdqs1010.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202104300350_ryankemper_4599_wdqs1010_eqiad_wmnet.log.

Mentioned in SAL (#wikimedia-operations) [2021-04-30T03:50:57Z] <ryankemper> T280382 sudo -i wmf-auto-reimage-host -p T280382 wdqs1010.eqiad.wmnet on ryankemper@cumin1001 tmux session reimage

Looking into why wdqs1004 failed. I did a sudo install_console wdqs1004.eqiad.wmnet and saw the following in /var/log/syslog:

Apr 29 15:45:30 partman-auto-raid: mdadm: RUN_ARRAY failed: Unknown error 524
Apr 29 15:45:31 kernel: [   88.633426] md/raid0:md2: cannot assemble multi-zone RAID0 with default_layout setting
Apr 29 15:45:31 kernel: [   88.633428] md/raid0: please set raid0.default_layout to 1 or 2
Apr 29 15:45:31 kernel: [   88.633429] md: pers->run() failed ...
Apr 29 15:45:31 kernel: [   88.633463] md: md2 stopped.
Apr 29 15:45:31 partman-auto-raid: Error creating array /dev/md2
Apr 29 15:45:31 debconf: --> SET partman-auto-raid/error false
Apr 29 15:45:31 debconf: <-- 0 value set
Apr 29 15:45:31 debconf: --> INPUT critical partman-auto-raid/error
Apr 29 15:45:31 debconf: <-- 0 question will be asked
Apr 29 15:45:31 debconf: --> GO

This is the relevant partman line:

wdqs*) echo partman/standard.cfg partman/raid0.cfg partman/raid0-4dev.cfg ;; \

Completed auto-reimage of hosts:

['wdqs1010.eqiad.wmnet']

and were ALL successful.

Mentioned in SAL (#wikimedia-operations) [2021-04-30T04:39:58Z] <ryankemper> T280382 sudo -i cookbook sre.wdqs.data-transfer --source wdqs1003.eqiad.wmnet --dest wdqs1010.eqiad.wmnet --reason "transferring fresh categories journal following reimage" --blazegraph_instance categories on ryankemper@cumin1001 tmux session reimage

Script wmf-auto-reimage was launched by ryankemper on cumin1001.eqiad.wmnet for hosts:

wdqs2007.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202105031918_ryankemper_19232_wdqs2007_codfw_wmnet.log.

Mentioned in SAL (#wikimedia-operations) [2021-05-03T19:21:30Z] <ryankemper> T280382 [WDQS] sudo confctl select 'name=wdqs1004.eqiad.wmnet' set/pooled=no (wdqs1004 failed re-image [not sure why yet] and won't let me ssh in to depool so using conftool instead)

Mentioned in SAL (#wikimedia-operations) [2021-05-03T19:24:57Z] <ryankemper> T280382 sudo -i cookbook sre.wdqs.data-transfer --without-lvs --source wdqs1003.eqiad.wmnet --dest wdqs1010.eqiad.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph on ryankemper@cumin1001 tmux session reimage

Completed auto-reimage of hosts:

['wdqs2007.codfw.wmnet']

and were ALL successful.

Circling back on why wdqs1004 is failing to re-image:


wdqs1004 and wdqs1005 were procured in the same ticket

Haven't tried reimaging wdqs1005 yet but wdqs1004 has failed repeatedly

See https://netbox.wikimedia.org/dcim/devices/983/ <-> phabricator.wikimedia.org/T166780


Now onto the important part:

Per https://phabricator.wikimedia.org/T280382#7048070 the error message seems to indicate that disks are different sizes: https://blog.icod.de/2019/10/10/caution-kernel-5-3-4-and-raid0-default_layout/

Now comparing the disks of wdqs1005 (I can't ssh into wdqs1004 otherwise I'd check that host) against wdqs1006 (arbitrarily chosen), the difference is pretty clear:

wdqs1005 (and presumably wdqs1004) has 2 x 800GB and 2 x 960GB, whereas wdqs1006 has 4 x 960GB

So perhaps the raid0-4dev implicitly assumes that all 4 disks are of equal size?

ryankemper@wdqs1005:~$ sudo lshw -class disk
  *-disk:0
       description: ATA Disk
       product: SSDSC2BB800G7R
       physical id: 0
       bus info: scsi@0:0.0.0
       logical name: /dev/sda
       version: DL42
       serial: PHDV723101N9800CGN
       size: 745GiB (800GB)
       capabilities: gpt-1.00 partitioned partitioned:gpt
       configuration: ansiversion=5 guid=bf791b93-efaf-420b-a401-f423ed656ebf logicalsectorsize=512 sectorsize=4096
  *-disk:1
       description: ATA Disk
       product: SSDSC2BB800G7R
       physical id: 1
       bus info: scsi@1:0.0.0
       logical name: /dev/sdb
       version: DL42
       serial: PHDV722601JF800CGN
       size: 745GiB (800GB)
       capabilities: gpt-1.00 partitioned partitioned:gpt
       configuration: ansiversion=5 guid=9cddcb56-7aaa-479b-acc0-5c1250bf6a49 logicalsectorsize=512 sectorsize=4096
  *-disk:2
       description: ATA Disk
       product: SSDSC2KB960G7R
       physical id: 2
       bus info: scsi@2:0.0.0
       logical name: /dev/sdc
       version: DL58
       serial: PHYS742105KP960CGN
       size: 894GiB (960GB)
       capabilities: gpt-1.00 partitioned partitioned:gpt
       configuration: ansiversion=5 guid=a5bac7c4-806b-4230-bb26-48ea508acff9 logicalsectorsize=512 sectorsize=4096
  *-disk:3
       description: ATA Disk
       product: SSDSC2KB960G7R
       physical id: 3
       bus info: scsi@3:0.0.0
       logical name: /dev/sdd
       version: DL58
       serial: PHYS7411001G960CGN
       size: 894GiB (960GB)
       capabilities: gpt-1.00 partitioned partitioned:gpt
       configuration: ansiversion=5 guid=38ce1fee-5b5d-4a13-af9a-6726f0d05811 logicalsectorsize=512 sectorsize=4096

versus

ryankemper@wdqs1006:~$ sudo lshw -class disk
PCI (sysfs)
  *-disk:0
       description: ATA Disk
       product: SSDSC2BB800G7R
       physical id: 0
       bus info: scsi@0:0.0.0
       logical name: /dev/sda
       version: DL43
       serial: BTDV735109UA800CGN
       size: 745GiB (800GB)
       capabilities: gpt-1.00 partitioned partitioned:gpt
       configuration: ansiversion=5 guid=efa08189-6f41-4b39-8b54-e3228d5b596e logicalsectorsize=512 sectorsize=4096
  *-disk:1
       description: ATA Disk
       product: SSDSC2BB800G7R
       physical id: 1
       bus info: scsi@1:0.0.0
       logical name: /dev/sdb
       version: DL43
       serial: BTDV7351067U800CGN
       size: 745GiB (800GB)
       capabilities: gpt-1.00 partitioned partitioned:gpt
       configuration: ansiversion=5 guid=c8113fd0-4657-494c-b79b-182200e523e6 logicalsectorsize=512 sectorsize=4096
  *-disk:2
       description: ATA Disk
       product: SSDSC2BB800G7R
       physical id: 2
       bus info: scsi@2:0.0.0
       logical name: /dev/sdc
       version: DL43
       serial: BTDV7351004L800CGN
       size: 745GiB (800GB)
       capabilities: gpt-1.00 partitioned partitioned:gpt
       configuration: ansiversion=5 guid=52fde268-4c9c-478f-8a1b-40bb1c4dd24d logicalsectorsize=512 sectorsize=4096
  *-disk:3
       description: ATA Disk
       product: SSDSC2BB800G7R
       physical id: 3
       bus info: scsi@3:0.0.0
       logical name: /dev/sdd
       version: DL43
       serial: BTDV735106CM800CGN
       size: 745GiB (800GB)
       capabilities: gpt-1.00 partitioned partitioned:gpt
       configuration: ansiversion=5 guid=5e0a118c-ff72-4926-9ec7-2577f197e98a logicalsectorsize=512 sectorsize=4096

Mentioned in SAL (#wikimedia-operations) [2021-05-03T20:37:30Z] <ryankemper> T280382 sudo -i cookbook sre.wdqs.data-transfer --source wdqs2001.codfw.wmnet --dest wdqs2007.codfw.wmnet --reason "transferring fresh categories journal following reimage" --blazegraph_instance categories on ryankemper@cumin1001 tmux session reimage

Mentioned in SAL (#wikimedia-operations) [2021-05-03T20:56:49Z] <ryankemper> T280382 [WDQS] ryankemper@wdqs2001:~$ sudo run-puppet-agent --force

Mentioned in SAL (#wikimedia-operations) [2021-05-03T21:02:45Z] <ryankemper> T280382 wdqs1010.eqiad.wmnet has been re-imaged and had the appropriate wikidata/categories journal files transferred. df -h shows disk space is no longer an issue following the switch to raid0: /dev/md2 2.6T 975G 1.5T 39% /srv

Mentioned in SAL (#wikimedia-operations) [2021-05-03T21:06:07Z] <ryankemper> T280382 sudo -i cookbook sre.wdqs.data-transfer --source wdqs2001.codfw.wmnet --dest wdqs2007.codfw.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph on ryankemper@cumin1001 tmux session reimage

Script wmf-auto-reimage was launched by ryankemper on cumin1001.eqiad.wmnet for hosts:

wdqs1011.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202105032109_ryankemper_8659_wdqs1011_eqiad_wmnet.log.

Mentioned in SAL (#wikimedia-operations) [2021-05-03T21:09:40Z] <ryankemper> T280382 sudo -i wmf-auto-reimage-host -p T280382 wdqs1011.eqiad.wmnet on ryankemper@cumin1001 tmux session reimage

Mentioned in SAL (#wikimedia-operations) [2021-05-03T21:20:27Z] <ryankemper> T280382 [WDQS] ryankemper@puppetmaster1001:~$ sudo confctl select 'name=wdqs1011.eqiad.wmnet' set/pooled=no

Completed auto-reimage of hosts:

['wdqs1011.eqiad.wmnet']

and were ALL successful.

Mentioned in SAL (#wikimedia-operations) [2021-05-05T01:39:29Z] <ryankemper> T280382 sudo -i cookbook sre.wdqs.data-transfer --source wdqs1006.eqiad.wmnet --dest wdqs1011.eqiad.wmnet --reason "transferring fresh categories journal following reimage" --blazegraph_instance categories on ryankemper@cumin1001 tmux session reimage

Mentioned in SAL (#wikimedia-operations) [2021-05-05T01:43:56Z] <ryankemper> T280382 [WDQS] racadm>>racadm serveraction powercycle on wdqs2007

Mentioned in SAL (#wikimedia-operations) [2021-05-05T01:45:23Z] <ryankemper> T280382 sudo -i cookbook sre.wdqs.data-transfer --source wdqs1006.eqiad.wmnet --dest wdqs1011.eqiad.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph on ryankemper@cumin1001 tmux session reimage

Mentioned in SAL (#wikimedia-operations) [2021-05-05T01:49:56Z] <ryankemper> T280382 sudo -i cookbook sre.wdqs.data-transfer --source wdqs2001.codfw.wmnet --dest wdqs2007.codfw.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph on ryankemper@cumin1001 tmux session reimage (will likely fail due to underlying hw but we'll see)

Mentioned in SAL (#wikimedia-operations) [2021-05-05T03:50:55Z] <ryankemper> T280382 wdqs2007.codfw.wmnet has been re-imaged and had the appropriate wikidata/categories journal files transferred. df -h shows disk space is no longer an issue following the switch to raid0: /dev/mapper/vg0-srv 2.7T 998G 1.6T 39% /srv

Mentioned in SAL (#wikimedia-operations) [2021-05-05T03:51:39Z] <ryankemper> T280382 [WDQS] ryankemper@wdqs2007:~$ sudo depool (need to monitor host to see if it becomes ssh unreachable again or if it was a one-off; also high update lag)

Mentioned in SAL (#wikimedia-operations) [2021-05-05T03:54:11Z] <ryankemper> T280382 wdqs1011.eqiad.wmnet has been re-imaged and had the appropriate wikidata/categories journal files transferred. df -h shows disk space is no longer an issue following the switch to raid0: /dev/mapper/vg0-srv 2.7T 998G 1.6T 39% /srv

Script wmf-auto-reimage was launched by ryankemper on cumin1001.eqiad.wmnet for hosts:

wdqs2004.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202105060338_ryankemper_28228_wdqs2004_codfw_wmnet.log.

Script wmf-auto-reimage was launched by ryankemper on cumin1001.eqiad.wmnet for hosts:

wdqs1007.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202105060337_ryankemper_28220_wdqs1007_eqiad_wmnet.log.

Mentioned in SAL (#wikimedia-operations) [2021-05-06T03:38:45Z] <ryankemper> T280382 sudo -i wmf-auto-reimage-host -p T280382 wdqs2004.codfw.wmnet on ryankemper@cumin1001 tmux session reimage

Mentioned in SAL (#wikimedia-operations) [2021-05-06T03:38:54Z] <ryankemper> T280382 sudo -i wmf-auto-reimage-host -p T280382 wdqs1007.eqiad.wmnet on ryankemper@cumin1001 tmux session reimage

Completed auto-reimage of hosts:

['wdqs1007.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['wdqs2004.codfw.wmnet']

and were ALL successful.

Mentioned in SAL (#wikimedia-operations) [2021-05-06T05:37:43Z] <ryankemper> T280382 sudo -i cookbook sre.wdqs.data-transfer --source wdqs2008.codfw.wmnet --dest wdqs2004.codfw.wmnet --reason "transferring fresh categories journal following reimage" --blazegraph_instance categories on ryankemper@cumin1001 tmux session reimage

Mentioned in SAL (#wikimedia-operations) [2021-05-06T05:37:59Z] <ryankemper> T280382 sudo -i cookbook sre.wdqs.data-transfer --source wdqs1011.eqiad.wmnet --dest wdqs1007.eqiad.wmnet --reason "transferring fresh categories journal following reimage" --blazegraph_instance categories on ryankemper@cumin1001 tmux session reimage

Mentioned in SAL (#wikimedia-operations) [2021-05-06T06:00:12Z] <ryankemper> T280382 sudo -i cookbook sre.wdqs.data-transfer --source wdqs2008.codfw.wmnet --dest wdqs2004.codfw.wmnet --reason "transferring fresh categories journal following reimage" --blazegraph_instance categories on ryankemper@cumin1001 tmux session reimage

Mentioned in SAL (#wikimedia-operations) [2021-05-06T06:00:29Z] <ryankemper> T280382 sudo -i cookbook sre.wdqs.data-transfer --source wdqs2008.codfw.wmnet --dest wdqs2004.codfw.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph on ryankemper@cumin1001 tmux session reimage

Mentioned in SAL (#wikimedia-operations) [2021-05-06T06:01:01Z] <ryankemper> T280382 sudo -i cookbook sre.wdqs.data-transfer --source wdqs1011.eqiad.wmnet --dest wdqs1007.eqiad.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph on ryankemper@cumin1001 tmux session reimage

Mentioned in SAL (#wikimedia-operations) [2021-05-06T15:26:04Z] <ryankemper> T280382 wdqs1007.eqiad.wmnet has been re-imaged and had the appropriate wikidata/categories journal files transferred. df -h shows disk space is no longer an issue following the switch to raid0: /dev/md2 2.6T 998G 1.5T 40% /srv

Mentioned in SAL (#wikimedia-operations) [2021-05-06T15:26:13Z] <ryankemper> T280382 wdqs2004.codfw.wmnet has been re-imaged and had the appropriate wikidata/categories journal files transferred. df -h shows disk space is no longer an issue following the switch to raid0: /dev/md2 2.6T 998G 1.5T 40% /srv

Mentioned in SAL (#wikimedia-operations) [2021-05-06T15:26:21Z] <ryankemper> T280382 [WDQS] Pooled wdqs1007 and wdqs2004

Script wmf-auto-reimage was launched by ryankemper on cumin1001.eqiad.wmnet for hosts:

wdqs2003.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202105061531_ryankemper_23130_wdqs2003_codfw_wmnet.log.

Script wmf-auto-reimage was launched by ryankemper on cumin1001.eqiad.wmnet for hosts:

wdqs1012.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202105061531_ryankemper_23304_wdqs1012_eqiad_wmnet.log.

Mentioned in SAL (#wikimedia-operations) [2021-05-06T15:32:03Z] <ryankemper> T280382 sudo -i wmf-auto-reimage-host -p T280382 wdqs2003.codfw.wmnet on ryankemper@cumin1001 tmux session reimage

Mentioned in SAL (#wikimedia-operations) [2021-05-06T15:32:12Z] <ryankemper> T280382 sudo -i wmf-auto-reimage-host -p T280382 wdqs1012.eqiad.wmnet on ryankemper@cumin1001 tmux session reimage

Completed auto-reimage of hosts:

['wdqs1012.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['wdqs2003.codfw.wmnet']

Of which those FAILED:

['wdqs2003.codfw.wmnet']

Script wmf-auto-reimage was launched by ryankemper on cumin1001.eqiad.wmnet for hosts:

wdqs2003.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202105061911_ryankemper_18235_wdqs2003_codfw_wmnet.log.

Completed auto-reimage of hosts:

['wdqs2003.codfw.wmnet']

Of which those FAILED:

['wdqs2003.codfw.wmnet']