Page MenuHomePhabricator

WDQS diskspace is low
Closed, ResolvedPublic

Description

On the WDQS cluster, we are around 80% disk usage. We need to expand the disk space pretty soon. WDQS Internal cluster is at about 40%, so it's fine for now.

/dev/mapper/wdqs1003--vg-data  686G  521G  131G  80% /srv
/dev/mapper/wdqs1004--vg-data  686G  519G  133G  80% /srv
/dev/mapper/wdqs1005--vg-data  686G  528G  124G  82% /srv
/dev/mapper/wdqs2001--vg-data  686G  506G  146G  78% /srv
/dev/mapper/wdqs2002--vg-data  686G  510G  141G  79% /srv
/dev/mapper/wdqs2003--vg-data  686G  508G  144G  79% /srv

Related Objects

StatusSubtypeAssignedTask
ResolvedGehel
ResolvedGehel
Resolved Cmjohnson
ResolvedSmalyshev
Resolved Cmjohnson

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Smalyshev added a subscriber: Gehel.

We have a "sleeping" task to order new disks: T186526

Smalyshev added a subtask: Unknown Object (Task).Jun 5 2018, 5:34 PM
RobH closed subtask Unknown Object (Task) as Resolved.EditedJul 2 2018, 9:13 PM
RobH added a subscriber: RobH.

We have a "sleeping" task to order new disks: T186526

That task has been resolved, as there are now 4 sub-tasks off it for the ordering of the ssds for each of the hosts listed in this task. As they are procurement tasks, they are non-public (thus my update here.)

faidon reopened subtask Unknown Object (Task) as Open.Jul 11 2018, 1:24 PM
RobH closed subtask Unknown Object (Task) as Resolved.Aug 6 2018, 6:06 PM

Please note all sub-tasks for additions to the wdqs cluster have been created and are linked off this task.

Next steps are for coordination with @Gehel and onsites to add the disks. As the disks are hot swap addtions (not replacements), it shouldnt result in any downtime. However, its always best to be logged in and watching the host when swapping harware.

To not duplicate infos on each of the child tasks, I'll add anything that is common to all on this task.

We'll take this occasion to reimage the systems, so that we can validate that we have a working partman configuration with the new disks as well. Newer wdqs servers use the raid10-gpt-srv-lvm-ext4 recipe. We should use the same. We will loose a bit of disk space, since the new disks are slightly larger than the old ones (960GB vs 800GB). We are unlikely to need those GB for a few years, and by that time, those systems will be out of warranty. So let's choose simplicity and coherence over maximization of disk space that we're probably not going to need.

Note that data import after reimage can be done by copying over data from wdqs1010, which has been reimported recently. Procedure is documented on https://wikitech.wikimedia.org/wiki/Wikidata_query_service#Data_transfer_procedure.

Change 455563 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] configured wqds to use RAID10

https://gerrit.wikimedia.org/r/455563

Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['wdqs2003.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201808271540_gehel_3833.log.

Change 455563 merged by Gehel:
[operations/puppet@production] configured wqds to use RAID10

https://gerrit.wikimedia.org/r/455563

Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['wdqs2003.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201808271550_gehel_5518.log.

Completed auto-reimage of hosts:

['wdqs2003.codfw.wmnet']

and were ALL successful.

Gehel claimed this task.

New SSD in place, server reimaged and data reimported. We're all good!