Page MenuHomePhabricator

Maps[12]004 /srv disk space is critical
Closed, ResolvedPublic

Description

Maps masters disk space usage has reached critical level and something should be done about it fast.

onimisionipe@maps2004:/srv$ df -h
Filesystem                     Size  Used Avail Use% Mounted on
udev                            63G     0   63G   0% /dev
tmpfs                           13G  234M   13G   2% /run
/dev/md0                        46G  6.2G   38G  15% /
tmpfs                           63G  4.0K   63G   1% /dev/shm
tmpfs                          5.0M     0  5.0M   0% /run/lock
tmpfs                           63G     0   63G   0% /sys/fs/cgroup
tmpfs                          1.0G     0  1.0G   0% /var/lib/nginx
/dev/mapper/maps2004--vg-data  1.4T  1.3T   39G  98% /srv
onimisionipe@maps1004:~$ df -h
Filesystem                     Size  Used Avail Use% Mounted on
udev                            63G     0   63G   0% /dev
tmpfs                           13G  250M   13G   2% /run
/dev/md0                        92G   48G   39G  55% /
tmpfs                           63G  4.0K   63G   1% /dev/shm
tmpfs                          5.0M     0  5.0M   0% /run/lock
tmpfs                           63G     0   63G   0% /sys/fs/cgroup
tmpfs                          1.0G     0  1.0G   0% /var/lib/nginx
/dev/mapper/maps1004--vg-data  1.7T  1.3T  307G  81% /srv
tmpfs                           13G     0   13G   0% /run/user/19981

This could be related to large osm updates

Event Timeline

Mathew.onipe triaged this task as Unbreak Now! priority.May 27 2019, 6:58 AM
Mathew.onipe added a project: SRE.

Previous instance of a similar problem: T194966

Note that we've reimaged the servers since then, and we might have lost some configuration in the process.

Mentioned in SAL (#wikimedia-operations) [2019-05-27T07:05:31Z] <gehel> running nodetool repair on maps2004 -T224395

Change 512639 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] maps: all maps servers use RAID10

https://gerrit.wikimedia.org/r/512639

For whatever reason, only maps1004 was reimaged to RAID10 (instead of RAID1) when adding new disks (so we have 2 unused disks in each server). Note that since we have disks of different sizes, RAID10 is still wasting quite a bit of space, we should probably have RAID1 over the physical disks and use LVM to spread the partition over those 2 RAID1.

Change 512639 merged by Gehel:
[operations/puppet@production] maps: all maps servers use RAID10

https://gerrit.wikimedia.org/r/512639

Script wmf-auto-reimage was launched by gehel on cumin2001.codfw.wmnet for hosts:

['maps2004.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201905270918_gehel_16550.log.

Completed auto-reimage of hosts:

['maps2004.codfw.wmnet']

and were ALL successful.

Mentioned in SAL (#wikimedia-operations) [2019-05-27T11:33:33Z] <onimisionipe> starting osm initial import on maps2004 - T224395

Mathew.onipe lowered the priority of this task from Unbreak Now! to High.May 27 2019, 5:10 PM

Mentioned in SAL (#wikimedia-operations) [2019-05-30T08:50:12Z] <onimisionipe> maps2001 postgres initialization - T224395

Mentioned in SAL (#wikimedia-operations) [2019-05-30T15:06:16Z] <onimisionipe> pooled maps2004 - osm import is complete - T224395

On trying to initialize maps2001, It encountered the same disk space issues like maps2004. I will hold on now on others. We will have to reimage others (maps200[1-3]) before we can proceed.
So maps2001 is out from codfw cluster but we still have maps2004 and maps200[23].

Mentioned in SAL (#wikimedia-operations) [2019-05-31T14:32:13Z] <onimisionipe> depool maps2004 (again) - T224395

Mentioned in SAL (#wikimedia-operations) [2019-06-03T09:48:52Z] <onimisionipe> depooled maps codfw due to lag and disk issues - T224395

Mentioned in SAL (#wikimedia-operations) [2019-06-03T16:40:57Z] <onimisionipe> started osm-import on maps2004 - T224395

Mentioned in SAL (#wikimedia-operations) [2019-06-05T04:09:43Z] <onimisionipe> starting postgres slave init on maps2001 - T224395

Mentioned in SAL (#wikimedia-operations) [2019-06-05T08:42:06Z] <onimisionipe> removing maps2001 from cassandra cluster. It is going to be reimaged - T224395

Script wmf-auto-reimage was launched by gehel on cumin2001.codfw.wmnet for hosts:

['maps2001.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201906050845_gehel_31234.log.

Completed auto-reimage of hosts:

['maps2001.codfw.wmnet']

and were ALL successful.

Change 514479 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] maps: enable osm replication cron

https://gerrit.wikimedia.org/r/514479

Change 514479 merged by Gehel:
[operations/puppet@production] maps: enable osm replication cron

https://gerrit.wikimedia.org/r/514479

Script wmf-auto-reimage was launched by gehel on cumin2001.codfw.wmnet for hosts:

['maps2002.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201906061541_gehel_14064.log.

Completed auto-reimage of hosts:

['maps2002.codfw.wmnet']

Of which those FAILED:

['maps2002.codfw.wmnet']

Script wmf-auto-reimage was launched by gehel on cumin2001.codfw.wmnet for hosts:

['maps2002.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201906061709_gehel_31620.log.

Completed auto-reimage of hosts:

['maps2002.codfw.wmnet']

Of which those FAILED:

['maps2002.codfw.wmnet']

Script wmf-auto-reimage was launched by gehel on cumin2001.codfw.wmnet for hosts:

['maps2002.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201906061711_gehel_32108.log.

Completed auto-reimage of hosts:

['maps2002.codfw.wmnet']

Of which those FAILED:

['maps2002.codfw.wmnet']

Script wmf-auto-reimage was launched by gehel on cumin2001.codfw.wmnet for hosts:

['maps2002.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201906070957_gehel_3439.log.

Completed auto-reimage of hosts:

['maps2002.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by gehel on cumin2001.codfw.wmnet for hosts:

['maps2003.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201906110818_gehel_7880.log.

Completed auto-reimage of hosts:

['maps2003.codfw.wmnet']

and were ALL successful.

Mentioned in SAL (#wikimedia-operations) [2019-06-12T08:41:40Z] <onimisionipe> pool map2003. reimage and setup is complete - T224395

Mentioned in SAL (#wikimedia-operations) [2019-06-17T08:52:55Z] <onimisionipe> remove maps1001 from cassandra cluster - T224395

Script wmf-auto-reimage was launched by gehel on cumin1001.eqiad.wmnet for hosts:

['maps1001.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201906171404_gehel_104965.log.

Completed auto-reimage of hosts:

['maps1001.eqiad.wmnet']

and were ALL successful.

Mentioned in SAL (#wikimedia-operations) [2019-06-18T06:56:49Z] <onimisionipe> pooling maps1001 - reimage is complete - T224395

Mentioned in SAL (#wikimedia-operations) [2019-06-19T04:21:34Z] <onimisionipe> depooling maps1002 for reimaging into new partition scheme - T224395

Script wmf-auto-reimage was launched by gehel on cumin1001.eqiad.wmnet for hosts:

['maps1002.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201906190828_gehel_259837.log.

Completed auto-reimage of hosts:

['maps1002.eqiad.wmnet']

and were ALL successful.

Mentioned in SAL (#wikimedia-operations) [2019-06-19T15:37:35Z] <onimisionipe> pooled maps1002 - postgres init is complete and successfully joined to its cluster - T224395

Mentioned in SAL (#wikimedia-operations) [2019-06-20T04:00:32Z] <onimisionipe> depooling maps1003 for reimage into new partition scheme - T224395

Script wmf-auto-reimage was launched by gehel on cumin1001.eqiad.wmnet for hosts:

['maps1003.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201906200757_gehel_18929.log.

Completed auto-reimage of hosts:

['maps1003.eqiad.wmnet']

and were ALL successful.

Mentioned in SAL (#wikimedia-operations) [2019-06-20T23:01:53Z] <onimisionipe> pool maps1003 - node is ready to receive requests - T224395

All nodes reimaged, we're good for the moment