Page MenuHomePhabricator

reimage maps-test* servers
Closed, ResolvedPublic

Description

Disks have been filled on maps-test* servers (see T146848). This most probably result in data corruption for cassandra and / or postgresql. Those servers needs to be reimaged anyway to align them with other maps servers and the improved automation that has been put in place on those.

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2016-10-03T13:09:17Z] <gehel> shutting down services on maps-test* servers prior to reimage -T147194

Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['maps-test2001.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201610031410_gehel_29303.log.

Completed auto-reimage of hosts:

['maps-test2001.codfw.wmnet']

Those hosts were successful:

['maps-test2001.codfw.wmnet']

Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['maps-test2002.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201610051449_gehel_18275.log.

Completed auto-reimage of hosts:

['maps-test2002.codfw.wmnet']

Those hosts were successful:

['maps-test2002.codfw.wmnet']
Gehel triaged this task as High priority.Oct 5 2016, 8:43 PM

Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['maps-test2003.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201610070855_gehel_10452.log.

Completed auto-reimage of hosts:

['maps-test2003.codfw.wmnet']

Those hosts were successful:

[]

Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['maps-test2004.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201610070920_gehel_14871.log.

Completed auto-reimage of hosts:

['maps-test2004.codfw.wmnet']

Those hosts were successful:

[]

Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['maps-test2001.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201610071625_gehel_13919.log.

Completed auto-reimage of hosts:

['maps-test2001.codfw.wmnet']

Those hosts were successful:

['maps-test2001.codfw.wmnet']

Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['maps-test2002.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201610101311_gehel_19322.log.

Completed auto-reimage of hosts:

['maps-test2002.codfw.wmnet']

Those hosts were successful:

[]

Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['maps-test2003.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201610101403_gehel_27871.log.

Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['maps-test2004.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201610101403_gehel_27863.log.

Completed auto-reimage of hosts:

['maps-test2004.codfw.wmnet']

Those hosts were successful:

[]

Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['maps-test2004.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201610101416_gehel_29045.log.

Completed auto-reimage of hosts:

['maps-test2003.codfw.wmnet']

Those hosts were successful:

['maps-test2003.codfw.wmnet']

Completed auto-reimage of hosts:

['maps-test2004.codfw.wmnet']

Those hosts were successful:

['maps-test2004.codfw.wmnet']

Change 315271 had a related patch set uploaded (by Gehel):
Maps - cleanup postgres user creation

https://gerrit.wikimedia.org/r/315271

Change 315959 had a related patch set uploaded (by Gehel):
maps - adding dummy passwords for postgresql monitoring and replication

https://gerrit.wikimedia.org/r/315959

Change 315959 merged by Gehel:
maps - adding dummy passwords for postgresql monitoring and replication

https://gerrit.wikimedia.org/r/315959

Change 316536 had a related patch set uploaded (by Gehel):
maps - change structure of slaves according to https://gerrit.wikimedia.org/r/#/c/315271/

https://gerrit.wikimedia.org/r/316536

Change 316549 had a related patch set uploaded (by Gehel):
maps - adding dummy monitoring password for postgresql

https://gerrit.wikimedia.org/r/316549

Change 316549 merged by Gehel:
maps - adding dummy monitoring password for postgresql

https://gerrit.wikimedia.org/r/316549

Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['maps-test2001.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201610211601_gehel_23846.log.

Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['maps-test2001.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201610211601_gehel_24043.log.

Completed auto-reimage of hosts:

['maps-test2001.codfw.wmnet']

Of which those FAILED:

set(['maps-test2001.codfw.wmnet'])

Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['maps-test2002.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201610211642_gehel_529.log.

Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['maps-test2003.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201610211642_gehel_662.log.

Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['maps-test2004.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201610211642_gehel_723.log.

Completed auto-reimage of hosts:

['maps-test2002.codfw.wmnet']

Of which those FAILED:

set(['maps-test2002.codfw.wmnet'])

Completed auto-reimage of hosts:

['maps-test2003.codfw.wmnet']

Of which those FAILED:

set(['maps-test2003.codfw.wmnet'])

Completed auto-reimage of hosts:

['maps-test2004.codfw.wmnet']

Of which those FAILED:

set(['maps-test2004.codfw.wmnet'])

Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['maps-test2002.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201610211709_gehel_7440.log.

Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['maps-test2003.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201610211713_gehel_8323.log.

Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['maps-test2004.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201610211720_gehel_9640.log.

Completed auto-reimage of hosts:

['maps-test2003.codfw.wmnet']

Of which those FAILED:

set(['maps-test2003.codfw.wmnet'])

Completed auto-reimage of hosts:

['maps-test2002.codfw.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['maps-test2004.codfw.wmnet']

Of which those FAILED:

set(['maps-test2004.codfw.wmnet'])

Change 315271 merged by Gehel:
Maps - cleanup postgres user creation

https://gerrit.wikimedia.org/r/315271

Change 318282 had a related patch set uploaded (by Gehel):
maps / postgresql: new configuration format for slaves

https://gerrit.wikimedia.org/r/318282

Change 318282 merged by Gehel:
maps / postgresql: new configuration format for slaves

https://gerrit.wikimedia.org/r/318282

Change 318283 had a related patch set uploaded (by Gehel):
maps / postgresql: corrected hiera key for replication password

https://gerrit.wikimedia.org/r/318283

Change 318283 merged by Gehel:
maps / postgresql: corrected hiera key for replication password

https://gerrit.wikimedia.org/r/318283

Change 318286 had a related patch set uploaded (by Gehel):
maps / postgresql: corrected hiera key for replication password

https://gerrit.wikimedia.org/r/318286

Change 318286 merged by Gehel:
maps / postgresql: corrected hiera key for replication password

https://gerrit.wikimedia.org/r/318286

Mentioned in SAL (#wikimedia-operations) [2016-10-27T13:33:39Z] <gehel> postgres replication checks in error after deployment of https://gerrit.wikimedia.org/r/#/c/315271/ (T147194) - replication is working, only check is failing - icinga is silenced

Mentioned in SAL (#wikimedia-operations) [2016-10-27T13:34:05Z] <gehel> maps / postgres replication checks in error after deployment of https://gerrit.wikimedia.org/r/#/c/315271/ (T147194) - replication is working, only check is failing - icinga is silenced

Change 318511 had a related patch set uploaded (by Gehel):
maps / postgresql: use replication user for monitoring

https://gerrit.wikimedia.org/r/318511

Change 318511 merged by Gehel:
maps / postgresql: use replication user for monitoring

https://gerrit.wikimedia.org/r/318511

Re-image is complete, initial tile generation is in progress and working fine, but we are going to switch it to Cassandra. I already mark this task as resovled.