Page MenuHomePhabricator

All wtp and parse servers have a bad partition scheme.
Closed, ResolvedPublic

Description

I have depooled wtp1025 since its root partition was full, causing some mw errors etc..

The rest of the wtp nodes are around 70/75% of usage, the only outlier seems 1025.

The worst dirs on 1025 are:

1.2G	/tmp
2.1G	/var
4.8G	/usr
31G	/srv

elukey@wtp1025:/srv/mediawiki$ sudo du -hs * | sort -h |tail
340K	dblists
2.1M	docroot
2.1M	portals
8.0M	wmf-config
50M	static
126M	fonts
1.8G	php-1.35.0-wmf.24
6.6G	php-1.35.0-wmf.40
10G	php-1.36.0-wmf.1
11G	php-1.35.0-wmf.41

elukey@wtp1025:/srv/mediawiki/php-1.36.0-wmf.1$ sudo du -hs * | sort -h |tail
1.2M	HISTORY
4.5M	maintenance
18M	skins
18M	tests
21M	resources
30M	includes
68M	languages
410M	vendor
548M	extensions
8.9G	cache

The reason for all this is that the partition scheme for wtp servers is severely wrong, as it doesn't include a separate SRV partition on a logical volume. We need to reimage all of the WTP servers with a good partition scheme.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Script wmf-auto-reimage was launched by jayme on cumin1001.eqiad.wmnet for hosts:

wtp1039.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202007281338_jayme_16334_wtp1039_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by jayme on cumin1001.eqiad.wmnet for hosts:

wtp1040.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202007281340_jayme_17993_wtp1040_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['wtp1037.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by jayme on cumin1001.eqiad.wmnet for hosts:

wtp1041.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202007281355_jayme_30999_wtp1041_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['wtp1038.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['wtp1040.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by jayme on cumin1001.eqiad.wmnet for hosts:

wtp1042.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202007281455_jayme_22035_wtp1042_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by jayme on cumin1001.eqiad.wmnet for hosts:

wtp1043.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202007281458_jayme_24772_wtp1043_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['wtp1039.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['wtp1041.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by jayme on cumin1001.eqiad.wmnet for hosts:

wtp1044.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202007281530_jayme_23290_wtp1044_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by jayme on cumin1001.eqiad.wmnet for hosts:

wtp1045.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202007281531_jayme_24828_wtp1045_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['wtp1043.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['wtp1042.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['wtp1045.eqiad.wmnet']

and were ALL successful.

All hosts but wtp104[6-8].eqiad.wmnet completed.

Unfortunately wtp1044.eqiad.wmnet is still waiting for the puppet run after reboot. Checking back on that later.

Completed auto-reimage of hosts:

['wtp1044.eqiad.wmnet']

and were ALL successful.

Mentioned in SAL (#wikimedia-operations) [2020-07-28T17:41:17Z] <volans> run apt-get clean on wtp[1046,1048].eqiad.wmnet and wtp2001.codfw.wmnet to free ~`2GB as they were 100% - T258775

Unfortunately wtp1044.eqiad.wmnet is still waiting for the puppet run after reboot. Checking back on that later.

wtp1044.eqiad.wmnet complete and repooled.

Change 616920 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] installserver: use correct partman recipe for parse*

https://gerrit.wikimedia.org/r/616920

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

wtp1046.eqiad.wmmet

The log can be found in /var/log/wmf-auto-reimage/202007282305_dzahn_12630_wtp1046_eqiad_wmmet.log.

Completed auto-reimage of hosts:

['wtp1046.eqiad.wmmet']

Of which those FAILED:

['wtp1046.eqiad.wmmet']

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

wtp1046.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202007282306_dzahn_13944_wtp1046_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

wtp1047.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202007282315_dzahn_21134_wtp1047_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

wtp1048.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202007282352_dzahn_20783_wtp1048_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['wtp1046.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['wtp1047.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['wtp1048.eqiad.wmnet']

and were ALL successful.

All hosts but wtp104[6-8].eqiad.wmnet completed.

wtp1046, wtp1047, wtp1048 completed and repooled

on wtp1047 i had to manually delete deployment-cache due to some race condition on the first puppet run

All hosts but wtp104[6-8].eqiad.wmnet completed.

wtp1046, wtp1047, wtp1048 completed and repooled

Thanks!

Script wmf-auto-reimage was launched by jayme on cumin1001.eqiad.wmnet for hosts:

wtp2001.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202007290842_jayme_19895_wtp2001_codfw_wmnet.log.

Completed auto-reimage of hosts:

['wtp2001.codfw.wmnet']

and were ALL successful.

@Dzahn I did wtp2001.codfw.wmnet as that was pretty full as well.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

wtp2002.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202007291532_dzahn_8935_wtp2002_codfw_wmnet.log.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

wtp2003.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202007291533_dzahn_10043_wtp2003_codfw_wmnet.log.

Change 616920 merged by Dzahn:
[operations/puppet@production] installserver: use correct partman recipe for parse*

https://gerrit.wikimedia.org/r/616920

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

['parse2001.codfw.wmnet', 'parse2002.codfw.wmnet', 'parse2003.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202007291600_dzahn_1939.log.

@Dzahn I did wtp2001.codfw.wmnet as that was pretty full as well.

Thank you. Taking over with wtp2002,2003 and parse2001,2002,2003 and counting up.

Changed the partman recipe for parse* to the same you put on wtp*.

Completed auto-reimage of hosts:

['parse2002.codfw.wmnet', 'parse2003.codfw.wmnet', 'parse2001.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

['parse2004.codfw.wmnet', 'parse2005.codfw.wmnet', 'parse2006.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202007291627_dzahn_29722.log.

Completed auto-reimage of hosts:

['parse2006.codfw.wmnet', 'parse2005.codfw.wmnet', 'parse2004.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

['parse2007.codfw.wmnet', 'parse2008.codfw.wmnet', 'parse2009.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202007291701_dzahn_30187.log.

Completed auto-reimage of hosts:

['parse2008.codfw.wmnet', 'parse2007.codfw.wmnet', 'parse2009.codfw.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['wtp2002.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

['parse2010.codfw.wmnet', 'parse2011.codfw.wmnet', 'parse2012.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202007291746_dzahn_11362.log.

Completed auto-reimage of hosts:

['wtp2003.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

wtp2004.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202007291751_dzahn_15752_wtp2004_codfw_wmnet.log.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

wtp2005.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202007291756_dzahn_20103_wtp2005_codfw_wmnet.log.

Completed auto-reimage of hosts:

['wtp2005.codfw.wmnet']

Of which those FAILED:

['wtp2005.codfw.wmnet']

Note wtp2005 is missing because of T257903.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

wtp2006.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202007291757_dzahn_21341_wtp2006_codfw_wmnet.log.

Completed auto-reimage of hosts:

['parse2011.codfw.wmnet', 'parse2012.codfw.wmnet', 'parse2010.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

['parse2013.codfw.wmnet', 'parse2014.codfw.wmnet', 'parse2015.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202007291827_dzahn_17101.log.

Completed auto-reimage of hosts:

['parse2013.codfw.wmnet', 'parse2014.codfw.wmnet']

Of which those FAILED:

['parse2015.codfw.wmnet']

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

['parse2016.codfw.wmnet', 'parse2017.codfw.wmnet', 'parse2018.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202007291901_dzahn_17125.log.

Completed auto-reimage of hosts:

['parse2018.codfw.wmnet', 'parse2016.codfw.wmnet', 'parse2017.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

['parse2019.codfw.wmnet', 'parse2020.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202007291926_dzahn_10430.log.

Completed auto-reimage of hosts:

['parse2019.codfw.wmnet', 'parse2020.codfw.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['wtp2004.codfw.wmnet']

and were ALL successful.

all parse2* hosts done:

[cumin1001:~] $ sudo cumin parse2* 'df -h | grep srv | cut -d " " -f12'
..
20 hosts will be targeted:

1%

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

wtp2007.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202007292019_dzahn_30753_wtp2007_codfw_wmnet.log.

Completed auto-reimage of hosts:

['wtp2006.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

wtp2008.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202007292034_dzahn_11016_wtp2008_codfw_wmnet.log.

Completed auto-reimage of hosts:

['wtp2007.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

wtp2009.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202007292245_dzahn_32551_wtp2009_codfw_wmnet.log.

Completed auto-reimage of hosts:

['wtp2008.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

wtp2010.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202007292341_dzahn_17824_wtp2010_codfw_wmnet.log.

Completed auto-reimage of hosts:

['wtp2009.codfw.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['wtp2010.codfw.wmnet']

and were ALL successful.

wtp2002 through wtp2010 done and repooled.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

wtp2011.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202007301558_dzahn_30276_wtp2011_codfw_wmnet.log.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

wtp2012.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202007301559_dzahn_31216_wtp2012_codfw_wmnet.log.

Completed auto-reimage of hosts:

['wtp2011.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

wtp2013.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202007301924_dzahn_32723_wtp2013_codfw_wmnet.log.

Completed auto-reimage of hosts:

['wtp2012.codfw.wmnet']

Of which those FAILED:

['wtp2012.codfw.wmnet']

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

wtp2014.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202007302045_dzahn_11722_wtp2014_codfw_wmnet.log.

Completed auto-reimage of hosts:

['wtp2013.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

wtp2015.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202007302144_dzahn_20146_wtp2015_codfw_wmnet.log.

Completed auto-reimage of hosts:

['wtp2014.codfw.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['wtp2015.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

wtp2016.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202007311629_dzahn_20334_wtp2016_codfw_wmnet.log.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

wtp2017.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202007311631_dzahn_20487_wtp2017_codfw_wmnet.log.

Completed auto-reimage of hosts:

['wtp2017.codfw.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['wtp2016.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

wtp2018.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202007311858_dzahn_10880_wtp2018_codfw_wmnet.log.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

wtp2019.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202007311921_dzahn_14318_wtp2019_codfw_wmnet.log.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

wtp2020.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202007311927_dzahn_14796_wtp2020_codfw_wmnet.log.

Completed auto-reimage of hosts:

['wtp2018.codfw.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['wtp2019.codfw.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['wtp2020.codfw.wmnet']

and were ALL successful.

Mentioned in SAL (#wikimedia-operations) [2020-07-31T22:03:22Z] <mutante> wtp2019 - parsoid could not start after reimaging - was missing /etc/parsoid/config.yaml which is a symbolic link deep onto /srv/deployment/parsoid/deploy-cache/.. like in some other cases before manually deleted deploy-cache dir and ran puppet again .. T258775

All wtp* and parse* servers have been reimaged.

With the exception of wtp2019 they have also been tested with httpbb, parsoid service running, repooled and look fine in monitoring.

wtp2019 has an issue with starting the parsoid service.

For some other servers i ran into race conditions that could be fixed by manually restarting php-fpm or deleting parser-cache dir and letting puppet re-clone.

wtp2019 still needs to be fixed.

[cumin1001:~] $ sudo cumin wtp* 'df -h | grep mapper | cut -d "/" -f1,2'
43 hosts will be targeted:
wtp[2001-2004,2006-2020].codfw.wmnet,wtp[1025-1048].eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====                                                                                                                                                                        
(43) wtp[2001-2004,2006-2020].codfw.wmnet,wtp[1025-1048].eqiad.wmnet                                                                                                                          
----- OUTPUT of 'df -h | grep map...cut -d "/" -f1,2' -----                                                                                                                                   
/dev                                                                                                                                                                                          
/dev


[cumin1001:~] $ sudo cumin parse* 'df -h | grep mapper | cut -d "/" -f1,2'
20 hosts will be targeted:
parse[2001-2020].codfw.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====                                                                                                                                                                        
(20) parse[2001-2020].codfw.wmnet                                                                                                                                                             
----- OUTPUT of 'df -h | grep map...cut -d "/" -f1,2' -----                                                                                                                                   
/dev                                                                                                                                                                                          
/dev                                                                                                                                                                                          
================                                                                                                                                                                              
PASS |███████████

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

wtp2019.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202008060033_dzahn_14511_wtp2019_codfw_wmnet.log.

Completed auto-reimage of hosts:

['wtp2019.codfw.wmnet']

and were ALL successful.