Page MenuHomePhabricator

mw1228 reporting readonly file system
Closed, ResolvedPublic

Description

reedy@tin:/srv/mediawiki-staging$ scap-purge-l10n-cache --version=php-1.27.0-wmf.7
15:41:28 sudo -u mwdeploy -n -- /bin/rm --recursive --force /srv/mediawiki/php-1.27.0-wmf.7/cache/l10n/* on mw1228.eqiad.wmnet returned [1]: /bin/rm: cannot remove ‘/srv/mediawiki/php-1.27.0-wmf.7/cache/l10n/l10n_cache-ab.cdb’: Read-only file system

Event Timeline

Reedy raised the priority of this task from to Medium.
Reedy updated the task description. (Show Details)
Reedy added a project: SRE.
Reedy subscribed.
The authenticity of host 'mw1228.eqiad.wmnet (<no hostip for proxy command>)' can't be established.
ECDSA key fingerprint is SHA256:rdCU2vs6Jctc96R4kDXnIdrhl0DaKizk8ctz0vTcr2M.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'mw1228.eqiad.wmnet' (ECDSA) to the list of known hosts.
packet_write_wait: Connection to UNKNOWN: Broken pipe
[14:50:59] <icinga-wm> PROBLEM - HHVM rendering on mw1228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:52:06] --> govg (~govg@unaffiliated/govg) has joined #wikimedia-operations
[14:52:19] <-- MGChecker (~MGChecker@p4FE945BD.dip0.t-ipconnect.de) has quit (Read error: Connection reset by peer)
[14:52:37] --> MGChecker (~MGChecker@p4FE945BD.dip0.t-ipconnect.de) has joined #wikimedia-operations
[14:52:58] <icinga-wm> PROBLEM - puppet last run on mw1228 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:54:39] <icinga-wm> RECOVERY - HHVM rendering on mw1228 is OK: HTTP OK: HTTP/1.1 200 OK - 65334 bytes in 0.128 second response time
[14:54:40] <icinga-wm> RECOVERY - puppet last run on mw1228 is OK: OK: Puppet is currently enabled, last run 29 minutes ago with 0 failures
[14:58:49] <icinga-wm> PROBLEM - HHVM rendering on mw1228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:59:40] <icinga-wm> PROBLEM - nutcracker process on mw1228 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:59:40] <icinga-wm> PROBLEM - RAID on mw1228 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:59:59] <icinga-wm> PROBLEM - DPKG on mw1228 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:00:00] <icinga-wm> PROBLEM - HHVM processes on mw1228 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:00:00] <icinga-wm> PROBLEM - salt-minion processes on mw1228 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:00:10] <icinga-wm> PROBLEM - Check size of conntrack table on mw1228 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:00:40] <icinga-wm> PROBLEM - configured eth on mw1228 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:00:49] <icinga-wm> PROBLEM - puppet last run on mw1228 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:00:58] <icinga-wm> PROBLEM - Disk space on mw1228 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:01:08] <icinga-wm> PROBLEM - nutcracker port on mw1228 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:01:19] <icinga-wm> PROBLEM - Apache HTTP on mw1228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:01:38] <icinga-wm> RECOVERY - RAID on mw1228 is OK: OK: no RAID installed
[15:01:38] <icinga-wm> RECOVERY - nutcracker process on mw1228 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker
[15:01:41] <-- MGChecker (~MGChecker@p4FE945BD.dip0.t-ipconnect.de) has quit (Read error: Connection reset by peer)
[15:01:53] --> MGChecker (~MGChecker@p4FE945BD.dip0.t-ipconnect.de) has joined #wikimedia-operations
[15:01:59] <icinga-wm> RECOVERY - HHVM processes on mw1228 is OK: PROCS OK: 6 processes with command name hhvm
[15:01:59] <icinga-wm> RECOVERY - DPKG on mw1228 is OK: All packages OK
[15:01:59] <icinga-wm> RECOVERY - salt-minion processes on mw1228 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[15:02:10] <icinga-wm> RECOVERY - Check size of conntrack table on mw1228 is OK: OK: nf_conntrack is 0 % full
[15:02:38] <icinga-wm> RECOVERY - configured eth on mw1228 is OK: OK - interfaces up
[15:02:40] <icinga-wm> RECOVERY - puppet last run on mw1228 is OK: OK: Puppet is currently enabled, last run 37 minutes ago with 0 failures
[15:02:50] <icinga-wm> RECOVERY - Disk space on mw1228 is OK: DISK OK
[15:02:59] <icinga-wm> RECOVERY - nutcracker port on mw1228 is OK: TCP OK - 0.000 second response time on port 11212

godog has now depooled it. Still needs investigation

Change 260251 had a related patch set uploaded (by Filippo Giunchedi):
scap: mw1228 reported ro fs

https://gerrit.wikimedia.org/r/260251

Change 260251 merged by Filippo Giunchedi:
scap: mw1228 reported ro fs

https://gerrit.wikimedia.org/r/260251

Dzahn subscribed.

when trying to ssh to it:

packet_write_wait: Connection to UNKNOWN: Broken pipe

when trying console login:

mw1228 login: root
[34428496.513327] end_request: I/O error, dev sda, sector 197411136
[34428496.520224] end_request: I/O error, dev sda, sector 197411136
[34428506.537227] end_request: I/O error, dev sda, sector 520476784
[34428506.547770] end_request: I/O error, dev sda, sector 109442248
[34428506.554614] end_request: I/O error, dev sda, sector 109442248
[34428554.395529] end_request: I/O error, dev sda, sector 520476784

adding ops-eqiad for a disk replacement

Cmjohnson raised the priority of this task from Medium to High.Jan 20 2016, 3:39 PM

Congratulations: Work Order SR923370958 was successfully submitted.

Disk replaced,
Return shipping information
USPS 9202 3946 5301 2430 6122 60
FEDEX 9611918 2393026 52103949

Cmjohnson added subscribers: Joe, Cmjohnson.

new OS installed...assigning to @Joe to add back to the cluser

Server repooled and taking traffic correctly.

http://config-master.wikimedia.org/conftool/eqiad/api

{ 'host': 'mw1228.eqiad.wmnet', 'weight':10, 'enabled': True }

The default weight is 10 conftool-data/services/mediawiki.yaml but other hosts have 15/20 as weight, so the last step is to adjust the value. Checking.

mw1228.eqiad.wmnet: weight changed 10 => 20