Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ops-monitoring-bot
	Feb 19 2018, 1:53 PM

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host db2030. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Degraded)

$ sudo /usr/local/lib/nagios/plugins/get_raid_status_megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: =====> Degraded <=====
	Number Of Drives per span: 6
	Number of Spans: 2
	Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 6

			PD: 4 Information
			Enclosure Device ID: 32
			Slot Number: 4
			Drive's position: DiskGroup: 0, Span: 0, Arm: 4
			Media Error Count: 29
			Other Error Count: 0
			Predictive Failure Count: =====> 290 <=====
			Last Predictive Failure Event Seq Number: 5652

				Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
				Firmware state: Online, Spun Up
				Media Type: Hard Disk Device
				Drive Temperature: 49C (120.20 F)

			PD: 5 Information
			Enclosure Device ID: 32
			Slot Number: 5
			Drive's position: DiskGroup: 0, Span: 0, Arm: 5
			Media Error Count: 3
			Other Error Count: 0
			Predictive Failure Count: =====> 57 <=====
			Last Predictive Failure Event Seq Number: 5285

				Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
				Firmware state: =====> Failed <=====
				Media Type: Hard Disk Device
				Drive Temperature: 50C (122.00 F)

=== RaidStatus completed

Details

Subject	Repo	Branch	Lines +/-
dbproxy1005: Replace db2030 with db2037	operations/puppet	production	+2 -2
Change m5-slave to db2037	operations/dns	master	+1 -1
site.pp: Remove db2037 from s4	operations/puppet	production	+1 -1
m5.hosts: Add db2037	operations/software	master	+1 -2
mariadb: Move db2037 from s4 role to m5	operations/puppet	production	+7 -11
db2037: Disable notifications	operations/puppet	production	+1 -0
install_server: Allow db2037 to reinstall	operations/puppet	production	+1 -3
mariadb: Remove db2037 and db2044 for mediawiki	operations/mediawiki-config	master	+0 -16
mariadb: Depool db2037 and db2044	operations/mediawiki-config	master	+2 -2

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Dzahn	T165620 Upload gerrit package to stretch apt.wm.org repo
Resolved	Dzahn	T168562 Reimage gerrit2001 as stretch
Resolved	Dzahn	T176532 Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working
Resolved	• jcrespo	T177208 Provide dedicated database resources for wikidata
Resolved	None	T170662 Productionize 22 new codfw database servers
Resolved	None	T176243 Decommission database hosts <= db2031 (tracking)
Open	None	T157702 Followup for TLS MariaDB server roll-out
Resolved	• jcrespo	T183470 Setup newer machines and replace all old misc (m*) and x1 codfw machines
Resolved	• Marostegui	T187722 Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030)
Resolved	RobH	T187768 Decommission db2030

Event Timeline

ops-monitoring-bot added projects: ops-codfw, SRE.Feb 19 2018, 1:53 PM

ops-monitoring-bot subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 19 2018, 1:53 PM

Volans added a project: DBA.Feb 19 2018, 3:34 PM

@Papaul if you got spares, could you replace it?
Thanks!

• Marostegui moved this task from Triage to In progress on the DBA board.Feb 19 2018, 3:38 PM

@Marostegui disk replacement complete

Papaul reassigned this task from Papaul to • Marostegui.Feb 19 2018, 6:47 PM

This host has completely failed - looks like (as per my chat with @Papaul that there were two disks blinking badly on the server chassis), the one replaced was not yet detected by megacli, and thus the raid failed completely.

root@db2030:~# dmesg
-bash: /bin/dmesg: Input/output error

From HW logs looks like disk :

/admin1/system1/logs1/log1-> show record4
/admin1/system1/logs1/log1/record4
	properties
		CreationTimestamp = 20180219133157.000000+000
		ElementName = System Event Log Entry
		RecordData = *4*Fault detected on Drive 5.
		RecordFormat = *string Severity*string Description
		RecordID = 4
	associations
	targets
	verbs
		cd
		show
		help
		version
/admin1/system1/logs1/log1-> show record5
/admin1/system1/logs1/log1/record5
	properties
		CreationTimestamp = 20180219184540.000000+000
		ElementName = System Event Log Entry
		RecordData = *4*Drive 4 is removed.
		RecordFormat = *string Severity*string Description
		RecordID = 5
	associations
	targets
	verbs
		cd
		show
		help
		version
/admin1/system1/logs1/log1-> show record6
/admin1/system1/logs1/log1/record6
	properties
		CreationTimestamp = 20180219184545.000000+000
		ElementName = System Event Log Entry
		RecordData = *2*Drive 4 is installed.
		RecordFormat = *string Severity*string Description
		RecordID = 6

This host is going to be decommissioned anyways, so probably not worth doing any work trying to get it fixed or reimaged - it is also a passive (non-used) slave from m5
Better to take another host (initially from this list T170662 ) and reimport the data.

• jcrespo subscribed.Feb 19 2018, 7:16 PM

Also we could remove 2x160GB from s6 with a large server and use one of those 160GB ones for this replacement

• Marostegui renamed this task from Degraded RAID on db2030 to Replace db2030 from m5 with another host (WAS: Degraded RAID on db2030).Feb 20 2018, 6:38 AM

My proposal is:

Replace db2067 and db2060 (160GB) in s6 with a large host.
This is the current status of s6

's6' => [
    'db2039'      => 0,   # C6 2.9TB 160GB, master
    'db2046'      => 400, # C6 2.9TB 160GB
    'db2053'      => 100, # D6 2.9TB 160GB, dump (inactive), vslow
    'db2060'      => 100, # D6 3.3TB 160GB, api
    'db2067'      => 400, # D6 3.3TB 160GB
    'db2076'      => 400, # B1 3.3TB 512GB
    'db2087:3316' => 1, # C1 3.3TB 512GB # rc, log: s6 and s7
    'db2089:3316' => 1, # A3 3.3TB 512GB # rc, log: s6 and s5(s8)

My proposal

                'db2039'      => 0,   # C6 2.9TB 160GB, master
                'db2046'      => 400, # C6 2.9TB 160GB
                'db2053'      => 100, # D6 2.9TB 160GB, dump (inactive), vslow
-               'db2060'      => 100, # D6 3.3TB 160GB, api
-               'db2067'      => 400, # D6 3.3TB 160GB
-               'db2076'      => 400, # B1 3.3TB 512GB
+               'dbXXXX'      => 400, # XX 3.3TB 512GB, api
+               'db2076'      => 500, # B1 3.3TB 512GB
                'db2087:3316' => 1, # C1 3.3TB 512GB # rc, log: s6 and s7
                'db2089:3316' => 1, # A3 3.3TB 512GB # rc, log: s6 and s5(s8)

Use db2060 for m5
Use db2067 for m2

I would like to focus on s4. s4 needs better hardware than s6, then do:

		'db2051'	  => 0,   # B8 2.9TB 160GB, master
-		'db2037'	  => 50,  # C6 2.9TB 160GB, rc, log
-		'db2044'	  => 50,  # C6 2.9TB 160GB, rc, log
		'db2058'	  => 50,  # D6 3.3TB 160GB, dump (inactive), vslow
		'db2065'	  => 200, # D6 3.3TB 160GB, api
		'db2073'	  => 400, # C6 3.3TB 512GB # Compressed InnoDB
+		'db2XXX'	  => 400, # XX 3.3TB 512GB
		'db2084:3314' => 1, # D6 3.3TB 512GB # rc, log: s4 and s5
		'db2091:3314' => 1, # A8 3.3TB 512GB # rc, log: s2 and s4

In T187722#3984885, @jcrespo wrote:

I would like to focus on s4. s4 needs better hardware than s6, then do:

I was doubting between s4 and s6 too - so this looks good to me too :-)

Mentioned in SAL (#wikimedia-operations) [2018-02-20T09:16:35Z] <marostegui> Data checks for db2037 before removing it from s4 - T187722

We also have backups from the host that crashed itself on dbstore2001, I think, we could use them to reconstruct m5 without touching the master to, at the same time, validate the backups and help accomplish the goal. It is ok if you prefer to do it faster with other methods.

In T187722#3984891, @Marostegui wrote:

In T187722#3984885, @jcrespo wrote:

I would like to focus on s4. s4 needs better hardware than s6, then do:

I was doubting between s4 and s6 too - so this looks good to me too :-)

My rationale is that the only hosts that have more than 1 large servers on codfw are s1 and s8, s4 should come next.

In T187722#3984897, @jcrespo wrote:

We also have backups from the host that crashed itself on dbstore2001, I think, we could use them to reconstruct m5 without touching the master to, at the same time, validate the backups and help accomplish the goal. It is ok if you prefer to do it faster with other methods.

Yeah, I was planning on using dbstore2001 at first instance precisely for that, to see if we are doing the things rightly :-)

Change 412871 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool db2037 and db2044

https://gerrit.wikimedia.org/r/412871

gerritbot added a project: Patch-For-Review.Feb 20 2018, 10:43 AM

Change 412872 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Remove db2037 and db2044 for mediawiki

https://gerrit.wikimedia.org/r/412872

Change 412871 merged by Jcrespo:
[operations/mediawiki-config@master] mariadb: Depool db2037 and db2044

https://gerrit.wikimedia.org/r/412871

Change 412872 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Remove db2037 and db2044 for mediawiki

https://gerrit.wikimedia.org/r/412872

Aside from m5, what is the other host for? x1 or m2?

So I think db2037 -> m5
db2044 -> x1/m2 no?

m2, sorry

Change 412876 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] install_server: Allow db2037 to reinstall

https://gerrit.wikimedia.org/r/412876

Change 412876 merged by Marostegui:
[operations/puppet@production] install_server: Allow db2037 to reinstall

https://gerrit.wikimedia.org/r/412876

Change 412878 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db2037: Disable notifications

https://gerrit.wikimedia.org/r/412878

Change 412878 merged by Marostegui:
[operations/puppet@production] db2037: Disable notifications

https://gerrit.wikimedia.org/r/412878

• jcrespo added a parent task: T183470: Setup newer machines and replace all old misc (m*) and x1 codfw machines.Feb 20 2018, 8:01 PM

Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts:

db2037.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/201802210623_marostegui_3847_db2037_codfw_wmnet.log.

Change 413103 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Move db2037 from s4 role to m5

https://gerrit.wikimedia.org/r/413103

Completed auto-reimage of hosts:

['db2037.codfw.wmnet']

and were ALL successful.

Change 413103 merged by Marostegui:
[operations/puppet@production] mariadb: Move db2037 from s4 role to m5

https://gerrit.wikimedia.org/r/413103

Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts:

db2037.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/201802210658_marostegui_30995_db2037_codfw_wmnet.log.

I am trying to check what's wrong with db2037, as it is showing:

[   52.315934] blk_update_request: critical medium error, dev sda, sector 8395392
[   52.401753] sd 0:1:0:0: [sda] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[   52.401776] sd 0:1:0:0: [sda] tag#1 Sense Key : Medium Error [current]
[   52.401778] sd 0:1:0:0: [sda] tag#1 Add. Sense: Unrecovered read error
[   52.401781] sd 0:1:0:0: [sda] tag#1 CDB: Read(16) 88 00 00 00 00 00 00 80 1a 88 00 00 00 08 00 00
[   52.401783] blk_update_request: critical medium error, dev sda, sector 8395400
[   52.434405] sd 0:1:0:0: [sda] tag#2 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[   52.434407] sd 0:1:0:0: [sda] tag#2 Sense Key : Medium Error [current]
[   52.434408] sd 0:1:0:0: [sda] tag#2 Add. Sense: Unrecovered read error
[   52.434409] sd 0:1:0:0: [sda] tag#2 CDB: Read(16) 88 00 00 00 00 00 00 80 1a 90 00 00 00 08 00 00
[   52.434410] blk_update_request: critical medium error, dev sda, sector 8395408
[   52.467099] sd 0:1:0:0: [sda] tag#3 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[   52.467101] sd 0:1:0:0: [sda] tag#3 Sense Key : Medium Error [current]
[   52.467103] sd 0:1:0:0: [sda] tag#3 Add. Sense: Unrecovered read error
[   52.467105] sd 0:1:0:0: [sda] tag#3 CDB: Read(16) 88 00 00 00 00 00 00 80 1a 98 00 00 00 08 00 00
[   52.467106] blk_update_request: critical medium error, dev sda, sector 8395416
[   52.500450] sd 0:1:0:0: [sda] tag#4 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[   52.500451] sd 0:1:0:0: [sda] tag#4 Sense Key : Medium Error [current]
[   52.500453] sd 0:1:0:0: [sda] tag#4 Add. Sense: Unrecovered read error
[   52.500455] sd 0:1:0:0: [sda] tag#4 CDB: Read(16) 88 00 00 00 00 00 00 80 1a a0 00 00 00 08 00 00
[   52.500457] blk_update_request: critical medium error, dev sda, sector 8395424
[   52.533487] sd 0:1:0:0: [sda] tag#5 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[   52.533489] sd 0:1:0:0: [sda] tag#5 Sense Key : Medium Error [current]
[   52.533491] sd 0:1:0:0: [sda] tag#5 Add. Sense: Unrecovered read error
[   52.533820] sd 0:1:0:0: [sda] tag#5 CDB: Read(16) 88 00 00 00 00 00 00 80 1a a8 00 00 00 08 00 00
[   52.533821] blk_update_request: critical medium error, dev sda, sector 8395432
[   53.263857] sd 0:1:0:0: [sda] tag#6 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[   53.263879] sd 0:1:0:0: [sda] tag#6 Sense Key : Medium Error [current]
[   53.263883] sd 0:1:0:0: [sda] tag#6 Add. Sense: Unrecovered read error
[   53.263886] sd 0:1:0:0: [sda] tag#6 CDB: Read(16) 88 00 00 00 00 00 00 80 1a b0 00 00 00 08 00 00
[   53.263888] blk_update_request: critical medium error, dev sda, sector 8395440
[   54.012267] sd 0:1:0:0: [sda] tag#7 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[   54.012290] sd 0:1:0:0: [sda] tag#7 Sense Key : Medium Error [current]
[   54.012295] sd 0:1:0:0: [sda] tag#7 Add. Sense: Unrecovered read error
[   54.012297] sd 0:1:0:0: [sda] tag#7 CDB: Read(16) 88 00 00 00 00 00 00 80 1a b8 00 00 00 08 00 00
[   54.012299] blk_update_request: critical medium error, dev sda, sector 8395448
[   55.007060] EXT4-fs error (device sda1): __ext4_get_inode_loc:4355: inode #269794: block 1049054: comm ferm: unable to read itable block
[   57.689290] scsi_io_completion: 18 callbacks suppressed
[   57.689317] sd 0:1:0:0: [sda] tag#10 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[   57.689319] sd 0:1:0:0: [sda] tag#10 Sense Key : Medium Error [current]
[   57.689322] sd 0:1:0:0: [sda] tag#10 Add. Sense: Unrecovered read error
[   57.689324] sd 0:1:0:0: [sda] tag#10 CDB: Read(16) 88 00 00 00 00 00 00 80 1a d0 00 00 00 08 00 00
[   57.689326] blk_update_request: 18 callbacks suppressed
[   57.689328] blk_update_request: critical medium error, dev sda, sector 8395472
[   57.722338] sd 0:1:0:0: [sda] tag#11 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[   57.722339] sd 0:1:0:0: [sda] tag#11 Sense Key : Medium Error [current]
[   57.722340] sd 0:1:0:0: [sda] tag#11 Add. Sense: Unrecovered read error
[   57.722341] sd 0:1:0:0: [sda] tag#11 CDB: Read(16) 88 00 00 00 00 00 00 80 1a d8 00 00 00 08 00 00
[   57.722344] blk_update_request: critical medium error, dev sda, sector 8395480
[   57.754940] sd 0:1:0:0: [sda] tag#12 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[   57.754941] sd 0:1:0:0: [sda] tag#12 Sense Key : Medium Error [current]
[   57.754943] sd 0:1:0:0: [sda] tag#12 Add. Sense: Unrecovered read error
[   57.754944] sd 0:1:0:0: [sda] tag#12 CDB: Read(16) 88 00 00 00 00 00 00 80 1a e0 00 00 00 08 00 00
[   57.754945] blk_update_request: critical medium error, dev sda, sector 8395488
[   57.787731] sd 0:1:0:0: [sda] tag#13 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[   57.787732] sd 0:1:0:0: [sda] tag#13 Sense Key : Medium Error [current]
[   57.787733] sd 0:1:0:0: [sda] tag#13 Add. Sense: Unrecovered read error
[   57.787735] sd 0:1:0:0: [sda] tag#13 CDB: Read(16) 88 00 00 00 00 00 00 80 1a e8 00 00 00 08 00 00
[   57.787736] blk_update_request: critical medium error, dev sda, sector 8395496
[   57.830103] sd 0:1:0:0: [sda] tag#14 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[   57.830128] sd 0:1:0:0: [sda] tag#14 Sense Key : Medium Error [current]
[   57.830130] sd 0:1:0:0: [sda] tag#14 Add. Sense: Unrecovered read error
[   57.830132] sd 0:1:0:0: [sda] tag#14 CDB: Read(16) 88 00 00 00 00 00 00 80 1a f8 00 00 00 08 00 00
[   57.830137] blk_update_request: critical medium error, dev sda, sector 8395512
[   57.885981] sd 0:1:0:0: [sda] tag#15 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[   57.886007] sd 0:1:0:0: [sda] tag#15 Sense Key : Medium Error [current]
[   57.886009] sd 0:1:0:0: [sda] tag#15 Add. Sense: Unrecovered read error
[   57.886012] sd 0:1:0:0: [sda] tag#15 CDB: Read(16) 88 00 00 00 00 00 00 80 1a f0 00 00 00 08 00 00
[   57.886014] blk_update_request: critical medium error, dev sda, sector 8395504
[   57.922666] EXT4-fs error (device sda1): __ext4_get_inode_loc:4355: inode #269794: block 1049054: comm ferm: unable to read itable block
[   58.055599] sd 0:1:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[   58.055620] sd 0:1:0:0: [sda] tag#0 Sense Key : Medium Error [current]
[   58.055622] sd 0:1:0:0: [sda] tag#0 Add. Sense: Unrecovered read error
[   58.055625] sd 0:1:0:0: [sda] tag#0 CDB: Read(16) 88 00 00 00 00 00 00 80 1a 80 00 00 00 08 00 00
[   58.055627] blk_update_request: critical medium error, dev sda, sector 8395392
[   58.857109] sd 0:1:0:0: [sda] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[   58.857133] sd 0:1:0:0: [sda] tag#1 Sense Key : Medium Error [current]
[   58.857135] sd 0:1:0:0: [sda] tag#1 Add. Sense: Unrecovered read error
[   58.857137] sd 0:1:0:0: [sda] tag#1 CDB: Read(16) 88 00 00 00 00 00 00 80 1a 88 00 00 00 08 00 00
[   58.857139] blk_update_request: critical medium error, dev sda, sector 8395400
[   59.608267] sd 0:1:0:0: [sda] tag#2 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[   59.608289] sd 0:1:0:0: [sda] tag#2 Sense Key : Medium Error [current]
[   59.608291] sd 0:1:0:0: [sda] tag#2 Add. Sense: Unrecovered read error
[   59.608294] sd 0:1:0:0: [sda] tag#2 CDB: Read(16) 88 00 00 00 00 00 00 80 1a 90 00 00 00 08 00 00
[   59.608295] blk_update_request: critical medium error, dev sda, sector 8395408
[   59.643033] sd 0:1:0:0: [sda] tag#3 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[   59.643035] sd 0:1:0:0: [sda] tag#3 Sense Key : Medium Error [current]
[   59.643036] sd 0:1:0:0: [sda] tag#3 Add. Sense: Unrecovered read error
[   59.643037] sd 0:1:0:0: [sda] tag#3 CDB: Read(16) 88 00 00 00 00 00 00 80 1a 98 00 00 00 08 00 00
[   59.643038] blk_update_request: critical medium error, dev sda, sector 8395416
[   61.482618] EXT4-fs error (device sda1): __ext4_get_inode_loc:4355: inode #269794: block 1049054: comm ferm: unable to read itable block

But the RAID is still OPTIMAL, so...

logicaldrive 1 (3.3 TB, RAID 1+0, OK)

     physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)
     physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, OK)
     physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
     physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
     physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
     physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK)
     physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
     physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, OK)
     physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
     physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, OK)
     physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, OK)
     physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, OK)

But the host is on read-only mode...

root@db2037:~# touch test
touch: cannot touch ‘test’: Read-only file system

Completed auto-reimage of hosts:

['db2037.codfw.wmnet']

Of which those FAILED:

['db2037.codfw.wmnet']

Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts:

db2037.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/201802210737_marostegui_8297_db2037_codfw_wmnet.log.

And ILO isn't working any more, so the PXE cannot be set.

root@neodymium:~# ipmitool -I lanplus -H db2037.codfw.wmnet -U root -E chassis bootdev pxe
Unable to read password from environment
Password:
Error: Unable to establish IPMI v2 / RMCP+ session

I am trying to do it manually via vsp but I am unable to do so, given that when the serial gets started the boot process has already gone thru.

@Papaul can you help us here with the following steps?

check if there is any disk blinking as broken on this host?
try to power drain this host and see if this makes ILO come back to life?

Thanks a lot

HW logs show nothing by the way

I have managed to get the system up after fixing a few i-nodes:

root@db2037:~# touch test
root@db2037:~#

However I am going to force another reimage to see what happens.

Completed auto-reimage of hosts:

['db2037.codfw.wmnet']

Of which those FAILED:

['db2037.codfw.wmnet']

I am unable to reimage the server due to the PXE thing I described at: T187722#3988292
The system looks fine, so for, let's check what I mentioned at T187722#3988292 which is:

@Papaul can you help us here with the following steps?

check if there is any disk blinking as broken on this host?
try to power drain this host and see if this makes ILO come back to life?

Thanks a lot

Change 413120 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/software@master] m5.hosts: Add db2037

https://gerrit.wikimedia.org/r/413120

Change 413120 merged by jenkins-bot:
[operations/software@master] m5.hosts: Add db2037

https://gerrit.wikimedia.org/r/413120

Change 413128 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] site.pp: Remove db2037 from s4

https://gerrit.wikimedia.org/r/413128

Change 413128 merged by Marostegui:
[operations/puppet@production] site.pp: Remove db2037 from s4

https://gerrit.wikimedia.org/r/413128

m5 is now replicating on db2037.
I will leave notifications disable till we do the tests with @Papaul when he has a moment

Change 413129 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] dbproxy1005: Replace db2030 with db2037

https://gerrit.wikimedia.org/r/413129

Change 413130 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/dns@master] Change m5-slave to db2037

https://gerrit.wikimedia.org/r/413130

Change 413130 merged by Jcrespo:
[operations/dns@master] Change m5-slave to db2037

https://gerrit.wikimedia.org/r/413130

Change 413129 merged by Marostegui:
[operations/puppet@production] dbproxy1005: Replace db2030 with db2037

https://gerrit.wikimedia.org/r/413129

Mentioned in SAL (#wikimedia-operations) [2018-02-21T10:33:11Z] <marostegui> Reload haproxy on dbproxy1005 - T187722

• Marostegui mentioned this in T187768: Decommission db2030.Feb 21 2018, 10:37 AM

@Marostegui Disk in slot 5 is blinking.

db2030 or db2037? (we are no longer interested on db2030, we are going to retire it).

No blinking disk on db2037 all looks good. And ILO is working on my end

1- Power drain
2- Update all firmware on the system.

1- The ipmi is still not working :-(
2- Thanks!

Also thanks for checking the disks, the system boot up finely, which is good.
I will close this task now and open one for the IPMI thingy and make and attach it to: T169360

Thanks!

• Marostegui closed this task as Resolved.Feb 22 2018, 5:08 PM