Page MenuHomePhabricator

Rack and Setup ms-be1028-ms-1039
Closed, ResolvedPublic

Description

@fgiunchedi 12 new systems arrived, please let me know how you would like this racked....any particular racks you prefer.

Event Timeline

racking will be 3x systems per row, 10G where possible or 1G where we can't do 10G (see also T148647)

Cmjohnson renamed this task from Rack and Setup ms-be1028-ms-1033 to Rack and Setup ms-be1028-ms-1039.Mar 16 2017, 5:24 PM
Cmjohnson updated the task description. (Show Details)

Going to go with 3 in 10G rack A5, C8 and D8. The remaining 3 will go to row B3, B5, C5

reporting from IRC

17:51  <godog> cmjohnson1: it is easier to think about it if they are more spread equally among rows, 3x per row 
               should do it, it matters less 10G vs 1G in this case
``

Change 343660 had a related patch set uploaded (by Cmjohnson):
[operations/dns] Adding dns entries for new ms-be servers T160640

https://gerrit.wikimedia.org/r/343660

Change 343660 merged by Cmjohnson:
[operations/dns] Adding dns entries for new ms-be servers T160640

https://gerrit.wikimedia.org/r/343660

@Cmjohnson is there an ETA to have the servers with OS installed online? thanks!

@fgiunchedi All the servers are racked, cabled, for the most part the ILO is setup. On-Site work still needed is last few ILO configs, and raid setup. Still need to update switch, dhcpd, netboot.cfg and racktables. This should all be completed tomorrow(Thursday).

Change 344646 had a related patch set uploaded (by Cmjohnson):
[operations/puppet@production] T160640 Adding dhcpd entries and netboot.cfg for new swift servers ms-be1028-39

https://gerrit.wikimedia.org/r/344646

Change 344646 merged by Cmjohnson:
[operations/puppet@production] T160640 Adding dhcpd entries and netboot.cfg for new swift servers ms-be1028-39

https://gerrit.wikimedia.org/r/344646

Change 344663 had a related patch set uploaded (by Cmjohnson):
[operations/dns@master] T160640 Adding dns entries for production new swift servers ms-be1028-1036

https://gerrit.wikimedia.org/r/344663

Change 344663 merged by Cmjohnson:
[operations/dns@master] T160640 Adding dns entries for production new swift servers ms-be1028-1036

https://gerrit.wikimedia.org/r/344663

@Cmjohnson thanks!

I've fixed the raid on the machines (cfr https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/HP_DL3N0_Gen9#ms-be_RAID0_config) that were able to get to debian-installer via PXE and these are now reinstalling:

=== ms-be1028
=== ms-be1029
=== ms-be1030
=== ms-be1035
=== ms-be1037
=== ms-be1038

These don't seem to successfully PXE boot into debian-installer, could you take a look why is that?

=== ms-be1031
=== ms-be1032
=== ms-be1033
=== ms-be1034

These I don't seem to be able to reach the console

=== ms-be1036
=== ms-be1039

Change 345290 had a related patch set uploaded (by Filippo Giunchedi):
[operations/puppet@production] swift: add ms-be1028 -> ms-be1039

https://gerrit.wikimedia.org/r/345290

Change 345290 merged by Filippo Giunchedi:
[operations/puppet@production] swift: add ms-be1028 -> ms-be1039

https://gerrit.wikimedia.org/r/345290

Updated the mac address for 1031-34, console issue with 1036, cable was not in correct port. 1039, fat fingered the mgmt ip address during setup.

1031's port was also a member of labs-instances vlan, removed the port from there and disabled/enabled the port and now 1031 can pxe-boot.

1031 / 1032 / 1033 still had their 10G interfaces enabled and thus 1G interfaces would show up starting from eth2. I've disabled the 10G interfaces on all three and machines are now installing

Mentioned in SAL (#wikimedia-operations) [2017-03-30T14:38:29Z] <godog> run stress test (w/ bonnie) on new swift hw - T160640

Mentioned in SAL (#wikimedia-operations) [2017-03-30T18:24:28Z] <godog> swift eqiad-prod add ms-be1028 -> ms-be1039 - T160640

Change 345816 had a related patch set uploaded (by Filippo Giunchedi):
[operations/puppet@production] swift: increase max_connections for object server rsync

https://gerrit.wikimedia.org/r/345816

Change 345816 merged by Filippo Giunchedi:
[operations/puppet@production] swift: increase max_connections for object server rsync

https://gerrit.wikimedia.org/r/345816

Just powercycled ms-be1016 that was stuck in console (pingable but no ssh available):

[11674384.225319] BUG: soft lockup - CPU#12 stuck for 22s! [migration/12:149] in ms-be1016's console

When I powercycled I saw:

error: diskfilter writes are not supported.

Press any key to continue...

@elukey yes it's a known problem. I have the new part but @fgiunchedi is out this week. We'll take care of it next week. https://phabricator.wikimedia.org/T150206 <<task for ms-be1016

Mentioned in SAL (#wikimedia-operations) [2017-04-24T13:49:53Z] <godog> swift eqiad-prod: more weight on ms-be1028 -> ms-be1039 - T160640

Mentioned in SAL (#wikimedia-operations) [2017-05-08T08:25:11Z] <godog> swift eqiad-prod: ms-be1028/ms-be1039 container/account full weight - T160640

Mentioned in SAL (#wikimedia-operations) [2017-05-08T09:30:24Z] <godog> swift eqiad-prod: ms-be1028/ms-be1039 object weight 2000 - T160640

Mentioned in SAL (#wikimedia-operations) [2017-05-15T08:29:46Z] <godog> swift eqiad-prod: ms-be1028/ms-be1039 object weight 3000 - T160640

Mentioned in SAL (#wikimedia-operations) [2017-05-23T09:49:24Z] <godog> swift eqiad-prod: ms-be1028/ms-be1039 object weight 3500 - T160640

All hosts at weight 4000 and in service, decom task for correspondent old hw is T166489