Page MenuHomePhabricator

Reimage labstore1001 and labstore1002 for DRBD storage setup
Closed, ResolvedPublic

Description

We are coming up on a month of a "old" labstore setup being dormant. We have monitoring and backups in place for the new setup. I am making this task to outline the next phase of preparing labstore1001/1002 for a similar setup to the existing labstore1004/1005 and to close the legacy tasks outstanding in relation to old setup.

Related hardware:

Currently:

Shelves are attached to labstore1001 physically only (even though the naming in racktables is confusing to that effect). Both labstore1001 and labstore1002 have H800 RAID cards.

Proposal:

Labstore1001 - attached to labstore1001-array1 and labstore1001-array2
Labstore1002 - attached to labstore1002-array1 and labstore1002-array2
labstore1001-array3 - spare for now

Procedure:

  • Move labstore1002 to another rack (currently both are in C3)
  • Adjust shelve arrangement to match above proposal
  • Setup disks in BIOS for new shelve arrangement. Propose RAID10.
  • Reinstall OS on both servers (probably need a recipe applied)
  • Direct cable eth1 between both servers
  • Puppetize a redundant storage setup where we can move scratch and maps too (could we squeeze out enough here to move dumps to it for migration purposes / maint?)

Event Timeline

chasemp created this task.Feb 15 2017, 3:05 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 15 2017, 3:05 PM

note from T98183: labstore1002 issues while trying to reboot it is possible the H800 card in labstore1002 has an issue. We have some extra cards on hand IIRC but my guess is this was a red herring at the time and the card itself is fine considering surrounding factors.

greg added a subscriber: greg.Feb 15 2017, 7:01 PM

(those tasks above that this task was mentioned in were all(?) in #wikimedia-incident as a follow-up/action item, should this one be as well?)

(those tasks above that this task was mentioned in were all(?) in #wikimedia-incident as a follow-up/action item, should this one be as well?)

I have no objection :) but the real resolution was moving off of this setup entirely so it probably doesn't make sense to keep this task itself carrying the tag. But also, sure?

Change 339577 had a related patch set uploaded (by Madhuvishy):
labstore: Cleanup old/unused labstore1001 nfs related puppet files

https://gerrit.wikimedia.org/r/339577

Change 339577 merged by Madhuvishy:
[operations/puppet] labstore: Cleanup old/unused labstore1001 nfs related puppet files

https://gerrit.wikimedia.org/r/339577

Change 366977 had a related patch set uploaded (by Madhuvishy; owner: Madhuvishy):
[operations/puppet@production] install_server: Add new partman recipe for labstore100[1-2]

https://gerrit.wikimedia.org/r/366977

Change 366977 merged by Madhuvishy:
[operations/puppet@production] install_server: Add new partman recipe for labstore100[1-2]

https://gerrit.wikimedia.org/r/366977

Script wmf_auto_reimage was launched by madhuvishy on neodymium.eqiad.wmnet for hosts:

['labstore1001.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201707212336_madhuvishy_17526.log.

Completed auto-reimage of hosts:

['labstore1001.eqiad.wmnet']

and were ALL successful.

Script wmf_auto_reimage was launched by madhuvishy on neodymium.eqiad.wmnet for hosts:

['labstore1002.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201707220011_madhuvishy_28252.log.

Completed auto-reimage of hosts:

['labstore1002.eqiad.wmnet']

and were ALL successful.

bd808 moved this task from Backlog to Shared Storage on the Data-Services board.Jul 24 2017, 12:40 AM

These have been reimaged with jessie, but I'm wondering if it would be better to use stretch from the start; we don't have the NFS kernel/userspace mismatch we have in jessie there (with a backported kernel and standard jessie nfs user space)

Dzahn added a subscriber: Dzahn.Nov 22 2017, 3:50 AM

@chasemp Should we reinstall these with stretch? I noticed them in site.pp with a comment leading to this ticket.

madhuvishy closed this task as Resolved.Jan 5 2018, 6:52 PM

We'll do the upgrade to stretch for all labstore servers as a separate step after testing stretch for NFS. We would like to have parity in the OS versions across all the labstores to keep operational overhead minimal. I'm opening a different task - T184290 for upgrading labstore* to stretch and resolving this for now.