Page MenuHomePhabricator

Plug in ex-graphite2001 SSDs to recover coal data
Closed, ResolvedPublic

Description

We'll need to plug in the SSDs that were in graphite2001 and got replaced with newer/bigger SSDs (new ssds were received in T157153).
Papaul, assuming the old SSDs are not wiped yet, can we connect the four SSDs to a spare system and boot it? It should work as-in I think, none of those SSDs in codfw were found faulty yet.

A potential issue is when the new spare systemis booted up, it will attempt to use the IP of the existing graphite2001 system, so the below checklist should be followed. Since the old system (graphite2001) was in row B, and the spare system is in row C, we should be fairly safe for IP mismatch. The network port will be disabled, and the row switch stack the spare system is on cannot route row B IP addresses.

  • - select spare sysetm WMF6406 for use, as it has SFF bays.
  • - install extra drive trays into WMF6406, you may have to use spare onsite trays, or steal from another spare system (like WMF6407, but please note where they come from, since taking from the other spare system will take up both systems for use.)
  • - install SSDs (all 4) into spare system.
  • - determine if ssd installation was correct by attempting boot - this will take a few attempts and the system will NOT have a network connection enabled, just the mgmt connection.

Basically we need to try to setup the old SSDs in a spare system to boot them up. Once they are booting up and the raid is assembled, we can then attempt to copy over the data.

Once SSDs are working, the following steps must be taken:

The rest of the checklist wasn't done, since we never enabled networking or let this come fully online. Instead we copied the data via usb memory stick and serial console. Since the system was never online fully, its now been reclaimed as part of T162900

Event Timeline

A potential issue is when the new spare systemis booted up, it will attempt to use the IP of the existing graphite2001 system, so the below checklist should be followed. Since the old system (graphite2001) was in row B, and the spare system is in row C, we should be fairly safe for IP mismatch. The network port will be disabled, and the row switch stack the spare system is on cannot route row B IP addresses.

  • - select spare sysetm WMF6406 for use, as it has SFF bays.
  • - install extra drive trays into WMF6406, you may have to use spare onsite trays, or steal from another spare system (like WMF6407, but please note where they come from, since taking from the other spare system will take up both systems for use.)
  • - install SSDs (all 4) into spare system.
  • - determine if ssd installation was correct by attempting boot - this will take a few attempts and the system will NOT have a network connection enabled, just the mgmt connection.

Basically we need to try to setup the old SSDs in a spare system to boot them up. Once they are booting up and the raid is assembled, we can then attempt to copy over the data.

Once SSDs are working, the following steps must be taken:

  • - add new dns entries for the temp use of graphite2003 (this spare system with the old ssds)
  • - update the network config of system via root session on mgmt serial console to use new IP address
  • - update and enable the network port for system (vlan, description, etc)
  • - once system is online, hand this task to @fgiunchedi to copy data
  • - once data copy is complete, the system will need to be decommissioned fully to go back to spares. At that point, please hand this task back to @RobH.

decom steps

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/heira/dsh config removed
  • - remove site.pp (replace with role::spare if system isn't shut down immediately during this process.)

START NON-INTERRUPPTABLE STEPS

  • - disable puppet on host
  • - remove all remaining puppet references (include role::spare)
  • - power down host
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate, salt key removed

END NON-INTERRUPPTABLE STEPS

  • - SSDs wiped (by onsite)
  • - HDDs added back and system returned to pre SSD installation configuration.
  • - switch port configration reverted back to spare pool status.
  • - mgmt dns entries for hostname removed, asset tag entries left in place.
  • -system added back to spares tracking

So I can login to the mgmt of the system (WMF6406) via serial and see it is rebuilding the raid.

@Papaul, did the system have enough drive trays, or did you have to steal from another spare system?

Change 345177 had a related patch set uploaded (by RobH):
[operations/dns@master] adding temp host graphite2003

https://gerrit.wikimedia.org/r/345177

Change 345177 merged by RobH:
[operations/dns@master] adding temp host graphite2003

https://gerrit.wikimedia.org/r/345177

Ok, the networking on this is being shitty. @Papaul: Can you just plug in a usb stick into this system, I'll format it and copy the coal data over. Once I do that, you can move it from this host to graphite2001, and @fgiunchedi can copy data.

Please plug in a usb stick and assign back to me. (The data on the stick may be lost, so please don't use one that cannot be formatted.)

Thanks. We'll also need to devise a way to merge the data carefully, but let's first move the data to a different location on the new drive. (Because the time series database are stored by metric, not by time slice. We don't want to overwrite the recent data, either.)

Agreed, we've used https://github.com/graphite-project/carbonate in the past with success to merge whisper files and it seems to do what is says on the tin.

Papaul triaged this task as Medium priority.Mar 29 2017, 3:09 PM
Papaul subscribed.

Flash Drive in place

So this system is booted to OS with NO networking.

The usb stick is mounted as /mnt/sde/

All the data can be copied over, but the system must be accessed via serial console.

Assigning to @fgiunchedi for feedback. Please comment with the data needing copy, or just access and copy it to usb. Then this can be assigned to @Papaul to plug the usb stick in to the current graphite box for data migration.

thanks @RobH! I've archived and copied /var/lib/coal to /mnt/sde and umounted the usb drive. @Papaul you can plug it in graphite2001 at any time.

root@graphite2001:~# md5sum /mnt/sde/coal.tar.gz 
aa1b03a318cded59cc8d9502c3997d4e  /mnt/sde/coal.tar.gz
root@graphite2001:~# ls -la !$
ls -la /mnt/sde/coal.tar.gz
-rwxr-xr-x 1 root root 38554741 Apr 10 04:07 /mnt/sde/coal.tar.gz

@fgiunchedi the flash drive is not in graphite2001

Thanks @Papaul ! I've copied the coal data off the usb drive, you can unplug it. I suppose once that's done this task will be back to @RobH for advice/completion

@RobH the flash drive has been removed from graphite2001

RobH updated the task description. (Show Details)

@Krinkle on graphite2001, I've opened T163194: Backfill restored coal whisper files with current data to followup on the actual backfill. Note I won't have to work on it this week, though if you want to take a stab at it all files should be readable