⚓ T161538 Plug in ex-graphite2001 SSDs to recover coal data

	Subject	Repo	Branch	Lines +/-
	adding temp host graphite2003	operations/dns	master	+4 -0

fgiunchedi created this task.Mar 27 2017, 5:18 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 27 2017, 5:18 PM

fgiunchedi mentioned this in T123728: replace fluorine with mwlog servers (was: Upgrade fluorine to trusty/jessie).Mar 27 2017, 5:19 PM

A potential issue is when the new spare systemis booted up, it will attempt to use the IP of the existing graphite2001 system, so the below checklist should be followed. Since the old system (graphite2001) was in row B, and the spare system is in row C, we should be fairly safe for IP mismatch. The network port will be disabled, and the row switch stack the spare system is on cannot route row B IP addresses.

- select spare sysetm WMF6406 for use, as it has SFF bays.
- install extra drive trays into WMF6406, you may have to use spare onsite trays, or steal from another spare system (like WMF6407, but please note where they come from, since taking from the other spare system will take up both systems for use.)
- install SSDs (all 4) into spare system.
- determine if ssd installation was correct by attempting boot - this will take a few attempts and the system will NOT have a network connection enabled, just the mgmt connection.

Basically we need to try to setup the old SSDs in a spare system to boot them up. Once they are booting up and the raid is assembled, we can then attempt to copy over the data.

Once SSDs are working, the following steps must be taken:

- add new dns entries for the temp use of graphite2003 (this spare system with the old ssds)
- update the network config of system via root session on mgmt serial console to use new IP address
- update and enable the network port for system (vlan, description, etc)
- once system is online, hand this task to @fgiunchedi to copy data

- once data copy is complete, the system will need to be decommissioned fully to go back to spares. At that point, please hand this task back to @RobH.

decom steps

- all system services confirmed offline from production use
- set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
- remove system from all lvs/pybal active configuration
- any service group puppet/heira/dsh config removed
- remove site.pp (replace with role::spare if system isn't shut down immediately during this process.)

START NON-INTERRUPPTABLE STEPS

- disable puppet on host
- remove all remaining puppet references (include role::spare)
- power down host
- disable switch port
- switch port assignment noted on this task (for later removal)
- remove production dns entries
- puppet node clean, puppet node deactivate, salt key removed

END NON-INTERRUPPTABLE STEPS

- SSDs wiped (by onsite)
- HDDs added back and system returned to pre SSD installation configuration.
- switch port configration reverted back to spare pool status.
- mgmt dns entries for hostname removed, asset tag entries left in place.
-system added back to spares tracking

RobH updated the task description. (Show Details)Mar 27 2017, 5:46 PM

Jdforrester-WMF subscribed.Mar 28 2017, 2:39 PM

@fgiunchedi system is up

So I can login to the mgmt of the system (WMF6406) via serial and see it is rebuilding the raid.

@Papaul, did the system have enough drive trays, or did you have to steal from another spare system?

RobH updated the task description. (Show Details)Mar 28 2017, 4:28 PM

Change 345177 had a related patch set uploaded (by RobH):
[operations/dns@master] adding temp host graphite2003

https://gerrit.wikimedia.org/r/345177

Change 345177 merged by RobH:
[operations/dns@master] adding temp host graphite2003

https://gerrit.wikimedia.org/r/345177

RobH updated the task description. (Show Details)Mar 28 2017, 4:33 PM

Ok, the networking on this is being shitty. @Papaul: Can you just plug in a usb stick into this system, I'll format it and copy the coal data over. Once I do that, you can move it from this host to graphite2001, and @fgiunchedi can copy data.

Please plug in a usb stick and assign back to me. (The data on the stick may be lost, so please don't use one that cannot be formatted.)

RobH removed a project: Patch-For-Review.Mar 28 2017, 4:49 PM

In T123728#3135277, @Ottomata wrote:

In T123728#3135227, @Krinkle wrote:

Both Statsv and Graphite have no concept of time for incoming data, everything is "now".

I'm pretty sure graphite does have a concept of time. We use this to submit metrics computed from Hadoop batch jobs to Graphite.

http://graphite.readthedocs.io/en/latest/feeding-carbon.html#the-plaintext-protocol

https://github.com/wikimedia/analytics-refinery-source/blob/699614fabaf0d19f219c5b594a184422110ae8a3/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/RESTBaseMetrics.scala#L93-L100

https://github.com/wikimedia/analytics-refinery-source/blob/699614fabaf0d19f219c5b594a184422110ae8a3/refinery-core/src/main/scala/org/wikimedia/analytics/refinery/core/GraphiteClient.scala#L62-L72

In T123728#3136435, @fgiunchedi wrote:

In T123728#3135227, @Krinkle wrote:

Thanks. We'll also need to devise a way to merge the data carefully, but let's first move the data to a different location on the new drive. (Because the time series database are stored by metric, not by time slice. We don't want to overwrite the recent data, either.)

Agreed, we've used https://github.com/graphite-project/carbonate in the past with success to merge whisper files and it seems to do what is says on the tin.

Papaul triaged this task as Medium priority.Mar 29 2017, 3:09 PM

Flash Drive in place

Krinkle added a project: Performance-Team.Apr 5 2017, 7:06 PM

Krinkle moved this task from Inbox, needs triage to Blocked (old) on the Performance-Team board.

So this system is booted to OS with NO networking.

The usb stick is mounted as /mnt/sde/

All the data can be copied over, but the system must be accessed via serial console.

Assigning to @fgiunchedi for feedback. Please comment with the data needing copy, or just access and copy it to usb. Then this can be assigned to @Papaul to plug the usb stick in to the current graphite box for data migration.

thanks @RobH! I've archived and copied /var/lib/coal to /mnt/sde and umounted the usb drive. @Papaul you can plug it in graphite2001 at any time.

root@graphite2001:~# md5sum /mnt/sde/coal.tar.gz 
aa1b03a318cded59cc8d9502c3997d4e  /mnt/sde/coal.tar.gz
root@graphite2001:~# ls -la !$
ls -la /mnt/sde/coal.tar.gz
-rwxr-xr-x 1 root root 38554741 Apr 10 04:07 /mnt/sde/coal.tar.gz

@fgiunchedi the flash drive is not in graphite2001

@fgiunchedi the drive is now in graphite2001

fgiunchedi added a project: User-fgiunchedi.Apr 11 2017, 1:00 PM

Thanks @Papaul ! I've copied the coal data off the usb drive, you can unplug it. I suppose once that's done this task will be back to @RobH for advice/completion

@RobH the flash drive has been removed from graphite2001

Krinkle moved this task from Blocked (old) to To-do: Goals prioritized current Quarter on the Performance-Team board.Apr 12 2017, 4:26 PM

• Gilles subscribed.Apr 12 2017, 6:56 PM

RobH closed this task as Resolved.Apr 13 2017, 5:04 PM

RobH updated the task description. (Show Details)

@fgiunchedi So where is the data now?

Peter subscribed.Apr 18 2017, 7:33 AM

@Krinkle on graphite2001, I've opened T163194: Backfill restored coal whisper files with current data to followup on the actual backfill. Note I won't have to work on it this week, though if you want to take a stab at it all files should be readable

Plug in ex-graphite2001 SSDs to recover coal data
Closed, ResolvedPublic
Actions

Description

Details

Related Objects

Event Timeline

decom steps

Plug in ex-graphite2001 SSDs to recover coal dataClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

decom steps

Plug in ex-graphite2001 SSDs to recover coal data
Closed, ResolvedPublic
Actions