Page MenuHomePhabricator

reinstall logstash1001-1003
Closed, ResolvedPublic

Description

This is the tracking task for the reinstallation of logstash1001-1003. Once the new logstash1004-1006 are fully online, we should be able to start reinstalling these (so the OS has raid1 like the new hosts) & getting them off precise and running jessie.

This task is initially being assgined to @bd808, since he'll be fully implementing the service on logstash1004-1006. Once we are ready to start reinstalling the older systems, he can assign this back to me with details.

Related Objects

Event Timeline

RobH created this task.Apr 29 2015, 4:41 PM
RobH updated the task description. (Show Details)
RobH raised the priority of this task from to Normal.
RobH assigned this task to bd808.
RobH added subscribers: RobH, bd808.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 29 2015, 4:41 PM
RobH added a subscriber: Cmjohnson.Apr 29 2015, 4:42 PM

Details needed for reinstall:

  • rolling reinstallations (where only one is offline) or batch?
  • we discussed in irc the removal of the larger hard disks, as these won't keep the elasticsearch data any longer. Are dual 500GB then sufficient for this? (If so, we'll create a sub-task for the onsite tech (@Cmjohnson) in ops-eqiad to swap the disks out AFTER wiping them. Keep in mind, this wipe will require the host to be depooled for 24 hours. We don't want to put unwiped disks back on a shelf as spare. If the 24 hour downtime isn't acceptable, we can discuss possibly moving these disks into a spare box for wipe, but that is labor intensive for on-site, and non-preferred.
  • if its just dual 500GB, we'll use the standard raid1/lvm formatting. (this formatting normally puts a /srv partition for all our larger data items, would this work for you guys for the new roles of these systems?)
bd808 added a comment.May 4 2015, 7:14 PM

When we are ready to do this we should also move 2 of the three boxes to new racks. Today all three are in the same rack behind the same switch which can lead to catastrophic down time rather than service degradation in the face of certain hardware failures.

faidon added a subscriber: faidon.May 4 2015, 7:15 PM

(preferrably, different rows too for even more protection)

Restricted Application added a subscriber: Matanya. · View Herald TranscriptJul 7 2015, 9:50 PM
bd808 added a subscriber: Gage.EditedJul 9 2015, 3:53 PM

Once I complete T105101: Upgrade Logstash Elasticsearch cluster to 1.6.0 I think we will be ready to start rebuilding logstash100[1-3]. I think this is the list of things we want to do as part of this task:

  • T98042: Update Wikimedia apt repo to include debs for Elasticsearch & Logstash on jessie
  • Remove the 2 Seagate ST3000DM001-1CH1 3TB storage disks we added to each host so the disks can go into the spare pool
  • Relocate the 2 of the 3 boxes to new racks and rows so that the failure of a single switch doesn't take out all logstash event ingestion and kibana frontends
  • Reimage hosts using jessie base image
  • Apply logstash, kibana and logstash::apifeatureusage Puppet roles
  • Profit!

Each of the three hosts is a SPOF for some log event ingestion.

  • logstash1001: HHVM, Apache2, CX, SCA
  • logstash1002: OCG, Hadoop, IPSEC
  • logstash1003: Parsoid

Additionally MediaWiki is configured to randomly select a server from the pool to send log events to for each individual request. This combination means that any downtime will result in some log event loss and the longer each node is down the more events will be lost. Coordinating rotating all of these services to alternate hosts would be possible but techops will need to help determine if that much coordination is necessary or if instead we can deal with some amount of log event loss during the upgrade.

bd808 reassigned this task from bd808 to RobH.Jul 11 2015, 9:58 PM

Assigning back to @RobH so he can coordinate the next steps based on the rough outline in T97545#1441645.

RobH added a comment.Jul 14 2015, 6:27 PM

It appears that we are now in the steps of relocating two of the three systems into different racks, correct?

We'll have @Cmjohnson then remove the larger capacity disks and relocate two of the three into different racks.

With the above, it seems safe to proceed to take these system offline for these changes. Please advise if that isn't so. (I've also pinged @bd808 in irc to confirm this.)

It appears that we are now in the steps of relocating two of the three systems into different racks, correct?

We'll have @Cmjohnson then remove the larger capacity disks and relocate two of the three into different racks.

With the above, it seems safe to proceed to take these system offline for these changes. Please advise if that isn't so. (I've also pinged @bd808 in irc to confirm this.)

Yes, this should be safe to do at any time. There will be some loss of log event data while the systems are offline but that shouldn't be the end of the world. Logstash100{1,2,3} currently hold none of the actual stored log data and instead only provide log ingestion via Logstash and the Kibana frontend at https://logstash.wikimedia.org/.

If they are reimaged before T98042 is done then we will need to manually copy over and install the debs to get them back online and processing logs:

I will be traveling on 2015-07-20 and probably not online until the SF afternoon on 2015-07-21 so it would be great if they were not reimaged before 2015-07-22 unless someone else can be sure to babysit the process of getting them back to work.

RobH reassigned this task from RobH to Cmjohnson.Jul 22 2015, 7:21 PM

I'm going to re-assign this from myself to @Cmjohnson; as it is onsite disk swaps and server relocations followed by reinstallation.

Chris: The next steps are outlined above in the last two updates.

The racks are prepped for the move. Switches have been updated
DNS patch is ready to merge https://gerrit.wikimedia.org/r/#/c/226722/

Just need to do the physical removal of disks and move each server

logstash1001 is going to row A4 ge-4/0/13
logstash1003 is going to row D3 ge-3/0/16

Change 227245 had a related patch set uploaded (by Cmjohnson):
Updating dhcp file for logstash1001-1003 to use jessie installer per phab task https://phabricator.wikimedia.org/T97545

https://gerrit.wikimedia.org/r/227245

Change 227245 merged by Cmjohnson:
Updating dhcp file for logstash1001-1003 to use jessie installer per phab task https://phabricator.wikimedia.org/T97545

https://gerrit.wikimedia.org/r/227245

Change 227258 had a related patch set uploaded (by BryanDavis):
logstash: change ip address for logstash1001 and logstash1003

https://gerrit.wikimedia.org/r/227258

Change 227258 merged by jenkins-bot:
logstash: change ip address for logstash1001 and logstash1003

https://gerrit.wikimedia.org/r/227258

Cmjohnson reassigned this task from Cmjohnson to bd808.Jul 28 2015, 3:50 PM

The on-site portion of this task has been completed. Assigning to Bryan to complete and resolve.

bd808 moved this task from Backlog to Archive on the Wikimedia-Logstash board.Jul 28 2015, 4:50 PM
bd808 closed this task as Resolved.Jul 28 2015, 4:52 PM

All 3 hosts are up and running jessie with elasticsearch 1.7.0 and logstash 1.4.2.