reinstall logstash1001-1003
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	RobH
	Apr 29 2015, 4:41 PM

Description

This is the tracking task for the reinstallation of logstash1001-1003. Once the new logstash1004-1006 are fully online, we should be able to start reinstalling these (so the OS has raid1 like the new hosts) & getting them off precise and running jessie.

This task is initially being assgined to @bd808, since he'll be fully implementing the service on logstash1004-1006. Once we are ready to start reinstalling the older systems, he can assign this back to me with details.

Details

	Subject	Repo	Branch	Lines +/-
	logstash: change ip address for logstash1001 and logstash1003	operations/mediawiki-config	master	+2 -2

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	bd808	T69817 Monitor for anomalies/spikes in read failures of memcached
Resolved	bd808	T100735 Have Logstash report per-channel log message rate to Graphite
Resolved	bd808	T99735 Upgrade Logstash to 1.5.3
Resolved	bd808	T97545 reinstall logstash1001-1003
Resolved	bd808	T96692 Rack and Setup (3) Logstash Servers
Resolved	RobH	T84958 eqiad: (3) servers for logstash service
Declined	bd808	T87078 Upgrade RAM for logstash100[123] to 64G
Resolved	RobH	T89402 purchase 3 additional logstash nodes
Declined	RobH	T87460 Allocate temporary Elasticsearch nodes from spares pool for Logstash
Resolved	faidon	T97481 jessie installs fail - mirror issue due to jessie release?
Resolved	bd808	T97645 Elasticsearch not starting on Jessie hosts
Resolved	bd808	T101541 Build jessie based elasticsearch/logstash/kibana (ELK) host for beta testing
Resolved	MoritzMuehlenhoff	T98042 Update Wikimedia apt repo to include debs for Elasticsearch & Logstash on jessie
Resolved	bd808	T105101 Upgrade Logstash Elasticsearch cluster to 1.6.0
Resolved	fgiunchedi	T104035 logstash partman recipe huge root partition
Resolved	bd808	T107083 Java class org.bouncycastle.jcajce.provider.digest.MD5$Digest not found for Logstash on logstash1001 (jessie)

Event Timeline

RobH created this task.Apr 29 2015, 4:41 PM

RobH assigned this task to bd808.

RobH raised the priority of this task from to Medium.

RobH updated the task description. (Show Details)

RobH added projects: acl*sre-team, Wikimedia-Logstash.

RobH added subscribers: RobH, bd808.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 29 2015, 4:41 PM

Details needed for reinstall:

rolling reinstallations (where only one is offline) or batch?
we discussed in irc the removal of the larger hard disks, as these won't keep the elasticsearch data any longer. Are dual 500GB then sufficient for this? (If so, we'll create a sub-task for the onsite tech (@Cmjohnson) in ops-eqiad to swap the disks out AFTER wiping them. Keep in mind, this wipe will require the host to be depooled for 24 hours. We don't want to put unwiped disks back on a shelf as spare. If the 24 hour downtime isn't acceptable, we can discuss possibly moving these disks into a spare box for wipe, but that is labor intensive for on-site, and non-preferred.
if its just dual 500GB, we'll use the standard raid1/lvm formatting. (this formatting normally puts a /srv partition for all our larger data items, would this work for you guys for the new roles of these systems?)

RobH added a subtask: T96692: Rack and Setup (3) Logstash Servers.Apr 29 2015, 4:42 PM

RobH mentioned this in T96692: Rack and Setup (3) Logstash Servers.

When we are ready to do this we should also move 2 of the three boxes to new racks. Today all three are in the same rack behind the same switch which can lead to catastrophic down time rather than service degradation in the face of certain hardware failures.

(preferrably, different rows too for even more protection)

bd808 closed subtask T96692: Rack and Setup (3) Logstash Servers as Resolved.May 6 2015, 3:17 AM

bd808 added a subtask: T101541: Build jessie based elasticsearch/logstash/kibana (ELK) host for beta testing.Jun 5 2015, 6:52 PM

bd808 closed subtask T101541: Build jessie based elasticsearch/logstash/kibana (ELK) host for beta testing as Resolved.Jul 7 2015, 9:50 PM

Restricted Application added a subscriber: Matanya. · View Herald TranscriptJul 7 2015, 9:50 PM

bd808 added a subtask: T98042: Update Wikimedia apt repo to include debs for Elasticsearch & Logstash on jessie.Jul 9 2015, 3:32 PM

bd808 added a subtask: T105101: Upgrade Logstash Elasticsearch cluster to 1.6.0.

bd808 added a parent task: T99735: Upgrade Logstash to 1.5.3.

Once I complete T105101: Upgrade Logstash Elasticsearch cluster to 1.6.0 I think we will be ready to start rebuilding logstash100[1-3]. I think this is the list of things we want to do as part of this task:

T98042: Update Wikimedia apt repo to include debs for Elasticsearch & Logstash on jessie
Remove the 2 Seagate ST3000DM001-1CH1 3TB storage disks we added to each host so the disks can go into the spare pool
Relocate the 2 of the 3 boxes to new racks and rows so that the failure of a single switch doesn't take out all logstash event ingestion and kibana frontends
Reimage hosts using jessie base image
Apply logstash, kibana and logstash::apifeatureusage Puppet roles
Profit!

Each of the three hosts is a SPOF for some log event ingestion.

logstash1001: HHVM, Apache2, CX, SCA
logstash1002: OCG, Hadoop, IPSEC
logstash1003: Parsoid

Additionally MediaWiki is configured to randomly select a server from the pool to send log events to for each individual request. This combination means that any downtime will result in some log event loss and the longer each node is down the more events will be lost. Coordinating rotating all of these services to alternate hosts would be possible but techops will need to help determine if that much coordination is necessary or if instead we can deal with some amount of log event loss during the upgrade.

I've added T104035: logstash partman recipe huge root partition while we're at it re imagining servers

bd808 closed subtask T105101: Upgrade Logstash Elasticsearch cluster to 1.6.0 as Resolved.Jul 11 2015, 9:39 PM

Assigning back to @RobH so he can coordinate the next steps based on the rough outline in T97545#1441645.

It appears that we are now in the steps of relocating two of the three systems into different racks, correct?

We'll have @Cmjohnson then remove the larger capacity disks and relocate two of the three into different racks.

With the above, it seems safe to proceed to take these system offline for these changes. Please advise if that isn't so. (I've also pinged @bd808 in irc to confirm this.)

In T97545#1452296, @RobH wrote:

It appears that we are now in the steps of relocating two of the three systems into different racks, correct?

We'll have @Cmjohnson then remove the larger capacity disks and relocate two of the three into different racks.

With the above, it seems safe to proceed to take these system offline for these changes. Please advise if that isn't so. (I've also pinged @bd808 in irc to confirm this.)

Yes, this should be safe to do at any time. There will be some loss of log event data while the systems are offline but that shouldn't be the end of the world. Logstash100{1,2,3} currently hold none of the actual stored log data and instead only provide log ingestion via Logstash and the Kibana frontend at https://logstash.wikimedia.org/.

If they are reimaged before T98042 is done then we will need to manually copy over and install the debs to get them back online and processing logs:

I will be traveling on 2015-07-20 and probably not online until the SF afternoon on 2015-07-21 so it would be great if they were not reimaged before 2015-07-22 unless someone else can be sure to babysit the process of getting them back to work.

fgiunchedi closed subtask T104035: logstash partman recipe huge root partition as Resolved.Jul 16 2015, 4:19 PM

I'm going to re-assign this from myself to @Cmjohnson; as it is onsite disk swaps and server relocations followed by reinstallation.

Chris: The next steps are outlined above in the last two updates.

The racks are prepped for the move. Switches have been updated
DNS patch is ready to merge https://gerrit.wikimedia.org/r/#/c/226722/

Just need to do the physical removal of disks and move each server

logstash1001 is going to row A4 ge-4/0/13
logstash1003 is going to row D3 ge-3/0/16

Change 227245 had a related patch set uploaded (by Cmjohnson):
Updating dhcp file for logstash1001-1003 to use jessie installer per phab task https://phabricator.wikimedia.org/T97545

https://gerrit.wikimedia.org/r/227245

gerritbot added a project: Patch-For-Review.Jul 27 2015, 5:23 PM

Change 227245 merged by Cmjohnson:
Updating dhcp file for logstash1001-1003 to use jessie installer per phab task https://phabricator.wikimedia.org/T97545

https://gerrit.wikimedia.org/r/227245

Change 227258 had a related patch set uploaded (by BryanDavis):
logstash: change ip address for logstash1001 and logstash1003

https://gerrit.wikimedia.org/r/227258

Change 227258 merged by jenkins-bot:
logstash: change ip address for logstash1001 and logstash1003

https://gerrit.wikimedia.org/r/227258

bd808 mentioned this in rOMWC45b6fbad9725: logstash: change ip address for logstash1001 and logstash1003.Jul 27 2015, 6:37 PM

MoritzMuehlenhoff closed subtask T98042: Update Wikimedia apt repo to include debs for Elasticsearch & Logstash on jessie as Resolved.Jul 28 2015, 8:12 AM

The on-site portion of this task has been completed. Assigning to Bryan to complete and resolve.

bd808 moved this task from Backlog to Archive on the Wikimedia-Logstash board.Jul 28 2015, 4:50 PM

All 3 hosts are up and running jessie with elasticsearch 1.7.0 and logstash 1.4.2.

bd808 closed subtask T107083: Java class org.bouncycastle.jcajce.provider.digest.MD5$Digest not found for Logstash on logstash1001 (jessie) as Resolved.Aug 20 2015, 9:49 PM

fgiunchedi added a project: observability.Aug 19 2019, 2:33 PM

reinstall logstash1001-1003Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

reinstall logstash1001-1003
Closed, ResolvedPublic
Actions

Related Objects
Search...