Page MenuHomePhabricator

all log producers need to use the logstash LVS endpoint
Closed, ResolvedPublic

Description

Since we are replacing logstash100[1-3] with logstash100[7-9], all log producers which currently access one of the logstash node directly need to be reconfigured. Using the LVS endpoint is the obvious solution (logstash.svc.eqiad.wmnet).

Event Timeline

It seems striker, maps and aqs need fixing.

Change 376500 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[operations/puppet@production] service: Use LVS endpoint for logstash

https://gerrit.wikimedia.org/r/376500

Mentioned in SAL (#wikimedia-operations) [2017-09-27T12:47:34Z] <akosiaris> T175242 disable puppet across aqs kafka maps maps-test ores restbase restbase-dev sca scb wtp clusters for merging https://gerrit.wikimedia.org/r/#/c/376500/

Change 376500 merged by Alexandros Kosiaris:
[operations/puppet@production] service: Use LVS endpoint for logstash

https://gerrit.wikimedia.org/r/376500

Mentioned in SAL (#wikimedia-operations) [2017-09-27T12:54:53Z] <akosiaris> T175242 enabled puppet in aqs kafka maps maps-test selected hosts and ran puppet manually.

Mentioned in SAL (#wikimedia-operations) [2017-09-27T13:01:10Z] <akosiaris> T175242 tilerator and tileratorui need manually restart

Mentioned in SAL (#wikimedia-operations) [2017-09-27T13:15:52Z] <akosiaris> T175242 restbase requires manual restart

Mentioned in SAL (#wikimedia-operations) [2017-09-27T13:25:22Z] <akosiaris> T175242 parsoid requires manual restart

Mentioned in SAL (#wikimedia-operations) [2017-09-27T13:31:33Z] <akosiaris> T175242 eventstreams requires manual restart

Mentioned in SAL (#wikimedia-operations) [2017-09-27T13:43:55Z] <akosiaris> T175242 re-enable puppet across aqs kafka maps maps-test ores restbase restbase-dev sca scb wtp clusters for merging https://gerrit.wikimedia.org/r/#/c/376500/. Run puppet as well in a batched execution

Mentioned in SAL (#wikimedia-operations) [2017-09-27T14:03:48Z] <akosiaris> T175242 restart tilerator, tileratorui, restbase across the fleet to pick up the change in a rolling restart manner with a batch size of 2

Mentioned in SAL (#wikimedia-operations) [2017-09-27T14:16:30Z] <akosiaris> T175242 restart parsoid across the fleet to pick up the change in a rolling restart manner with a batch size of 5

Mentioned in SAL (#wikimedia-operations) [2017-09-27T14:17:37Z] <akosiaris> T175242 restart eventstreams across the fleet to pick up the change in a rolling restart manner with a batch size of 2

Change 380991 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] elasticsearch: use the lgostash LVS endpoint

https://gerrit.wikimedia.org/r/380991

Change 380992 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] aqs: switch to LVS endpoint for logstash

https://gerrit.wikimedia.org/r/380992

Change 380993 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] striker: switch to LVS endpoint for logstash

https://gerrit.wikimedia.org/r/380993

Change 380994 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] mediawiki: switch to LVS endpoint for logstash

https://gerrit.wikimedia.org/r/380994

Change 380995 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] ocg: switch to LVS endpoint for logstash

https://gerrit.wikimedia.org/r/380995

Change 380991 merged by Gehel:
[operations/puppet@production] elasticsearch: use the logstash LVS endpoint

https://gerrit.wikimedia.org/r/380991

Mentioned in SAL (#wikimedia-operations) [2017-09-28T10:30:45Z] <gehel> restart elasticsearch on relforge to validate new logging config - T175242

Change 380992 merged by Elukey:
[operations/puppet@production] aqs: switch to LVS endpoint for logstash

https://gerrit.wikimedia.org/r/380992

Change 380994 merged by Gehel:
[operations/puppet@production] mediawiki: switch to LVS endpoint for logstash

https://gerrit.wikimedia.org/r/380994

Mentioned in SAL (#wikimedia-operations) [2017-10-04T12:00:51Z] <gehel> mediawiki now uses the LVS endpoint for logstash - T175242

Correction, https://gerrit.wikimedia.org/r/380994 is actually a noop, cleaning up a default that is overwritten more globally in hieradata/common.yaml.

Change 380993 merged by Gehel:
[operations/puppet@production] striker: switch to LVS endpoint for logstash

https://gerrit.wikimedia.org/r/380993

Change 383097 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] lgostash: all log producers need to use the logstash LVS endpoint

https://gerrit.wikimedia.org/r/383097

Change 383098 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] maps: all log producers need to use the logstash LVS endpoint

https://gerrit.wikimedia.org/r/383098

Change 383098 merged by Gehel:
[operations/puppet@production] maps: all log producers need to use the logstash LVS endpoint

https://gerrit.wikimedia.org/r/383098

Change 383146 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] logstash: update logstash_syslog common hiera parameter to point to LVS.

https://gerrit.wikimedia.org/r/383146

Change 383147 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] [test] mediawiki: use LVS endpoint for logstash

https://gerrit.wikimedia.org/r/383147

Change 383147 merged by Gehel:
[operations/puppet@production] [test] mediawiki: use LVS endpoint for logstash

https://gerrit.wikimedia.org/r/383147

Change 383355 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/mediawiki-config@master] use the logstash LVS endpoint

https://gerrit.wikimedia.org/r/383355

Change 383097 merged by Gehel:
[operations/puppet@production] logstash: all log producers need to use the logstash LVS endpoint

https://gerrit.wikimedia.org/r/383097

Change 380995 abandoned by Gehel:
ocg: switch to LVS endpoint for logstash

Reason:
OCG is being decommed

https://gerrit.wikimedia.org/r/380995

Change 383355 merged by jenkins-bot:
[operations/mediawiki-config@master] use the logstash LVS endpoint

https://gerrit.wikimedia.org/r/383355

Mentioned in SAL (#wikimedia-operations) [2017-11-02T13:11:24Z] <zfilipin@tin> Synchronized wmf-config/ProductionServices.php: SWAT: [[gerrit:383355|use the logstash LVS endpoint (T175242)]] (duration: 00m 51s)

Change 388052 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] cassandra: use LVS endpoint for logstash

https://gerrit.wikimedia.org/r/388052

Change 383146 merged by Gehel:
[operations/puppet@production] logstash: update logstash_syslog common hiera parameter to point to LVS.

https://gerrit.wikimedia.org/r/383146

Change 388426 had a related patch set uploaded (by Gehel; owner: Guillaume Lederrey):
[operations/puppet@production] udp2log: use LVS endpoint for logstash

https://gerrit.wikimedia.org/r/388426

Change 388052 merged by Gehel:
[operations/puppet@production] cassandra: use LVS endpoint for logstash

https://gerrit.wikimedia.org/r/388052

A short tcpdump session indicates that the only log producers still using logstash100[123] are udp2log and elasticsearch. Elasticsearch restart is in progress, for udp2log https://gerrit.wikimedia.org/r/#/c/388426/ still needs to be merged. Another check will be needed before actually decommissioning those servers.

Change 388426 merged by Gehel:
[operations/puppet@production] udp2log: use LVS endpoint for logstash

https://gerrit.wikimedia.org/r/388426

Mentioned in SAL (#wikimedia-operations) [2017-12-04T14:37:15Z] <gehel@tin> Started deploy [kartotherian/deploy@e166d87]: dummy kartotherian deployment to test udp2log config change - T175242

Mentioned in SAL (#wikimedia-operations) [2017-12-04T14:37:25Z] <gehel@tin> Finished deploy [kartotherian/deploy@e166d87]: dummy kartotherian deployment to test udp2log config change - T175242 (duration: 00m 03s)

All reference to logstash100[123] have been removed from puppet. I'll still do a check that no traffic is coming to those servers (we might have something outside of puppet) and start decommisionning the servers.

Monitoring traffic for a few hours on logstash100[123] shows that nothing is coming into any of the logstash ports. Thanks to every one who helped this move forward!