Page MenuHomePhabricator

setup/install eventlog1002.eqiad.wmnet
Closed, ResolvedPublic

Description

This task will track the setup and implementation of eventlog1002(WMF4751). Once online, it will replace eventlog1001, which will be decommissioned.

Event Timeline

RobH triaged this task as Medium priority.

Change 406127 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] setting dns entries for eventlog1002

https://gerrit.wikimedia.org/r/406127

Change 406127 merged by RobH:
[operations/dns@master] setting dns entries for eventlog1002

https://gerrit.wikimedia.org/r/406127

@Ottomata: eventlog1001 is trusty. Can eventlog1002 be stretch or does it need to be an older distro? Please advise and assign back to me, thanks!

Change 406129 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] eventlog1002 install params

https://gerrit.wikimedia.org/r/406129

Change 406129 merged by RobH:
[operations/puppet@production] eventlog1002 install params

https://gerrit.wikimedia.org/r/406129

faidon renamed this task from setup/install evenlog1002.eqiad.wmnet to setup/install eventlog1002.eqiad.wmnet.Jan 25 2018, 6:30 PM
RobH updated the task description. (Show Details)

@RobH, sorry I missed your ping on this, Trusty please! :)

Trusty has about a year left of upstream support, and likely less for our own purposes. Any reason to not switch to somewhere more recent while we're at it?

We are holding for Kubernetes! :) When it is ready, we will move the many individual processes (which are managed and monitored via a custom upstart based eventloggingctl scripts written by Ori) to k8s. We'd prefer not to have to do a bunch of work to change what we have now, just to do it again (hopefully next FY) to move to K8s.

I had a look at both modules/eventlogging/files/eventloggingctl and modules/eventlogging/templates/upstart/*. They all seemed fairly easy to reimplement with systemd (with or without templates; for the former, a good reference would be e.g. the Tor package's units). It all feels to me like less than a day's effort unless I've gravely misunderstood how this all works and underestimating it.

IMHO, it would be better to do this effort now, rather than being pressured to do this Kubernetes migration in less than a year's time. Such an interdependency has the potential of being one of our last trusty strugglers, so I'd like to untangle it and get it out of the way, unless you're feeling strongly against this :)

I actually tried to move to systemd a couple of years ago. I don’t
remember the exact details, but there were some serious difficulties in
automatically registering groups of processes to be managed together, even
with templates. Wildcards (*) sort of worked, but not in all cases.

I actually tried to move to systemd a couple of years ago.

But T114199 was for jessie, the first Debian release with systemd by default. With stretch a lot has happened since then (a full two years of activity of a fairly lively upstream project). We should really revisit this with stretch.

Install blocked by network issue detailed on T186252 for onsite work.

Change 407677 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] fixing eventlog1002 entry

https://gerrit.wikimedia.org/r/407677

Change 407677 merged by RobH:
[operations/dns@master] fixing eventlog1002 entry

https://gerrit.wikimedia.org/r/407677

RobH added a subscriber: elukey.

So due to both Faidon and Mortiz's comments, I've gone ahead and installed with stretch. If it needs to be re-imaged to fall back to an older distro, then dhcp and reimage will have to be updated/run.

The initial puppet run is currently going, I'm escalating this to @elukey or @Ottomata (both have been involved in the setup of this host, so not sure who is ideal.)

I had a look at both modules/eventlogging/files/eventloggingctl and modules/eventlogging/templates/upstart/*. They all seemed fairly easy to reimplement with systemd (with or without templates; for the former, a good reference would be e.g. the Tor package's units). It all feels to me like less than a day's effort unless I've gravely misunderstood how this all works and underestimating it.

We could to a research spike of one day and attempt to have something working in labs, and maybe discuss after it if it is worth to keep going with trusty/upstart or not?

All the work to move eventlogging to systemd is going to be tracked in https://phabricator.wikimedia.org/T114199, let's use this task only for the eventlog1002's productionization.

As preparation step to make the migration I discovered that eventlog1001 seems to receive udp traffic from mwlog* hosts to parse and generate mw.errors.* metrics. I opened a task to remove all this code since it seems deprecated, but it might take a bit of time: https://phabricator.wikimedia.org/T188749

Also pinged folks in https://gerrit.wikimedia.org/r/#/c/415218/, it would be great not to create a ZMQ forwarder on eventlog1002.

Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['eventlog1002.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201803081249_elukey_9444.log.

Completed auto-reimage of hosts:

['eventlog1002.eqiad.wmnet']

and were ALL successful.

elukey updated the task description. (Show Details)

Change 418714 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/dns@master] Fix eventlog1002's ipv6 address

https://gerrit.wikimedia.org/r/418714

Change 418714 merged by Elukey:
[operations/dns@master] Fix eventlog1002's ipv6 address

https://gerrit.wikimedia.org/r/418714