Description
Details
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | Milimetric | T130247 Operational improvements and maintenance in EventLogging in Q4 {oryx} | |||
| Resolved | hashar | T97402 Make services manageable by systemd | |||
| Resolved | elukey | T114199 Upgrade eventlogging servers to Stretch | |||
| Resolved | elukey | T188749 Deprecation of mw.errors.* metrics | |||
| Resolved | Ottomata | T185667 setup/install eventlog1002.eqiad.wmnet | |||
| Resolved | • Cmjohnson | T185668 apply hostname labels to eventlog1001/WMF4751 | |||
| Resolved | • Cmjohnson | T186252 check eventlog1002 production network cable | |||
| Restricted Task | |||||
| Resolved | • Cmjohnson | T189566 Decommission eventlog1001 |
Event Timeline
decided not to do this for now: systemd too complicated for event logging, didn't fit with the current setup
Could you or someone else elaborate a little bit on what's the new plan? A migration to systemd (& jessie) eventually is inevitable and a migration away from precise is needed ASAP, so I'm not sure why this was closed as declined. Thanks in advance :)
eventlog1001 is Trusty, not Precise, so we didn't think it was urgent.
Closing this doesn't mean we won't do it, it just means we aren't letting it take up any headspace. I suppose we could keep it open on the backlog...
In short, there was no easy way to dynamically manage all or groups of eventlogging processes in the same way that is done via upstart and Ori's eventloggingctl script. The upstart managed processes can listen to 'events' and respond appropriately. From my attempt to do this (I spent almost a week on it), systemd didn't have a good way to group proceses. I could address them via wildcards, which worked for some actions but not all, and I can declare dependencies between processes...but this didn't work well either.
I tried many variations on Filippo's PartOf/WantedBy suggestion, but it didn't quite work. For example (I'm writing from memory here), I tried a dummy service called 'eventlogging' with all the different processes PartOf eventlogging. After running service eventlogging stop the services would stop, but then they would no longer be associated with eventlogging afterwards. A subsequent service eventlogging status would only show the dummy service and none of the real ones. I suppose systemd forgets them somehow? (Although, looking at Filippo's comment again, I don't remember trying this RemainAfterExit=true thing...hm.).
Thanks :) Not urgent, but needs to happen at some point regardless (upstart is pretty dead, even in Ubuntu), so keeping this open sounds like a plan.
Did you try @ units as well? (e.g. eventlogging@.service)?
Templated units, right? Ja I tried that. IIRC, that doesn't help much with grouping of services, just with DRYing them.
So restarting this work to see how we can proceed to move Eventlogging to systemd. I'd start from the last comment from Andrew, related to daemon grouping:
This seems to be what happens as well for thumbor:
elukey@thumbor1001:~$ sudo systemctl status thumbor-instances
● thumbor-instances.service - thumbor instances
Loaded: loaded (/lib/systemd/system/thumbor-instances.service; static)
Active: active (exited) since Wed 2018-02-21 08:47:10 UTC; 1 day 2h ago
Process: 34203 ExecStart=/bin/true (code=exited, status=0/SUCCESS)
Main PID: 34203 (code=exited, status=0/SUCCESS)
CGroup: /system.slice/thumbor-instances.service
Feb 21 08:47:10 thumbor1001 systemd[1]: Stopping thumbor instances...
Feb 21 08:47:10 thumbor1001 systemd[1]: Starting thumbor instances...
Feb 21 08:47:10 thumbor1001 systemd[1]: Started thumbor instances.
elukey@thumbor1001:~$ sudo systemctl status thumbor@8801.service
● thumbor@8801.service - Thumbor image manipulation service (instance 8801)
Loaded: loaded (/lib/systemd/system/thumbor@.service; enabled)
Active: active (running) since Wed 2018-02-21 08:47:11 UTC; 1 day 2h ago
Main PID: 34392 (firejail)
CGroup: /system.slice/system-thumbor.slice/thumbor@8801.service
├─34392 /usr/bin/firejail --profile=/etc/firejail/thumbor.profile --env=TMPDIR=/srv/thumbor/tmp/thumbor@8801 --env=MAGICK_TEMPORARY_PATH=/srv/thumbor/tmp/thumbor@8801 ...
├─34402 /usr/bin/firejail --profile=/etc/firejail/thumbor.profile --env=TMPDIR=/srv/thumbor/tmp/thumbor@8801 --env=MAGICK_TEMPORARY_PATH=/srv/thumbor/tmp/thumbor@8801 ...
└─34570 /usr/bin/python /usr/bin/thumbor --port 8801 --ip 127.0.0.1 --keyfile /etc/thumbor.key --conf /etc/thumbor.d/
etc..So status doesn't show all the thumbor instance units, but I am pretty sure that systemctl start|stop thumbor-instances work as expected. The eventloggingctl script for upstart uses a "hack" to show the status of the eventlogging daemons:
case "$command" in
status)
initctl list | grep -Po '(?<=eventlogging/)(?!init).*' | sort -k5 \
| sed 's/, process//' | column -ts'( )' \
| perl -pe 'END { exit $status } $status=1 if /stop\/waiting/;'
;;Not sure if I am missing something but I'd proceed in this way:
- create a dummy eventlogging service, using PartOf/WantedBy and @units for each EL daemon.
- adapt eventloggingctl to show what we want (if needed), even if with systemctl we should have everything that we need.
Change 413362 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] [WIP] eventlogging: add systemd support
@elukey, the error you were getting in deployment prep was caused by https://github.com/dpkp/kafka-python/pull/828, which breaks kafka-python with newer kafka broker versions. https://gerrit.wikimedia.org/r/#/c/415378/ updates to 1.4.1. I've built and included this in apt for jessie and stretch. I'm going to assume we won't need it for trusty, since it looks like we will be able to move to eventlog1002 with systemd before we migrate eventlogging to jumbo in T183297
Change 413362 merged by Elukey:
[operations/puppet@production] eventlogging: add systemd support
Change 416389 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::tcpircbot: remove eventlog1001 references
Some things that would be nice to complete before the migration:
- https://gerrit.wikimedia.org/r/#/c/415887/ - remove any dependency to mwlog and (unencrypted) UDP traffic. mw.errors.* metrics seems not used anymore.
- https://gerrit.wikimedia.org/r/#/c/416389/- references of eventlog1001 in tcpircbot's firewall rules. As far as I can see this was a old reminiscence from vanadium, not really needed anymore.
- https://gerrit.wikimedia.org/r/#/c/415218/ - coal migration to a Kafka consumer (should happen this week), that would free us from deploying the zmq-forwarder and would also not need any coordination with Performance when migrating to eventlog1002.
Change 416405 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] eventlogging: add eventloggingctl script for systemd
Change 416405 merged by Elukey:
[operations/puppet@production] eventlogging: add eventloggingctl script for systemd
Change 416389 merged by Elukey:
[operations/puppet@production] profile::tcpircbot: remove eventlog1001 references
Change 416471 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] eventlogging: remove zmq-forwarder
Change 417240 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::eventlogging:analytics: include zmq config only on eventlog1001
Change 417240 merged by Elukey:
[operations/puppet@production] role::eventlogging:analytics: include zmq config only on eventlog1001
Change 417242 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Apply role::eventlogging::analytics to eventlog1002
Change 417244 had a related patch set uploaded (by Elukey; owner: Elukey):
[eventlogging/scap/analytics@master] Update beta and prod endpoints for systemd migration.
Change 417244 merged by Elukey:
[eventlogging/scap/analytics@master] Update beta and prod endpoints for systemd migration.
Change 417242 merged by Elukey:
[operations/puppet@production] Apply role::eventlogging::analytics to eventlog1002
So current situation:
- on eventlog1002 all daemons but zmq-forwarder are running fine (stretch/systemd)
- on eventlog1001 the zmq-forwarder is still running until coal is migrated to Kafka
Change 417317 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] eventlogging: fix the nagios check when using systemd
Change 417317 merged by Elukey:
[operations/puppet@production] eventlogging: fix the nagios check when using systemd
Change 417322 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] statistics::rsync::eventlogging: change rsync target to eventlog1002
Change 417322 merged by Elukey:
[operations/puppet@production] statistics::rsync::eventlogging: change rsync target to eventlog1002
Change 417978 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Reduce references and roles for eventlog1001
Change 418953 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] eventlogging: reduce eventlog1001's scope
Change 418953 merged by Elukey:
[operations/puppet@production] eventlogging: reduce eventlog1001's scope
Updates:
- eventlog1001 runs a special puppet role called eventlogging::analytics::legacy that enforces only the presence of the zmq-forwarder. This host will need to be up until https://gerrit.wikimedia.org/r/#/c/415218/ is merged, after that we'll proceed with decom (T189566)
- eventlog1002 is running fine and rsyncs from stat1005 are backing up /srv/log/eventlogging's file daily as expected. There were some bits to fix (ipv6 AAAA record, analytics vlan rules, etc..) but the important ones are now fixed.
Change 422135 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] eventlogging: move alarms from graphite to prometheus
Change 422135 merged by Elukey:
[operations/puppet@production] eventlogging: move alarms from graphite to prometheus