Page MenuHomePhabricator

Upgrade eventlogging servers to Stretch
Closed, ResolvedPublic21 Estimate Story Points

Description

Eventlogging no longer works well in precise, due to unsatisfied dependencies for python-pykafka, and also no longer installs properly in Trusty due to unsatisfied dependencies for python-etcd.

See also:
T109567
T112688

We should just upgrade eventlogging hosts to Jessie.

Event Timeline

Ottomata raised the priority of this task from to Medium.
Ottomata updated the task description. (Show Details)
Ottomata added subscribers: Ottomata, Krinkle.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 30 2015, 12:09 AM
Jay8g added a subscriber: Jay8g.Oct 11 2015, 12:54 AM
Nuria set Security to None.
Nuria added a subscriber: Nuria.
Milimetric closed this task as Declined.Jun 2 2016, 4:52 PM
Milimetric added a subscriber: Milimetric.

decided not to do this for now: systemd too complicated for event logging, didn't fit with the current setup

faidon reopened this task as Open.Jun 6 2016, 12:43 PM
faidon added a subscriber: faidon.

Could you or someone else elaborate a little bit on what's the new plan? A migration to systemd (& jessie) eventually is inevitable and a migration away from precise is needed ASAP, so I'm not sure why this was closed as declined. Thanks in advance :)

eventlog1001 is Trusty, not Precise, so we didn't think it was urgent.

Closing this doesn't mean we won't do it, it just means we aren't letting it take up any headspace. I suppose we could keep it open on the backlog...

In short, there was no easy way to dynamically manage all or groups of eventlogging processes in the same way that is done via upstart and Ori's eventloggingctl script. The upstart managed processes can listen to 'events' and respond appropriately. From my attempt to do this (I spent almost a week on it), systemd didn't have a good way to group proceses. I could address them via wildcards, which worked for some actions but not all, and I can declare dependencies between processes...but this didn't work well either.

I tried many variations on Filippo's PartOf/WantedBy suggestion, but it didn't quite work. For example (I'm writing from memory here), I tried a dummy service called 'eventlogging' with all the different processes PartOf eventlogging. After running service eventlogging stop the services would stop, but then they would no longer be associated with eventlogging afterwards. A subsequent service eventlogging status would only show the dummy service and none of the real ones. I suppose systemd forgets them somehow? (Although, looking at Filippo's comment again, I don't remember trying this RemainAfterExit=true thing...hm.).

faidon added a comment.Jun 6 2016, 2:57 PM

Thanks :) Not urgent, but needs to happen at some point regardless (upstart is pretty dead, even in Ubuntu), so keeping this open sounds like a plan.

Did you try @ units as well? (e.g. eventlogging@.service)?

Templated units, right? Ja I tried that. IIRC, that doesn't help much with grouping of services, just with DRYing them.

Krinkle removed a subscriber: Krinkle.Jul 4 2016, 5:46 PM
elukey added a subscriber: elukey.
elukey moved this task from Backlog to Analytics Backlog on the User-Elukey board.Jul 18 2017, 9:43 AM
mforns moved this task from Dashiki to Backlog (Later) on the Analytics board.Jul 31 2017, 3:40 PM
Nuria moved this task from Backlog (Later) to Dashiki on the Analytics board.Jan 3 2018, 10:48 PM

So restarting this work to see how we can proceed to move Eventlogging to systemd. I'd start from the last comment from Andrew, related to daemon grouping:

I tried many variations on Filippo's PartOf/WantedBy suggestion, but it didn't quite work. For example (I'm writing from memory here), I tried a dummy service called 'eventlogging' with all the different processes PartOf eventlogging. After running service eventlogging stop the services would stop, but then they would no longer be associated with eventlogging afterwards. A subsequent service eventlogging status would only show the dummy service and none of the real ones. I suppose systemd forgets them somehow? (Although, looking at Filippo's comment again, I don't remember trying this RemainAfterExit=true thing...hm.).

This seems to be what happens as well for thumbor:

elukey@thumbor1001:~$ sudo systemctl status thumbor-instances
● thumbor-instances.service - thumbor instances
   Loaded: loaded (/lib/systemd/system/thumbor-instances.service; static)
   Active: active (exited) since Wed 2018-02-21 08:47:10 UTC; 1 day 2h ago
  Process: 34203 ExecStart=/bin/true (code=exited, status=0/SUCCESS)
 Main PID: 34203 (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/thumbor-instances.service

Feb 21 08:47:10 thumbor1001 systemd[1]: Stopping thumbor instances...
Feb 21 08:47:10 thumbor1001 systemd[1]: Starting thumbor instances...
Feb 21 08:47:10 thumbor1001 systemd[1]: Started thumbor instances.

elukey@thumbor1001:~$ sudo systemctl status thumbor@8801.service
● thumbor@8801.service - Thumbor image manipulation service (instance 8801)
   Loaded: loaded (/lib/systemd/system/thumbor@.service; enabled)
   Active: active (running) since Wed 2018-02-21 08:47:11 UTC; 1 day 2h ago
 Main PID: 34392 (firejail)
   CGroup: /system.slice/system-thumbor.slice/thumbor@8801.service
           ├─34392 /usr/bin/firejail --profile=/etc/firejail/thumbor.profile --env=TMPDIR=/srv/thumbor/tmp/thumbor@8801 --env=MAGICK_TEMPORARY_PATH=/srv/thumbor/tmp/thumbor@8801 ...
           ├─34402 /usr/bin/firejail --profile=/etc/firejail/thumbor.profile --env=TMPDIR=/srv/thumbor/tmp/thumbor@8801 --env=MAGICK_TEMPORARY_PATH=/srv/thumbor/tmp/thumbor@8801 ...
           └─34570 /usr/bin/python /usr/bin/thumbor --port 8801 --ip 127.0.0.1 --keyfile /etc/thumbor.key --conf /etc/thumbor.d/

etc..

So status doesn't show all the thumbor instance units, but I am pretty sure that systemctl start|stop thumbor-instances work as expected. The eventloggingctl script for upstart uses a "hack" to show the status of the eventlogging daemons:

case "$command" in
    status)
        initctl list | grep -Po '(?<=eventlogging/)(?!init).*' | sort -k5 \
                | sed 's/, process//' | column -ts'( )' \
                | perl -pe 'END { exit $status } $status=1 if /stop\/waiting/;'
        ;;

Not sure if I am missing something but I'd proceed in this way:

  1. create a dummy eventlogging service, using PartOf/WantedBy and @units for each EL daemon.
  2. adapt eventloggingctl to show what we want (if needed), even if with systemctl we should have everything that we need.
elukey edited projects, added Analytics-Kanban; removed Analytics.Feb 22 2018, 5:04 PM
elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.

Change 413362 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] [WIP] eventlogging: add systemd support

https://gerrit.wikimedia.org/r/413362

Ottomata renamed this task from Upgrade eventlogging servers to Jessie to Upgrade eventlogging servers to Stretch.Feb 28 2018, 8:19 PM

@elukey, the error you were getting in deployment prep was caused by https://github.com/dpkp/kafka-python/pull/828, which breaks kafka-python with newer kafka broker versions. https://gerrit.wikimedia.org/r/#/c/415378/ updates to 1.4.1. I've built and included this in apt for jessie and stretch. I'm going to assume we won't need it for trusty, since it looks like we will be able to move to eventlog1002 with systemd before we migrate eventlogging to jumbo in T183297

Ottomata assigned this task to elukey.Feb 28 2018, 8:22 PM

Change 413362 merged by Elukey:
[operations/puppet@production] eventlogging: add systemd support

https://gerrit.wikimedia.org/r/413362

Change 416389 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::tcpircbot: remove eventlog1001 references

https://gerrit.wikimedia.org/r/416389

elukey added a comment.Mar 5 2018, 8:35 AM

Some things that would be nice to complete before the migration:

  • https://gerrit.wikimedia.org/r/#/c/415218/ - coal migration to a Kafka consumer (should happen this week), that would free us from deploying the zmq-forwarder and would also not need any coordination with Performance when migrating to eventlog1002.

Change 416405 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] eventlogging: add eventloggingctl script for systemd

https://gerrit.wikimedia.org/r/416405

Change 416405 merged by Elukey:
[operations/puppet@production] eventlogging: add eventloggingctl script for systemd

https://gerrit.wikimedia.org/r/416405

Change 416389 merged by Elukey:
[operations/puppet@production] profile::tcpircbot: remove eventlog1001 references

https://gerrit.wikimedia.org/r/416389

Change 416471 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] eventlogging: remove zmq-forwarder

https://gerrit.wikimedia.org/r/416471

Change 417240 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::eventlogging:analytics: include zmq config only on eventlog1001

https://gerrit.wikimedia.org/r/417240

Change 417240 merged by Elukey:
[operations/puppet@production] role::eventlogging:analytics: include zmq config only on eventlog1001

https://gerrit.wikimedia.org/r/417240

Change 417242 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Apply role::eventlogging::analytics to eventlog1002

https://gerrit.wikimedia.org/r/417242

Change 417244 had a related patch set uploaded (by Elukey; owner: Elukey):
[eventlogging/scap/analytics@master] Update beta and prod endpoints for systemd migration.

https://gerrit.wikimedia.org/r/417244

Change 417244 merged by Elukey:
[eventlogging/scap/analytics@master] Update beta and prod endpoints for systemd migration.

https://gerrit.wikimedia.org/r/417244

Change 417242 merged by Elukey:
[operations/puppet@production] Apply role::eventlogging::analytics to eventlog1002

https://gerrit.wikimedia.org/r/417242

elukey added a comment.EditedMar 8 2018, 2:32 PM

So current situation:

  1. on eventlog1002 all daemons but zmq-forwarder are running fine (stretch/systemd)
  2. on eventlog1001 the zmq-forwarder is still running until coal is migrated to Kafka

Change 417317 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] eventlogging: fix the nagios check when using systemd

https://gerrit.wikimedia.org/r/417317

Change 417317 merged by Elukey:
[operations/puppet@production] eventlogging: fix the nagios check when using systemd

https://gerrit.wikimedia.org/r/417317

Change 417322 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] statistics::rsync::eventlogging: change rsync target to eventlog1002

https://gerrit.wikimedia.org/r/417322

Change 417322 merged by Elukey:
[operations/puppet@production] statistics::rsync::eventlogging: change rsync target to eventlog1002

https://gerrit.wikimedia.org/r/417322

Change 417978 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Reduce references and roles for eventlog1001

https://gerrit.wikimedia.org/r/417978

Change 417978 abandoned by Elukey:
Reduce references and roles for eventlog1001

https://gerrit.wikimedia.org/r/417978

elukey added a subtask: Restricted Task.Mar 11 2018, 8:26 AM

Change 416471 abandoned by Elukey:
eventlogging: remove zmq-forwarder

https://gerrit.wikimedia.org/r/416471

Change 418953 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] eventlogging: reduce eventlog1001's scope

https://gerrit.wikimedia.org/r/418953

Change 418953 merged by Elukey:
[operations/puppet@production] eventlogging: reduce eventlog1001's scope

https://gerrit.wikimedia.org/r/418953

Updates:

  • eventlog1001 runs a special puppet role called eventlogging::analytics::legacy that enforces only the presence of the zmq-forwarder. This host will need to be up until https://gerrit.wikimedia.org/r/#/c/415218/ is merged, after that we'll proceed with decom (T189566)
  • eventlog1002 is running fine and rsyncs from stat1005 are backing up /srv/log/eventlogging's file daily as expected. There were some bits to fix (ipv6 AAAA record, analytics vlan rules, etc..) but the important ones are now fixed.
elukey set the point value for this task to 21.Mar 13 2018, 10:01 AM
elukey moved this task from In Progress to Done on the Analytics-Kanban board.
elukey moved this task from In Progress to Done on the User-Elukey board.Mar 15 2018, 5:04 PM
Nuria closed this task as Resolved.Mar 26 2018, 9:27 PM
Nuria closed subtask Restricted Task as Resolved.

Change 422135 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] eventlogging: move alarms from graphite to prometheus

https://gerrit.wikimedia.org/r/422135

Change 422135 merged by Elukey:
[operations/puppet@production] eventlogging: move alarms from graphite to prometheus

https://gerrit.wikimedia.org/r/422135