Page MenuHomePhabricator

Migrate eventlog1002 to buster
Closed, ResolvedPublic

Description

T238230: Decommission EventLogging backend components by migrating to MEP will not happen by the end of Q4, so we should upgrade the server (currently eventlog1002) running the backend eventlogging-processor to Buster. Many streams have already been migrated to EventGate, so we should be able to spin up a new Buster Ganeti VM and run eventlogging-processor there, and then decom eventlog1002.

Event Timeline

Just to clarify - should eventlog1002 be upgraded to buster and then decommissioned as part of this task or is decommissioning work part of another task? Should the new eventlog VM (eventlog1003 I guess :)) be kept in place for the foreseeable future until we decide to decommission all eventlogging components?

@hnowlan in theory this could be the perfect scenario:

  1. We create eventlog1003 on Ganeti (sizing the VM appropriately) using Buster and Python 3.7 (shipped with it), and we run it in parallel with eventlog1002.
  2. After a little time running both, when we are confident that no corner cases need to be fixed with Python 3.7, we stop eventlogging on 1002.
  3. We decom eventlog1002 and return it to DCops (without any upgrade).

On paper Eventlogging is a stateless app now, since it works only on Kafka topics, so in theory we could even think about Kubernetes. In practice it is surely quicker to spin up a VM and re-use the current puppet machinery, to avoid investing too much time on something that we hope to deprecate asap in favor of eventgate-analytics.

Lemme know your thoughts :)

Sounds good to me! I don't think there's much point in exploring Kubernetes as opposed to using a VM if our medium-term plan is to get rid of the system altogether.

Based on the graphs it looks like we might be okay with a VM with 4GB of memory, maybe 4 vcpus to start with? We can tune CPUs upwards if needs be if it's looking overloaded. Doesn't seem like it'll need a lot of disk.

Sounds good to me! I don't think there's much point in exploring Kubernetes as opposed to using a VM if our medium-term plan is to get rid of the system altogether.

Based on the graphs it looks like we might be okay with a VM with 4GB of memory, maybe 4 vcpus to start with? We can tune CPUs upwards if needs be if it's looking overloaded. Doesn't seem like it'll need a lot of disk.

+1

What is the current status of eventlog1003?
It's reported by a cumin check that ensures that all hosts matching the alias A:all are part of one of the datacenters, and eventlog1003 is not part of the alias for A:eqiad.
AFAICT it's in PuppetDB but not assigned to any role in site.pp. See also https://puppetboard.wikimedia.org/node/eventlog1003.eqiad.wmnet

What is the current status of eventlog1003?
It's reported by a cumin check that ensures that all hosts matching the alias A:all are part of one of the datacenters, and eventlog1003 is not part of the alias for A:eqiad.
AFAICT it's in PuppetDB but not assigned to any role in site.pp. See also https://puppetboard.wikimedia.org/node/eventlog1003.eqiad.wmnet

Just changed role to insetup, apologies for the noise. https://gerrit.wikimedia.org/r/681704

Change 682573 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] site: set eventlog1003 role to eventlogging

https://gerrit.wikimedia.org/r/682573

Change 682573 merged by Hnowlan:

[operations/puppet@production] site: eventlog1003 role to eventlogging, allow access to kafka

https://gerrit.wikimedia.org/r/682573

eventlog1003 is now handling all eventlogging jobs. Utilisation looks okay so far - it was initially looking a little busy with 4 CPUs but is better at 6.

Change 683859 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[eventlogging/scap/analytics@master] Deploy to new eventlog hosts

https://gerrit.wikimedia.org/r/683859

Change 683859 merged by Hnowlan:

[eventlogging/scap/analytics@master] Deploy to new eventlog hosts

https://gerrit.wikimedia.org/r/683859

Do we plan on decommissioning or reclaiming the hardware for the old eventlog1002?

Yes let's fully decommission eventlog1002 once we are ok with 1003 :)

I think I'm okay with 1003 so far - it's been running all processors since 15:00 on the 29th of April and it seems to be coping fine with no aberrations in the eventlogging graphs. If you're cool with it I can start the decom asap

Change 685757 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/dns@master] wmnet: correct eventlogging CNAME

https://gerrit.wikimedia.org/r/685757

Change 685757 merged by Hnowlan:

[operations/dns@master] wmnet: correct eventlogging CNAME

https://gerrit.wikimedia.org/r/685757