Page MenuHomePhabricator

Move mjolnir kafka daemon from ES to search-loader VMs
Closed, ResolvedPublic

Description

In T258189, search-loader[12]001 VMs were created to host the following profiles (maybe more):

  • profile::mjolnir::kafka_bulk_daemon
  • profile::mjolnir::kafka_msearch_daemon

The final goal for Analytics is to whitelist only search-loader* to pull data from Kafka Jumbo (via Ferm rules), rather than from all ES hosts.

Event Timeline

elukey renamed this task from Move to Move mjolnir kafka daemon from ES to search-loader VMs.Jul 17 2020, 9:28 AM

@EBernhardson @RKemper if you have time this/next week do you think that we could prioritize this task? I am asking since I'd love to add ferm rules to Kafka Jumbo asap, to be able to test them carefully on one node first etc.. but I don't want to break anybody in the process :)

If you are already packed with things to do don't worry, I'll wait!

Change 616101 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Move mjolnir's daemons to search-loader hosts

https://gerrit.wikimedia.org/r/616101

While working on https://gerrit.wikimedia.org/r/c/operations/puppet/+/616101/, I realized that mjolnir is also used on relforge, and with different settings, so in theory we'd need a multi-instance daemon to run on search-loader hosts? We cannot really share the configs in hiera afaics, @EBernhardson thoughts?

@elukey In response to your earlier ping in this thread - yup, let's work to get this moving next week

The main blocker at the moment seems to be the fact that mjolnir runs in two places:

  • role::elasticsearch::cirrus, that includes profile::mjolnir::kafka_msearch_daemon and profile::mjolnir::kafka_bulk_daemon
  • role::elasticsearch::relforge that includes profile::mjolnir::kafka_msearch_daemon

After a chat on the discovery chan on IRC it is not clear if the latter really needs mjolnir, but if so we'd have a problem since the current puppetization doesn't allow to have multiple instances of mjolnir running on the same host (that would be either of the search-loader vms). So I see two possible way to move forward:

  1. We don't need mjolnir on relforge now, so we can remove it and re-add it if needed when the cluster will run inside the Analytics VLAN (since all the hosts inside will be able to query Kafka Jumbo without restrictions). This would allow to review/merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/616101/ if good.
  2. We need mjolnir on relforge, and we have a plan to deploy it on multiple clusters in the future. In this case we probably need to adjust the mjolnir puppet code to use systemd multi-instance units (basically they use a template with placeholders/variables) and allow mjolnir to run in multiple separate instances. We do it in a lot of places so it shouldn't be super hard to do.

We can kill the relforge installation of the daemons. The msearch daemon lets us run search queries, but as mentioned the plan is for the new relforge instances to live in the analytics network where we can directly do this without having some intermediary like the mjolnir kafka_msearch_daemon

Change 616101 merged by Elukey:
[operations/puppet@production] Move mjolnir's daemons to search-loader hosts

https://gerrit.wikimedia.org/r/616101

Change 618364 had a related patch set uploaded (by Ebernhardson; owner: Ebernhardson):
[search/MjoLniR@master] Parameterize elastic endpoint for msearch daemon

https://gerrit.wikimedia.org/r/618364

Change 618365 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::search::loader: add URL of elastic search endpoing

https://gerrit.wikimedia.org/r/618365

Change 618365 merged by Elukey:
[operations/puppet@production] role::search::loader: add URL of elastic search endpoing

https://gerrit.wikimedia.org/r/618365

Change 618382 had a related patch set uploaded (by Ebernhardson; owner: Ebernhardson):
[operations/puppet@production] Add search-loader dsh group

https://gerrit.wikimedia.org/r/618382

Change 618364 merged by jenkins-bot:
[search/MjoLniR@master] Support daemons working against remote elasticsearch

https://gerrit.wikimedia.org/r/618364

Change 618383 had a related patch set uploaded (by Ebernhardson; owner: Ebernhardson):
[search/MjoLniR/deploy@master] Deploy mjolnir to search-loader hosts

https://gerrit.wikimedia.org/r/618383

Change 618383 merged by Ebernhardson:
[search/MjoLniR/deploy@master] Move mjolnir daemons from cirrus hosts to dedicated instances

https://gerrit.wikimedia.org/r/618383

Change 618391 had a related patch set uploaded (by Ebernhardson; owner: Ebernhardson):
[operations/puppet@production] mjolnir: Provide primary cirrus cluster url to msearch daemon

https://gerrit.wikimedia.org/r/618391

Change 618382 abandoned by Ebernhardson:
[operations/puppet@production] Add search-loader dsh group

Reason:
unnecessary per CR

https://gerrit.wikimedia.org/r/618382

Change 618391 merged by Elukey:
[operations/puppet@production] mjolnir: Provide primary cirrus cluster url to msearch daemon

https://gerrit.wikimedia.org/r/618391

Change 618493 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::prometheus::ops: change mjolnir's target classes

https://gerrit.wikimedia.org/r/618493

Change 618493 merged by Elukey:
[operations/puppet@production] profile::prometheus::ops: change mjolnir's target classes

https://gerrit.wikimedia.org/r/618493

elukey added a parent task: Restricted Task.Aug 5 2020, 10:09 AM

Remaining things to do:

  1. evaluate correctness and performances of mjolnir on search-loader VMs (currently in progress - @EBernhardson).
  2. clean up puppet to remove mjolnir's profiles from relforge/cirrus roles (will be done at the end if we decide not to rollback the current settings).

Any updates on this from the mjonlir work?

elukey changed the task status from Open to Stalled.Sep 4 2020, 6:18 AM

We are currently blocked on T260305, next week we should be able to deploy a new puppet patch for multi-instance mjolnir and we'll see how it goes. There are still some problems (including performance) that need to be fixed before calling this migration done (namely no possibility of rolling back to ES).

The daemons are moved. A few followups might be required elsewhere, but this task should be complete.

elukey claimed this task.

Closing it then, thanks a lot!