mjolnir-kafka-msearch-daemon dropping produced messages after move to search-loader[12]001
Closed, ResolvedPublic8 Estimated Story Points
Actions

Assigned To

Authored By

	EBernhardson
	Aug 12 2020, 11:16 PM

Description

As a deployed service I need to continue working after moving between hosts so i can continue providing live elasticsearch data to ML training processes.

mjolnir daemons (two separate but related daemons) were moved from elasticsearch hosts (35 per dc) to a single dedicated vm per dc. Since moving the mjolnir-kafka-*msearch*-daemon has been dropping messages and the downstream datasets the depend on the messages produced by this daemon have been incomplete.

Possible causes:

Most obvious guess is from messages logged when the daemon is busy: Failed to send a message to the broker: KafkaTimeoutError: Batch for TopicPartition(topic='mjolnir.msearch-prod-response', partition=23) containing 1 record(s) expired: 30 seconds have passed since last append
While experimenting with parallelism the internal thread pool was increased from 2 when it was deployed to all elasticsearch servers (25 kafka partitions, gave ~50 threads across the cluster making parallel requests) to 10 on a single instance, and then 25. 10 made some increases in throughput, 25 has same throughput as 10 but higher resource utilization. almost certainly this should be scaled back.

What does the daemon do:

These daemons run in the production network and perform various actions in response to events produced in the analytics network.
Two parts of the ML training pipeline need to perform requests against the production search indices, but they live in separate networks and cannot directly talk to the elasticsearch clusters. They typically run once a week, but can be manually triggered.
The msearch daemon monitors metrics in the elasticsearch cluster related to cirrussearch, and only subscribes to the appropriate kafka topic if the cluster it's pointed at is idle. Effectively that means only the msearch daemon in codfw does anything under normal operations.
Analytics side generates fully formed queries for the elasticsearch _msearch api endpoint and puts them into separate messages in a kafka topic
Daemons on the production side read the events, perform the relevant http request, and put the response in a second topic.
Mostly this means the daemon sits around and waits on IO. For each message recieved from kafka it makes an http request and waits 100-200ms for elasticsearch to respond. Per record processing within the daemon is minimal.
Analytics side reads back the results. There are some complications in the daemon that help the analytics side figure out where to stop reading, but they shouldn't be relevant to this task.
There are two separate parts of the training process that cause the msearch daemon to run. The first one runs typical search queries and has small responses, the second stage collects feature vectors from the search clusters and has responses from 100-800kB per record. Connections to elasticsearch and messages sent to kafka should all be compressed as this content is highly duplicative.

grafana dashboard for msearch
Search host:search-loader2001 in logstash for related logs
If my memory is correct the reason we send 1 message per produce request in the msearch is because kafka had issues in the past due to requests being too large. May or may not still be relevant.
Reproducing timeout issues on a local instance is going to be difficult at best. Easiest way to reproduce on the production instance would be to directly produce messages to the topic, likely a few MB of real previous messages can be put in a file and re-produced with the kafkacat utility. I've recorded 10k requests from this weeks run at stat1006.eqiad.wmnet:/home/ebernhardson/mjolnir.msearch-prod-request.20200812.ndjson
Testing on the msearch daemon in production isn't particularly risky, it's hardcoded to only hit the /_msearch endpoint which is read only, and the daemon only operates against the idle cluster. If both clusters are somehow busy, it simply doesn't run.

Details

Subject	Repo	Branch	Lines +/-
Remove unnecessary daemon definitions	operations/puppet	production	+0 -33
mjolnir: Bump msearch daemon parallelism	operations/puppet	production	+2 -2
role::elasticsearch::cirrus: add missing mjolnir hiera value	operations/puppet	production	+1 -0
profile::mjolnir::kafka_msearch_daemon_instance: fix enable value	operations/puppet	production	+1 -1
role::elasticsearch::relforge: add missing hiera config for mjolnir	operations/puppet	production	+1 -0
profile::mjolnir::kafka_msearch_daemon: clean up old code	operations/puppet	production	+0 -9
mjolnir: fix syslog identifier of the msearch instances	operations/puppet	production	+1 -1
mjolnir: fix syslog identifier in the msearch systemd unit template	operations/puppet	production	+1 -1
Multiple instances of msearch_daemon	operations/puppet	production	+93 -49

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
		Restricted Task
Stalled	None	T250146 Add Authentication/Encryption to Kafka Jumbo's clients
		Restricted Task
Resolved	elukey	T258245 Move mjolnir kafka daemon from ES to search-loader VMs
Resolved	• Zbyszko	T260305 mjolnir-kafka-msearch-daemon dropping produced messages after move to search-loader[12]001
Resolved	elukey	T262385 Increase vcores and ram on search-loader VMs

Event Timeline

EBernhardson created this task.Aug 12 2020, 11:16 PM

EBernhardson updated the task description. (Show Details)

EBernhardson updated the task description. (Show Details)Aug 12 2020, 11:20 PM

Adding also another note. The configuration of the msearch response topic is the following:

Topic:mjolnir.msearch-prod-response     PartitionCount:35       ReplicationFactor:3     Configs:retention.ms=86400000

35 partitions probably made sense when the whole ES cluster was consuming/producing, but not now that we have search loader nodes. Kafka as far as I know, doesn't support decreasing partitions for a given topic, so one solution (if data is droppable) is to delete/re-create the topic with less partitions.

The error

Failed to send a message to the broker: KafkaTimeoutError: Batch for TopicPartition(topic='mjolnir.msearch-prod-response', partition=23) containing 1 record(s) expired: 30 seconds have passed since last append

should indicate, IIRC/IIUC, that mjolnir queues messages to produce to the topic in the local buffer for more than 30s, and this causes data to not be sent. Given the exception I suppose that mjolnir uses Kafka Python, that in turn doesn't use librdkafka underneath (that has some transparent retry mechanism for these use cases). Kafka Python has the possibility to retry (https://kafka-python.readthedocs.io/en/master/apidoc/KafkaProducer.html), but it is disabled by default to prevent duplicates.

The suspicion that I have is that mjolnir on search-loader cannot keep up producing to 35 partitions, given that it should be probably use round robin to all of them.

CBogen moved this task from needs triage to Current work on the Discovery-Search board.Aug 13 2020, 7:04 PM

CBogen edited projects, added Discovery-Search (Current work); removed Discovery-Search.

Other bits i forgot:

repository: search/MjoLniR
deploy repository: search/MjoLniR-deploy

deploy process:

Commit to primary repo
Increase submodule in deploy repository to match primary repo
scap deploy .. from deployment.eqiad.wmnet in /srv/deployment/search/mjolnir/deploy
sudo service restart mjolnir-kafka-msearch-daemon from host (ex: search-loader2001.codfw.wmnet)

CBogen set the point value for this task to 8.Aug 17 2020, 5:12 PM

CBogen moved this task from Incoming to Ready for Estimation on the Discovery-Search (Current work) board.

CBogen moved this task from Ready for Estimation to Ready for Dev -- SWE on the Discovery-Search (Current work) board.Aug 17 2020, 5:14 PM

• Zbyszko moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.Aug 19 2020, 8:48 AM

Quick test showed that reducing the number of partitions does not fix the issue. Our current suspicion is that current parallelism (done via Python GIL) is lacking. Idea to fix:

limit per service workers to 2
launch more msearch daemons per loader (similar to carbon cache configuration)

Change 621988 had a related patch set uploaded (by ZPapierski; owner: ZPapierski):
[operations/puppet@production] Multiple instances of msearch_daemon

https://gerrit.wikimedia.org/r/621988

gerritbot added a project: Patch-For-Review.Aug 24 2020, 9:43 AM

Change 622355 had a related patch set uploaded (by ZPapierski; owner: ZPapierski):
[operations/puppet@production] Remove unnecessary daemon definitions

https://gerrit.wikimedia.org/r/622355

elukey moved this task from Backlog to Q1 2021/2022 on the Analytics-Clusters board.Aug 27 2020, 4:20 PM

• Zbyszko moved this task from In Progress to Needs review on the Discovery-Search (Current work) board.Aug 28 2020, 10:36 AM

elukey mentioned this in T258245: Move mjolnir kafka daemon from ES to search-loader VMs.Sep 4 2020, 6:18 AM

Change 621988 merged by Elukey:
[operations/puppet@production] Multiple instances of msearch_daemon

https://gerrit.wikimedia.org/r/621988

Mentioned in SAL (#wikimedia-operations) [2020-09-08T18:22:35Z] <elukey> rm /srv/prometheus/ops/targets/mjolnir_msearch_eqiad.yaml on prometheus100[3,4] as cleanup after https://gerrit.wikimedia.org/r/621988 - T260305

Change 625952 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] mjolnir: fix syslog identifier in the msearch systemd unit template

https://gerrit.wikimedia.org/r/625952

Change 625952 merged by Elukey:
[operations/puppet@production] mjolnir: fix syslog identifier in the msearch systemd unit template

https://gerrit.wikimedia.org/r/625952

Change 625954 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] mjolnir: fix syslog identifier of the msearch instances

https://gerrit.wikimedia.org/r/625954

Change 625954 merged by Elukey:
[operations/puppet@production] mjolnir: fix syslog identifier of the msearch instances

https://gerrit.wikimedia.org/r/625954

Change 626055 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::mjolnir::kafka_msearch_daemon: clean up old code

https://gerrit.wikimedia.org/r/626055

Change 626055 merged by Elukey:
[operations/puppet@production] profile::mjolnir::kafka_msearch_daemon: clean up old code

https://gerrit.wikimedia.org/r/626055

Change 626056 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::elasticsearch::relforge: add missing hiera config for mjolnir

https://gerrit.wikimedia.org/r/626056

Change 626056 merged by Elukey:
[operations/puppet@production] role::elasticsearch::relforge: add missing hiera config for mjolnir

https://gerrit.wikimedia.org/r/626056

Change 626058 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::mjolnir::kafka_msearch_daemon_instance: fix enable value

https://gerrit.wikimedia.org/r/626058

Change 626058 merged by Elukey:
[operations/puppet@production] profile::mjolnir::kafka_msearch_daemon_instance: fix enable value

https://gerrit.wikimedia.org/r/626058

Change 626060 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::elasticsearch::cirrus: add missing mjolnir hiera value

https://gerrit.wikimedia.org/r/626060

Change 626060 merged by Elukey:
[operations/puppet@production] role::elasticsearch::cirrus: add missing mjolnir hiera value

https://gerrit.wikimedia.org/r/626060

Change 626105 had a related patch set uploaded (by ZPapierski; owner: ZPapierski):
[operations/puppet@production] Bump msearch daemon parallelism

https://gerrit.wikimedia.org/r/626105

Change 626105 merged by Elukey:
[operations/puppet@production] mjolnir: Bump msearch daemon parallelism

https://gerrit.wikimedia.org/r/626105

elukey closed subtask T262385: Increase vcores and ram on search-loader VMs as Resolved.Sep 10 2020, 7:10 AM

I noticed that kafka-python==1.4.3 in the venv, can we try to upgrade it to 1.4.7? https://pypi.org/project/kafka-python/#history

Change 622355 abandoned by ZPapierski:
[operations/puppet@production] Remove unnecessary daemon definitions

Reason:
Already removed

https://gerrit.wikimedia.org/r/622355

In T260305#6449812, @elukey wrote:

I noticed that kafka-python==1.4.3 in the venv, can we try to upgrade it to 1.4.7? https://pypi.org/project/kafka-python/#history

Erik quickly checked and it seems that the last version, 2.0.1, could be a better candidate (the major bump is related to dropping old consumer code).

The next steps that I see are:

Try kafka-python 2.0.1 to see if the consumer errors get fixed (if any, I still didn't get if the problem is there or not).
Test performances after T262385 and decide if the VMs are now ok as replacement for the previous settings (to unblock Analytics that needs to deploy iptables/Ferm rules to Kafka Jumbo)

EBernhardson mentioned this in T263157: Process to check approximate correctness of analytics pipeline outputs.Sep 17 2020, 5:12 PM

Try kafka-python 2.0.1 to see if the consumer errors get fixed (if any, I still didn't get if the problem is there or not).

Not seeing a repeat of the consumer errors from before. We should still upgrade, but it doesn't look to be needed before closing this task.

Test performances after T262385 and decide if the VMs are now ok as replacement for the previous settings (to unblock Analytics that needs to deploy iptables/Ferm rules to Kafka Jumbo)

Metrics for this weeks run look acceptable. It's currently about half the throughput of the prevision deployment, but that isn't enough of a change to cause issues in my estimation.

EBernhardson removed a project: Patch-For-Review.Sep 17 2020, 8:25 PM

EBernhardson moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.

Thanks a lot for all the work on this, really appreciated. As far as I can see it seems that the new search-loader environment seems stable and working, even if not as performant as what it was the old setting on the ES cluster. If needed we can try to increase the number of vcores/ram on the loader hosts, and possibly (long term) think about bare metal hw for next fiscal.

Gehel closed this task as Resolved.Sep 28 2020, 2:41 PM

mjolnir-kafka-msearch-daemon dropping produced messages after move to search-loader[12]001Closed, ResolvedPublic8 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

mjolnir-kafka-msearch-daemon dropping produced messages after move to search-loader[12]001
Closed, ResolvedPublic8 Estimated Story Points
Actions

Related Objects
Search...