Rework the data flow between logstash and cirrus elasticsearch cluster for ApiFeatureUsage
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Gehel
	Mar 6 2019, 10:17 AM

Description

The current flow for ApiFeatureUsage is that usage logs are collected via logstash, which has an output to the cirrus elasticsearch clusters. This causes multiple issues:

synchronous flow: if the cirrus cluster is down for maintenance (or crashed) logstash pipeline will stall (see T176335)
strong coupling: logstash and the cirrus cluster need to run compatible versions of logstash / elasticsearch, which can be problematic during upgrades

We should rework this data flow, probably using kafka, which would take care of both those issues.

Related Objects
Search...

Status	Assigned	Task
Resolved	herron	T281266 Decommission old ELK5 Logstash cluster
Resolved	herron	T297239 Move logstash api-feature-usage output away from v5 cluster
Resolved	EBernhardson	T217742 Rework the data flow between logstash and cirrus elasticsearch cluster for ApiFeatureUsage

Event Timeline

Gehel created this task.Mar 6 2019, 10:17 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 6 2019, 10:17 AM

Gehel updated the task description. (Show Details)Mar 6 2019, 10:23 AM

Gehel mentioned this in T176335: logs sent to logstash are lost when the elasticsearch cirrus cluster is unavailable.Mar 6 2019, 10:27 AM

Anomie subscribed.Mar 6 2019, 2:06 PM

A couple random thoughts:

This could potentially be a part of T185233?
Another potentially related component could be mjolnir-bulk-daemon which is used today to take bulk updates from kafka and update the wiki indices
Alternatively ApiFeatureUsage could have it's own logstash instance running in ganetti which would reduce the coupling between the logging services and the cirrus services, but adds complications to puppet to handle multiple versions.

EBernhardson triaged this task as Medium priority.Mar 7 2019, 6:18 PM

EBernhardson moved this task from needs triage to Ops / SRE on the Discovery-Search board.Mar 7 2019, 6:47 PM

Gehel mentioned this in T234854: Upgrade ELK Stack to version 7.Oct 24 2019, 1:32 PM

@Mstyles

@Gehel, @EBernhardson mentioned that our new Elasticsearch cluster version doesn't have the same issue with data replication when upgrading the cluster, which means that stopping writes might be less important in the next upgrade.

I think specifically the updates are around this ticket, https://phabricator.wikimedia.org/T235833

Mstyles claimed this task.Dec 18 2019, 9:29 PM

Tabling this for now as it's not urgent

Gehel added a subtask: T241791: (Need by: 2020-04-02) rack/setup/install relforge100[34].Jul 24 2020, 8:34 AM

Mstyles removed Mstyles as the assignee of this task.Aug 6 2020, 7:37 PM

Gehel removed a subtask: T241791: (Need by: 2020-04-02) rack/setup/install relforge100[34].Aug 25 2020, 7:09 PM

Adding kafka between logstash and cirrus search seems to be the easy solution that solves our biggest concerns.

We discussed this and we think sticking kafka between logstash and elasticsearch will help improve the synchronous flow and thus will give us a good value:effort tradeoff.

In the future we could look into cutting the logstash dependency entirely by having the analytics cluster parse web request logs to achieve the same effect, but for now let's have the scope of this ticket be just sticking kafka in there.

fgiunchedi subscribed.Sep 29 2020, 12:16 PM

Aklapper removed a subscriber: Anomie.Oct 16 2020, 5:02 PM

colewhite subscribed.Apr 15 2021, 3:36 PM

herron mentioned this in T297239: Move logstash api-feature-usage output away from v5 cluster.Dec 7 2021, 9:56 PM

herron added a parent task: T297239: Move logstash api-feature-usage output away from v5 cluster.

I think this was accomplished in T297239 by moving apifeatureusage logstash to a host separate from the main logging pipeline. Do you agree?

In T217742#8115790, @colewhite wrote:

I think this was accomplished in T297239 by moving apifeatureusage logstash to a host separate from the main logging pipeline. Do you agree?

Of the two items listed in the ticket, this resolves the synchronous flow issue but doesn't do anything about the strong coupling. Elastic has been quite good about not changing the bulk update api's though and the strong coupling doesn't seem as big of an issue as it has been in the past. Can probably call this complete.

Rework the data flow between logstash and cirrus elasticsearch cluster for ApiFeatureUsageClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Rework the data flow between logstash and cirrus elasticsearch cluster for ApiFeatureUsage
Closed, ResolvedPublic
Actions

Related Objects
Search...