Page MenuHomePhabricator

Rework the data flow between logstash and cirrus elasticsearch cluster for ApiFeatureUsage
Closed, ResolvedPublic

Description

The current flow for ApiFeatureUsage is that usage logs are collected via logstash, which has an output to the cirrus elasticsearch clusters. This causes multiple issues:

  • synchronous flow: if the cirrus cluster is down for maintenance (or crashed) logstash pipeline will stall (see T176335)
  • strong coupling: logstash and the cirrus cluster need to run compatible versions of logstash / elasticsearch, which can be problematic during upgrades

We should rework this data flow, probably using kafka, which would take care of both those issues.

Event Timeline

A couple random thoughts:

  • This could potentially be a part of T185233?
  • Another potentially related component could be mjolnir-bulk-daemon which is used today to take bulk updates from kafka and update the wiki indices
  • Alternatively ApiFeatureUsage could have it's own logstash instance running in ganetti which would reduce the coupling between the logging services and the cirrus services, but adds complications to puppet to handle multiple versions.

@Gehel, @EBernhardson mentioned that our new Elasticsearch cluster version doesn't have the same issue with data replication when upgrading the cluster, which means that stopping writes might be less important in the next upgrade.

I think specifically the updates are around this ticket, https://phabricator.wikimedia.org/T235833

Tabling this for now as it's not urgent

Adding kafka between logstash and cirrus search seems to be the easy solution that solves our biggest concerns.

We discussed this and we think sticking kafka between logstash and elasticsearch will help improve the synchronous flow and thus will give us a good value:effort tradeoff.

In the future we could look into cutting the logstash dependency entirely by having the analytics cluster parse web request logs to achieve the same effect, but for now let's have the scope of this ticket be just sticking kafka in there.

I think this was accomplished in T297239 by moving apifeatureusage logstash to a host separate from the main logging pipeline. Do you agree?

EBernhardson claimed this task.

I think this was accomplished in T297239 by moving apifeatureusage logstash to a host separate from the main logging pipeline. Do you agree?

Of the two items listed in the ticket, this resolves the synchronous flow issue but doesn't do anything about the strong coupling. Elastic has been quite good about not changing the bulk update api's though and the strong coupling doesn't seem as big of an issue as it has been in the past. Can probably call this complete.