Page MenuHomePhabricator

logs sent to logstash are lost when the elasticsearch cirrus cluster is unavailable
Closed, ResolvedPublic

Description

A cold restart of the cirrus elasticsearch eqiad cluster was done today, and had a few issues. During that time, logs collected by logstash dropped to almost 0.

The obvious link between those 2 elasticsearch clusters is the apifeature logging, which is sent to the cirrus cluster.

It seems strange that api feature would affect all logs. Maybe conenctions are timing out and consuming all logstash resources?

Related incident: https://wikitech.wikimedia.org/wiki/Incident_documentation/2017-09-20_Logstash

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
mobrovac subscribed.

This is currently occurring on RESTBase and Parsoid hosts and SCB, impacting most of the Node.JS services, leaving them without logs in logstash.

FTR, all of the aforementioned services use logstash1001 directly. That ought to change soon(TM) with T175242: all log producers need to use the logstash LVS endpoint.

debt triaged this task as High priority.Sep 20 2017, 6:15 PM
debt edited projects, added Discovery-Search (Current work); removed Discovery-Search.

API Feature logs are sent to the cirrus cluster, presumably for consumption by https://en.wikipedia.org/wiki/Special:ApiFeatureUsage. Note that they are sent only to the eqiad cluster, which means that this page probably breaks when elasticsearch eqiad is down.

It might be possible to tune the elasticsearch output plugin to be more robust. The [[ https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html#plugins-outputs-elasticsearch-resurrect_delay | resurrect_delay ]] options is something we might want to look.

API Feature logs are sent to the cirrus cluster, presumably for consumption by https://en.wikipedia.org/wiki/Special:ApiFeatureUsage.

That is correct. More specifically, API feature usage log messages are both sent to the logstash cluster (see them in kibana) and cloned and sanitized of most private information to be sent to the cirrus cluster for access by that special page. That way the wiki code doesn't access the same ES cluster that holds the logs containing all sorts of private data.

Note that they are sent only to the eqiad cluster, which means that this page probably breaks when elasticsearch eqiad is down.

Please do fix that, I don't know how.

Note that they are sent only to the eqiad cluster, which means that this page probably breaks when elasticsearch eqiad is down.

Please do fix that, I don't know how.

I think we just need to send the data to the search cluster in codfw (I don't know how easy it is to add a new output).
Then when switching search traffic to codfw for maintenance purposes we can also direct ApiFeatureUsage to use the search cluster in codfw :

if ( $wmgUseApiFeatureUsage ) {
        wfLoadExtension( 'ApiFeatureUsage' );
        $wgApiFeatureUsageQueryEngineConf = [
                'class' => 'ApiFeatureUsageQueryEngineElastica',
                'serverList' => $wmfLocalServices['search'], // Switch to $wmfAllServices['codfw']['search']
        ];
}

It seems to be possible to add a second output to codfw, this requires some minor refactoring of the underlying puppet code. I'll create a sub task for this, let's keep this task for the main issue of loosing logs.

debt subscribed.

This is fairly tricky and no obvious solution right now. We might want to wait for the next version of Logstash.

debt lowered the priority of this task from High to Medium.Oct 24 2017, 5:18 PM

Removing the Discovery team from this, it should now be handled by the observability team. The immediate issue of ApiFeatureUsage not being duplicated to codfw has been addressed in T176430. The stability of the logstash pipeline still needs to be addressed.

Note that part of the solution will be addressed in T217742 (rework the data flow for ApiFeatureUsage), but we might want to do additional work on the general stability of the logstash pipelines.

colewhite claimed this task.
colewhite subscribed.

This was resolved in T297239 - the main logging transformation pipeline no longer forwards logs to the search cluster.