Page MenuHomePhabricator

logs sent to logstash are lost when the elasticsearch cirrus cluster is unavailable
Open, NormalPublic

Description

A cold restart of the cirrus elasticsearch eqiad cluster was done today, and had a few issues. During that time, logs collected by logstash dropped to almost 0.

The obvious link between those 2 elasticsearch clusters is the apifeature logging, which is sent to the cirrus cluster.

It seems strange that api feature would affect all logs. Maybe conenctions are timing out and consuming all logstash resources?

Event Timeline

Gehel created this task.Sep 20 2017, 4:54 PM
Restricted Application added projects: Discovery, Discovery-Search. · View Herald TranscriptSep 20 2017, 4:54 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
mobrovac added a subscriber: mobrovac.

This is currently occurring on RESTBase and Parsoid hosts and SCB, impacting most of the Node.JS services, leaving them without logs in logstash.

FTR, all of the aforementioned services use logstash1001 directly. That ought to change soon(TM) with T175242: all log producers need to use the logstash LVS endpoint.

debt triaged this task as High priority.Sep 20 2017, 6:15 PM
debt edited projects, added Discovery-Search (Current work); removed Discovery-Search.
Gehel added a comment.Sep 21 2017, 1:19 PM

API Feature logs are sent to the cirrus cluster, presumably for consumption by https://en.wikipedia.org/wiki/Special:ApiFeatureUsage. Note that they are sent only to the eqiad cluster, which means that this page probably breaks when elasticsearch eqiad is down.

Gehel added a comment.Sep 21 2017, 1:24 PM

It might be possible to tune the elasticsearch output plugin to be more robust. The resurrect_delay options is something we might want to look.

Anomie added a subscriber: Anomie.Sep 21 2017, 4:06 PM

API Feature logs are sent to the cirrus cluster, presumably for consumption by https://en.wikipedia.org/wiki/Special:ApiFeatureUsage.

That is correct. More specifically, API feature usage log messages are both sent to the logstash cluster (see them in kibana) and cloned and sanitized of most private information to be sent to the cirrus cluster for access by that special page. That way the wiki code doesn't access the same ES cluster that holds the logs containing all sorts of private data.

Note that they are sent only to the eqiad cluster, which means that this page probably breaks when elasticsearch eqiad is down.

Please do fix that, I don't know how.

Note that they are sent only to the eqiad cluster, which means that this page probably breaks when elasticsearch eqiad is down.

Please do fix that, I don't know how.

I think we just need to send the data to the search cluster in codfw (I don't know how easy it is to add a new output).
Then when switching search traffic to codfw for maintenance purposes we can also direct ApiFeatureUsage to use the search cluster in codfw :

if ( $wmgUseApiFeatureUsage ) {
        wfLoadExtension( 'ApiFeatureUsage' );
        $wgApiFeatureUsageQueryEngineConf = [
                'class' => 'ApiFeatureUsageQueryEngineElastica',
                'serverList' => $wmfLocalServices['search'], // Switch to $wmfAllServices['codfw']['search']
        ];
}
Gehel added a comment.Sep 21 2017, 5:22 PM

It seems to be possible to add a second output to codfw, this requires some minor refactoring of the underlying puppet code. I'll create a sub task for this, let's keep this task for the main issue of loosing logs.

debt added a subscriber: debt.

This is fairly tricky and no obvious solution right now. We might want to wait for the next version of Logstash.

debt lowered the priority of this task from High to Normal.Oct 24 2017, 5:18 PM
fgiunchedi moved this task from Backlog to Up next on the Wikimedia-Logstash board.Aug 6 2018, 1:07 PM
debt moved this task from Tech Debt/Misc to elastic / cirrus on the Discovery-Search board.
debt added a project: Discovery-Search.

Removing the Discovery team from this, it should now be handled by the observability team. The immediate issue of ApiFeatureUsage not being duplicated to codfw has been addressed in T176430. The stability of the logstash pipeline still needs to be addressed.

Note that part of the solution will be addressed in T217742 (rework the data flow for ApiFeatureUsage), but we might want to do additional work on the general stability of the logstash pipelines.

herron added a subscriber: herron.May 17 2019, 3:30 PM
fgiunchedi moved this task from Backlog to Up next on the observability board.Aug 19 2019, 2:57 PM