logs sent to logstash are lost when the elasticsearch cirrus cluster is unavailable
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Gehel
	Sep 20 2017, 4:54 PM

Description

A cold restart of the cirrus elasticsearch eqiad cluster was done today, and had a few issues. During that time, logs collected by logstash dropped to almost 0.

The obvious link between those 2 elasticsearch clusters is the apifeature logging, which is sent to the cirrus cluster.

It seems strange that api feature would affect all logs. Maybe conenctions are timing out and consuming all logstash resources?

Related Objects
Search...

Status	Assigned	Task
Resolved	herron	T281266 Decommission old ELK5 Logstash cluster
Resolved	herron	T297239 Move logstash api-feature-usage output away from v5 cluster
Resolved	colewhite	T176335 logs sent to logstash are lost when the elasticsearch cirrus cluster is unavailable
Resolved	dcausse	T176430 api feature logs should be sent to both eqiad and codfw clusters

Event Timeline

Gehel created this task.Sep 20 2017, 4:54 PM

Restricted Application added projects: Discovery-ARCHIVED, Discovery-Search. · View Herald TranscriptSep 20 2017, 4:54 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

This is currently occurring on RESTBase and Parsoid hosts and SCB, impacting most of the Node.JS services, leaving them without logs in logstash.

FTR, all of the aforementioned services use logstash1001 directly. That ought to change soon(TM) with T175242: all log producers need to use the logstash LVS endpoint.

debt triaged this task as High priority.Sep 20 2017, 6:15 PM

debt edited projects, added Discovery-Search (Current work); removed Discovery-Search.

API Feature logs are sent to the cirrus cluster, presumably for consumption by https://en.wikipedia.org/wiki/Special:ApiFeatureUsage. Note that they are sent only to the eqiad cluster, which means that this page probably breaks when elasticsearch eqiad is down.

It might be possible to tune the elasticsearch output plugin to be more robust. The [[ https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html#plugins-outputs-elasticsearch-resurrect_delay | resurrect_delay ]] options is something we might want to look.

Anomie subscribed.Sep 21 2017, 4:06 PM

In T176335#3624317, @Gehel wrote:

API Feature logs are sent to the cirrus cluster, presumably for consumption by https://en.wikipedia.org/wiki/Special:ApiFeatureUsage.

That is correct. More specifically, API feature usage log messages are both sent to the logstash cluster (see them in kibana) and cloned and sanitized of most private information to be sent to the cirrus cluster for access by that special page. That way the wiki code doesn't access the same ES cluster that holds the logs containing all sorts of private data.

Note that they are sent only to the eqiad cluster, which means that this page probably breaks when elasticsearch eqiad is down.

Please do fix that, I don't know how.

In T176335#3624891, @Anomie wrote:

In T176335#3624317, @Gehel wrote:

Note that they are sent only to the eqiad cluster, which means that this page probably breaks when elasticsearch eqiad is down.

Please do fix that, I don't know how.

I think we just need to send the data to the search cluster in codfw (I don't know how easy it is to add a new output).
Then when switching search traffic to codfw for maintenance purposes we can also direct ApiFeatureUsage to use the search cluster in codfw :

if ( $wmgUseApiFeatureUsage ) {
        wfLoadExtension( 'ApiFeatureUsage' );
        $wgApiFeatureUsageQueryEngineConf = [
                'class' => 'ApiFeatureUsageQueryEngineElastica',
                'serverList' => $wmfLocalServices['search'], // Switch to $wmfAllServices['codfw']['search']
        ];
}

It seems to be possible to add a second output to codfw, this requires some minor refactoring of the underlying puppet code. I'll create a sub task for this, let's keep this task for the main issue of loosing logs.

Gehel mentioned this in T176430: api feature logs should be sent to both eqiad and codfw clusters.Sep 21 2017, 5:24 PM

Gehel created subtask T176430: api feature logs should be sent to both eqiad and codfw clusters.

Anomie reopened subtask T176430: api feature logs should be sent to both eqiad and codfw clusters as Open.Sep 21 2017, 7:52 PM

debt closed subtask T176430: api feature logs should be sent to both eqiad and codfw clusters as Resolved.Oct 2 2017, 2:13 PM

This is fairly tricky and no obvious solution right now. We might want to wait for the next version of Logstash.

debt moved this task from needs triage to This Quarter on the Discovery-Search board.Oct 19 2017, 5:06 PM

debt lowered the priority of this task from High to Medium.Oct 24 2017, 5:18 PM

debt moved this task from This Quarter to Tech Debt/Misc on the Discovery-Search board.Oct 24 2017, 5:26 PM

fgiunchedi moved this task from Backlog to Up next on the Wikimedia-Logstash board.Aug 6 2018, 1:07 PM

• mobrovac added a project: Platform Team Legacy (Watching / External).Dec 20 2018, 12:03 PM

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 9:28 PM

debt removed a project: Discovery-Search.Jan 29 2019, 6:37 PM

debt moved this task from Tech Debt/Misc to elastic / cirrus on the Discovery-Search board.

debt added a project: Discovery-Search.

Gehel mentioned this in T217742: Rework the data flow between logstash and cirrus elasticsearch cluster for ApiFeatureUsage.Mar 6 2019, 10:23 AM

Removing the Discovery team from this, it should now be handled by the observability team. The immediate issue of ApiFeatureUsage not being duplicated to codfw has been addressed in T176430. The stability of the logstash pipeline still needs to be addressed.

Note that part of the solution will be addressed in T217742 (rework the data flow for ApiFeatureUsage), but we might want to do additional work on the general stability of the logstash pipelines.

fgiunchedi mentioned this in T223483: Logstash stops processing messages if a single output becomes blocked.May 17 2019, 3:29 PM

herron subscribed.May 17 2019, 3:30 PM

fgiunchedi added a project: observability.Aug 19 2019, 2:31 PM

fgiunchedi moved this task from Inbox to Up next on the observability board.Aug 19 2019, 2:57 PM

lmata moved this task from Up next to Backlog on the observability board.Sep 21 2020, 8:26 PM

Aklapper removed a subscriber: Anomie.Oct 16 2020, 5:02 PM

lmata edited projects, added SRE Observability; removed observability.Jul 12 2021, 2:21 AM

Maintenance_bot added a project: observability.Jul 12 2021, 2:49 AM

lmata moved this task from Inbox to Backlog on the SRE Observability board.Jul 15 2021, 4:09 AM

lmata edited projects, added Observability-Logging; removed SRE Observability.Aug 9 2021, 2:47 AM

Maintenance_bot edited projects, added SRE Observability; removed Observability-Logging.Aug 9 2021, 3:48 AM

Krinkle edited projects, added Sustainability (Incident Followup); removed Platform Team Legacy (Watching / External), Services (watching).Sep 28 2021, 11:53 PM

Krinkle updated the task description. (Show Details)

herron mentioned this in T297239: Move logstash api-feature-usage output away from v5 cluster.Dec 7 2021, 9:56 PM

herron added a parent task: T297239: Move logstash api-feature-usage output away from v5 cluster.

lmata edited projects, added Observability-Logging; removed SRE Observability.Jan 17 2022, 11:11 PM

This was resolved in T297239 - the main logging transformation pipeline no longer forwards logs to the search cluster.

logs sent to logstash are lost when the elasticsearch cirrus cluster is unavailableClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

logs sent to logstash are lost when the elasticsearch cirrus cluster is unavailable
Closed, ResolvedPublic
Actions

Related Objects
Search...