⚓ T297239 Move logstash api-feature-usage output away from v5 cluster

Subject	Repo	Branch	Lines +/-
apifeatureusage: clean up legacy apifeatureusage config	operations/puppet	production	+0 -173
apifeatureusage: increase logstash heap memory to 2G	operations/puppet	production	+3 -0
logstash: move safepoint logging flag inside gc_log gate	operations/puppet	production	+7 -7
apifeatureusage: disable gc logging	operations/puppet	production	+1 -0
elasticsearch: write curator logs to stdout	operations/puppet	production	+1 -1
logstash: move elk5 collectors to role::spare::system	operations/puppet	production	+2 -4
profile::apifeatureusage::logstash: update ssl identification and index name	operations/puppet	production	+21 -19
logstash: set logstash-json-tcp monitoring to non-critical	operations/puppet	production	+1 -1
assign role::apifeatureusage::logstash to apifeatureusage[12]001 hosts	operations/puppet	production	+3 -1
logstash: add optional document_type parameter to es output config	operations/puppet	production	+5 -0
profile::apifeatureusage::logstash: add placeholder secrets for PCC	labs/private	master	+6 -0
role: add apifeatureusage role	operations/puppet	production	+283 -0
logstash: move api-feature-usage outputs to elk7 cluster	operations/puppet	production	+112 -57

Status	Assigned	Task
Resolved	herron	T281266 Decommission old ELK5 Logstash cluster
Resolved	herron	T297239 Move logstash api-feature-usage output away from v5 cluster
Resolved	EBernhardson	T217742 Rework the data flow between logstash and cirrus elasticsearch cluster for ApiFeatureUsage
Resolved	colewhite	T176335 logs sent to logstash are lost when the elasticsearch cirrus cluster is unavailable
Resolved	dcausse	T176430 api feature logs should be sent to both eqiad and codfw clusters
Resolved	herron	T288620 Document path forward and Retire remaining non-Kafka Logstash inputs
Resolved	herron	T298794 eqiad/codfw: 2 VMs requested for apifeatureusage

herron triaged this task as Medium priority.Dec 7 2021, 9:15 PM

herron created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 7 2021, 9:15 PM

herron added a parent task: T281266: Decommission old ELK5 Logstash cluster.Dec 7 2021, 9:15 PM

Change 744862 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] logstash: move api-feature-usage outputs to elk7 cluster

https://gerrit.wikimedia.org/r/744862

gerritbot added a project: Patch-For-Review.Dec 7 2021, 9:16 PM

Cwhite 4:43 PM in https://gerrit.wikimedia.org/r/c/operations/puppet/+/744862/
I would rather we not move api-feature-usage into the elk7 cluster for several reasons:

We've wanted to move it off for a while: https://phabricator.wikimedia.org/T217742

Cirrus unavailability causes an outage for us in this configuration: https://phabricator.wikimedia.org/T176335

It requires we open up the firewall between the collectors and the cirrus clusters.

We have an ingester installed with appropriate access on the cirrus cluster which can do this via the work from https://phabricator.wikimedia.org/T288620

herron added subtasks: T217742: Rework the data flow between logstash and cirrus elasticsearch cluster for ApiFeatureUsage, T176335: logs sent to logstash are lost when the elasticsearch cirrus cluster is unavailable, T288620: Document path forward and Retire remaining non-Kafka Logstash inputs.Dec 7 2021, 9:56 PM

We have an ingester installed with appropriate access on the cirrus cluster which can do this via the work from https://phabricator.wikimedia.org/T288620

If I'm understanding correctly this would be ingesting/filtering the medaiwiki log topics from kafka-logging using the "gelf_relay" logstash instances on the cirrus elastic hosts? Maybe that's ok, but it does introduce a bit of a feature creep from the reason the instances were deployed in T288620. Some extra overhead to plan for, JVMs size, etc.

Also worth considering is the option of addressing points 1 & 2 within a single logstash cluster using pipeline configurations. Essentially instead of cloning the log as api-feature-usage-sanitized we could output it to a secondary pipeline which in turn would output to search.svc.

Personally I'd be inclined to explore using multiple pipelines, assuming it pans out we could repurpose and expand on that approach for other uses in the future.

FWIW we do appear to have connectivity between the v7 collectors and search.svc today

In T297239#7554905, @herron wrote:

Also worth considering is the option of addressing points 1 & 2 within a single logstash cluster using pipeline configurations. Essentially instead of cloning the log as api-feature-usage-sanitized we could output it to a secondary pipeline which in turn would output to search.svc.

Personally I'd be inclined to explore using multiple pipelines, assuming it pans out we could repurpose and expand on that approach for other uses in the future.

I think it's reasonable for us to provide an interface to the data (preferably Kafka), but I also think it's beyond the scope of Observability to maintain the api-feature-usage pipeline. The main problem is that we cannot expect Search and CPT to be on the same version (or even software) as we are. This enmeshed situation may become more difficult when Search decides what solution they want to adopt in response to the SSPL change. On the other side, the ApiFeatureUsage extension being unmaintained will likely be slow to adopt change.

If the api-feature-usage pipeline were external to the logging pipeline, the dependencies would be decoupled and all would be free to upgrade independently (unless Kafka does something weird).

FWIW we do appear to have connectivity between the v7 collectors and search.svc today

I thought this was cleaned up with the move to elk7... bummer.

! In T297239#7554905, @herron wrote:
If I'm understanding correctly this would be ingesting/filtering the medaiwiki log topics from kafka-logging using the "gelf_relay" logstash instances on the cirrus elastic hosts? Maybe that's ok, but it does introduce a bit of a feature creep from the reason the instances were deployed in T288620. Some extra overhead to plan for, JVMs size, etc.

There are 71 hosts with role::elasticsearch::cirrus, which would make for much larger consumer groups than are needed. Maybe there's a subset of cirrus hosts that would be better suited to handle additional services like hosting api-feature-usage processing? We only need a few hosts per site.

@lmata @fgiunchedi @colewhite and I discussed this at todays o11y team meeting. To summarize, we have at least two actionable options on the table to move this forward:

Move api-feature-usage into the elk7 cluster as a separate logstash pipeline -- However this is problematic because it re-introduces constraints in terms of ability to upgrade the logstash cluster, and deploys a curator job onto the logstash with elevated privileges towards the search cluster.

Deploy a new role for api-feature-usage that decouples this function from the logstash hosts -- In theory this could be deployed to any host, but for the near-term we would likely roll out a set of small buster or bullseye apifeatureusage VMs to host it.

Given our limited knowledge of the higher level roadmap/ownership for this service, and our need to migrate away from the legacy logstash cluster, option 2 is preferable. I'll plan to proceed in that direction in the near future, unless a better approach presents itself.

I tested Logstash 7.10 writing api feature usage logs to an ES 6 instance in cloud. Somewhere in the pipeline, the api feature usage logs get assigned two types which makes ES 6 reject it:

[2021-12-14T20:56:19,352][DEBUG][o.e.a.b.TransportShardBulkAction] [4Js0n17] [apifeatureusage-2021.12.14][0] failed to execute bulk item (index) index {[apifeatureusage-2021.12.14][doc][_Ly7un0B49-t0j9CVvuQ], source[{"agent":"ChangePropagation/WMF","feature":"https-expected","@timestamp":"2021-12-14T20:56:19.118Z","type":"api-feature-usage-sanitized","@version":1}]}
java.lang.IllegalArgumentException: Rejecting mapping update to [apifeatureusage-2021.12.14] as the final mapping would have more than 1 type: [doc, api-feature-usage-sanitized]
        at org.elasticsearch.index.mapper.MapperService.internalMerge(MapperService.java:451) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.index.mapper.MapperService.internalMerge(MapperService.java:399) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:331) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.cluster.metadata.MetaDataMappingService$PutMappingExecutor.applyRequest(MetaDataMappingService.java:313) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.cluster.metadata.MetaDataMappingService$PutMappingExecutor.execute(MetaDataMappingService.java:229) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:639) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:268) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:198) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:133) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) ~[elasticsearch-6.5.4.jar:6.5.4]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:829) [?:?]

Explicitly defining document_type as api-feature-usage-sanitized in the elasticsearch output plugin seemed to rectify this issue.

Change 747634 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: add optional document_type parameter to es output config

https://gerrit.wikimedia.org/r/747634

Change 747635 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] role: add apifeatureusage role

https://gerrit.wikimedia.org/r/747635

Change 747636 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] apifeatureusage: clean up legacy apifeatureusage config

https://gerrit.wikimedia.org/r/747636

Change 744862 abandoned by Herron:

[operations/puppet@production] logstash: move api-feature-usage outputs to elk7 cluster

Reason:

going with another approach, see T297239

https://gerrit.wikimedia.org/r/744862

herron added a subtask: T298794: eqiad/codfw: 2 VMs requested for apifeatureusage.Jan 7 2022, 7:24 PM

Change 752211 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] assign role::apifeatureusage::logstash to apifeatureusages[12]001 hosts

https://gerrit.wikimedia.org/r/752211

herron closed subtask T298794: eqiad/codfw: 2 VMs requested for apifeatureusage as Resolved.Jan 7 2022, 9:10 PM

Change 747635 merged by Cwhite:

[operations/puppet@production] role: add apifeatureusage role

https://gerrit.wikimedia.org/r/747635

Change 754007 had a related patch set uploaded (by Herron; author: Herron):

[labs/private@master] profile::apifeatureusage::logstash: add placeholder secrets for PCC

https://gerrit.wikimedia.org/r/754007

Change 754007 merged by Herron:

[labs/private@master] profile::apifeatureusage::logstash: add placeholder secrets for PCC

https://gerrit.wikimedia.org/r/754007

herron mentioned this in rLPRI7d339a9c3a0f: profile::apifeatureusage::logstash: add placeholder secrets for PCC.Jan 14 2022, 6:26 PM

Planning to move the apifeatureusage pipeline over to the new hosts next week with these switchover steps:

Add profile::apifeatureusage::logstash::input_kafka_ssl_truststore_passwords secrets to puppet private
Merge and deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/747634/2
Disable puppet on elk5 hosts
Stop logstashes on elk5 hosts
Merge and deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/752211/7
Check that events are flowing
move elk5 hosts to role::spare::system and prep for decom

note: we'll keep the same consumer group names as used today so the new apieatureusage hosts will pick up where elk5 hosts left off

Change 747634 merged by Cwhite:

[operations/puppet@production] logstash: add optional document_type parameter to es output config

https://gerrit.wikimedia.org/r/747634

Mentioned in SAL (#wikimedia-operations) [2022-01-19T17:58:07Z] <herron> beginning logstash apifeatureusage switchover T297239

Change 752211 merged by Herron:

[operations/puppet@production] assign role::apifeatureusage::logstash to apifeatureusage[12]001 hosts

https://gerrit.wikimedia.org/r/752211

Change 755456 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] logstash: set logstash-json-tcp monitoring to non-critical

https://gerrit.wikimedia.org/r/755456

apifeatureusage[12]001 are now live, but puppet is currently disabled on these hosts as a couple of small manual fixes had to be put in place to bring the pipeline up:

need to add + to the date format in the output template names
need to specify ssl_endpoint_identification_algorithm on the kafka inputs

I'll work on persisting these configs

Change 755467 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] logstash: move elk5 collectors to role::spare::system

https://gerrit.wikimedia.org/r/755467

Change 755456 merged by Herron:

[operations/puppet@production] logstash: set logstash-json-tcp monitoring to non-critical

https://gerrit.wikimedia.org/r/755456

Change 755468 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] profile::apifeatureusage::logstash: update ssl identification and index name

https://gerrit.wikimedia.org/r/755468

Change 755468 merged by Herron:

[operations/puppet@production] profile::apifeatureusage::logstash: update ssl identification and index name

https://gerrit.wikimedia.org/r/755468

Change 755467 merged by Herron:

[operations/puppet@production] logstash: move elk5 collectors to role::spare::system

https://gerrit.wikimedia.org/r/755467

herron updated the task description. (Show Details)Jan 19 2022, 8:54 PM

In T297239#7633975, @herron wrote:

apifeatureusage[12]001 are now live, but puppet is currently disabled on these hosts as a couple of small manual fixes had to be put in place to bring the pipeline up:

need to add + to the date format in the output template names

need to specify ssl_endpoint_identification_algorithm on the kafka inputs

I'll work on persisting these configs

These changes are now puppetized. There is some lvs/monitoring cleanup needed due to moving old logstash hosts to role::spare::system, but will track that in T281266. Resolving!

Change 756053 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] elasticsearch: write curator logs to stdout

https://gerrit.wikimedia.org/r/756053

Change 756053 merged by Cwhite:

[operations/puppet@production] elasticsearch: write curator logs to stdout

https://gerrit.wikimedia.org/r/756053

Change 757955 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] apifeatureusage: disable gc logging

https://gerrit.wikimedia.org/r/757955

Change 757955 merged by Cwhite:

[operations/puppet@production] apifeatureusage: disable gc logging

https://gerrit.wikimedia.org/r/757955

Change 758533 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: move safepoint logging flag inside gc_log gate

https://gerrit.wikimedia.org/r/758533

Change 758533 merged by Cwhite:

[operations/puppet@production] logstash: move safepoint logging flag inside gc_log gate

https://gerrit.wikimedia.org/r/758533

Change 758970 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] apifeatureusage: increase logstash heap memory to 2G

https://gerrit.wikimedia.org/r/758970

Change 758970 merged by Cwhite:

[operations/puppet@production] apifeatureusage: increase logstash heap memory to 2G

https://gerrit.wikimedia.org/r/758970

colewhite mentioned this in T302638: Sunset ApiFeatureUsage (TDMP).Mar 23 2022, 11:16 PM

colewhite closed subtask T176335: logs sent to logstash are lost when the elasticsearch cirrus cluster is unavailable as Resolved.Jul 26 2022, 9:15 PM

colewhite mentioned this in T176335: logs sent to logstash are lost when the elasticsearch cirrus cluster is unavailable.

Krinkle added a project: Sustainability (Incident Followup).Jul 29 2022, 12:30 AM

lmata moved this task from Inbox to Radar on the Observability-Logging board.Jul 29 2022, 1:23 AM

lmata added a project: SRE Observability (FY2022/2023-Q1).

lmata moved this task from Inbox to Done on the SRE Observability (FY2022/2023-Q1) board.

colewhite mentioned this in T217742: Rework the data flow between logstash and cirrus elasticsearch cluster for ApiFeatureUsage.Jul 29 2022, 3:59 PM

EBernhardson closed subtask T217742: Rework the data flow between logstash and cirrus elasticsearch cluster for ApiFeatureUsage as Resolved.Aug 1 2022, 2:36 PM

Change 747636 abandoned by Cwhite:

[operations/puppet@production] apifeatureusage: clean up legacy apifeatureusage config

Reason:

this was cleaned up in other patches

https://gerrit.wikimedia.org/r/747636

Maintenance_bot removed a project: Patch-For-Review.Feb 17 2023, 12:10 AM

Move logstash api-feature-usage output away from v5 cluster
Closed, ResolvedPublic
Actions

Description

Details

Related Objects
Search...

Event Timeline

Move logstash api-feature-usage output away from v5 clusterClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Move logstash api-feature-usage output away from v5 cluster
Closed, ResolvedPublic
Actions

Related Objects
Search...