Page MenuHomePhabricator

Move logstash api-feature-usage output away from v5 cluster
Closed, ResolvedPublic

Description

Before turning off the logstash v5 hosts we will need to find a new home for the api-feature-usage elasticsearch outputs that are currently serviced by the v5 clusters.

Event Timeline

herron triaged this task as Medium priority.Dec 7 2021, 9:15 PM
herron created this task.

Change 744862 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] logstash: move api-feature-usage outputs to elk7 cluster

https://gerrit.wikimedia.org/r/744862

Cwhite 4:43 PM in https://gerrit.wikimedia.org/r/c/operations/puppet/+/744862/
I would rather we not move api-feature-usage into the elk7 cluster for several reasons:

  1. We've wanted to move it off for a while: https://phabricator.wikimedia.org/T217742
  2. Cirrus unavailability causes an outage for us in this configuration: https://phabricator.wikimedia.org/T176335
  3. It requires we open up the firewall between the collectors and the cirrus clusters.
  4. We have an ingester installed with appropriate access on the cirrus cluster which can do this via the work from https://phabricator.wikimedia.org/T288620

We have an ingester installed with appropriate access on the cirrus cluster which can do this via the work from https://phabricator.wikimedia.org/T288620

If I'm understanding correctly this would be ingesting/filtering the medaiwiki log topics from kafka-logging using the "gelf_relay" logstash instances on the cirrus elastic hosts? Maybe that's ok, but it does introduce a bit of a feature creep from the reason the instances were deployed in T288620. Some extra overhead to plan for, JVMs size, etc.

Also worth considering is the option of addressing points 1 & 2 within a single logstash cluster using pipeline configurations. Essentially instead of cloning the log as api-feature-usage-sanitized we could output it to a secondary pipeline which in turn would output to search.svc.

Personally I'd be inclined to explore using multiple pipelines, assuming it pans out we could repurpose and expand on that approach for other uses in the future.

FWIW we do appear to have connectivity between the v7 collectors and search.svc today

Also worth considering is the option of addressing points 1 & 2 within a single logstash cluster using pipeline configurations. Essentially instead of cloning the log as api-feature-usage-sanitized we could output it to a secondary pipeline which in turn would output to search.svc.

Personally I'd be inclined to explore using multiple pipelines, assuming it pans out we could repurpose and expand on that approach for other uses in the future.

I think it's reasonable for us to provide an interface to the data (preferably Kafka), but I also think it's beyond the scope of Observability to maintain the api-feature-usage pipeline. The main problem is that we cannot expect Search and CPT to be on the same version (or even software) as we are. This enmeshed situation may become more difficult when Search decides what solution they want to adopt in response to the SSPL change. On the other side, the ApiFeatureUsage extension being unmaintained will likely be slow to adopt change.

If the api-feature-usage pipeline were external to the logging pipeline, the dependencies would be decoupled and all would be free to upgrade independently (unless Kafka does something weird).

FWIW we do appear to have connectivity between the v7 collectors and search.svc today

I thought this was cleaned up with the move to elk7... bummer.

! In T297239#7554905, @herron wrote:
If I'm understanding correctly this would be ingesting/filtering the medaiwiki log topics from kafka-logging using the "gelf_relay" logstash instances on the cirrus elastic hosts? Maybe that's ok, but it does introduce a bit of a feature creep from the reason the instances were deployed in T288620. Some extra overhead to plan for, JVMs size, etc.

There are 71 hosts with role::elasticsearch::cirrus, which would make for much larger consumer groups than are needed. Maybe there's a subset of cirrus hosts that would be better suited to handle additional services like hosting api-feature-usage processing? We only need a few hosts per site.

@lmata @fgiunchedi @colewhite and I discussed this at todays o11y team meeting. To summarize, we have at least two actionable options on the table to move this forward:

  1. Move api-feature-usage into the elk7 cluster as a separate logstash pipeline -- However this is problematic because it re-introduces constraints in terms of ability to upgrade the logstash cluster, and deploys a curator job onto the logstash with elevated privileges towards the search cluster.
  1. Deploy a new role for api-feature-usage that decouples this function from the logstash hosts -- In theory this could be deployed to any host, but for the near-term we would likely roll out a set of small buster or bullseye apifeatureusage VMs to host it.

Given our limited knowledge of the higher level roadmap/ownership for this service, and our need to migrate away from the legacy logstash cluster, option 2 is preferable. I'll plan to proceed in that direction in the near future, unless a better approach presents itself.

I tested Logstash 7.10 writing api feature usage logs to an ES 6 instance in cloud. Somewhere in the pipeline, the api feature usage logs get assigned two types which makes ES 6 reject it:

[2021-12-14T20:56:19,352][DEBUG][o.e.a.b.TransportShardBulkAction] [4Js0n17] [apifeatureusage-2021.12.14][0] failed to execute bulk item (index) index {[apifeatureusage-2021.12.14][doc][_Ly7un0B49-t0j9CVvuQ], source[{"agent":"ChangePropagation/WMF","feature":"https-expected","@timestamp":"2021-12-14T20:56:19.118Z","type":"api-feature-usage-sanitized","@version":1}]}
java.lang.IllegalArgumentException: Rejecting mapping update to [apifeatureusage-2021.12.14] as the final mapping would have more than 1 type: [doc, api-feature-usage-sanitized]
        at org.elasticsearch.index.mapper.MapperService.internalMerge(MapperService.java:451) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.index.mapper.MapperService.internalMerge(MapperService.java:399) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:331) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.cluster.metadata.MetaDataMappingService$PutMappingExecutor.applyRequest(MetaDataMappingService.java:313) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.cluster.metadata.MetaDataMappingService$PutMappingExecutor.execute(MetaDataMappingService.java:229) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:639) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:268) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:198) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:133) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) ~[elasticsearch-6.5.4.jar:6.5.4]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:829) [?:?]

Explicitly defining document_type as api-feature-usage-sanitized in the elasticsearch output plugin seemed to rectify this issue.

Change 747634 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: add optional document_type parameter to es output config

https://gerrit.wikimedia.org/r/747634

Change 747635 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] role: add apifeatureusage role

https://gerrit.wikimedia.org/r/747635

Change 747636 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] apifeatureusage: clean up legacy apifeatureusage config

https://gerrit.wikimedia.org/r/747636

Change 744862 abandoned by Herron:

[operations/puppet@production] logstash: move api-feature-usage outputs to elk7 cluster

Reason:

going with another approach, see T297239

https://gerrit.wikimedia.org/r/744862

Change 752211 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] assign role::apifeatureusage::logstash to apifeatureusages[12]001 hosts

https://gerrit.wikimedia.org/r/752211

Change 747635 merged by Cwhite:

[operations/puppet@production] role: add apifeatureusage role

https://gerrit.wikimedia.org/r/747635

Change 754007 had a related patch set uploaded (by Herron; author: Herron):

[labs/private@master] profile::apifeatureusage::logstash: add placeholder secrets for PCC

https://gerrit.wikimedia.org/r/754007

Change 754007 merged by Herron:

[labs/private@master] profile::apifeatureusage::logstash: add placeholder secrets for PCC

https://gerrit.wikimedia.org/r/754007

Planning to move the apifeatureusage pipeline over to the new hosts next week with these switchover steps:

note: we'll keep the same consumer group names as used today so the new apieatureusage hosts will pick up where elk5 hosts left off

Change 747634 merged by Cwhite:

[operations/puppet@production] logstash: add optional document_type parameter to es output config

https://gerrit.wikimedia.org/r/747634

Mentioned in SAL (#wikimedia-operations) [2022-01-19T17:58:07Z] <herron> beginning logstash apifeatureusage switchover T297239

Change 752211 merged by Herron:

[operations/puppet@production] assign role::apifeatureusage::logstash to apifeatureusage[12]001 hosts

https://gerrit.wikimedia.org/r/752211

Change 755456 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] logstash: set logstash-json-tcp monitoring to non-critical

https://gerrit.wikimedia.org/r/755456

apifeatureusage[12]001 are now live, but puppet is currently disabled on these hosts as a couple of small manual fixes had to be put in place to bring the pipeline up:

  • need to add + to the date format in the output template names
  • need to specify ssl_endpoint_identification_algorithm on the kafka inputs

I'll work on persisting these configs

Change 755467 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] logstash: move elk5 collectors to role::spare::system

https://gerrit.wikimedia.org/r/755467

Change 755456 merged by Herron:

[operations/puppet@production] logstash: set logstash-json-tcp monitoring to non-critical

https://gerrit.wikimedia.org/r/755456

Change 755468 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] profile::apifeatureusage::logstash: update ssl identification and index name

https://gerrit.wikimedia.org/r/755468

Change 755468 merged by Herron:

[operations/puppet@production] profile::apifeatureusage::logstash: update ssl identification and index name

https://gerrit.wikimedia.org/r/755468

Change 755467 merged by Herron:

[operations/puppet@production] logstash: move elk5 collectors to role::spare::system

https://gerrit.wikimedia.org/r/755467

herron claimed this task.

apifeatureusage[12]001 are now live, but puppet is currently disabled on these hosts as a couple of small manual fixes had to be put in place to bring the pipeline up:

  • need to add + to the date format in the output template names
  • need to specify ssl_endpoint_identification_algorithm on the kafka inputs

I'll work on persisting these configs

These changes are now puppetized. There is some lvs/monitoring cleanup needed due to moving old logstash hosts to role::spare::system, but will track that in T281266. Resolving!

Change 756053 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] elasticsearch: write curator logs to stdout

https://gerrit.wikimedia.org/r/756053

Change 756053 merged by Cwhite:

[operations/puppet@production] elasticsearch: write curator logs to stdout

https://gerrit.wikimedia.org/r/756053

Change 757955 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] apifeatureusage: disable gc logging

https://gerrit.wikimedia.org/r/757955

Change 757955 merged by Cwhite:

[operations/puppet@production] apifeatureusage: disable gc logging

https://gerrit.wikimedia.org/r/757955

Change 758533 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: move safepoint logging flag inside gc_log gate

https://gerrit.wikimedia.org/r/758533

Change 758533 merged by Cwhite:

[operations/puppet@production] logstash: move safepoint logging flag inside gc_log gate

https://gerrit.wikimedia.org/r/758533

Change 758970 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] apifeatureusage: increase logstash heap memory to 2G

https://gerrit.wikimedia.org/r/758970

Change 758970 merged by Cwhite:

[operations/puppet@production] apifeatureusage: increase logstash heap memory to 2G

https://gerrit.wikimedia.org/r/758970

Change 747636 abandoned by Cwhite:

[operations/puppet@production] apifeatureusage: clean up legacy apifeatureusage config

Reason:

this was cleaned up in other patches

https://gerrit.wikimedia.org/r/747636