Job queue for writes to cloudelastic falling behind
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	EBernhardson
	Aug 2 2022, 6:33 PM

Description

Writes to cloudelastic aren't all making it through, rather the queue is constantly increasing and occasionally resetting: https://grafana-rw.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=main-eqiad&var-topic=All&var-consumer_group=cpjobqueue-cirrusSearchElasticaWrite&from=now-90d&to=now

We suspect this is directly related to jobqueue, rather than cloudelastic itself, as the thread pools for writing in cloudelastic are mostly idle and we have previous experience with cpjobqueue not managing to run enough concurrent jobs to send writes to the cluster (T300914).

Implement some method that allows the writes to complete as expected.

AC:

All expected writes make it to cloudelastic in a timely manner

Details

Other Assignee: bking

Subject	Repo	Branch	Lines +/-
changeprop-jobqueue: further reduce memory	operations/deployment-charts	master	+2 -2
Cirrus: bump changeprop-jobqueue vers	operations/deployment-charts	master	+1 -1
CirrusSearch: bump changeprop version	operations/deployment-charts	master	+1 -1
Change CirrusSearchElasticaWrite partitioning key	operations/deployment-charts	master	+7 -4
Add explicit partitioning key to ElasticaWrite	mediawiki/extensions/CirrusSearch	wmf/1.39.0-wmf.23	+195 -12
Add explicit partitioning key to ElasticaWrite	mediawiki/extensions/CirrusSearch	wmf/1.39.0-wmf.22	+195 -12
Add explicit partitioning key to ElasticaWrite	mediawiki/extensions/CirrusSearch	master	+195 -12

Customize query in gerrit

Related Objects

Mentioned In: T300914: cpjobqueue not achieving configured concurrency
Mentioned Here: rECIR70a18f584611: Add explicit partitioning key to ElasticaWrite
rECIR9961e9bc8f58: Add explicit partitioning key to ElasticaWrite
T300914: cpjobqueue not achieving configured concurrency

Event Timeline

EBernhardson created this task.Aug 2 2022, 6:33 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 2 2022, 6:33 PM

The general idea is to add a new parameter to the ElasticaWrite job, jobqueue_partition, and have that include both the cluster name and a integer partition number derived through random number % num_partitions. cpjobqueue should then be configured to partition by jobqueue_partition rather than the existing cluster value.

EBernhardson claimed this task.Aug 2 2022, 6:38 PM

EBernhardson triaged this task as High priority.

EBernhardson added a subscriber: Discovery-Search (Current work).

Change 819751 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] Add explicit partitioning key to ElasticaWrite

https://gerrit.wikimedia.org/r/819751

gerritbot added a project: Patch-For-Review.Aug 2 2022, 9:21 PM

Change 819752 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/deployment-charts@master] Change CirrusSearchElasticaWrite partitioning key

https://gerrit.wikimedia.org/r/819752

EBernhardson added a project: Discovery-Search (Current work).Aug 2 2022, 9:44 PM

EBernhardson moved this task from Incoming to Needs review on the Discovery-Search (Current work) board.

EBernhardson removed a subscriber: Discovery-Search (Current work).

Change 819751 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Add explicit partitioning key to ElasticaWrite

https://gerrit.wikimedia.org/r/819751

ReleaseTaggerBot added a project: MW-1.39-notes (1.39.0-wmf.25; 2022-08-15).Aug 3 2022, 5:00 PM

This will require increasing the partition counts in kafka for the appropriate topics. Today they should have 3 partitions, we want to change it to have 6 partitions. The topics live in the main-eqiad and main-codfw kafka clusters.

The topics:

codfw.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite
eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite

Change 820190 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@wmf/1.39.0-wmf.22] Add explicit partitioning key to ElasticaWrite

https://gerrit.wikimedia.org/r/820190

Change 820191 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@wmf/1.39.0-wmf.23] Add explicit partitioning key to ElasticaWrite

https://gerrit.wikimedia.org/r/820191

Mentioned in SAL (#wikimedia-operations) [2022-08-03T17:55:39Z] <ottomata> increasing partitions from 5 to 6 for *.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite topics in Kafka main-eqiad and main-codfw - T314426

kafka topics --alter --topic codfw.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite --partitions 6
kafka topics --alter --topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite --partitions 6

Ran in both main-eqiad and main-codfw

Change 820190 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@wmf/1.39.0-wmf.22] Add explicit partitioning key to ElasticaWrite

https://gerrit.wikimedia.org/r/820190

Change 820191 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@wmf/1.39.0-wmf.23] Add explicit partitioning key to ElasticaWrite

https://gerrit.wikimedia.org/r/820191

Mentioned in SAL (#wikimedia-operations) [2022-08-03T20:28:15Z] <urbanecm@deploy1002> Synchronized php-1.39.0-wmf.22/extensions/CirrusSearch/: rECIR9961e9bc8f58: Add explicit partitioning key to ElasticaWrite (T314426) (duration: 03m 23s)

Mentioned in SAL (#wikimedia-operations) [2022-08-03T20:31:29Z] <urbanecm@deploy1002> Synchronized php-1.39.0-wmf.23/extensions/CirrusSearch/: rECIR70a18f584611: Add explicit partitioning key to ElasticaWrite (T314426) (duration: 03m 13s)

ReleaseTaggerBot edited projects, added MW-1.39-notes (1.39.0-wmf.22; 2022-07-25); removed MW-1.39-notes (1.39.0-wmf.25; 2022-08-15).Aug 3 2022, 9:00 PM

Change 819752 merged by jenkins-bot:

[operations/deployment-charts@master] Change CirrusSearchElasticaWrite partitioning key

https://gerrit.wikimedia.org/r/819752

While the patch for deployment-charts was merged, when SRE went to deploy the patch the systems reported no change to the deployment. Unclear what the necessary next step is to have the cpjobqueue configuration updated.

Change 820536 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/deployment-charts@master] CirrusSearch: bump changeprop version

https://gerrit.wikimedia.org/r/820536

Change 820536 merged by Ryan Kemper:

[operations/deployment-charts@master] CirrusSearch: bump changeprop version

https://gerrit.wikimedia.org/r/820536

Change 820553 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/deployment-charts@master] Cirrus: bump changeprop-jobqueue vers

https://gerrit.wikimedia.org/r/820553

Change 820553 merged by jenkins-bot:

[operations/deployment-charts@master] Cirrus: bump changeprop-jobqueue vers

https://gerrit.wikimedia.org/r/820553

bking updated Other Assignee, added: bking.Aug 8 2022, 3:24 PM

bking added subscribers: bking, RKemper.

Change 821800 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] changeprop-jobqueue: further reduce memory

https://gerrit.wikimedia.org/r/821800

Change 821800 merged by Bking:

[operations/deployment-charts@master] changeprop-jobqueue: further reduce memory

https://gerrit.wikimedia.org/r/821800

Can see in the JobQueue Job grafana dashboard that concurrency jumped from ~16 to ~25 arround same time as above patches were merged (~aug 9 at 2300 utc). Backlog appears to be staying low. The Saneitizer fix rate on cloudelastic is still high but we suspect that is related to jobs that were dropped when the backlog grew beyond retention. Expecting the saneitizer fix rate to return to the same level as eqiad/codfw within ~2 weeks of the deployment (around aug 23 or so). With resepect to this ticket that means that saneitizer is pushing additional jobs beyond what we normally see due to page edits and the job queue is keeping up with that additional load.

EBernhardson mentioned this in T300914: cpjobqueue not achieving configured concurrency.Aug 22 2022, 3:02 PM

Fix rate on cloudelastic has mostly reverted to the expected low levels seen on other clusters, but not entirely. While we don't record exactly which wikis fixes are applied to, we know the checks are applied and metrics are recorded in roughly alphabetical order. Fixes are applied towards the beginning and end of each saneitizer cycle, suggesting commonswiki (or maybe enwiki) and wikidatawiki are still having fixes applied.

Overall it looks like things are working well and the background processes are cleaning up the time period where updates were lost in the expected fashion.

Gehel closed this task as Resolved.Aug 29 2022, 2:26 PM

Job queue for writes to cloudelastic falling behindClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Job queue for writes to cloudelastic falling behind
Closed, ResolvedPublic
Actions