Page MenuHomePhabricator

Job queue for writes to cloudelastic falling behind
Closed, ResolvedPublic

Description

Writes to cloudelastic aren't all making it through, rather the queue is constantly increasing and occasionally resetting: https://grafana-rw.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=main-eqiad&var-topic=All&var-consumer_group=cpjobqueue-cirrusSearchElasticaWrite&from=now-90d&to=now

We suspect this is directly related to jobqueue, rather than cloudelastic itself, as the thread pools for writing in cloudelastic are mostly idle and we have previous experience with cpjobqueue not managing to run enough concurrent jobs to send writes to the cluster (T300914).

Implement some method that allows the writes to complete as expected.

AC:

  • All expected writes make it to cloudelastic in a timely manner

Event Timeline

The general idea is to add a new parameter to the ElasticaWrite job, jobqueue_partition, and have that include both the cluster name and a integer partition number derived through random number % num_partitions. cpjobqueue should then be configured to partition by jobqueue_partition rather than the existing cluster value.

EBernhardson triaged this task as High priority.

Change 819751 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] Add explicit partitioning key to ElasticaWrite

https://gerrit.wikimedia.org/r/819751

Change 819752 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/deployment-charts@master] Change CirrusSearchElasticaWrite partitioning key

https://gerrit.wikimedia.org/r/819752

Change 819751 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Add explicit partitioning key to ElasticaWrite

https://gerrit.wikimedia.org/r/819751

This will require increasing the partition counts in kafka for the appropriate topics. Today they should have 3 partitions, we want to change it to have 6 partitions. The topics live in the main-eqiad and main-codfw kafka clusters.

The topics:

  • codfw.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite
  • eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite

Change 820190 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@wmf/1.39.0-wmf.22] Add explicit partitioning key to ElasticaWrite

https://gerrit.wikimedia.org/r/820190

Change 820191 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@wmf/1.39.0-wmf.23] Add explicit partitioning key to ElasticaWrite

https://gerrit.wikimedia.org/r/820191

Mentioned in SAL (#wikimedia-operations) [2022-08-03T17:55:39Z] <ottomata> increasing partitions from 5 to 6 for *.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite topics in Kafka main-eqiad and main-codfw - T314426

kafka topics --alter --topic codfw.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite --partitions 6
kafka topics --alter --topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite --partitions 6

Ran in both main-eqiad and main-codfw

Change 820190 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@wmf/1.39.0-wmf.22] Add explicit partitioning key to ElasticaWrite

https://gerrit.wikimedia.org/r/820190

Change 820191 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@wmf/1.39.0-wmf.23] Add explicit partitioning key to ElasticaWrite

https://gerrit.wikimedia.org/r/820191

Mentioned in SAL (#wikimedia-operations) [2022-08-03T20:28:15Z] <urbanecm@deploy1002> Synchronized php-1.39.0-wmf.22/extensions/CirrusSearch/: rECIR9961e9bc8f58: Add explicit partitioning key to ElasticaWrite (T314426) (duration: 03m 23s)

Mentioned in SAL (#wikimedia-operations) [2022-08-03T20:31:29Z] <urbanecm@deploy1002> Synchronized php-1.39.0-wmf.23/extensions/CirrusSearch/: rECIR70a18f584611: Add explicit partitioning key to ElasticaWrite (T314426) (duration: 03m 13s)

Change 819752 merged by jenkins-bot:

[operations/deployment-charts@master] Change CirrusSearchElasticaWrite partitioning key

https://gerrit.wikimedia.org/r/819752

While the patch for deployment-charts was merged, when SRE went to deploy the patch the systems reported no change to the deployment. Unclear what the necessary next step is to have the cpjobqueue configuration updated.

Change 820536 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/deployment-charts@master] CirrusSearch: bump changeprop version

https://gerrit.wikimedia.org/r/820536

Change 820536 merged by Ryan Kemper:

[operations/deployment-charts@master] CirrusSearch: bump changeprop version

https://gerrit.wikimedia.org/r/820536

Change 820553 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/deployment-charts@master] Cirrus: bump changeprop-jobqueue vers

https://gerrit.wikimedia.org/r/820553

Change 820553 merged by jenkins-bot:

[operations/deployment-charts@master] Cirrus: bump changeprop-jobqueue vers

https://gerrit.wikimedia.org/r/820553

bking added subscribers: bking, RKemper.

Change 821800 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] changeprop-jobqueue: further reduce memory

https://gerrit.wikimedia.org/r/821800

Change 821800 merged by Bking:

[operations/deployment-charts@master] changeprop-jobqueue: further reduce memory

https://gerrit.wikimedia.org/r/821800

Can see in the JobQueue Job grafana dashboard that concurrency jumped from ~16 to ~25 arround same time as above patches were merged (~aug 9 at 2300 utc). Backlog appears to be staying low. The Saneitizer fix rate on cloudelastic is still high but we suspect that is related to jobs that were dropped when the backlog grew beyond retention. Expecting the saneitizer fix rate to return to the same level as eqiad/codfw within ~2 weeks of the deployment (around aug 23 or so). With resepect to this ticket that means that saneitizer is pushing additional jobs beyond what we normally see due to page edits and the job queue is keeping up with that additional load.

Fix rate on cloudelastic has mostly reverted to the expected low levels seen on other clusters, but not entirely. While we don't record exactly which wikis fixes are applied to, we know the checks are applied and metrics are recorded in roughly alphabetical order. Fixes are applied towards the beginning and end of each saneitizer cycle, suggesting commonswiki (or maybe enwiki) and wikidatawiki are still having fixes applied.

Overall it looks like things are working well and the background processes are cleaning up the time period where updates were lost in the expected fashion.