Page MenuHomePhabricator

Job queue for writes to cloudelastic falling behind
Open, HighPublic

Description

Writes to cloudelastic aren't all making it through, rather the queue is constantly increasing and occasionally resetting: https://grafana-rw.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=main-eqiad&var-topic=All&var-consumer_group=cpjobqueue-cirrusSearchElasticaWrite&from=now-90d&to=now

We suspect this is directly related to jobqueue, rather than cloudelastic itself, as the thread pools for writing in cloudelastic are mostly idle and we have previous experience with cpjobqueue not managing to run enough concurrent jobs to send writes to the cluster (T300914).

Implement some method that allows the writes to complete as expected.

AC:

  • All expected writes make it to cloudelastic in a timely manner

Event Timeline

The general idea is to add a new parameter to the ElasticaWrite job, jobqueue_partition, and have that include both the cluster name and a integer partition number derived through random number % num_partitions. cpjobqueue should then be configured to partition by jobqueue_partition rather than the existing cluster value.

EBernhardson triaged this task as High priority.

Change 819751 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] Add explicit partitioning key to ElasticaWrite

https://gerrit.wikimedia.org/r/819751

Change 819752 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/deployment-charts@master] Change CirrusSearchElasticaWrite partitioning key

https://gerrit.wikimedia.org/r/819752

Change 819751 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Add explicit partitioning key to ElasticaWrite

https://gerrit.wikimedia.org/r/819751

This will require increasing the partition counts in kafka for the appropriate topics. Today they should have 3 partitions, we want to change it to have 6 partitions. The topics live in the main-eqiad and main-codfw kafka clusters.

The topics:

  • codfw.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite
  • eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite

Change 820190 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@wmf/1.39.0-wmf.22] Add explicit partitioning key to ElasticaWrite

https://gerrit.wikimedia.org/r/820190

Change 820191 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@wmf/1.39.0-wmf.23] Add explicit partitioning key to ElasticaWrite

https://gerrit.wikimedia.org/r/820191

Mentioned in SAL (#wikimedia-operations) [2022-08-03T17:55:39Z] <ottomata> increasing partitions from 5 to 6 for *.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite topics in Kafka main-eqiad and main-codfw - T314426

kafka topics --alter --topic codfw.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite --partitions 6
kafka topics --alter --topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite --partitions 6

Ran in both main-eqiad and main-codfw

Change 820190 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@wmf/1.39.0-wmf.22] Add explicit partitioning key to ElasticaWrite

https://gerrit.wikimedia.org/r/820190

Change 820191 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@wmf/1.39.0-wmf.23] Add explicit partitioning key to ElasticaWrite

https://gerrit.wikimedia.org/r/820191

Mentioned in SAL (#wikimedia-operations) [2022-08-03T20:28:15Z] <urbanecm@deploy1002> Synchronized php-1.39.0-wmf.22/extensions/CirrusSearch/: 9961e9bc8f5873f8ddc8a11108de0a7bfcb14ae6: Add explicit partitioning key to ElasticaWrite (T314426) (duration: 03m 23s)

Mentioned in SAL (#wikimedia-operations) [2022-08-03T20:31:29Z] <urbanecm@deploy1002> Synchronized php-1.39.0-wmf.23/extensions/CirrusSearch/: 70a18f5846111a0dfe8ba473daf384cbb8e88804: Add explicit partitioning key to ElasticaWrite (T314426) (duration: 03m 13s)

Change 819752 merged by jenkins-bot:

[operations/deployment-charts@master] Change CirrusSearchElasticaWrite partitioning key

https://gerrit.wikimedia.org/r/819752

While the patch for deployment-charts was merged, when SRE went to deploy the patch the systems reported no change to the deployment. Unclear what the necessary next step is to have the cpjobqueue configuration updated.

Change 820536 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/deployment-charts@master] CirrusSearch: bump changeprop version

https://gerrit.wikimedia.org/r/820536

Change 820536 merged by Ryan Kemper:

[operations/deployment-charts@master] CirrusSearch: bump changeprop version

https://gerrit.wikimedia.org/r/820536

Change 820553 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/deployment-charts@master] Cirrus: bump changeprop-jobqueue vers

https://gerrit.wikimedia.org/r/820553

Change 820553 merged by jenkins-bot:

[operations/deployment-charts@master] Cirrus: bump changeprop-jobqueue vers

https://gerrit.wikimedia.org/r/820553

bking added subscribers: bking, RKemper.