Page MenuHomePhabricator

Partition htmlCacheUpdate job topic
Closed, ResolvedPublic

Description

On 03/25 around between 11:21 and 11:33 there was an increase in htmlCacheUpdate job concurrency from 2.5 up to 7.5 jobs. This was caused by an edit on enwiki for the page https://en.wikipedia.org/wiki/Module:Language/data/iana_scripts which caused a long sequence of recursive updates. Given that the batchSize is 300 for the htmlCacheUpdate job, this overloaded MySQL replication.

In order to aviod that we need to decrease the htmlCacheUpdate job concurrency. However, it's better to also partition the htmlCacheUpdate topic according to MySQL replicas just like we do for refreshLinks job. Given that we have 8 partitions and current overall concurrency is 10, 2 concurrent jobs for htmlCacheUpdate for each partition should be enough.

Prior to deploying the change the existing topics for htmlCacheUpdate must be edited manually to add partitions.

Event Timeline

Restricted Application added subscribers: Liuxinyu970226, Aklapper. · View Herald Transcript

Can do, @Pchelolo you want {eqiad,codfw}.mediawiki.job.htmlCacheUpdate topics to be bumped to 8 partitions?

@Ottomata yes, but not just yet, we still need to prepare the patches etc.

Oo, actually we should get @herron to do this :)

Change 499193 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/change-propagation/jobqueue-deploy@master] Partition htmlCacheUpdate job according to MariaDB partitioning.

https://gerrit.wikimedia.org/r/499193

Actually, the existing topic need to be left alone, but 2 new topics 8 partitions each needs to be created:

  • eqiad.cpjobqueue.partitioned.mediawiki.job.htmlCacheUpdate
  • codfw.cpjobqueue.partitioned.mediawiki.job.htmlCacheUpdate

Also as a part of this change I want to rename the paritioned topics for refreshLinks job, thus the following topics with 8 partitions should be created:

  • eqiad.cpjobqueue.partitioned.mediawiki.job.refreshLinks
  • codfw.cpjobqueue.partitioned.mediawiki.job.refreshLinks

Change 499209 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/change-propagation/jobqueue-deploy@master] Partition htmlCacheUpdate job according to MariaDB partitioning.

https://gerrit.wikimedia.org/r/499209

Done in main-eqiad and main-codfw:

[@kafka1001:/home/otto] $ kafka topics --describe | grep -E '^Topic:.*cpjobqueue\.partitioned'
Topic:codfw.cpjobqueue.partitioned.mediawiki.job.htmlCacheUpdate	PartitionCount:8	ReplicationFactor:3	Configs:
Topic:codfw.cpjobqueue.partitioned.mediawiki.job.refreshLinks	PartitionCount:8	ReplicationFactor:3	Configs:
Topic:eqiad.cpjobqueue.partitioned.mediawiki.job.htmlCacheUpdate	PartitionCount:8	ReplicationFactor:3	Configs:
Topic:eqiad.cpjobqueue.partitioned.mediawiki.job.refreshLinks	PartitionCount:8	ReplicationFactor:3	Configs:
[@kafka2001:/home/otto] $ kafka topics --describe | grep -E '^Topic:.*cpjobqueue\.partitioned'
Topic:codfw.cpjobqueue.partitioned.mediawiki.job.htmlCacheUpdate	PartitionCount:8	ReplicationFactor:3	Configs:
Topic:codfw.cpjobqueue.partitioned.mediawiki.job.refreshLinks	PartitionCount:8	ReplicationFactor:3	Configs:
Topic:eqiad.cpjobqueue.partitioned.mediawiki.job.htmlCacheUpdate	PartitionCount:8	ReplicationFactor:3	Configs:
Topic:eqiad.cpjobqueue.partitioned.mediawiki.job.refreshLinks	PartitionCount:8	ReplicationFactor:3	Configs:

\o/ Thanks @Ottomata . For posterity, the plan is to go forward with this tomorrow, 2019-03-28.

Change 499193 merged by Ppchelko:
[mediawiki/services/change-propagation/jobqueue-deploy@master] Partition htmlCacheUpdate job according to MariaDB partitioning.

https://gerrit.wikimedia.org/r/499193

Mentioned in SAL (#wikimedia-operations) [2019-03-28T13:12:41Z] <ppchelko@deploy1001> Started deploy [cpjobqueue/deploy@17285f8]: Partition htmlCacheUpdate topic, step 1 T219159

Mentioned in SAL (#wikimedia-operations) [2019-03-28T13:14:26Z] <ppchelko@deploy1001> Finished deploy [cpjobqueue/deploy@17285f8]: Partition htmlCacheUpdate topic, step 1 T219159 (duration: 01m 46s)

Change 499765 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/change-propagation/jobqueue-deploy@master] Explicitly exclude htmlCacheUpdate from regex rule.

https://gerrit.wikimedia.org/r/499765

Change 499765 merged by Ppchelko:
[mediawiki/services/change-propagation/jobqueue-deploy@master] Explicitly exclude htmlCacheUpdate from regex rule.

https://gerrit.wikimedia.org/r/499765

Mentioned in SAL (#wikimedia-operations) [2019-03-28T13:21:25Z] <ppchelko@deploy1001> Started deploy [cpjobqueue/deploy@c120b38]: Partition htmlCacheUpdate topic, explicitly exclude htmlCacheUpdate T219159

Mentioned in SAL (#wikimedia-operations) [2019-03-28T13:22:13Z] <ppchelko@deploy1001> Finished deploy [cpjobqueue/deploy@c120b38]: Partition htmlCacheUpdate topic, explicitly exclude htmlCacheUpdate T219159 (duration: 00m 48s)

Change 499209 merged by Mobrovac:
[mediawiki/services/change-propagation/jobqueue-deploy@master] Partition htmlCacheUpdate job according to MariaDB partitioning.

https://gerrit.wikimedia.org/r/499209

Mentioned in SAL (#wikimedia-operations) [2019-03-28T14:31:12Z] <ppchelko@deploy1001> Started deploy [cpjobqueue/deploy@3a8a889]: Partition htmlCacheUpdate topic, step 2 T219159

Mentioned in SAL (#wikimedia-operations) [2019-03-28T14:32:05Z] <ppchelko@deploy1001> Finished deploy [cpjobqueue/deploy@3a8a889]: Partition htmlCacheUpdate topic, step 2 T219159 (duration: 00m 53s)

Change 499782 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/change-propagation/jobqueue-deploy@master] Remove temporary auto.offset.reset override and increase htmlCacheUpdate concurr.

https://gerrit.wikimedia.org/r/499782

Change 499782 merged by Mobrovac:
[mediawiki/services/change-propagation/jobqueue-deploy@master] Remove temporary auto.offset.reset override and increase htmlCacheUpdate concurr.

https://gerrit.wikimedia.org/r/499782

Mentioned in SAL (#wikimedia-operations) [2019-03-28T14:45:11Z] <ppchelko@deploy1001> Started deploy [cpjobqueue/deploy@4deeb04]: Partition htmlCacheUpdate topic, final cleanup stage T219159

Mentioned in SAL (#wikimedia-operations) [2019-03-28T14:46:03Z] <ppchelko@deploy1001> Finished deploy [cpjobqueue/deploy@4deeb04]: Partition htmlCacheUpdate topic, final cleanup stage T219159 (duration: 00m 52s)

We have deployed the partitioner for the htmlCacheUpdate job and it's not running in production. We have created some lag in the process, but it should clear out soon.