Page MenuHomePhabricator

Partition CirrusSearch mediawiki jobs by cluster
Closed, ResolvedPublic

Description

The recent deployment of cloudelastic, the third elasticsearch cluster we write to, has made our mediawiki job response times for writes incredibly erratic. Additionally cloudelastic isn't nearly as powerful ( ~1/10 the size) and can't always keep up with the full update rate. To support this use case we want to partition these jobs such that each cluster can be written to independently.

The overall goal is to allow cloudelastic to fall behind and catch back up at it's own pace, independent from the primary clusters. Any slowdowns with cloudelastic needs to have little, if any, impact on writes to the primary clusters.

Event Timeline

bd808 renamed this task from Partition CirrusSerch mediawiki jobs by cluster to Partition CirrusSearch mediawiki jobs by cluster.Aug 16 2019, 9:08 AM
bd808 updated the task description. (Show Details)

@Pchelolo is there an action for Core Platform on this?

@kchapman yes. After the search team makes the jobs ready to be partitioned.

We need to rework our updater a little bit to share some expensive work before the partitioned jobs, but pull the ContentHandler data per-partition. Shouldn't be that much work, but needs to be done on our end so the cirrusSearchElasticaWrite job can be partitioned

Change 548932 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@master] Enqueue a job per cluster to write to

https://gerrit.wikimedia.org/r/548932

Change 548932 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Enqueue a job per cluster to write to

https://gerrit.wikimedia.org/r/548932

Change 551895 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/services/change-propagation/jobqueue-deploy@master] Partition CirrusSearchElasticaWrite jobs

https://gerrit.wikimedia.org/r/551895

@Pchelolo This patch will rollout with the next train (early dec?). What will happen at that point is:

  • The current CirrusSearchLinksUpdate jobs will no longer write to elasticsearch, rather it will enqueue a document update per-cluster as CirrusSearchElasticaWrite jobs
  • Total number of CirrusSearchLinksUpdate(Prioritized) jobs should stay consistent with what we have now, but they should run much quicker since they don't have to fetch complete page content or wait for elasticsearch to confirm a write. They currently run with a latency of ~700ms and concurrency of 100-200. I would expect this latency to drop dramatically, probably under 100ms. Likely no immediate change needs to happen here, but after deployment we can check stats and drop the configured concurrency limits.
  • There will now be, approximately, 3x as many ElasticaWrite jobs as there were CirrusSearchLinksUpdate jobs. Ballpark estimate on latency is 300ms, basically dividing the current 700ms by three and rounding up a bit. We almost certainly need to increase concurrency here, using the current level of links update (300) is almost certainly safe, and we can adjust from there.

Summary:

  • We need to increase concurrency on CirrusSearchElasticaWrite to 300 prior to the next train
  • We will want to decrease CirrusSearchLinksUpdate concurrency post-deploy
  • We need to partition CirrusSearchElasticaWrite job queues by a job parameter.

I've put up a patch to this effect.

Change 551895 merged by Ppchelko:
[mediawiki/services/change-propagation/jobqueue-deploy@master] Partition CirrusSearchElasticaWrite jobs

https://gerrit.wikimedia.org/r/551895

Mentioned in SAL (#wikimedia-operations) [2019-11-26T20:05:13Z] <ppchelko@deploy1001> Started deploy [cpjobqueue/deploy@2b713d6]: Partition CirrusSearchElasticaWrite jobs T230495

Mentioned in SAL (#wikimedia-operations) [2019-11-26T20:06:36Z] <ppchelko@deploy1001> Finished deploy [cpjobqueue/deploy@2b713d6]: Partition CirrusSearchElasticaWrite jobs T230495 (duration: 01m 23s)

Change 553179 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/change-propagation/jobqueue-deploy@master] Followup on I1ba56c242e6b37c11572fa62a9b6b0fc1635861d

https://gerrit.wikimedia.org/r/553179

Change 553179 merged by Ppchelko:
[mediawiki/services/change-propagation/jobqueue-deploy@master] Followup on I1ba56c242e6b37c11572fa62a9b6b0fc1635861d

https://gerrit.wikimedia.org/r/553179

Mentioned in SAL (#wikimedia-operations) [2019-11-26T20:24:26Z] <ppchelko@deploy1001> Started deploy [cpjobqueue/deploy@c282e86]: Followup on T230495

Mentioned in SAL (#wikimedia-operations) [2019-11-26T20:25:26Z] <ppchelko@deploy1001> Finished deploy [cpjobqueue/deploy@c282e86]: Followup on T230495 (duration: 00m 59s)

Mentioned in SAL (#wikimedia-operations) [2019-12-02T17:28:33Z] <ppchelko@deploy1001> Started deploy [cpjobqueue/deploy@deafe56]: Followup on cirrusSearchElasticWrite partitioning T230495

Mentioned in SAL (#wikimedia-operations) [2019-12-02T17:29:47Z] <ppchelko@deploy1001> Finished deploy [cpjobqueue/deploy@deafe56]: Followup on cirrusSearchElasticWrite partitioning T230495 (duration: 01m 14s)

Seems like after the last deploy the jobs are being partitioned according to the cluster correctly.

Thanks! I'll keep an eye on things and see how this goes as the train rolls forward this week and we shift all the updates into these partitioned jobs.

Looks like there's been quite an increase in the insertion rate of cirrusSearchElasticaWrite jobs since wmf.8 rolled out. Is this expected?

cirrusSearchElasticaWrite job insertion rate (1×1 px, 357 KB)

@Mholloway yes it is expected, previously this topic was only used to replay failed updates to elasticsearch.
As Erik mentionned in a previous comment:

There will now be, approximately, 3x as many ElasticaWrite jobs as there were CirrusSearchLinksUpdate jobs. Ballpark estimate on latency is 300ms, basically dividing the current 700ms by three and rounding up a bit. We almost certainly need to increase concurrency here, using the current level of links update (300) is almost certainly safe, and we can adjust from there.