Page MenuHomePhabricator

Batch updates create slave lag on s3 over WAN
Closed, ResolvedPublic

Description

When updating slow special pages from terbium, such as Listredirects, certain rows of the querycache table are deleted and then inserted.

Under normal circunstances, those updates do not create a problem. However, I believe a combination of factors can make them lag slaves:

  • Special pages of s3 are updated, which means hundreds of updates, independently of the wiki size, multiplied by the hundreds of wikis on that shard. Not other shard has >800 wikis.
  • WAN latency is higher than same-datacenter replication
  • Other writes are happening at the same time, such as updating pagelinks or wbc_entity_usage
  • ROW-based replication is used
  • A non-very-flat topology is in use (there are now 4 tiers, which is not desirable)

Given that special page update is not time-sensitive, I would like to:
a) Introduce pauses between wiki updates or, better, check that those have been applied to >50% of the slaves before continuing (including remote slaves)
b) make the updates non-transactional, splitting the filling of those tables in several, smaller transactions

Event Timeline

jcrespo created this task.Dec 25 2015, 11:04 AM
jcrespo raised the priority of this task from to Needs Triage.
jcrespo updated the task description. (Show Details)
jcrespo added a subscriber: jcrespo.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptDec 25 2015, 11:04 AM
jcrespo added a project: Operations.EditedDec 25 2015, 1:22 PM

I cannot say for sure if it is the Special pages or wbc_entity_usage updates, one of the two:

I see lots of:

UPDATE /* Wikibase\Client\Usage\Sql\EntityUsageTable::touchUsageBatch 127.0.0.1 */  `wbc_entity_usage` SET eu_touched = '20151225132251' WHERE eu_row_id IN ('613395','476260','476261','613397','523272','476258','476259','525131','476254','394381','476252','476253','543080')

Setting db2018 as MIXED temporarily to see if that helps.

jcrespo renamed this task from Batch update of special pages creates slave lag on s3 over WAN to Batch updated create slave lag on s3 over WAN.Dec 25 2015, 1:27 PM
jcrespo set Security to None.
jcrespo renamed this task from Batch updated create slave lag on s3 over WAN to Batch updates create slave lag on s3 over WAN.Dec 25 2015, 1:32 PM
jcrespo added a subscriber: hoo.Jan 6 2016, 8:41 PM
jcrespo closed this task as Resolved.Feb 5 2016, 1:24 PM
jcrespo claimed this task.

Mixed fixed the ongoing issue as a workaround, the root causes are still there and have to be fixed: pagelinks and/or wbc_entity_usage write activity.