Page MenuHomePhabricator

[RFC] improve parsercache replication, sharding and HA
Open, MediumPublic

Description

@aaron, @ori thank you for your work on emergency parsercache key implementation. I want to track here pending task related to those, starting with some discussion:

  • Should we, slowly, change the keys to something more reasonable (e.g., name of the shards (pc1, pc2, pc3; changing 1 key per server pair until all old keys are expired). Will changing one key at a time affect the sharding function for the others, too?
  • Should we implement a more deterministic sharding function? As far as I know, the server depends now on the key and the number of servers, but that means that on maintenance (it is very typical to depool one server at a time), keys are distributed randomly among the servers. Could be only that keys going to the old server are sent randomly, while already present ones due to the previous function go to the right servers- for example, maintaining the keys but pointing to a NULL server. Maybe the rule should be failovering cross-datacenter? Should we buy 2 servers per "shard" and datacenter to maintain always the same servers?
  • As a more long term question, how should parsercache be handled for active-active. Is that something that parsercache architecture should know about, or should be resolve it at mediawiki "routing" layer?
  • Could we have a hot-swap (we now have a spare host) mechanism, or something else completely separate to allow for automatic failure detection and recovery that works for the parsercache model (it depends also on the handling of the above question)

Details

Related Gerrit Patches:

Event Timeline

jcrespo created this task.Apr 25 2016, 9:54 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 25 2016, 9:54 AM
jcrespo moved this task from Triage to Backlog on the DBA board.Apr 25 2016, 9:54 AM
Joe added a subscriber: Joe.May 2 2016, 4:16 PM
jcrespo added a project: Operations.

One option would be to shift sharding to the same model than external storage. Shard by a static key value, and separate the contents on different tables, so that they can be moved anywhere, but maintaining the subset of keys per table fixed.

'pc1' => 'server1',
'pc2' => 'server1'
'pc3' => 'server3'

{pc1-001, pc1-002, pc2-001, pc2-002, ...}

That way there is some constant on server being down/under maintenance, which can latter be cleaned up/migrated/replicated.

Another option worth considering would be to store the parser cache in Cassandra, and leverage its multi-DC and sharding functionality.

aaron added a comment.May 4 2016, 4:22 PM

Another option worth considering would be to store the parser cache in Cassandra, and leverage its multi-DC and sharding functionality.

I was tempted to suggest this already per NIH.

jcrespo added a comment.EditedMay 5 2016, 8:13 AM

I would not only consider Cassandra (reliability issues), but yes, outside of MySQL.

chasemp triaged this task as Medium priority.May 5 2016, 8:45 PM

All of this is good, but realistically I would be happy with a couple of fixes for now:

  1. Server connections failures should not be fatal, and servers should be "depooled" the same way they are when they have lag
  2. Query failures on parsercache, both reads and writes, should not be fatal (read only, queries killed, out-of-band changes, any other issue)
  3. Move parsercache replication to statement based, set them all in read-write mode, accept eventually consistent (loose consistency) and change misses into INSERT IGNORE/REPLACE

All of this is true no matter if we replace the storage backend. I can start working on this because "it should be just a couple of line fixes", but we should agree that is a good model first (there is a danger on transparent failure). Some may be already true- but we need to tune then the logging. Also this is tightly associated with T132317.

aaron added a comment.May 9 2016, 11:45 PM

Read/write exceptions are already supposed to be caught in handleReadError()/handleWriteError(). Are there backtraces of exceptions that incorrectly bubbled up anyway?

Writes generate errors at 1000-10000 (rate per second), I am concerned about the logging infrastructure, specifically this:

https://logstash.wikimedia.org/#dashboard/temp/AVSZZJTzctfB5Pje7jwt (the time since one datacenter was marked as read only and traffic was directed on the other datacenter)

I see that REPLACE is already run, which means I could set both sites replicating AND in read-write mode -either by setting STATEMENT based replication or setting the slaves in IDEMPOTENT exec mode (so they will not have 100% sure the same contents). Ok with that? But I am still worried about a SPOF for each individual key, specially given that those servers do not have redundant disks.

Change 288362 had a related patch set uploaded (by Jcrespo):
Change binlog format for parsercaches to STATEMENT

https://gerrit.wikimedia.org/r/288362

Writes generate errors at 1000-10000 (rate per second), I am concerned about the logging infrastructure, specifically this:
https://logstash.wikimedia.org/#dashboard/temp/AVSZZJTzctfB5Pje7jwt (the time since one datacenter was marked as read only and traffic was directed on the other datacenter)
I see that REPLACE is already run, which means I could set both sites replicating AND in read-write mode -either by setting STATEMENT based replication or setting the slaves in IDEMPOTENT exec mode (so they will not have 100% sure the same contents). Ok with that? But I am still worried about a SPOF for each individual key, specially given that those servers do not have redundant disks.

Sounds OK to me, since MediaWiki validates cache validity against the main DBs on cache get(), so conflicts getting ignored don't matter since no strict chronology model is needed. Either r/w method seems fine to me.

Change 288362 merged by Jcrespo:
Change binlog format for parsercaches to STATEMENT

https://gerrit.wikimedia.org/r/288362

I've just applied the STATEMENT/REPLACE option. This will basically avoid all issues of the passive datacenter being in read only. Now both disk-based caches are in read-write, which means the passive datacenter can overwrite the active one's cache (misses get inserted on the local datacenter). As aaron mentions, I also believe it will not create issues and both simplify the failover process and allow multi-dc.

I will document the current model with T133337.

The only thing I have pending here is the sharding key, but that may take some time and we may want to solve it "properly" (different data storage). I would like to do a "what would it happen if one server stopped responding" test? Even if it is solved "automatically", I am not sure it would be very transparent because timeout + large amount of misses (aka SPOF). Those being caches, I decided to not have disk redundancy, and that is an issue for availability.

@jcrespo What's the current status of this task? Is this intended to be a general request for comments for how to handle various operational, or should this also be handled by TechCom? The first two points in this task seem like changes/questions that can be handled within configuration changes that wouldn't need/require TechCom involvement. The third one is a bigger question regarding active-active that perhaps would make sense to have a TechCom-RFC about.

See my latest comments on: T167784#3961866

The third one is a bigger question regarding active-active that perhaps would make sense to have a TechCom-RFC about.

Funnily enough, the third question is already resolved- we already can and do have write-anyware parsercache. MySQL, contrary to the popular believe, can have master-master configuration, the problem is getting consistency with complex table dependencies. With a no-sql like configuration, like parsercaches, this is already a reality (writes on codfw appear on eqiad and the other way around). This is right now disabled because it allows for easier testing and mainteance, but we have been running like that for quite some time (over a year).

We can have a conversation, but I do not think it is a mediawiki conversation, but a WMF one, if you should give SRE more control over things like sharding and failover, for improved operation experience, but I believe most of that would not involve code/app changes.

@jcrespo Thanks, I'll untag our team for now then. Let me know if there's anything we can do.

A bit of a recap on the original questions:

  • Parsercache keys are renamed to pc1, pc2, pc3 at: T210725
  • Parsercaches are write-write, and (can) send the results to the other datacenter. At the moment the replication in the direction codfw -> eqiad is disabled until both dcs are active-active fore reads
  • There was a purchase of a redundant host for each datacenter for maintenance, but at the moment the switch is not automatic. It would be nice to have an automatic method to switch to it either on mediawiki load balancer or through a proxy so it doesn't need manual interaction

3 additional items/proposals regarding purging:

  • Smarter purging- something maybe priority queue based, while respecting TTL, not sure about details
  • Review purging so it works well for passive hosts, and our host failover scenario (which is memcache like, the others take over the keys of the one failed)
  • Potentially enabling purging on both datacenters, as this service is write-write, active-active
jcrespo updated the task description. (Show Details)Oct 9 2019, 11:10 AM
jcrespo renamed this task from [RFC] improve parsercache replication and sharding handling to [RFC] improve parsercache replication, sharding and HA.Oct 9 2019, 11:13 AM
jcrespo updated the task description. (Show Details)