Maniphest T119626

Eliminate SPOF at the main database infrastructure
Open, LowPublic
Actions

Assigned To

None

Authored By

	• jcrespo
	Nov 25 2015, 2:42 PM

Description

Philosophy:

Never perform "failovers"
No passive/unused hardware
No SPOF for service problems
All servers can be stopped at any time for maintenance
Replication channel is also not a SPOF
Independence from specific technologies/vendors

  Master-Master active-passive replication                       -> Regular replication or semisync
            with backup channel                                  => Syncronous replication (maybe)
   ---------------------------------------                        *> Client connection
   |         -------------------         |
   |         |                 |         |
   v         v                 v         v
master    [RO]slave1      [RO]master  [RO]slave1 (galera, GTID or
 eqiad <=> eqiad            codfw  <=> codfw      binlog servers)   Application servers
   |\       /|                 |\       /|                          ------------------
   |  \   /  |   semisync      |  \   /  |                          |   Mediawiki    |
   |    X    |  replication    |    X    |                          |       *        |
   |  /   \  |                 |  /   \  |                          |       *        |
   |/       \|                 |/       \|                          |       v        |
   v         v                 v         v                          |     Proxy      |
 slave2    slave3 ...        slave2    slave3 ... <********************(syncronized  <============== etcd configuration
 eqiad     eqiad             codfw     codfw                        |   fleet-wide)  |
                                                                    ------------------

Related Objects
Search...

Status	Subtype	Assigned	Task
Open		None	T119626 Eliminate SPOF at the main database infrastructure
Resolved		• jcrespo	T119642 Create a Master-master topology between datacenters for easier failover (setup circular replication dallas -> eqiad for mysql databases)
Resolved		• jcrespo	T133385 Implement GTID replication on MariaDB 10 servers
Resolved		• jcrespo	T111992 Physical location SPOF because of database server distribution on a single rack (D1)
Resolved		• jcrespo	T133398 Install, configure and provision recently arrived db core machines
Resolved		• Cmjohnson	T135253 Rack and set up 16 db's db1079-1094
Resolved		• Marostegui	T141547 Setup automatic failover for misc database servers
			Restricted Task
Resolved	PRODUCTION ERROR	• mmodell	T190960 1.31.0-wmf.27 rolled back due to increase in fatals: "Replication wait failed: lost connection to MySQL server during query"
Declined		None	T156475 Investigate spike in 500s during asw-c2-eqiad replacement
Resolved		tstarling	T198049 Investigate possible outage on wikidata on 25th June - 04:13AM UTC - 05:27AM UTC
Resolved		• mobrovac	T202107 Job queue should not overload the DB servers when there is replication lag

Event Timeline

• jcrespo created this task.Nov 25 2015, 2:42 PM

• jcrespo raised the priority of this task from to Needs Triage.

• jcrespo updated the task description. (Show Details)

• jcrespo added projects: SRE, DBA.

• jcrespo subscribed.

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptNov 25 2015, 2:42 PM

• jcrespo renamed this task from [EPIC] Eliminate SPOF at the main database infrastructure to Eliminate SPOF at the main database infrastructure.Nov 25 2015, 6:06 PM

• jcrespo added a project: Epic.

• jcrespo set Security to None.

• jcrespo added a subtask: T119642: Create a Master-master topology between datacenters for easier failover (setup circular replication dallas -> eqiad for mysql databases).Nov 25 2015, 6:17 PM

Some notes on Galera that affect MediaWiki (some bits from https://mariadb.com/kb/en/mariadb/mariadb-galera-cluster-known-limitations/):

GET_LOCK() is used in a few places. In Galera, it is only local to the mariadb server the client talks to, so something would have to change there.
When we write to multiple DBs, it's a best-effort (BEGIN, BEGIN)...(COMMIT,COMMIT) in commitMasterChanges(). This works well with S2PL (as innodb and sqlite use) as rollbacks are very rare (e.g. connection loss after first COMMIT or internal commit failure on a node). Galera introduces optimistic transactions failures (e.g. when conflicting changes happen on other nodes and are first in the Group Broadcast queue). This might increase the chances of (COMMIT, ROLLBACK). Some of this can probably be improved by making sure centralauth updates are committed first or the some other secondary updates use the jobqueue (flowing from the main wiki DB to the foreign one, for example). I could add logging code to see what multi-db transaction we are actually doing to see what can be made more robust. This is partly a pre-existing problem for MediaWiki since COMMIT can certainly fail and for postgres (which we claim to support) using SERIALIZABLE or REPEATABLE-READ (which is really SNAPSHOT, having first-committer-wins).
LOCK IN SHARE MODE has some bugs (https://github.com/codership/galera/issues/336#issuecomment-136635018) (mentioned at https://aphyr.com/posts/327-call-me-maybe-mariadb-galera-cluster)
Any tables without PKs will block this probably T17441: Some tables lack unique or primary keys, may allow confusing duplicate data

@aaron Forget about that. Galera multimaster is not an option. A single master, that happens to have a "galera" slave, is (maybe).

In other words, slave1 is a master candidate, that by using GTIDs or cloning the binary logs (binlog server) we can continue replication automatically. But the idea is it being a passive master. Hopefuly that will give us automatic master failover.

Galera is only there to provide us more synchronization than regular replication so no transactions are lost, but I am not too fan of it given the large transactions that we have. So, maybe try it as active-passive, but I have already discarded it as a multi-master setup.

All other connections are regular slaves/semisync slaves. Maybe even the master candidate.

If the master is still passive, I assume that means other config changes are still need to use it when the active one fails, so the title of this task seems a bit strongly worded :)

That's partly what confused me and made me thing this involved galera, though OTOH galera isn't the only way to avoid a SPOF (the other option being some proxy and auto-switch logic).

Let's forget about this for now and only do the subtask, needed for other reasons.

@aaron, I also want to put a haproxy on every mediawiki, I just have not drown it :-) Galera would give us the strictness to do things automaticly, but no SPOF doesn't mean we may want everything fully automatic (so galera or anything else would not be needed)!

I focused here on the master failover by (maybe?) having multisource replication from a couple of masters.

• jcrespo updated the task description. (Show Details)Nov 25 2015, 7:35 PM

• jcrespo mentioned this in T121857: Implement a system to automatically deploy schema changes without needing DBA intervention.Dec 18 2015, 11:31 AM

JanZerebecki subscribed.Jan 6 2016, 10:36 PM

hoo subscribed.Jan 20 2016, 2:43 PM

• jcrespo mentioned this in T70062: Improve how Mediawiki handles a DB host that is flaky rather than completely down.Feb 4 2016, 8:58 PM

• jcrespo updated the task description. (Show Details)

greg subscribed.Feb 4 2016, 10:23 PM

• jcrespo mentioned this in T125215: Prepare db1018 and s2-slaves for s2 master failover.Feb 7 2016, 3:28 PM

• jcrespo closed subtask T119642: Create a Master-master topology between datacenters for easier failover (setup circular replication dallas -> eqiad for mysql databases) as Resolved.Mar 16 2016, 11:36 AM

• jcrespo added a parent task: T133337: Automate database datacenter switchover steps.Apr 22 2016, 1:33 PM

• jcrespo created subtask T133385: Implement GTID replication on MariaDB 10 servers.Apr 22 2016, 1:39 PM