Since the introduction of restbase, we've been processing asyncronous events (like storing a new revision in restbase after an edit) by querying Parsoid and the mediawiki API, on the same clusters that serves the public, so the one that backs, among other things, all our anti-vandalism tools and Visual Editor.
This has caused several minor or major incidents, be it triggering obscure bugs or just overwhelming the cluster.
While perfect separation of concerns is impossible (after all, everything operates on the same database), a decent amount of it is absolutely mandatory:
a failure/overload of the async processing pattern should not impact the user experience.
The current plan includes the following steps:
- Enable TLS termination on all the MediaWiki Clusters, so that cross-dc queries can be encrypted
- Create new cluster load balancers for the API https cluster
- Configure ChangePropagation to call restbase in the inactive datacenter
- Check that restbase in the inactive DC calls parsoid in the same DC
- Make parsoid call the MW API via https from the inactive DC
- Configure load balancers so that the main API LB has all the appservers as backends, while the async-API LB only includes roughly 50% of them. This will mean that the non-shared ones will be able to serve live traffic even in the case the remaining 50% will be unable to due to some overloading. This number can be adjusted further in the future.