Type of activity: Pre-scheduled session
Main topic: https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit/2017/How_to_manage_our_technical_debt
The problem
At the moment every change in our operational configuration requires multiple commits in multiple repositories. Even the smallest maintenance on databases/memcached/redis needs a commit to mediawiki-config and a deploy of the code (and in some cases, multiple softwares will need the same). We need a fast, reliable, consistent way of communicating changes in the state of the cluster, e.g. "a database server is offline for maintenance", "the active search cluster in datacenter X".
We already have a system that is used for our load balancers and edge systems, we should expand it to be used for mediawiki-config and most other services.
Expected outcome
Getting both a buy in and gathering requirements from stakeholders; if there is time left, discuss a possible implementation route.
Current status of the discussion
There was some discussion on this topic at the TechOps offsite.
The current implementation idea is, broadly speaking:
- Have the information about lists of hosts (databases, etc) used by mediawiki (that we usually find in wmf-config/ProductionServices.php or in wmf-config/db-$site.php) managed in etcd via conftool/puppet
- have confd running on all the interested nodes and watching etcd (possibly, at regular intervals instead than continuously, depending on etcd's performance). That will write templated files on the host (the format of whose should be decided)
- Either parse this output from wmf-config/CommonSettings.php where we currently include those files, or have a hook that stores that data into HHVM's APC (some caveats apply)
- For other services, by default we could output a json file that could be parsed (possibly without needing to restart the service itself)
There are a lot of things that still need to be addressed, as we didn't define a schema for discovery objects ( e.g. "the url of the mediawiki API cluster I should connect to"), nor we have a consensus on how to read such files nor on what their format should be.
Links
- Interesting blog post by stripe on how they implemented something similar using consul https://stripe.com/blog/service-discovery-at-stripe
- Tasks related to this one T125069 "Create a service location / discovery system for locating local/master resources easily across all WMF applications"