Integrating MediaWiki (and other services) with dynamic configuration
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Joe
	Oct 31 2016, 7:35 PM

Description

Type of activity: Pre-scheduled session
Main topic: https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit/2017/How_to_manage_our_technical_debt

The problem

At the moment every change in our operational configuration requires multiple commits in multiple repositories. Even the smallest maintenance on databases/memcached/redis needs a commit to mediawiki-config and a deploy of the code (and in some cases, multiple softwares will need the same). We need a fast, reliable, consistent way of communicating changes in the state of the cluster, e.g. "a database server is offline for maintenance", "the active search cluster in datacenter X".

We already have a system that is used for our load balancers and edge systems, we should expand it to be used for mediawiki-config and most other services.

Expected outcome

Getting both a buy in and gathering requirements from stakeholders; if there is time left, discuss a possible implementation route.

Current status of the discussion

There was some discussion on this topic at the TechOps offsite.

The current implementation idea is, broadly speaking:

Have the information about lists of hosts (databases, etc) used by mediawiki (that we usually find in wmf-config/ProductionServices.php or in wmf-config/db-$site.php) managed in etcd via conftool/puppet
have confd running on all the interested nodes and watching etcd (possibly, at regular intervals instead than continuously, depending on etcd's performance). That will write templated files on the host (the format of whose should be decided)
Either parse this output from wmf-config/CommonSettings.php where we currently include those files, or have a hook that stores that data into HHVM's APC (some caveats apply)
For other services, by default we could output a json file that could be parsed (possibly without needing to restart the service itself)

There are a lot of things that still need to be addressed, as we didn't define a schema for discovery objects ( e.g. "the url of the mediawiki API cluster I should connect to"), nor we have a consensus on how to read such files nor on what their format should be.

Details

Subject	Repo	Branch	Lines +/-
Enable EtcdConfig on the debug hosts	operations/mediawiki-config	master	+5 -3
profile::discovery::client: expose services as well	operations/puppet	production	+15 -3
profile::discovery::client: create confd-generate files for discovery	operations/puppet	production	+47 -0
conftool-data: add first discovery objects	operations/puppet	production	+46 -0
profile::conftool::client: add default schema	operations/puppet	production	+22 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Qgil	T153007 Technical Collaboration annual plan FY2017-18
Resolved	Qgil	T159313 Draft WMF annual plan program about technical events
Resolved	Qgil	T149300 Future of the Wikimedia Developer Summit
Resolved	• Rfarrand	T153996 Wikimedia Developer Summit 2017: Feedback Survey
Resolved	• Rfarrand	T141926 Wikimedia Developer Summit 2017
Resolved	Qgil	T141938 Prepare a program for Wikimedia Developer Summit 2017 to effectively address current high level movement needs
Resolved	greg	T147937 Facilitate Wikidev'17 main topic "How to manage our technical debt"
Resolved	Joe	T154658 Prepare and improve the datacenter switchover procedure
Resolved	Joe	T149617 Integrating MediaWiki (and other services) with dynamic configuration
Resolved	Joe	T155823 Expand conftool to support multiple objects via a schema definition.
Resolved	None	T156100 DNS: dynamically generate entries for service discovery
Resolved	Volans	T160994 Create the failoid service as fallback for the DNS discovery
Resolved	Joe	T156924 Allow integration of data from etcd into the MediaWiki configuration
Resolved	Volans	T160178 MediaWiki Datacenter Switchover automation
Resolved	jcrespo	T161007 Decouple Mariadb semi-sync replication from $::mw_primary
Resolved	Legoktm	T162188 Pick/develop a PHP library for etcd to use in MediaWiki

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

• Pchelolo edited projects, added Services (watching); removed Services.Nov 2 2016, 5:33 PM

greg added a parent task: T147937: Facilitate Wikidev'17 main topic "How to manage our technical debt".Nov 2 2016, 5:58 PM

The discussion in T125069: Create a service location / discovery system for locating local/master resources easily across all WMF applications is highly relevant here.

I still think that DNS has served us well so far, and provides a reliable and extremely simple solution for node services. While DNS caching *can* be broken in some services, I haven't really seen concrete information on whether this is the case for HHVM, and if so, whether an OS-level configuration change could fix it. Lets at least check these things before rewriting a custom DNS look-alike at the app level.

This is not limited to a discovery system, and is focused on MediaWiki (as it's the most common offender in terms of having tons of reference to cluster state in its configuration). Most of what is needed by MediaWiki (and potentially, any other service needing more than just one url, but - say- lists of servers in a pool) can't be served satisfactorily just using DNS (I know you can with SRV records, but that's really suboptimal).

Which can be still a good alternative whenever we don't need complex logic: it was cited both in the linked ticket and in the linked article, I wasn't discarding it. I just think that's just one part of the story and one that doesn't apply well to the MediaWiki use case, and that I think is suboptimal whenever we want to have fast changes - we really don't want to set the DNS TTL too low.

In any case, whatever is the mean to receive discovery information, a global mechanism should be able to account for relatively complex cases as the mediawiki-config one.

Qgil moved this task from Missing basic information to Missing proven interest on the Wikimedia-Developer-Summit (2017) board.Nov 3 2016, 1:51 PM

Most of what is needed by MediaWiki (and potentially, any other service needing more than just one url, but - say- lists of servers in a pool) can't be served satisfactorily just using DNS (I know you can with SRV records, but that's really suboptimal).

There is a long tradition of using multiple A records for such lists. Most clients will round-robin between those contacts, but others like Cassandra also use the entire list, in cassandra's case to initialize its list of contact points.

PHP seems to support the retrieval of all records quite well with [dns_get_record()](http://php.net/manual/en/function.dns-get-record.php). That said, I haven't used this implementation in practice, so can't vouch for quality or performance. There is similar built-in support in nodejs. Python requires a separate package (dnspython): https://c0deman.wordpress.com/2014/06/17/find-nameservers-of-domain-name-python/

Edit: It seems that Python's built-in socket.gethostbyname_ex() can return all A records as well, so no external library is needed there either: http://stackoverflow.com/questions/3837744/how-to-resolve-dns-in-python

• Gilles moved this task from Inbox, needs triage to Radar on the Performance-Team board.Nov 3 2016, 9:55 PM

aaron added a project: Wikimedia-Multiple-active-datacenters.Nov 10 2016, 5:29 AM

Volans awarded a token.Nov 15 2016, 4:19 PM

Addshore awarded a token.Nov 16 2016, 3:11 PM

Addshore subscribed.

• Gilles subscribed.Nov 17 2016, 10:51 AM

• dpatrick awarded a token.Nov 17 2016, 7:44 PM

• demon awarded a token.Nov 18 2016, 4:27 AM

elukey awarded a token.Nov 18 2016, 9:15 AM

akosiaris awarded a token.Nov 18 2016, 9:30 AM

akosiaris subscribed.Nov 18 2016, 9:33 AM

Physikerwelt awarded a token.Nov 21 2016, 2:07 PM

• mmodell awarded a token.Nov 23 2016, 10:14 PM

• mmodell subscribed.

Krenair awarded a token.Nov 27 2016, 12:01 PM

jcrespo awarded a token.Nov 27 2016, 3:40 PM

bd808 awarded a token.Nov 27 2016, 3:58 PM

Seb35 awarded a token.Nov 27 2016, 8:26 PM

I’m interested in this topic, particularly because I try to develop a wiki farm and I wrote and released a configuration management extension for MediaWiki. For now it is only focused on MediaWiki config in the context of a farm (hierarchical configuration similar to InitialiseSettings.php with multiversion management) and a next step would be exactly what is described in this task: integrate MediaWiki in its environment and manage inter-dependencies.

If we place MediaWiki in the core of the environment, there are "input dependencies" (e.g. MediaWiki depends on PHP) and "output dependencies" (e.g. Parsoid is called by MediaWiki/VisualEditor). The difficulty is to avoid re-developing what Puppet/Ansible/etc does (e.g. install PHP, MediaWiki, Parsoid, etc), but in the same time give some retroaction to these tools (e.g. given we want to install the VisualEditor feature on a MediaWiki 1.28, we have to install Parsoid 0.6.1 and say to MediaWiki that Parsoid is located on this host). Hence, somewhere, there should be a database with a map (tool:version <-> tool:version), and a map (service <-> configuration), and this second map could depend on context parameters like the datacenter or the realm (prod/labs) (although in my mind, in the MediaWikiFarm extension, something like the wmflabs would be only another farm derived from the prod farm, just with some overloaded config parameters like URL and database config).

Qgil moved this task from Missing proven interest to Proposed Unconference Sessions on the Wikimedia-Developer-Summit (2017) board.Nov 28 2016, 10:12 AM

+1 for the etc/confd/json + APC approach for MediaWiki, at least starting with DB configuration.

What were the APC caveats?

In T149617#2832697, @aaron wrote:

Task description:

Either parse this output from wmf-config/CommonSettings.php where we currently include those files, or have a hook that stores that data into HHVM's APC (some caveats apply)

+1 for the etc/confd/json + APC approach for MediaWiki, at least starting with DB configuration.

What were the APC caveats?

So this means a separate background process that is triggered by etc/confd, which takes the data and stores it as an associative array in APC - which we read at run-time in wmf-config PHP code.

The benefit would be that it can do a hot swap (because it's re-read every request) without any notification or on-disk file for HHVM (with or without RepoAuth enabled). The downside is that APC is considered ephemeral and storing this there seems a bit fragile.

Aside from potentially unpredictable evictions (which are actually rare in HHVM, and a problem in itself that helps us in this case) - we'd have to very carefully orchestrate HHVM reboots.

We currently cache the expansion/extraction process of SiteConfig from InitialiseSettings.php in a temporary file containing serialised php (source). This is invalidated and regenerated at run-time based on a newer filemtime. This could perhaps be moved to APC as a proof of concept.

If it works well, we can do the same with db configs. E.g. have etcd/confd produce a json file, that is parsed and cached in APC lazily by wmf-config. That way it will fall back gracefully. We could of course populate that APC key directly from the script that creates the json file (e.g. right before the json file is written to wmf-config). That way it's only a fallback.

The background process would write to the JSON file. APC caching would be done by wmf-config PHP code I assume, checking the mtime of the JSON.

In T149617#2832982, @aaron wrote:

The background process would write to the JSON file. APC caching would be done by wmf-config PHP code I assume, checking the mtime of the JSON.

Sounds good.

bd808 moved this task from Proposed Unconference Sessions to To be pre-scheduled on the Wikimedia-Developer-Summit (2017) board.Dec 8 2016, 11:20 PM

jcrespo mentioned this in T152761: db1028 increased lag after extensions/CentralAuth/maintenance/populateLocalAndGlobalIds.php.Dec 9 2016, 11:57 AM

Florian subscribed.Dec 14 2016, 9:31 PM

@aaron another possibility is to have the process call a special url on HHVM to push the APC refresh.

In T149617#2832764, @Krinkle wrote:

We currently cache the expansion/extraction process of SiteConfig from InitialiseSettings.php in a temporary file containing serialised php (source). This is invalidated and regenerated at run-time based on a newer filemtime. This could perhaps be moved to APC as a proof of concept.

If it works well, we can do the same with db configs. E.g. have etcd/confd produce a json file, that is parsed and cached in APC lazily by wmf-config. That way it will fall back gracefully. We could of course populate that APC key directly from the script that creates the json file (e.g. right before the json file is written to wmf-config). That way it's only a fallback.

This is about what is done in Extension:MediaWikiFarm: the config (originally in YAML, JSON, or PHP arrays) is cached in a PHP file (return array( 'config1' => 'value1', … ); so that this opcode can be cached directly by PHP (APC or Opcache).

Note that Parsoid also loads a bunch of configuration from mediawiki at startup and caches it. Some changes to mediawiki configuration require a Parsoid restart.
,
It would be great to have a built in/standard mechanism for propagating configuration changes. At the very least, a monotonic counter associated with the configuration, so third-party services like Parsoid could easily determine whether they need to reload siteinfo.

In T149617#2918215, @cscott wrote:

Note that Parsoid also loads a bunch of configuration from mediawiki at startup and caches it. Some changes to mediawiki configuration require a Parsoid restart.
,
It would be great to have a built in/standard mechanism for propagating configuration changes. At the very least, a monotonic counter associated with the configuration, so third-party services like Parsoid could easily determine whether they need to reload siteinfo.

I am completely unaware of this; can you please elaborate?

Also, given we have no way of signalling parsoid that the mediawiki config has changed, this would look as a very very bad design decision; but I'm sure I misunderstood.

To the owner of this session: Here is the link to the session guidelines page: https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit/2017/Session_Guidelines. We encourage you to recruit Note-taker(s) 2(min) and 3(max), Remote Moderator, and Advocate (optional) on the spot before the beginning of your session. Instructions about each role player's task are outlined in the guidelines. The physical version of the role cards will be made available in all the session rooms. Good luck prepping, see you at the summit! :)

Note-taker(s) of this session: Follow the instructions here: https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit/2017/Session_Guidelines#NOTE-TAKER.28S.29 After the session, DO NOT FORGET to copy the relevant notes and summary into a new wiki page following the template here: https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit/2017/Your_Session and also link this from the All Session Notes page: https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit/2017/All_Session_Notes. The EtherPad links are also now linked from the Schedule page (https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit/2017/Schedule) for you!

The summary of the session with all the relevant links (detailed notes, slides and recording) is available on MediaWiki: Integrating MediaWiki (and other services) with dynamic configuration.

Joe triaged this task as Medium priority.Jan 11 2017, 7:40 PM

mark subscribed.Jan 18 2017, 5:11 PM

Joe added a parent task: T154658: Prepare and improve the datacenter switchover procedure.Jan 18 2017, 5:11 PM

Extracting from the session outcomes:

What we want to do:

Stop relying in configuration repository to store the live state of clusters, it needs to be treated separately.
We already have a tool to store such state reliably, and that is conftool. We have no reason not to expand it to be able to cover such cases to.
Use DNS for service location/discovery, and simple data structures, but has some limitations.
- We want to integrate conftool with gdnsd, as we see it as the most reliable and flexible option instead of coredns/skydns, which can read from etcd, but only from their own somewhat limited format
Use Confd with templates and scripts to manage more complex data structures.
For the MediaWiki specific case there are a bunch of options:
- generate a JSON file that is read by the application, parsed by MediaWiki and cached into APC, via APC (preferred)
- generate a PHP file that will be included directly by the application.
- Add safe measures to ensure that the configuration is consistent across the cluster and no hosts have a stale one.

Since this is at least partially interesting for the TechOps Goal for the current quarter, I'll add subtasks for the actual implementation of the system.

Joe created subtask T155823: Expand conftool to support multiple objects via a schema definition..Jan 20 2017, 12:18 PM

MoritzMuehlenhoff subscribed.Jan 20 2017, 12:46 PM

Joe moved this task from Backlog to Doing on the User-Joe board.Jan 22 2017, 10:43 AM

Volans mentioned this in T156100: DNS: dynamically generate entries for service discovery.Jan 24 2017, 5:16 AM

Volans created subtask T156100: DNS: dynamically generate entries for service discovery.

Joe moved this task from Doing to Blocked on others on the User-Joe board.Jan 24 2017, 8:34 AM

@Joe could you also upload the slides from https://docs.google.com/presentation/d/1r4g_3yZxfPzzEEvOgj4ZbbTBrNshBylDAb5KYdnzke4/edit?usp=sharing to commons? :)

In T149617#2977429, @Legoktm wrote:

@Joe could you also upload the slides from https://docs.google.com/presentation/d/1r4g_3yZxfPzzEEvOgj4ZbbTBrNshBylDAb5KYdnzke4/edit?usp=sharing to commons? :)

Yes, I am just unsure how / to who I can attribute the template design. That's what is blocking me at the moment.

In T149617#2980969, @Joe wrote:

In T149617#2977429, @Legoktm wrote:

@Joe could you also upload the slides from https://docs.google.com/presentation/d/1r4g_3yZxfPzzEEvOgj4ZbbTBrNshBylDAb5KYdnzke4/edit?usp=sharing to commons? :)

Yes, I am just unsure how / to who I can attribute the template design. That's what is blocking me at the moment.

While we wait for a response from communications, I decided I'd rather ask forgiveness than wait for permission and license everything under CC-by-SA 3.0

https://commons.wikimedia.org/wiki/File:Dynamic_configuration_in_MediaWiki_and_other_applications.pdf

Joe created subtask T156924: Allow integration of data from etcd into the MediaWiki configuration.Feb 1 2017, 4:45 PM

Joe mentioned this in T125069: Create a service location / discovery system for locating local/master resources easily across all WMF applications.Feb 1 2017, 5:26 PM

Joe closed subtask T155823: Expand conftool to support multiple objects via a schema definition. as Resolved.Feb 24 2017, 5:28 PM

Change 339673 had a related patch set uploaded (by Giuseppe Lavagetto):
profile::conftool::client: add default schema

https://gerrit.wikimedia.org/r/339673

Change 339674 had a related patch set uploaded (by Giuseppe Lavagetto):
conftool-data: add first discovery objects

https://gerrit.wikimedia.org/r/339674

Joe moved this task from Blocked on others to Doing on the User-Joe board.Feb 27 2017, 9:16 AM

Mentioned in SAL (#wikimedia-operations) [2017-02-27T10:11:46Z] <_joe_> upgrading conftool to 0.4.0 across the cluster T149617

Change 339673 merged by Giuseppe Lavagetto:
profile::conftool::client: add default schema

https://gerrit.wikimedia.org/r/339673

Change 339674 merged by Giuseppe Lavagetto:
conftool-data: add first discovery objects

https://gerrit.wikimedia.org/r/339674

Addshore unsubscribed.Feb 28 2017, 5:53 PM

Change 340538 had a related patch set uploaded (by Giuseppe Lavagetto):
[operations/puppet] profile::discovery::client: create confd-generate files for discovery

https://gerrit.wikimedia.org/r/340538

Change 340538 merged by Giuseppe Lavagetto:
[operations/puppet] profile::discovery::client: create confd-generate files for discovery

https://gerrit.wikimedia.org/r/340538

tstarling mentioned this in T158730: Automate WMF wiki creation.Mar 3 2017, 4:10 AM

Change 340935 had a related patch set uploaded (by oblivian):
[operations/puppet] profile::discovery::client: expose services as well

https://gerrit.wikimedia.org/r/340935

So, I just found out that the dns cache feature we were supposedly using in HHVM has been removed from it for some time, so while we have the ini setting in our setup, it's not doing much of an effect.

We can therefore probably use the DNS discovery for service discovery instead of needing to use confd for that.

This will still be useful for other uses, like lists of servers or databases

Change 340935 merged by Giuseppe Lavagetto:
[operations/puppet] profile::discovery::client: expose services as well

https://gerrit.wikimedia.org/r/340935

Scott_WUaS subscribed.Mar 9 2017, 5:28 PM

Status as of now:

DNS based discovery is live and functioning for most things, it's just pending a couple of merges in MediaWiki.
Etcd-based mediawiki configuration is being worked on

Joe closed subtask T156100: DNS: dynamically generate entries for service discovery as Resolved.Apr 3 2017, 6:38 AM

Joe moved this task from Doing to Blocking others on the User-Joe board.Apr 20 2017, 10:50 AM

Krinkle edited projects, added Multiple-active-datacenters archived; removed Wikimedia-Multiple-active-datacenters.May 3 2017, 7:37 PM

Krinkle edited projects, added Sustainability (MediaWiki-MultiDC); removed Multiple-active-datacenters archived.May 3 2017, 7:58 PM

Krinkle moved this task from Later to Current: Performance Team on the Sustainability (MediaWiki-MultiDC) board.May 3 2017, 8:00 PM

Krinkle removed a project: Patch-For-Review.Jul 29 2017, 4:43 AM

Krinkle mentioned this in T119641: Split-brain strategy for services that use config managed by etcd.Aug 5 2017, 11:14 PM

Krinkle edited projects, added Performance-Team (Radar); removed Performance-Team.Aug 8 2017, 3:15 AM

Krinkle moved this task from Limbo to Perf issue on the Performance-Team (Radar) board.Aug 8 2017, 3:21 AM

Seb35 mentioned this in T177111: Librarize multiversion code.Oct 7 2017, 8:51 AM

Joe mentioned this in T182597: Use EtcdConfig in production to allow automation of a datacenter switch.Dec 11 2017, 3:35 PM

Krinkle removed projects: MediaWiki-Configuration, Wikimedia-Developer-Summit (2017).Dec 20 2017, 9:15 PM

• Imarlier subscribed.Dec 21 2017, 4:01 PM

Joe moved this task from Blocking others to Doing on the User-Joe board.Jan 16 2018, 3:24 PM

Change 411296 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/mediawiki-config@master] Enable EtcdConfig on the debug hosts

https://gerrit.wikimedia.org/r/411296

gerritbot added a project: Patch-For-Review.Feb 16 2018, 4:54 PM

Change 411296 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable EtcdConfig on the debug hosts

https://gerrit.wikimedia.org/r/411296

• Imarlier mentioned this in T186774: Migrate webperf from hafnium to webperf1001.Feb 22 2018, 7:48 PM

Joe closed this task as Resolved.Mar 12 2018, 7:18 AM

Joe closed subtask T156924: Allow integration of data from etcd into the MediaWiki configuration as Resolved.Apr 17 2018, 3:16 PM