Page MenuHomePhabricator

Allow integration of data from etcd into the MediaWiki configuration
Closed, ResolvedPublic

Description

Ideally, the process should be as follows:

  1. confd watches an etcd tree and creates one or multiple json files containing configuration information, and a version
  2. Performs some kind of verification on the downloaded file. Maybe a call to a mediawiki script can be used for this.
  3. it then calls an url on the local machine that will trigger loading the json files to APC
  4. on every request, mediawiki-config will read that data from APC or, if something is missing, by parsing the json files
  5. the json will have some version number that will be stored in APC and that can be retrieved and monitored from a special URL, so that we can verify every HHVM server is in sync.

I am sure there are a few details that I am missing, but the overall idea is to play it as safe as possible, while not sacrificing the performance of each request.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 349468 had a related patch set uploaded (by Giuseppe Lavagetto):
[operations/puppet@production] profile::conftool::master: make the git root dir a parameter

https://gerrit.wikimedia.org/r/349468

I managed to get a bare-bones working installation of conftool in deployment-prep.

From there, you should connect to the https://deployment-etcd-01.deployment-prep.eqiad.wmflabs:2379. SRV records are not available in labs, sadly.

I cherry-picked https://gerrit.wikimedia.org/r/#/c/347360 on the deployment-prep puppetmaster and now you have two variables defined under

/v2/keys/conftool/v1/mediawiki-config/common

See the current output:

$ curl https://deployment-etcd-01:2379/v2/keys/conftool/v1/mediawiki-config/common/?recursive=true | jq .

{
  "action": "get",
  "node": {
    "key": "/conftool/v1/mediawiki-config/common",
    "dir": true,
    "nodes": [
      {
        "key": "/conftool/v1/mediawiki-config/common/WMFReadOnly",
        "value": "{\"val\": null}",
        "modifiedIndex": 48,
        "createdIndex": 48
      },
      {
        "key": "/conftool/v1/mediawiki-config/common/WMFMasterDatacenter",
        "value": "{\"val\": \"eqiad\"}",
        "modifiedIndex": 53,
        "createdIndex": 49
      }
    ],
    "modifiedIndex": 48,
    "createdIndex": 48
  }
}

If you want to add objects:

  1. ssh into deployment-puppetmaster02.deployment-prep.eqiad.wmflabs
  2. Add the new object into /etc/conftool/data-local/mwconfig/common.yaml
  3. Run sudo -i conftool-merge to create the new object in etcd. By default, all objects have a value of null. You will have to edit it to change it to something else

If you want to modify a value of an object:

  1. make a copy of /root/conftool/mwconfig-variable.yaml in your working directory and edit it to the value you want to set
  2. run for example sudo -i confctl --quiet --object-type mwconfig select 'name=WMFMasterDatacenter' set/@/home/oblivian/mwconfig-variable.yaml changing the variable name accordingly

To summarize, I think it is possible to test EtcdConfig in beta at this point with the limited deployment I made.

This is awesome! Thanks @Joe for getting this working on beta.

Change 349468 merged by Alexandros Kosiaris:
[operations/puppet@production] profile::conftool::master: make the git root dir a parameter

https://gerrit.wikimedia.org/r/349468

Regarding the implementation of the MW configuration, in particular CR https://gerrit.wikimedia.org/r/#/c/347537 (current patchset is #8), I think that we should first agree on the failure model, because I've seen different comments and approaches.

  1. Should MW work seamlessly if there is a temporary outage on the etcd service?
    • I think it should and we should use the stale cached data if that happens, assuming that all cached configuration is still valid.
  2. Should MW be able to restart when there is a temporary outage on the etcd service?
    • On this one I'm on the fence. From one side it would be nice to, from the other our load balancers right now cannot be restarted without etcd, meaning that etcd is fully part of our production infrastructure and a dependence on it upon restart might be acceptable. In any case, if we want to be able to restart MW without etcd I think we should use a disk-based file cache from where to read the data, not default values on the committed configuration files. Default values will be drifting in time from the live state in etcd and there is no secure way to keep them in sync.
  3. Should MW configuration have default values on the configuration file in case etcd is not reachable?
    • I think it should not, for the reasons expressed above, a wrong committed default value is more dangerous than a failure or a stale data from a disk cache.
  1. Should MW work seamlessly if there is a temporary outage on the etcd service?
    • I think it should and we should use the stale cached data if that happens, assuming that all cached configuration is still valid.

Agreed.

  1. Should MW be able to restart when there is a temporary outage on the etcd service?
    • On this one I'm on the fence. From one side it would be nice to, from the other our load balancers right now cannot be restarted without etcd, meaning that etcd is fully part of our production infrastructure and a dependence on it upon restart might be acceptable. In any case, if we want to be able to restart MW without etcd I think we should use a disk-based file cache from where to read the data, not default values on the committed configuration files. Default values will be drifting in time from the live state in etcd and there is no secure way to keep them in sync.

Previous discussion indicates we want to avoid a disk-based cache. I'm inclined to agree for sake of simplicity and performance. There are multiple Etcd instances in each data center. A cluster-wide restart will require Etcd either way. A manual restart of one app server should naturally cause it to be depooled (as far as I know), and if Etcd goes down at that time, it just means the app server can't be repooled until it successfully fetches data from Etcd. And for that matter, we should also enforce that it will not be repooled until it successfully re-scaps from the deployment host, but that's another matter.

  1. Should MW configuration have default values on the configuration file in case etcd is not reachable?
    • I think it should not, for the reasons expressed above, a wrong committed default value is more dangerous than a failure or a stale data from a disk cache.

As long as cache has no expiry and we use it in all failure scenarios, I agree we don't need default values. However there is currently one failure scenario that does not result in use of stale cache that imho should results in use of cache. If cache is stale, we get a lock, and fetch fails, we should use cache, but don't currently.

See FIXME at https://gerrit.wikimedia.org/r/#/c/350508/4/tests/phpunit/includes/config/EtcdConfigTest.php.

Change 347537 merged by jenkins-bot:
[operations/mediawiki-config@master] Use EtcdConfig in beta cluster only

https://gerrit.wikimedia.org/r/347537

Did the following testing:

  • cherry-picked the proposed conftool-data to the beta cluster puppet master, deployed it, and set ReadOnly to false
  • merged the mediawiki-config change
  • Set ReadOnly to a message "etcd test", checked that it was reflected in the beta cluster website
  • Set WMFMasterDatacenter to "blah", success could be seen by JobQueue errors due to lack of a redis hostname
  • Checked the correct value for $wgReadOnly and $wmfMasterDatacenter using eval.php
  • Confirmed that the APC cache was operational, by sending 100 requests to Special:BlankPage with ab while watching for inbound connections on the etcd server using tcpdump. There were 4 connections in 12 seconds. Connections could be seen coming in pairs with ~10s spacing between pairs.

Change 351132 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[operations/mediawiki-config@master] Enable EtcdConfig in production

https://gerrit.wikimedia.org/r/351132

Is there an easy way I could check which version and/or value of an Etcd-driven MW-config variable is actually loaded/cached by the running application?

How I could ensure after changing the value on Etcd in Switchdc that all the MW hosts have converged to use the new value?
I can run any command locally on the host and/or any remote call.

Is there an easy way I could check which version and/or value of an Etcd-driven MW-config variable is actually loaded/cached by the running application?

I'll add a siteinfo hook or something.

Did the following testing: [..]

  • Confirmed that the APC cache was operational, by sending 100 requests to Special:BlankPage with ab while watching for inbound connections on the etcd server using tcpdump. There were 4 connections in 12 seconds. Connections could be seen coming in pairs with ~10s spacing between pairs.

In local tests I noticed the cache is fragmented by WikiID. That's not supposed to be, right? It's explained by $this->srvCache->makeKey(), given that this is namespaced within the key space for the current wiki (like legacy wfMemcKey). We should probably use makeGlobalKey() instead and also ideally add something to the key that relates to "etcd" and the configured "host".

Right now it uses $this->srvCache->makeKey( 'variable', $this->directoryHash ); which seems too generic.

Change 351204 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] config: Use less generic cache key, and not fragmented by wiki

https://gerrit.wikimedia.org/r/351204

Change 351217 had a related patch set uploaded (by Tim Starling; owner: Krinkle):
[mediawiki/core@wmf/1.29.0-wmf.21] [1.29.0-wmf.21] config: Use less generic cache key, and not fragmented by wiki

https://gerrit.wikimedia.org/r/351217

Change 351204 merged by jenkins-bot:
[mediawiki/core@master] config: Use less generic cache key, and not fragmented by wiki

https://gerrit.wikimedia.org/r/351204

Change 351232 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[operations/mediawiki-config@master] Add $wmfMasterDatacenter to meta=siteinfo

https://gerrit.wikimedia.org/r/351232

Change 351217 merged by jenkins-bot:
[mediawiki/core@wmf/1.29.0-wmf.21] [1.29.0-wmf.21] config: Use less generic cache key, and not fragmented by wiki

https://gerrit.wikimedia.org/r/351217

Change 351232 merged by jenkins-bot:
[operations/mediawiki-config@master] Add $wmfMasterDatacenter to meta=siteinfo

https://gerrit.wikimedia.org/r/351232

Is there an easy way I could check which version and/or value of an Etcd-driven MW-config variable is actually loaded/cached by the running application?

I'll add a siteinfo hook or something.

Done. Try e.g.

curl -x mwdebug1001.eqiad.wmnet:80 -H'X-Forwarded-Proto: https' 'http://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&format=json&formatversion=2' | jq -r '.query.general["wmf-config"].wmfMasterDatacenter'

Change 351539 had a related patch set uploaded (by Volans; owner: Volans):
[operations/mediawiki-config@master] wmf-config: readonly is set in etcd now

https://gerrit.wikimedia.org/r/351539

Change 351132 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable EtcdConfig in production

https://gerrit.wikimedia.org/r/351132

Change 351539 merged by jenkins-bot:
[operations/mediawiki-config@master] wmf-config: readonly is set in etcd now

https://gerrit.wikimedia.org/r/351539

Change 351539 merged by jenkins-bot:
[operations/mediawiki-config@master] wmf-config: readonly is set in etcd now

https://gerrit.wikimedia.org/r/351539

Change 351132 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable EtcdConfig in production

https://gerrit.wikimedia.org/r/351132

For the record, these were reverted about half an hour later due to some bugs that were discovered. I'll let @tstarling write about what those were (something about the curl timeout not working as expected?).

We tried using it in production yesterday, and tested connection refused and connection timeout failure modes. The following problems were observed:

  • Threads were waiting for the server for minutes, not respecting the 10s timeout
  • 4 worker threads were all waiting for the server, the APC lock was not respected

I've isolated two concrete issues in the code and have a patch for them:

  • When using DNS SRV, when a server failed, it wasn't removed from the server list, so the failover loop was infinite
  • The TTL on the APC lock was only 10s, so expires while the thread is in the failover loop

I'm still working on isolating a third issue which causes a 120s wait time when waiting for the APC lock in WaitConditionLoop.

@aaron @tstarling @Joe: here is a minimal list of failure scenarios that I think we should test before getting this into production:

  • Etcd not listening (iptables REJECT)
  • Host not responding (iptables DROP)
  • Host unrechable (iptables REJECT with icmp-host-unreachable)
  • No DNS response for the SRV record (NXDOMAIN)
  • DNS SRV record(s) returns an invalid name (NXDOMAIN)
  • Etcd slow to respond (reaches the configured timeout)
  • High packet loss between MediaWiki and Etcd (i.e. when the master is in the other DC and there is an issue in the cross-DC connection)

In all the above failure scenarios I'd expect that:

  • MediaWiki continues to work as nothing happened, using stale data for the configuration from the cache, and no user request get stuck or timeout
  • at most one thread is trying to connect to Etcd at any given time, and it doesn't block any response to the users
  • warning/error logging is limited to at most one line every Etcd refresh retry (like every 10s), not for every request

As a side note, mostly for testing purposes, I think we should consider to add another configuration variable that is read from Etcd, a dummy one that can be changed safely at any time, given that we will not be able to change neither readonly nor masterdatacenter.
Ideally it would be nice to have this variable too exposed (for example in siteinfo), to be able to test in how much time the fleet converges to a new value when changed in Etcd.

Based on the last multi-dc meeting, this should consist of both:
a) Live (possibly scripted) tests in deployment-prep to create these failures
b) Unit tests mocking the results of the dns record method in PHP used by the DnsSrvDiscoverer class

Change 367127 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] config: Add tests for EtcdConfig::fetchAllFromEtcdServer

https://gerrit.wikimedia.org/r/367127

Change 367127 merged by jenkins-bot:
[mediawiki/core@master] config: Add tests for EtcdConfig::fetchAllFromEtcdServer

https://gerrit.wikimedia.org/r/367127

Change 367475 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] config: Fix invalid EtcdConfig return value after JSON parse error

https://gerrit.wikimedia.org/r/367475

Change 367476 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] config: Add more EtcdConfig::fetchAllFromEtcdServer tests

https://gerrit.wikimedia.org/r/367476

Change 367475 merged by jenkins-bot:
[mediawiki/core@master] config: Fix invalid EtcdConfig return value after JSON parse error

https://gerrit.wikimedia.org/r/367475

Change 367476 merged by jenkins-bot:
[mediawiki/core@master] config: Add more EtcdConfig::fetchAllFromEtcdServer tests

https://gerrit.wikimedia.org/r/367476

aaron removed aaron as the assignee of this task.Aug 23 2017, 7:13 PM

Change 375104 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/core@master] EtcdConfig: allow slashes in config key names

https://gerrit.wikimedia.org/r/375104

Change 375104 merged by jenkins-bot:
[mediawiki/core@master] EtcdConfig: allow slashes in config key names

https://gerrit.wikimedia.org/r/375104

Change 375108 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[operations/mediawiki-config@master] Re-enable EtcdConfig in beta cluster

https://gerrit.wikimedia.org/r/375108

Change 375108 merged by jenkins-bot:
[operations/mediawiki-config@master] Re-enable EtcdConfig in beta cluster

https://gerrit.wikimedia.org/r/375108

Per testing on deployment-prep:

@aaron @tstarling @Joe: here is a minimal list of failure scenarios that I think we should test before getting this into production:

  • Etcd not listening (iptables REJECT)

Web request works normally. Maintenance script fails with an exception after 3 seconds.

  • Host not responding (iptables DROP)

Web request works normally, with additional delay of ~2s. Maintenance script fails with an exception after 20s.

  • Host unrechable (iptables REJECT with icmp-host-unreachable)

Same as normal a reject.

  • No DNS response for the SRV record (NXDOMAIN)
  • DNS SRV record(s) returns an invalid name (NXDOMAIN)

Local testing, by modifying DNS responses in the client code, indicates that this will work as expected. If DNS is changed to something invalid on an existing working cluster, web requests will continue to work using the stale cache entry. Maintenance scripts will throw an exception on startup.

  • Etcd slow to respond (reaches the configured timeout)

This is untested, but the request timeout appears to be correctly configured, so this should be identical to iptables DROP. EtcdConfig does not distinguish between different curl error codes.

  • High packet loss between MediaWiki and Etcd (i.e. when the master is in the other DC and there is an issue in the cross-DC connection)

This is the same as "etcd slow to respond".

  • at most one thread is trying to connect to Etcd at any given time, and it doesn't block any response to the users

The lock which serializes all remote access was previously tested.

  • warning/error logging is limited to at most one line every Etcd refresh retry (like every 10s), not for every request

There is INFO level logging without a lock being held, but warning/error level logging is only done after remote access.

@tstarling Thanks for the testing and the recap, in general it looks mostly good to me.

  • Host not responding (iptables DROP)

Web request works normally, with additional delay of ~2s. Maintenance script fails with an exception after 20s.

Is it correct to say that the additional delay of 2s is just on the thread that is trying to connect to Etcd, and all other requests are not affected and will use the stale data?

Regarding maintenance scripts in general, do we run any of them often enough and/or do we have any of them critical enough that caching Etcd data or using confd instead might be worth?

Joe claimed this task.

Change 449742 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] Remove $::mw_primary variable from puppet

https://gerrit.wikimedia.org/r/449742

Change 449742 abandoned by Jcrespo:
Remove $::mw_primary variable from puppet

Reason:
Done in 457491 and related

https://gerrit.wikimedia.org/r/449742

Change 345346 abandoned by Jcrespo:
Remove $::mw_primary variable from puppet

Reason:
Done in 457491 and related

https://gerrit.wikimedia.org/r/345346