Allow integration of data from etcd into the MediaWiki configuration
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Joe
	Feb 1 2017, 4:45 PM

Description

Ideally, the process should be as follows:

confd watches an etcd tree and creates one or multiple json files containing configuration information, and a version
Performs some kind of verification on the downloaded file. Maybe a call to a mediawiki script can be used for this.
it then calls an url on the local machine that will trigger loading the json files to APC
on every request, mediawiki-config will read that data from APC or, if something is missing, by parsing the json files
the json will have some version number that will be stored in APC and that can be retrieved and monitored from a special URL, so that we can verify every HHVM server is in sync.

I am sure there are a few details that I am missing, but the overall idea is to play it as safe as possible, while not sacrificing the performance of each request.

Details

Subject	Repo	Branch	Lines +/-
Remove $::mw_primary variable from puppet	operations/puppet	production	+46 -30
Remove $::mw_primary variable from puppet	operations/puppet	production	+22 -35
Re-enable EtcdConfig in beta cluster	operations/mediawiki-config	master	+42 -3
EtcdConfig: allow slashes in config key names	mediawiki/core	master	+127 -19
config: Add more EtcdConfig::fetchAllFromEtcdServer tests	mediawiki/core	master	+109 -1
config: Fix invalid EtcdConfig return value after JSON parse error	mediawiki/core	master	+5 -9
config: Add tests for EtcdConfig::fetchAllFromEtcdServer	mediawiki/core	master	+49 -0
wmf-config: readonly is set in etcd now	operations/mediawiki-config	master	+7 -11
Enable EtcdConfig in production	operations/mediawiki-config	master	+2 -7
Add $wmfMasterDatacenter to meta=siteinfo	operations/mediawiki-config	master	+8 -0
[1.29.0-wmf.21] config: Use less generic cache key, and not fragmented by wiki	mediawiki/core	wmf/1.29.0-wmf.21	+5 -4
config: Use less generic cache key, and not fragmented by wiki	mediawiki/core	master	+5 -4
Use EtcdConfig in beta cluster only	operations/mediawiki-config	master	+62 -24
profile::conftool::master: make the git root dir a parameter	operations/puppet	production	+2 -6
Add a free-form 'any' type	operations/software/conftool	master	+32 -2
Add EtcdConfig class	mediawiki/core	master	+344 -0

Related Objects
Search...

Status	Assigned	Task
Resolved	Qgil	T153007 Technical Collaboration annual plan FY2017-18
Resolved	Qgil	T159313 Draft WMF annual plan program about technical events
Resolved	Qgil	T149300 Future of the Wikimedia Developer Summit
Resolved	• Rfarrand	T153996 Wikimedia Developer Summit 2017: Feedback Survey
Resolved	• Rfarrand	T141926 Wikimedia Developer Summit 2017
Resolved	Qgil	T141938 Prepare a program for Wikimedia Developer Summit 2017 to effectively address current high level movement needs
Resolved	greg	T147937 Facilitate Wikidev'17 main topic "How to manage our technical debt"
Resolved	Joe	T154658 Prepare and improve the datacenter switchover procedure
Resolved	Joe	T149617 Integrating MediaWiki (and other services) with dynamic configuration
Resolved	Joe	T156924 Allow integration of data from etcd into the MediaWiki configuration
Resolved	Volans	T160178 MediaWiki Datacenter Switchover automation
Resolved	• jcrespo	T161007 Decouple Mariadb semi-sync replication from $::mw_primary
Resolved	Legoktm	T162188 Pick/develop a PHP library for etcd to use in MediaWiki

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Joe moved this task from Blocking others to Doing on the User-Joe board.Apr 21 2017, 5:15 PM

Change 349468 had a related patch set uploaded (by Giuseppe Lavagetto):
[operations/puppet@production] profile::conftool::master: make the git root dir a parameter

https://gerrit.wikimedia.org/r/349468

I managed to get a bare-bones working installation of conftool in deployment-prep.

From there, you should connect to the https://deployment-etcd-01.deployment-prep.eqiad.wmflabs:2379. SRV records are not available in labs, sadly.

I cherry-picked https://gerrit.wikimedia.org/r/#/c/347360 on the deployment-prep puppetmaster and now you have two variables defined under

/v2/keys/conftool/v1/mediawiki-config/common

See the current output:

$ curl https://deployment-etcd-01:2379/v2/keys/conftool/v1/mediawiki-config/common/?recursive=true | jq .

{
  "action": "get",
  "node": {
    "key": "/conftool/v1/mediawiki-config/common",
    "dir": true,
    "nodes": [
      {
        "key": "/conftool/v1/mediawiki-config/common/WMFReadOnly",
        "value": "{\"val\": null}",
        "modifiedIndex": 48,
        "createdIndex": 48
      },
      {
        "key": "/conftool/v1/mediawiki-config/common/WMFMasterDatacenter",
        "value": "{\"val\": \"eqiad\"}",
        "modifiedIndex": 53,
        "createdIndex": 49
      }
    ],
    "modifiedIndex": 48,
    "createdIndex": 48
  }
}

If you want to add objects:

ssh into deployment-puppetmaster02.deployment-prep.eqiad.wmflabs
Add the new object into /etc/conftool/data-local/mwconfig/common.yaml
Run sudo -i conftool-merge to create the new object in etcd. By default, all objects have a value of null. You will have to edit it to change it to something else

If you want to modify a value of an object:

make a copy of /root/conftool/mwconfig-variable.yaml in your working directory and edit it to the value you want to set
run for example sudo -i confctl --quiet --object-type mwconfig select 'name=WMFMasterDatacenter' set/@/home/oblivian/mwconfig-variable.yaml changing the variable name accordingly

To summarize, I think it is possible to test EtcdConfig in beta at this point with the limited deployment I made.

This is awesome! Thanks @Joe for getting this working on beta.

greg subscribed.Apr 24 2017, 4:22 PM

Change 349468 merged by Alexandros Kosiaris:
[operations/puppet@production] profile::conftool::master: make the git root dir a parameter

https://gerrit.wikimedia.org/r/349468

• Mholloway subscribed.Apr 25 2017, 1:57 PM

Regarding the implementation of the MW configuration, in particular CR https://gerrit.wikimedia.org/r/#/c/347537 (current patchset is #8), I think that we should first agree on the failure model, because I've seen different comments and approaches.

Should MW work seamlessly if there is a temporary outage on the etcd service?
- I think it should and we should use the stale cached data if that happens, assuming that all cached configuration is still valid.
Should MW be able to restart when there is a temporary outage on the etcd service?
- On this one I'm on the fence. From one side it would be nice to, from the other our load balancers right now cannot be restarted without etcd, meaning that etcd is fully part of our production infrastructure and a dependence on it upon restart might be acceptable. In any case, if we want to be able to restart MW without etcd I think we should use a disk-based file cache from where to read the data, not default values on the committed configuration files. Default values will be drifting in time from the live state in etcd and there is no secure way to keep them in sync.
Should MW configuration have default values on the configuration file in case etcd is not reachable?
- I think it should not, for the reasons expressed above, a wrong committed default value is more dangerous than a failure or a stale data from a disk cache.

In T156924#3220304, @Volans wrote:

Should MW work seamlessly if there is a temporary outage on the etcd service?

I think it should and we should use the stale cached data if that happens, assuming that all cached configuration is still valid.

Agreed.

Should MW be able to restart when there is a temporary outage on the etcd service?

On this one I'm on the fence. From one side it would be nice to, from the other our load balancers right now cannot be restarted without etcd, meaning that etcd is fully part of our production infrastructure and a dependence on it upon restart might be acceptable. In any case, if we want to be able to restart MW without etcd I think we should use a disk-based file cache from where to read the data, not default values on the committed configuration files. Default values will be drifting in time from the live state in etcd and there is no secure way to keep them in sync.

Previous discussion indicates we want to avoid a disk-based cache. I'm inclined to agree for sake of simplicity and performance. There are multiple Etcd instances in each data center. A cluster-wide restart will require Etcd either way. A manual restart of one app server should naturally cause it to be depooled (as far as I know), and if Etcd goes down at that time, it just means the app server can't be repooled until it successfully fetches data from Etcd. And for that matter, we should also enforce that it will not be repooled until it successfully re-scaps from the deployment host, but that's another matter.

Should MW configuration have default values on the configuration file in case etcd is not reachable?

I think it should not, for the reasons expressed above, a wrong committed default value is more dangerous than a failure or a stale data from a disk cache.

As long as cache has no expiry and we use it in all failure scenarios, I agree we don't need default values. However there is currently one failure scenario that does not result in use of stale cache that imho should results in use of cache. If cache is stale, we get a lock, and fetch fails, we should use cache, but don't currently.

See FIXME at https://gerrit.wikimedia.org/r/#/c/350508/4/tests/phpunit/includes/config/EtcdConfigTest.php.

Change 347537 merged by jenkins-bot:
[operations/mediawiki-config@master] Use EtcdConfig in beta cluster only

https://gerrit.wikimedia.org/r/347537

Did the following testing:

cherry-picked the proposed conftool-data to the beta cluster puppet master, deployed it, and set ReadOnly to false
merged the mediawiki-config change
Set ReadOnly to a message "etcd test", checked that it was reflected in the beta cluster website
Set WMFMasterDatacenter to "blah", success could be seen by JobQueue errors due to lack of a redis hostname
Checked the correct value for $wgReadOnly and $wmfMasterDatacenter using eval.php
Confirmed that the APC cache was operational, by sending 100 requests to Special:BlankPage with ab while watching for inbound connections on the etcd server using tcpdump. There were 4 connections in 12 seconds. Connections could be seen coming in pairs with ~10s spacing between pairs.

Change 351132 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[operations/mediawiki-config@master] Enable EtcdConfig in production

https://gerrit.wikimedia.org/r/351132

Is there an easy way I could check which version and/or value of an Etcd-driven MW-config variable is actually loaded/cached by the running application?

How I could ensure after changing the value on Etcd in Switchdc that all the MW hosts have converged to use the new value?
I can run any command locally on the host and/or any remote call.

In T156924#3224786, @Volans wrote:

Is there an easy way I could check which version and/or value of an Etcd-driven MW-config variable is actually loaded/cached by the running application?

I'll add a siteinfo hook or something.

In T156924#3224690, @tstarling wrote:

Did the following testing: [..]

Confirmed that the APC cache was operational, by sending 100 requests to Special:BlankPage with ab while watching for inbound connections on the etcd server using tcpdump. There were 4 connections in 12 seconds. Connections could be seen coming in pairs with ~10s spacing between pairs.

In local tests I noticed the cache is fragmented by WikiID. That's not supposed to be, right? It's explained by $this->srvCache->makeKey(), given that this is namespaced within the key space for the current wiki (like legacy wfMemcKey). We should probably use makeGlobalKey() instead and also ideally add something to the key that relates to "etcd" and the configured "host".

Right now it uses $this->srvCache->makeKey( 'variable', $this->directoryHash ); which seems too generic.

Change 351204 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] config: Use less generic cache key, and not fragmented by wiki

https://gerrit.wikimedia.org/r/351204

Change 351217 had a related patch set uploaded (by Tim Starling; owner: Krinkle):
[mediawiki/core@wmf/1.29.0-wmf.21] [1.29.0-wmf.21] config: Use less generic cache key, and not fragmented by wiki

https://gerrit.wikimedia.org/r/351217

Change 351204 merged by jenkins-bot:
[mediawiki/core@master] config: Use less generic cache key, and not fragmented by wiki

https://gerrit.wikimedia.org/r/351204

ReleaseTaggerBot added a project: MW-1.30-release-notes (WMF-deploy-2017-05-09_(1.30.0-wmf.1)).May 2 2017, 12:00 AM

Change 351232 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[operations/mediawiki-config@master] Add $wmfMasterDatacenter to meta=siteinfo

https://gerrit.wikimedia.org/r/351232

Change 351217 merged by jenkins-bot:
[mediawiki/core@wmf/1.29.0-wmf.21] [1.29.0-wmf.21] config: Use less generic cache key, and not fragmented by wiki

https://gerrit.wikimedia.org/r/351217

ReleaseTaggerBot edited projects, added MW-1.29-release (WMF-deploy-2017-04-25_(1.29.0-wmf.21)); removed MW-1.29-release (WMF-deploy-2017-04-11_(1.29.0-wmf.20)).May 2 2017, 2:00 AM

Change 351232 merged by jenkins-bot:
[operations/mediawiki-config@master] Add $wmfMasterDatacenter to meta=siteinfo

https://gerrit.wikimedia.org/r/351232

In T156924#3226349, @tstarling wrote:

In T156924#3224786, @Volans wrote:

Is there an easy way I could check which version and/or value of an Etcd-driven MW-config variable is actually loaded/cached by the running application?

I'll add a siteinfo hook or something.

Done. Try e.g.

curl -x mwdebug1001.eqiad.wmnet:80 -H'X-Forwarded-Proto: https' 'http://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&format=json&formatversion=2' | jq -r '.query.general["wmf-config"].wmfMasterDatacenter'

thanks @tstarling!

Change 351539 had a related patch set uploaded (by Volans; owner: Volans):
[operations/mediawiki-config@master] wmf-config: readonly is set in etcd now

https://gerrit.wikimedia.org/r/351539

Change 351132 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable EtcdConfig in production

https://gerrit.wikimedia.org/r/351132

Change 351539 merged by jenkins-bot:
[operations/mediawiki-config@master] wmf-config: readonly is set in etcd now

https://gerrit.wikimedia.org/r/351539

Volans closed subtask T160178: MediaWiki Datacenter Switchover automation as Resolved.May 3 2017, 5:20 PM

Krinkle edited projects, added Multiple-active-datacenters archived; removed Wikimedia-Multiple-active-datacenters.May 3 2017, 7:37 PM

Krinkle moved this task from Backlog to Doing on the Multiple-active-datacenters archived board.May 3 2017, 7:52 PM

Krinkle edited projects, added Sustainability (MediaWiki-MultiDC); removed Multiple-active-datacenters archived.May 3 2017, 7:57 PM

Krinkle moved this task from Later to Current: Performance Team on the Sustainability (MediaWiki-MultiDC) board.May 3 2017, 8:00 PM

In T156924#3229983, @gerritbot wrote:

Change 351539 merged by jenkins-bot:
[operations/mediawiki-config@master] wmf-config: readonly is set in etcd now

https://gerrit.wikimedia.org/r/351539

In T156924#3229924, @gerritbot wrote:

Change 351132 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable EtcdConfig in production

https://gerrit.wikimedia.org/r/351132

For the record, these were reverted about half an hour later due to some bugs that were discovered. I'll let @tstarling write about what those were (something about the curl timeout not working as expected?).

We tried using it in production yesterday, and tested connection refused and connection timeout failure modes. The following problems were observed:

Threads were waiting for the server for minutes, not respecting the 10s timeout
4 worker threads were all waiting for the server, the APC lock was not respected

I've isolated two concrete issues in the code and have a patch for them:

When using DNS SRV, when a server failed, it wasn't removed from the server list, so the failover loop was infinite
The TTL on the APC lock was only 10s, so expires while the thread is in the failover loop

I'm still working on isolating a third issue which causes a 120s wait time when waiting for the APC lock in WaitConditionLoop.

Krinkle removed projects: MW-1.29-release (WMF-deploy-2017-04-25_(1.29.0-wmf.21)), MediaWiki-Configuration, Wikimedia-Developer-Summit (2017).May 3 2017, 8:59 PM

Krinkle removed a project: MW-1.29-release-notes.

@aaron @tstarling @Joe: here is a minimal list of failure scenarios that I think we should test before getting this into production:

Etcd not listening (iptables REJECT)
Host not responding (iptables DROP)
Host unrechable (iptables REJECT with icmp-host-unreachable)
No DNS response for the SRV record (NXDOMAIN)
DNS SRV record(s) returns an invalid name (NXDOMAIN)
Etcd slow to respond (reaches the configured timeout)
High packet loss between MediaWiki and Etcd (i.e. when the master is in the other DC and there is an issue in the cross-DC connection)

In all the above failure scenarios I'd expect that:

MediaWiki continues to work as nothing happened, using stale data for the configuration from the cache, and no user request get stuck or timeout
at most one thread is trying to connect to Etcd at any given time, and it doesn't block any response to the users
warning/error logging is limited to at most one line every Etcd refresh retry (like every 10s), not for every request

As a side note, mostly for testing purposes, I think we should consider to add another configuration variable that is read from Etcd, a dummy one that can be changed safely at any time, given that we will not be able to change neither readonly nor masterdatacenter.
Ideally it would be nice to have this variable too exposed (for example in siteinfo), to be able to test in how much time the fleet converges to a new value when changed in Etcd.

tstarling moved this task from Future to Inbox on the MediaWiki-Platform-Team-Archived board.May 22 2017, 11:27 AM

• jcrespo mentioned this in T32824: Integrate Facebooks "Online Schema Change for MySQL" into MediaWiki.May 23 2017, 12:55 PM

Based on the last multi-dc meeting, this should consist of both:
a) Live (possibly scripted) tests in deployment-prep to create these failures
b) Unit tests mocking the results of the dns record method in PHP used by the DnsSrvDiscoverer class

Joe moved this task from Doing to Blocked on others on the User-Joe board.Jul 3 2017, 7:54 AM

Krinkle removed a project: Patch-For-Review.Jul 12 2017, 7:18 PM

• Gilles moved this task from To-do: Goals prioritized current Quarter to Radar on the Performance-Team board.Jul 12 2017, 7:19 PM

tstarling moved this task from Inbox to Backlog on the MediaWiki-Platform-Team-Archived board.Jul 20 2017, 10:14 AM

Change 367127 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] config: Add tests for EtcdConfig::fetchAllFromEtcdServer

https://gerrit.wikimedia.org/r/367127

gerritbot added a project: Patch-For-Review.Jul 23 2017, 1:28 AM

Change 367127 merged by jenkins-bot:
[mediawiki/core@master] config: Add tests for EtcdConfig::fetchAllFromEtcdServer

https://gerrit.wikimedia.org/r/367127

ReleaseTaggerBot edited projects, added MW-1.30-release-notes (WMF-deploy-2017-07-25_(1.30.0-wmf.11)); removed MW-1.30-release-notes (WMF-deploy-2017-05-09_(1.30.0-wmf.1)).Jul 23 2017, 3:00 AM

Change 367475 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] config: Fix invalid EtcdConfig return value after JSON parse error

https://gerrit.wikimedia.org/r/367475

Change 367476 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] config: Add more EtcdConfig::fetchAllFromEtcdServer tests

https://gerrit.wikimedia.org/r/367476

Change 367475 merged by jenkins-bot:
[mediawiki/core@master] config: Fix invalid EtcdConfig return value after JSON parse error

https://gerrit.wikimedia.org/r/367475

Change 367476 merged by jenkins-bot:
[mediawiki/core@master] config: Add more EtcdConfig::fetchAllFromEtcdServer tests

https://gerrit.wikimedia.org/r/367476

ReleaseTaggerBot edited projects, added MW-1.30-release-notes; removed MW-1.30-release-notes (WMF-deploy-2017-07-25_(1.30.0-wmf.11)).Jul 25 2017, 3:00 AM

Krinkle removed a project: Patch-For-Review.Jul 27 2017, 4:36 AM

Krinkle removed a project: MW-1.30-release-notes.

Krinkle mentioned this in T119641: Split-brain strategy for services that use config managed by etcd.Aug 5 2017, 11:14 PM

Krinkle edited projects, added Performance-Team (Radar); removed Performance-Team.Aug 8 2017, 3:15 AM

Krinkle moved this task from Limbo to Perf issue on the Performance-Team (Radar) board.Aug 8 2017, 3:21 AM

aaron removed aaron as the assignee of this task.Aug 23 2017, 7:13 PM

tstarling moved this task from Backlog to Future on the MediaWiki-Platform-Team-Archived board.Aug 24 2017, 3:38 AM

Change 375104 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/core@master] EtcdConfig: allow slashes in config key names

https://gerrit.wikimedia.org/r/375104

gerritbot added a project: Patch-For-Review.Sep 1 2017, 2:27 AM

Change 375104 merged by jenkins-bot:
[mediawiki/core@master] EtcdConfig: allow slashes in config key names

https://gerrit.wikimedia.org/r/375104

ReleaseTaggerBot added a project: MW-1.30-release-notes (WMF-deploy-2017-09-05 (1.30.0-wmf.17)).Sep 1 2017, 4:00 AM

Krinkle removed projects: MW-1.30-release-notes (WMF-deploy-2017-09-05 (1.30.0-wmf.17)), Patch-For-Review.Sep 4 2017, 8:25 PM

Change 375108 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[operations/mediawiki-config@master] Re-enable EtcdConfig in beta cluster

https://gerrit.wikimedia.org/r/375108

gerritbot added a project: Patch-For-Review.Sep 18 2017, 5:22 AM

Change 375108 merged by jenkins-bot:
[operations/mediawiki-config@master] Re-enable EtcdConfig in beta cluster

https://gerrit.wikimedia.org/r/375108

Anomie mentioned this in T174028: Finalize database schema for MCR content meta-data .Oct 12 2017, 1:42 PM

CCicalese_WMF moved this task from Future to MWPT-Q1-Jul-Sep-2017 on the MediaWiki-Platform-Team-Archived board.Oct 13 2017, 6:26 PM

CCicalese_WMF edited projects, added MediaWiki-Platform-Team-Archived (MWPT-Q1-Jul-Sep-2017); removed MediaWiki-Platform-Team-Archived.

CCicalese_WMF moved this task from Ready to In Progress on the MediaWiki-Platform-Team-Archived (MWPT-Q1-Jul-Sep-2017) board.Oct 13 2017, 6:30 PM

CCicalese_WMF moved this task from MWPT-Q1-Jul-Sep-2017 to MWPT-Q2-Oct-Dec-2017 on the MediaWiki-Platform-Team-Archived board.Oct 19 2017, 9:28 PM

CCicalese_WMF edited projects, added MediaWiki-Platform-Team-Archived (MWPT-Q2-Oct-Dec-2017); removed MediaWiki-Platform-Team-Archived (MWPT-Q1-Jul-Sep-2017).

CCicalese_WMF moved this task from Ready to In Progress on the MediaWiki-Platform-Team-Archived (MWPT-Q2-Oct-Dec-2017) board.

Krinkle edited projects, added MediaWiki-Configuration; removed Patch-For-Review, discovery-system.Dec 20 2017, 9:15 PM

Krinkle moved this task from Backlog to In Progress on the MediaWiki-Configuration board.

• Imarlier subscribed.Dec 21 2017, 4:02 PM

CCicalese_WMF moved this task from MWPT-Q2-Oct-Dec-2017 to MWPT-Q3-Jan-Mar-2018 on the MediaWiki-Platform-Team-Archived board.Jan 9 2018, 2:55 AM

CCicalese_WMF edited projects, added MediaWiki-Platform-Team-Archived (MWPT-Q3-Jan-Mar-2018); removed MediaWiki-Platform-Team-Archived (MWPT-Q2-Oct-Dec-2017).

Volans mentioned this in T185078: Test EtcdConfig in different failure scenarios.Jan 17 2018, 9:59 AM

Per testing on deployment-prep:

In T156924#3269464, @Volans wrote:

@aaron @tstarling @Joe: here is a minimal list of failure scenarios that I think we should test before getting this into production:

Etcd not listening (iptables REJECT)

Web request works normally. Maintenance script fails with an exception after 3 seconds.

Host not responding (iptables DROP)

Web request works normally, with additional delay of ~2s. Maintenance script fails with an exception after 20s.

Host unrechable (iptables REJECT with icmp-host-unreachable)

Same as normal a reject.

No DNS response for the SRV record (NXDOMAIN)

DNS SRV record(s) returns an invalid name (NXDOMAIN)

Local testing, by modifying DNS responses in the client code, indicates that this will work as expected. If DNS is changed to something invalid on an existing working cluster, web requests will continue to work using the stale cache entry. Maintenance scripts will throw an exception on startup.

Etcd slow to respond (reaches the configured timeout)

This is untested, but the request timeout appears to be correctly configured, so this should be identical to iptables DROP. EtcdConfig does not distinguish between different curl error codes.

High packet loss between MediaWiki and Etcd (i.e. when the master is in the other DC and there is an issue in the cross-DC connection)

This is the same as "etcd slow to respond".

at most one thread is trying to connect to Etcd at any given time, and it doesn't block any response to the users

The lock which serializes all remote access was previously tested.

warning/error logging is limited to at most one line every Etcd refresh retry (like every 10s), not for every request

There is INFO level logging without a lock being held, but warning/error level logging is only done after remote access.

@tstarling Thanks for the testing and the recap, in general it looks mostly good to me.

Host not responding (iptables DROP)

Web request works normally, with additional delay of ~2s. Maintenance script fails with an exception after 20s.

Is it correct to say that the additional delay of 2s is just on the thread that is trying to connect to Etcd, and all other requests are not affected and will use the stale data?

Regarding maintenance scripts in general, do we run any of them often enough and/or do we have any of them critical enough that caching Etcd data or using confd instead might be worth?

CCicalese_WMF moved this task from Ready to Watching on the MediaWiki-Platform-Team-Archived (MWPT-Q3-Jan-Mar-2018) board.Feb 27 2018, 2:46 AM

CCicalese_WMF edited projects, added MediaWiki-Platform-Team-Archived (MWPT-Q4-Apr-Jun-2018); removed MediaWiki-Platform-Team-Archived (MWPT-Q3-Jan-Mar-2018).Apr 4 2018, 6:54 PM

CCicalese_WMF moved this task from Ready to Watching on the MediaWiki-Platform-Team-Archived (MWPT-Q4-Apr-Jun-2018) board.

Joe closed this task as Resolved.Apr 17 2018, 3:16 PM

Joe claimed this task.

CCicalese_WMF moved this task from Watching to Done on the MediaWiki-Platform-Team-Archived (MWPT-Q4-Apr-Jun-2018) board.May 1 2018, 12:13 AM

Change 449742 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] Remove $::mw_primary variable from puppet

https://gerrit.wikimedia.org/r/449742

gerritbot added a project: Patch-For-Review.Aug 1 2018, 3:20 PM

Change 449742 abandoned by Jcrespo:
Remove $::mw_primary variable from puppet

Reason:
Done in 457491 and related

https://gerrit.wikimedia.org/r/449742