Motivation
Currently configuration updates are applied immediately. This makes it impossible to do any sanity checks before deploying, e.g. using the canary feature we have for code deployments. Canary has worked perfectly to avoid downtime due to bad code deployments, but at 2020-07-29 we had 6 minute downtime due to mistake in a configuration update.
Proposal
Use normal deployment process also for configuration updates.
Process comparison
Currently, to do a configuration update:
- Submit a patch to translatewiki repository
- Have the patch merged
- Log in to web2
- Run twn-update-config (change is now deployed)
- (not enforced) test in production
- (not enforced) monitor logs
If it would go through normal deployments:
- Submit a patch to translatewiki repository
- Have the patch merged
- Log in to web2
- Run twn-update-config
- (not enforced) test in canary
- (not enforced) check logs
- cd /srv/mediawiki
- b oregano tag
- b oregano deploy (change is now deployed)
- (not enforced) monitor logs
We could a a single command to do steps 7-9 to make it a bit easier.
Pros and cons
Current process | New process |
Simple and fast | Will not cause downtime if checked on canary first |
Risky, can cause downtime | Same process, no surprises |
Different, surprising process compared to code deployments | Additional steps |
Requires learning how to use canary, and it cannot be enforced | |
May cause "split-brain" scenario as caches are shared (already happens for code, but all such changes (database schemas, message keys) should be done to take this into account | |
List of data not part of deployments that are used during PHP web requests
- /resources/caches/translatewiki.net/messagechanges.*
- /resources/caches/translatewiki.net/translate_messageindex.cdb
- /resources/caches/translatewiki.net/translate_groupcache-*
- /www/translatewiki.net/logs/
- /home/betawiki/config/groups/
- /home/betawiki/config/groups/MediaWiki/MediaWikiTopMessageGroup.php
- /home/betawiki/config/webfiles/ (via symlink from workdir)
List of configuration not part of deployment that are used during PHP web requests
- /home/betawiki/config/DevelopmentSettings.php (not in production)
- /home/betawiki/config/ExtensionSettings.php
- /home/betawiki/config/FallbackSettings.php
- /home/betawiki/config/nikext.php
- /home/betawiki/config/nikext.i18n.magic.php
- /home/betawiki/config/PermissionSettings.php
- /home/betawiki/config/SpecialRally.php
- /home/betawiki/config/TranslateSettings.php
- /home/betawiki/config/TranslatewikiSettings.php
- /home/betawiki/config/groups/validation-exclusion-list.php
The scope of this task is PHP configuration files.
Plan
- Empty current workdir/config directory (it's unused)
- Block workdir/config directory in nginx config
- Move all PHP configuration files under [translatewiki-repo]/mw-config for clarity and grouping. Keep existing files as redirects via symlinks
- Update all twn-update-config (and twn-update-all??) to rsync [translatewiki-repo]/mw-config to workdir/config
- Update references in configuration files to read from workdir/config
- Remove symlinks
Other cleanups to do separately:
- Move nikext, Special:Rally and webfiles to a separate mini-extension