Page MenuHomePhabricator

Reimagine channel configuration (re)loading to avoid need for git pull
Closed, ResolvedPublicFeature

Description

There is some benefit of reading the current config from NFS vs requiring a restart to apply, primarily that some messages might get lost in the restart period otherwise. But there is also value in reducing/removing the NFS dependency.

One option could be to store the config in redis? That gets rid of NFS but keeps the live-update logic.

Yeah, we could look at either redis or ToolsDB for storing the channel mapping config. I like that we have a "free" audit log for those config changes in the git repo though too. I think we can design an update process such that git is the canonical location but a web hook can trigger updating the running config without requiring a bot restart too. Pushing an updated structure into Redis could be one way to do that.

A ConfigMap would be the most Kubernetes native way to deal with the channel mappings, but right now Toolforge jobs & webservices don't have any built-in magic for using ConfigMaps so that would mean using Kubernetes directly to manage our Deployments. This is certainly possible (I run several direct Kubernetes tools), but it makes things a bit more involved for long term maintenance.

Details

TitleReferenceAuthorSource BranchDest Branch
Reimagine channel configuration (re)loading to avoid need for git pulltoolforge-repos/wikibugs2!23bd808work/bd808/channels-channels-everywheremain
Customize query in GitLab

Event Timeline

bd808 triaged this task as High priority.
bd808 moved this task from Need discussion to Doing on the Wikibugs board.

This is the last remaining blocker to switching to a Toolforge Build Service container for the tool. We want to preserve the ability to update the channel mappings for both Phorge and Gerrit without restarting those tasks. We also want to preserve tracking of who changed the config and when.

Today runtime config updates work by having a post-merge job call a web hook running in the tool's namespace. That web hook executes git pull in the $HOME/wikibugs2 git repo. This puts the latest config on the $HOME NFS share. The wikibugs2 irc and wikibugs2 gerrit tasks both use the wikibugs2.channelfilter.ChannelFilter.update() method to check for updated file mtime on their YAML config files and reload configuration from disk when the mtime has changed since the last (re)load.

I've had several different ideas about how to replace the git + mtime check mechanism. Loading data from Redis in the tasks with some external service responsible for updating the Redis data is likely the simplest change using the existing tech stack for the tool. A slight twist on that solution would be for the external service to also act as the configuration provider to the jobs instead of using Redis to pass state. This would avoid adding new Redis dependencies which seems prudent in light of T360596: Figure out a plan to move forward with regarding Redis License changes.

I'm leaning towards a design that would introduce a new python webservice to the tool. Initially the webservice would be responsible for providing API endpoints for:

  • Fetching the stored Gerrit channel configuration (used by the Gerrit task)
  • Fetching the stored Phorge channel configuration (used by the IRC task)
  • Handling the web hook callback from GitLab to trigger an update of the stored configuration

The webservice could load the configuration from the GitLab repo via http both as a bootstrap when initially starting and in response to a thunk of the web hook.

To avoid the complexity of needing separate API endpoints to check if the configuration had changed, the webservice should support Etag+If-None-Match or more simply a query string param so the client can request the config only if it has changed from a prior known state. The server should respond with a 304 Not Modified status and no body when the provided config identity is unchanged.

wikibugs2.channelfilter.ChannelFilter should support both loading config directly from disk for ease of testing and development as well as the config server solution. The direct loading configuration does not need to support automatic reloading.

This is "fun". asyncssh/connection.py raises an exception if getpass.getuser() fails. The workaround I applied for the moment was toolforge envvars create LOGNAME suchabot. It would be nicer to take care of this inside the app directly. The value really doesn't matter as we are not using ~/.ssh/ files and we are already setting an explict username for the ssh connection.

bd808 merged https://gitlab.wikimedia.org/toolforge-repos/wikibugs2/-/merge_requests/23

Reimagine channel configuration (re)loading to avoid need for git pull

Mentioned in SAL (#wikimedia-cloud) [2024-03-31T23:20:47Z] <wmbot~bd808@tools-bastion-12> Updated all to f31b4bd and buildservice for everything (T360860)