Page MenuHomePhabricator

Maintenance environment needed for running one-off commands
Open, HighPublic

Description

MediaWiki has the mwmaint servers available as locations to run maintenance scripts, manual jobs, database queries, and interact with wikis via a REPL.

These same sorts of tasks can be done from within a Toolhub container, but that container needs to be running in an environment where the Django app can connect to the production database, elasticsearch cluster, and memcached servers. Docker is not currently available on mwmaint boxes to run the container, and the permissions for the 'toolhub' user in the Kubernetes clusters do not grant pod/exec[create delete] rights to attach to a running container.

Non-exhaustive list of things that would be easier with cli access to a running Toolhub container:

  • Database schema maintenance. Django includes a fairly nice system for managing forward and backward schema migrations including the ability to add migration steps that operate at an object graph level for complex data fills. Running DB_USER=toolhub_admin DB_PASSWORD=... poetry run python3 ./manage.py migrate from a container that can connect to m5-master would make db migrations much easier than requiring SQL export by Django with manual application of the DDL changes.
  • Bootstrapping the database with seed data using Django fixture files
  • Bootstrapping advanced user rights with Django's REPL
  • Rebuilding the elasticsearch index when switching clusters
  • Debugging issues occurring in production which cannot be replicated in a dev environment (connectivity, services only used in production, dataset related issues)

From an end user perspective, being able to kubectl exec -it toolhub-main-57dcd5447f-rsqk7 -- /bin/bash from a deploy server would be an ideal solution. It would also be reasonable to sudo somehow to do this rather than having the default kube_env toolhub staging provided credentials have this level of access.

Being able to launch a container in interactive mode on a mwmaint server or equivalent and not in a Kubernetes cluster directly would also be workable, but may have complications in configuration to reach service dependencies.

A nearly worst case solution would be to write action plans and hand them to SREs with container access to implement.

A worst case solution would be to require that all one-off commands actually be baked into Helm charts and deployed.

Event Timeline

bd808 added a subscriber: Legoktm.

@Legoktm has kindly offered to bring up this desire in a future serviceops team meeting.

bd808 triaged this task as High priority.Sep 3 2021, 9:26 PM
bd808 moved this task from Backlog to Research needed on the Toolhub board.

Tangentially related: T290360

I understand the need for this, having experience from Heroku in the past, the one off dynos functionality was pretty useful both for debugging as well as running some maintenance commands.

That being said, we don't have yet implemented this functionality and we 'd like to see how to best implement it. kubectl exec is a valid solution, but once one ends up in one of our containers and tries to actually debug something, it turns to pain. As I point out in T290360#7339450, basic tools for debugging are just not there as we try to keep our containers slim for more efficient use of the infrastructure (the issues we are currently facing with the large mediawiki images are a testament to that) as well as faster deployments. @dancy had to create an image with proper debugging tools especially for mediawiki for that. The other 4 bullet points though justify the use case and we should find a way to implement something one-off functionality.

That being said, we don't have yet implemented this functionality and we 'd like to see how to best implement it. kubectl exec is a valid solution, but once one ends up in one of our containers and tries to actually debug something, it turns to pain.

I'm pretty used to the limited tooling that is available inside a container from many years of of work inside of pods running on the Toolforge Kubernetes cluster. Sometimes there is no way to proceed without some additions to the containers, but a lot of things can actually be taken care of with a Python3 REPL and a bit of patience.

For deep debugging, nsenter to become a privileged user inside the container can be valuable. That is however the sort of thing that I would typically expect to be limited to those who already have root rights on the exec nodes to protect against container breakout escalations.

Hi @bd808, sorry for taking so long to answer, it turns out it has been a weird issue and discussions took a bit longer than usual.

We 've had a brainstorming session with @Joe and @JMeybohm regarding this (after an amount of time wondering how to best solve it). We also tried to approach this with a mind for the future as we anticipate we will have similar needs for mediawiki in k8s, especially the mwmaint situation. The issue is somewhat thorny so we are going to split it by use case outlined above and attack the use cases one by one.

We did set out some guiding rules though:

  • We are trying to avoid having to allocate a PTY for most use cases.
  • We will support interactivity by allowing to run command and read their output.

For the use cases of migration and fixture loading, which are arguably a very semi-automatic process, requiring 0 input from the deployer we have the following suggestion:

We think that overall the best approach is a helm pre-install hook. The pattern is a well understood one and used through the industry. One just adds in the helm chart a Job resource like so:

apiVersion: batch/v1
kind: Job
metadata:
  name: "{{ .Release.Name }}"
  annotations:
    # This is what defines this resource as a hook
    "helm.sh/hook": pre-install
...

I am skipping the rest for brevity, it's very similar to defining a Deployment resource, the key concepts are the fact we use a Job [1], with a specific command for the container. Then we tie it with helm to happen before the upgrade takes place. The caveat that does exist is that if the schema migration/loading of fixtures take too long (600s in our case) the upgrade will fail (a simple fix is to bump to a larger timeout which is easily done in hemlfile.yaml). I think we won't be having that issue with toolhub for some time though.

Logs will make it to logstash and allow for inspecting the process and errors. Unless I am mistaken they will also be present in kubectl logs.

Of course, migrations should be written with the above in mind, i.e. that the migration will happen before the new code is running, so the migrations should be written to be backwards compatible. Most of the times this just boils down to: "Don't add NOT NULL columns, don't rename fields in 1 go".

We might be wrong here, but we also think that approach (depending on how the app is written a post-install hook could be used instead) would allow to serve the Rebuilding the elasticsearch index when switching clusters use case.

We aren't particularly interested in solving the use case of debugging in production via the container at this point, as we (SRE) feel we can serve the load of helping developers with cases where they have production connectivity issues and logging does not help them figure it out. That being said, we keep our eye on changes upstream that could support this use case in a better way than we currently can. (see https://kubernetes.io/docs/concepts/workloads/pods/ephemeral-containers/) We can always re-evaluate later of course.

That leaves the use cases of interactively using the Django REPL (or any other interactive tool requiring a PTY), which we are still discussing if and how we will support.

[1] https://kubernetes.io/docs/concepts/workloads/controllers/job/

  • We are trying to avoid having to allocate a PTY for most use cases.

I would be interested in learning more about this constraint and it's benefits for stability and support of the Kubernetes cluster. The majority of my Kubernetes experience comes from working in the Toolforge cluster where PTY attachment to running pods is integral to the workflow of managing a tool (webservice shell).

  • We are trying to avoid having to allocate a PTY for most use cases.

I would be interested in learning more about this constraint and it's benefits for stability and support of the Kubernetes cluster. The majority of my Kubernetes experience comes from working in the Toolforge cluster where PTY attachment to running pods is integral to the workflow of managing a tool (webservice shell).

Oh, apologies, I guess I should have explained why we set that requirement.

In our experience up to now with running services in kubernetes in production (we are currently at 33 instances of various microservices with more in the pipeline), we 've very seldomly needed to have a shell in a container. In many cases the platform fixes workload issues on its own, via a number of mechanisms, namely:

  • Depooling workloads until they are ready to receive traffic again
  • Killing workloads that fail their health checks
  • Setting strict limits to memory and killing workloads that grossly violate them
  • Settings strict CPU limits, protecting the infrastructure (although the caveat is that the apps might be starved of CPU is badly configured)
  • Memory/DISK/PID pressure thresholds on the nodes (the defaults actually, we haven't messed with any of that) avoid many of the most common scenarios that require shell access to resolve like like "Disk Full", "Out Of Memory due to a leak/too many processes"

Also, applications get restarted with every deploy, resetting them back to a "known good state".

All of the above events are viewable via logstash (eventrouter transports the kubernetes events to it) and grafana. I 'd argue that those environments are more usable than obtaining shells in limited environments and trying to debug the issue at hand. So, chances of actually needing a shell are trimmed down to a very small number of cases. Even then, when debugging needs to happen on a specific instance, cause we have determined that a (set of) specific instance(s) is problematic, we 've arrived at a situation that we will need tooling that is not going to present in our containers (which btw even lack procps) like gdb, strace, ltrace, tcpdump and friends. At which point, we can run those tools on the node (ones does not even need nsenter for some those, others require root access anyway) and move the output in a paste and work with that. But I must stress that I have very few recollections of actually needing to do so.

Finally we run multiple instances of most workloads, requiring that we first have some knowledge of which instance exhibits an issue before we dive into it. Which happens via using logstash/grafana to begin with and they increase the likelihood we 'll reach the cause sooner rather than later.

I hope the above covers the main/services cluster experience part, let me add now another 2 points.

First (and the least important one actually) is the immutability part. We 're striving towards making our containers as immutable as possible. While there are a number of way of going around this and we already implemented some in the pipeline, we aren't there yet. Having a generic PTY/shell solution would open up a way to mutate running containers in a way we would much rather avoid.

Secondly (and more importantly), there is the security aspect of it. We currently have pretty strict RBAC rules for all deployers, allowing only read access to the deployer accounts themselves. The deployments happen via Tiller, a (unfortunately for us?) helm component that sits in every namespace and has the rights to perform the various write actions like updating Deployments/Daemonsets etc. While Tiller could be abused to perform such actions by a malicious attacker that manages to obtain access to parts of our infrastructure, it's a more narrow attack vector path than generically allowing running a shell in any container, buying us (in the unfortunate scenario of a compromise) time and some auditing capabilities so we can react in the meantime. With the migration to helm3, see T251305 this is changing, we got some work to do there after the migration is done to make sure we are where we want to be. Adding pod/exec RBAC rights to accounts would open an attack vector path that we 'd rather not have.

That being said, we aren't oblivious to the need to have some form of interactivity so we are still investigating what we can do to mitigate the above concerns. The ephemeral containers feature added in 1.22 is an interesting way we might solve this problem, perhaps kubectl attach (but NOT kubectl exec. Also yes, I know it's a PTY allocation, but it's a way stricter one that exec) allowed. Or some form of a wrapper that returns the output of commands without actually allocating a PTY and allowing full interactivity.

I 'll be keeping this task updated with our findings and proposed solutions.

Oh, apologies, I guess I should have explained why we set that requirement.

Thank you, this additional context is helpful.

In our experience up to now with running services in kubernetes in production (we are currently at 33 instances of various microservices with more in the pipeline), we 've very seldomly needed to have a shell in a container.

Toolhub is breaking new ground in being a full stack application and not a microservice which could be part of the impedance match between my hopes for the k8s service and the current reality.

That being said, we aren't oblivious to the need to have some form of interactivity so we are still investigating what we can do to mitigate the above concerns. The ephemeral containers feature added in 1.22 is an interesting way we might solve this problem, perhaps kubectl attach (but NOT kubectl exec. Also yes, I know it's a PTY allocation, but it's a way stricter one that exec) allowed. Or some form of a wrapper that returns the output of commands without actually allocating a PTY and allowing full interactivity.

My understanding of kubectl attach is that it allows taking over stdin/stdout of the main process in an existing container. I don't think I have ever used it directly in a Kubernetes cluster, but I have used it indirectly via kubectl run which deep under the hood basically becomes a kubectl create followed by a kubectl attach if the -i -t flags are passed to run.

Ephemeral containers sounds like it might help with the immutable container image concern and for anything that gets all the way to using a bare container. Likely it would also be valuable for attaching with other tools for debugging that are kept out of the base images.

I have a temporary solution that I'm going to document here, but I would really, really like something less gross. My hack is running the published Docker container locally with config pointing the database and elasticsearch connections to ssh managed port forwards on my local machine. This script includes production secrets on my laptop, but I will omit them from what is published here. It also contains things related to the ssh tunnel creation that are specific to my personal split ssh agent setup which keeps production ssh keys in a dedicated ssh agent. Don't try this at home. Only publish such things to encourage others to help find less disgusting solutions.

The most interesting bit of this script that may not be obvious on an initial read is the use of host.docker.internal as a hostname in the DB_HOST and ES_HOSTS settings. This is a magic hostname that will work on Docker Desktop for MacOS and Windows and can be made to work on Docker for Linux to connect from the container to the host machine.

prod-maintenance.sh
#!/usr/bin/env bash
# Run the latest docker container connected to prod services
set -Eeuo pipefail

ENV_FILE=.prod-maintenance.env
LOCAL_MYSQL_PORT=13306
LOCAL_ES_PORT=19200

# Setup an env file to pass a lot of values to `docker run`
rm -f ${ENV_FILE}
cat << EOF > ${ENV_FILE}
DJANGO_SECRET_KEY=some0random1string2that3I4hope5doesnt6matter7here8really
DJANGO_SETTINGS_MODULE=toolhub.settings
DJANGO_DEBUG=false
DJANGO_ALLOWED_HOSTS=*
LOGGING_HANDLERS=console
LOGGING_LEVEL=INFO
DB_ENGINE=django.db.backends.mysql
DB_NAME=toolhub
DB_USER=toolhub_admin
DB_PASSWORD=THE_PROD_ADMIN_PASSWORD_GOES_HERE
DB_HOST=host.docker.internal
DB_PORT=${LOCAL_MYSQL_PORT}
ES_HOSTS=host.docker.internal:${LOCAL_ES_PORT}
WIKIMEDIA_OAUTH2_URL=https://meta.wikimedia.org/w/rest.php
WIKIMEDIA_OAUTH2_KEY=bf2b4acc4c6c7b5a34a2ef68ec071b79
# Don't panic! This secret value is not secret--it is the local dev value that is published elsewhere
WIKIMEDIA_OAUTH2_SECRET=d6fa3dc12e064166a9b36ddf8ae37ceb61161e69
EOF
trap "rm $ENV_FILE" EXIT

# Setup ssh tunnels to prod database and elasticsearch
PROD_KEY=${HOME}/.ssh/cluster_rsa
SSH_AUTH_SOCK=${HOME}/.ssh/sockets/prod_auth_sock
ps ax | grep ssh-agent | grep -q $SSH_AUTH_SOCK ||
    ssh-agent -a $SSH_AUTH_SOCK
ssh-add -l | grep -q $PROD_KEY ||
  ssh-add -t 3600 $PROD_KEY
ssh -o ExitOnForwardFailure=yes -N -f \
    -L ${LOCAL_MYSQL_PORT}:m5-master.eqiad.wmnet:3306 \
    -L ${LOCAL_ES_PORT}:search.svc.eqiad.wmnet:9243 \
    deploy1002.eqiad.wmnet

# Run the prod container and drop the user at a bash prompt
docker run -it --rm \
    --name="toolhub-maintenance" \
    --env-file ${ENV_FILE} \
    --entrypoint /bin/bash \
    docker-registry.wikimedia.org/wikimedia/wikimedia-toolhub:latest

# Clean up the ssh tunnels
kill $(ps ax | grep ssh | grep m5-master.eqiad.wmnet | awk '{print $1}')

Bug found in temp solution: the Elasticsearch connection needs to use TLS and also skip verifying the remote cert. See https://github.com/django-es/django-elasticsearch-dsl/pull/114/files for how that could be made possible with some additional Django config settings.

Change 751809 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[wikimedia/toolhub@main] config: Add setting to disable Elasticsearch TLS cert verification

https://gerrit.wikimedia.org/r/751809

Change 751809 merged by jenkins-bot:

[wikimedia/toolhub@main] config: Add setting to disable Elasticsearch TLS cert verification

https://gerrit.wikimedia.org/r/751809

Change 770638 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/deployment-charts@master] toolhub: Bump container version to 2022-03-15-002555-production

https://gerrit.wikimedia.org/r/770638

Change 770638 merged by jenkins-bot:

[operations/deployment-charts@master] toolhub: Bump container version to 2022-03-15-002555-production

https://gerrit.wikimedia.org/r/770638

That leaves the use cases of interactively using the Django REPL (or any other interactive tool requiring a PTY), which we are still discussing if and how we will support.

@akosiaris has there been any more discussion of this use case in your team? One of the findings from T303889: Toolhub broken in prod by memcached client library change was that "run maintenance actions from Bryan's laptop" (T290357#7578866) is slow and hindered our recovery from the outage.

One workable solution for Toolhub could be the ability to run a container outside of the Kubernetes cluster, but inside of the production network. This would remove the RTT issues with a container running remotely with ssh tunnels for connectivity to storage systems.

That leaves the use cases of interactively using the Django REPL (or any other interactive tool requiring a PTY), which we are still discussing if and how we will support.

@akosiaris has there been any more discussion of this use case in your team? One of the findings from T303889: Toolhub broken in prod by memcached client library change was that "run maintenance actions from Bryan's laptop" (T290357#7578866) is slow and hindered our recovery from the outage.

Unfortunately not. Constant re-prioritization of work in the last 4-5 months haven't allowed for looking into this more. I can't say when we will be able to look more into this. Point taken regarding the workaround being suboptimal.

One workable solution for Toolhub could be the ability to run a container outside of the Kubernetes cluster, but inside of the production network. This would remove the RTT issues with a container running remotely with ssh tunnels for connectivity to storage systems.

Hm, given we might not be able to revisit this properly, as a workaround, we might be able to have a container continuously running on deploy1002 and provide you with a specific shell-script to allow you to exec with an unprivileged user into it and run the commands you want. Let me take this with the team.

There is now support for a REPL for MediaWiki on Kubernetes from T341197: Allow deployers to get a php REPL environment inside the mw-debug pods. This might be seen as similar to @akosiaris's idea from T290357#7851407 about a continuous container with a special script to access it.