Page MenuHomePhabricator

Enable rolling restart for all MW servers (tracking)
Open, MediumPublic

Description

Rationale and value

We deploy code updates and configuration changes multiple times a day. When deployments happen, and a server running that software is not restarted, we numerous functional and operational problems:

  • Corruption of compilation cache. This poses a significant security threat, and has caused hundreds of time-consuming incidents and error investigations in recent years
  • Non-atomic deployments. This harms developer productivy, reduces confidence in our staging process, increases complexity of deployment processes, and reduces overall throughput.
  • Stale code and business logic or configuration due to continued handling of production traffic. This creates operational problems, as supposedly inactive/deconfigured hosts for backend services continue to receive traffic for many hours or even days after the configuration change has been deployed.

We've successfully solved the above in T266055 for our web-facing clusters (appservers, and api_appservers) which serve all our public wiki traffic (index.php, api.php, load.php, etc.). This task is to accomplish the same for the remaining clusters:

  • jobrunners (jobs over HTTP)
  • videoscalers (jobs over HTTP, dedicated for TMH transcoding jobs).
  • maintenance host (CLI maintenance scripts).
  • snapshot hosts (CLI maintenance scripts, dedicated for XML dump generation).
Background

Corruption: This refers to corruption of the Opcache module. This has a set amount of memory for storing compiled and optimised source code. When deploying new source code or configuration files, the FPM instance has to compile the new source code but also keep the old source code in memory. Eventually this will run out of space, which, if it happens under live traffic (instead of a restart) often leads to corruption of the compiled source code.

This corruption tends to manifest as subtle off-by-one errors such as a configuration value being flipped by one byte, and poses a significant security thread. E.g. if a configuration value OpenForAll defaults to true for public wikis but we set OpenForAll = false for a private wiki or blocked user group, this can get corrupted to something like PpenForAll = false (O > P) and thus quietly no longer have any effect. There are also more obvious errors such as classes or constants becoming undefined due to being defined under the wrong name, etc.

For examples and background of numerous these dozens upon dozens of errors we've had since 2019 refer to T224491, T245183, and T253673.

Non-atomic deployments. These servers are currently not restarted, and thus continue processing requests, whilst detecting changes to source code in real-time. This means they can end up mixing two incompatible files whilst amidst a larger code deployment. This regularly leads to spikes where thousands requests fail, with log spam and/or alerts that follow. These are then investigated (if not anticipated), or ignored in the hopes it will recover soon, or worse - the change is reverted and thus prolongs the failures through a similar error storm on the way back.

These can currently be avoided by opting for a more complex deployment, where changes are split up over several commits, which we document and have to teach every deployer to (try to) remember, and developers have to then accomodate by maintaining a stack of separate changes during development. For configuration this is somewhat doable. For code backports this is impactical and generally not done as it would mismatch with the source commit, and likely be mergable with passing tests.

Further more, we reduce confidence in our staging process as the change will be previewed via WikimediaDebug in whole, but then deployed in parts (reported as T239373).

Stale code: When servers are allowed to continue execution for hours/days (e.g. offline maintenance scripts), this means they also continue to use the same values of configuration they read at start-time. This is especially problematic when it comes to depooling hosts for the core database (MySQL/MariaDB), but also ParserCache, MainStash, Redis, PoolCounter, and more. Some details and frustrations are documented at T298485.

Approach

As mentioned, we've already done this for our web-facing servers. There we solved with a combination of the following three measures:

  1. Make things faster. It's not uncommon that the reason certain operations approach timeouts or are running too long, is due to a logical mistake or inefficiency that can be addressed first.
    • Split work. E.g. defer work to later, and as-needed spread over multiple jobs via the job queue.
    • Timeouts. Enforce the expected latency, to handle failure scenarios where e.g. we fail to process user input in time, or where backend services are slower than usual, we impose a limit to avoid causing resource exhaustion or cascading failures (ref Backend performance guidelines § Latency). We already limit GET navigation requests from browsers and API queries to 60 seconds, which includes page views, History, Search, and most Special pages. We also limit POST submission requests to 200 seconds, which includes edits and other user actions.
  2. Disable opcache revalidation. By setting opcache.validate_timestamps=0 we remove the responsibility of the server to detect new source code and thus it will not start compiling new files on its own. Instead, it will keep running the same version of the code until the server is restarted.
  3. Restart servers. To do this with as little disruption as possible, we first depool the server from the loadbalancer to stop new incoming traffic. We then give on-going requests a few seconds to complete, with any remaining ones terminated. The server is then restarted and re-pooled.
Concrete work

Details and resourcing TBD.

Roughly:

  • Decide on the wait threshold.
  • Identify job tasks that have exceeded the threshold recently.
  • Identify scheduled maintenance scripts have exceeded the threshold recently.
  • For jobs, determine which job types, if any, are at risk of unrecoverable termination through retry exhaustion (e.g. runtime vs regular deployments).
  • For the identified long-running maintenance scripts, ensure they can resume and recover by running them again.
  • Extend the restart script to also cover scheduled maintenance scripts, either by quiting them until the next scheduled run, or explicitly restarting them (might depend on scheduling frequency, and what systemd options we can use).

For web traffic we currently wait three to five seconds. It seems appealing to make this longer for jobrunners, but doing so would also lengthen how long people have to wait during a deployment window, since restarts happen during the deployment. According to Grafana dashboard: JobQueue, of the 100 or so job types, only 5 take more than five seconds. The rest complete well within a second, and many within 0.1s even.

Screenshot 2022-07-30 at 22.30.18.png (620×1 px, 725 KB)

Given the jobqueue has retry mechanisms in place, the odd job having to be retried is fine so I don't think we need a larger threshold, nor are there job types that would require speeding up somehow or splittinp up. The exception being the all-important webVideoTranscode jobs which can currently take upto 24 hours. We'll likely need to find a way to parallelise these in-job (make faster) or chunk-and-rejoin (split) these somehow.

For maintenance scripts, there are I believe 2 out of the 30 scheduled maintenance scripts that may need attention. E.g. they are long running and either can't resume/restart or otherwise are not yet cleared to be okay to restart potentially once or twice a day if they happen to be running during a deployment:

  • refreshLinkRecommendations.php, T299021.
  • updateSpecialPages.php, T310460.

And lastly, there are the snapshot hosts which run dump-generation maintenance scripts that produce local file artefacts and do not yet have recovery/resume capabilities (afaik) and would indeed a day or more to complete and thus can't be safely stopped on a daily or multiple-times-daily basis without disrupting the dumps schedule and causing major wikis to not complete their snapshots.


Ref T278382: Clean up CirrusSearch job retries.

Event Timeline

Task description by @Krinkle:

The exception being the all-important webVideoTranscode jobs which can currently take upto 24 hours. We'll likely need to find a way to parallelise these in-job (make faster) or chunk-and-rejoin (split) these somehow.

The longest invocation I was able to find looking back six months in Grafana/Prometheus for this job, is 18 minutes.

https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=webVideoTranscode&from=now-6M&to=now&viewPanel=61

Task description by @Krinkle:

The exception being the all-important webVideoTranscode jobs which can currently take upto 24 hours. We'll likely need to find a way to parallelise these in-job (make faster) or chunk-and-rejoin (split) these somehow.

The longest invocation I was able to find looking back six months in Grafana/Prometheus for this job, is 18 minutes.

https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=webVideoTranscode&from=now-6M&to=now&viewPanel=61

I don't think that panel measures what you think it measures. In fact, if we go look at the data in JobExecutor.log, even just this morning we had transcludes lasting more than 20 minutes:

..."job_status":true,"job_duration":1358.4691429138184}

I'll post a longer response below.

Let me prefix this whole discussion with the fact that the migration to k8s will force us to revalidateLet's see the various cases:

Jobs (non-video)

Let me start by saying that for jobrunners I don't think the restarts would be a problem, in fact, the only real reason the jobrunners are currently excluded is because they also run videoscaling.

So we could do it, with a caveat:

  1. We have retry logic in changeprop, and we will keep having it even if one day we dismiss it. So if a job fails it will be retried
  2. Most jobs are short enough that the chance of failure is minimal
  3. We're not sure jobs are idempotent, though. That's my big caveat. We should probably do something about those, as restarts will make the chance of them being interrupted higher.

Videoscaling

As explained, we have a lot of long-running jobs on videoscalers, and they're the real culprit. I would say that the problem is basically that we run ffmpeg on large files sequentially, and while ffmpeg does run in parallel it can still take a long time. The correct way to reduce the duration of a videoscaling job would be to chop a video file in 1-minute chunks and process them in parallel, then merging the result. Yes, that's basically MapReduce.

Videoscaling is basically abandoned, and right now it's a huge security risk and an operational issue and should be revised completely. We should probably adapt some off-the shelf software for distributed video transclusion to actually do the transclusion work, and communicate with it from MediaWiki.

As a stopgap solution, we could do as follows:

  1. We dedicate a few servers to videoscaling only
  2. We disable opcache revalidation
  3. We only deploy code to them once a day, unless there's some emergency that explicitly affects them

I came to the conclusion that given that no one touches the code for video transclusions for anything relevant to transclusions since march 2020, we should mostly be ok with that.

CLI scripts

We don't use opcache at all on CLI AFAIR and I don't think it's a good idea to start doing so. I think they're a non-issue on-prem; we are planning to explicitly exclude running crons and manual jobs from redeployment on mw on k8s. When a new release is available, new jobs will pick it up; currently running jobs should just keep working unfazed.

Dumps

Hic sunt leones. We don't currently have a working plan to move them to k8s, but I also don't think they use opcache at all, so it also doesn't apply.

Conclusions

I think this can be done organically as part of the move to k8s, which will happen over the next year. We might need some help with improving the handling of idempotency in jobs, but more importantly I think we can't defer the work on moving most of the MediaWiki configuration to a database with its own backoffice, which would allow us not to need a code deployment every time someone wants to change the configuration of a single wiki.