Rationale and value
We deploy code updates and configuration changes multiple times a day. When deployments happen, and a server running that software is not restarted, we numerous functional and operational problems:
- Corruption of compilation cache. This poses a significant security threat, and has caused hundreds of time-consuming incidents and error investigations in recent years
- Non-atomic deployments. This harms developer productivy, reduces confidence in our staging process, increases complexity of deployment processes, and reduces overall throughput.
- Stale code and business logic or configuration due to continued handling of production traffic. This creates operational problems, as supposedly inactive/deconfigured hosts for backend services continue to receive traffic for many hours or even days after the configuration change has been deployed.
We've successfully solved the above in T266055 for our web-facing clusters (appservers, and api_appservers) which serve all our public wiki traffic (index.php, api.php, load.php, etc.). This task is to accomplish the same for the remaining clusters:
- jobrunners (jobs over HTTP)
- videoscalers (jobs over HTTP, dedicated for TMH transcoding jobs).
- maintenance host (CLI maintenance scripts).
- snapshot hosts (CLI maintenance scripts, dedicated for XML dump generation).
Background
Corruption: This refers to corruption of the Opcache module. This has a set amount of memory for storing compiled and optimised source code. When deploying new source code or configuration files, the FPM instance has to compile the new source code but also keep the old source code in memory. Eventually this will run out of space, which, if it happens under live traffic (instead of a restart) often leads to corruption of the compiled source code.
This corruption tends to manifest as subtle off-by-one errors such as a configuration value being flipped by one byte, and poses a significant security thread. E.g. if a configuration value OpenForAll defaults to true for public wikis but we set OpenForAll = false for a private wiki or blocked user group, this can get corrupted to something like PpenForAll = false (O > P) and thus quietly no longer have any effect. There are also more obvious errors such as classes or constants becoming undefined due to being defined under the wrong name, etc.
For examples and background of numerous these dozens upon dozens of errors we've had since 2019 refer to T224491, T245183, and T253673.
Non-atomic deployments. These servers are currently not restarted, and thus continue processing requests, whilst detecting changes to source code in real-time. This means they can end up mixing two incompatible files whilst amidst a larger code deployment. This regularly leads to spikes where thousands requests fail, with log spam and/or alerts that follow. These are then investigated (if not anticipated), or ignored in the hopes it will recover soon, or worse - the change is reverted and thus prolongs the failures through a similar error storm on the way back.
These can currently be avoided by opting for a more complex deployment, where changes are split up over several commits, which we document and have to teach every deployer to (try to) remember, and developers have to then accomodate by maintaining a stack of separate changes during development. For configuration this is somewhat doable. For code backports this is impactical and generally not done as it would mismatch with the source commit, and likely be mergable with passing tests.
Further more, we reduce confidence in our staging process as the change will be previewed via WikimediaDebug in whole, but then deployed in parts (reported as T239373).
Stale code: When servers are allowed to continue execution for hours/days (e.g. offline maintenance scripts), this means they also continue to use the same values of configuration they read at start-time. This is especially problematic when it comes to depooling hosts for the core database (MySQL/MariaDB), but also ParserCache, MainStash, Redis, PoolCounter, and more. Some details and frustrations are documented at T298485.
Approach
As mentioned, we've already done this for our web-facing servers. There we solved with a combination of the following three measures:
- Make things faster. It's not uncommon that the reason certain operations approach timeouts or are running too long, is due to a logical mistake or inefficiency that can be addressed first.
- Split work. E.g. defer work to later, and as-needed spread over multiple jobs via the job queue.
- Timeouts. Enforce the expected latency, to handle failure scenarios where e.g. we fail to process user input in time, or where backend services are slower than usual, we impose a limit to avoid causing resource exhaustion or cascading failures (ref Backend performance guidelines § Latency). We already limit GET navigation requests from browsers and API queries to 60 seconds, which includes page views, History, Search, and most Special pages. We also limit POST submission requests to 200 seconds, which includes edits and other user actions.
- Disable opcache revalidation. By setting opcache.validate_timestamps=0 we remove the responsibility of the server to detect new source code and thus it will not start compiling new files on its own. Instead, it will keep running the same version of the code until the server is restarted.
- Restart servers. To do this with as little disruption as possible, we first depool the server from the loadbalancer to stop new incoming traffic. We then give on-going requests a few seconds to complete, with any remaining ones terminated. The server is then restarted and re-pooled.
Concrete work
Details and resourcing TBD.
Roughly:
- Decide on the wait threshold.
- Identify job tasks that have exceeded the threshold recently.
- Identify scheduled maintenance scripts have exceeded the threshold recently.
- For jobs, determine which job types, if any, are at risk of unrecoverable termination through retry exhaustion (e.g. runtime vs regular deployments).
- For the identified long-running maintenance scripts, ensure they can resume and recover by running them again.
- Extend the restart script to also cover scheduled maintenance scripts, either by quiting them until the next scheduled run, or explicitly restarting them (might depend on scheduling frequency, and what systemd options we can use).
For web traffic we currently wait three to five seconds. It seems appealing to make this longer for jobrunners, but doing so would also lengthen how long people have to wait during a deployment window, since restarts happen during the deployment. According to Grafana dashboard: JobQueue, of the 100 or so job types, only 5 take more than five seconds. The rest complete well within a second, and many within 0.1s even.
Given the jobqueue has retry mechanisms in place, the odd job having to be retried is fine so I don't think we need a larger threshold, nor are there job types that would require speeding up somehow or splittinp up. The exception being the all-important webVideoTranscode jobs which can currently take upto 24 hours. We'll likely need to find a way to parallelise these in-job (make faster) or chunk-and-rejoin (split) these somehow.
For maintenance scripts, there are I believe 2 out of the 30 scheduled maintenance scripts that may need attention. E.g. they are long running and either can't resume/restart or otherwise are not yet cleared to be okay to restart potentially once or twice a day if they happen to be running during a deployment:
And lastly, there are the snapshot hosts which run dump-generation maintenance scripts that produce local file artefacts and do not yet have recovery/resume capabilities (afaik) and would indeed a day or more to complete and thus can't be safely stopped on a daily or multiple-times-daily basis without disrupting the dumps schedule and causing major wikis to not complete their snapshots.
