Production Excellence #42: March 2022
We've had quite the month, with 8 documented incidents. That's more than double the two-year median of three a month (Incident graphs).

2022-03-01 ulsfo network
Impact: For 20 minutes, clients normally routed to Ulsfo were unable to reach our projects. This includes New Zealand, parts of Canada, and the United States west coast.

2022-03-04 esams availability banner sampling
Impact: For 1.5 hours, all wikis were largely unreachable from Europe (via Esams), with more limited impact across the globe via other data centers as well.

2022-03-06 wdqs-categories
Impact: For 1.5 hours, some requests to the public Wikidata Query Service API were sporadically blocked.

2022-03-10 site availability
Impact: For 12 min, all wikis were unreachable to logged-in users, and to unregistered users trying to access uncached content.

2022-03-27 api
Impact: For ~4 hours, in three segments of 1-2 hours each over two days, there were higher levels of failed or slow MediaWiki API requests.

2022-03-27 wdqs outage
Impact: For 30 minutes, all WDQS queries failed due to an internal deadlock.

2022-03-29 network
Impact: For approximately 5 minutes, Wikipedia and other Wikimedia sites were slow or inaccessible for many users, mostly in Europe/Africa/Asia. (Details not public at this time.)

2022-03-31 api errors
Impact: For 22 minutes, API server and app server availability were slightly decreased (~0.1% errors, all for s7-hosted wikis such as Spanish Wikipedia), and the latency of API servers was elevated as well.

proderr-incidents 2022-03.png (800×1 px, 107 KB)

Incident follow-up

Remember to review and schedule Incident Follow-up (Sustainability) in Phabricator, which are preventive measures and tech debt mitigations written down after an incident is concluded. Read more about past incidents at Incident status on Wikitech. Some recently completed sustainability work:

Add linecard diversity to router-to-router interconnect at Codfw
Filed by Chris @CDanis (SRE Infra) in 2020 after an incident where all hosts in the Codfw data center lost connectivity at once. Completed by Arzhel @ayounsi and Cathal cmooney (SRE Infra), and @Papaul (DC Ops); including in Esams where the same issue existed.

Expand parser tests to cover language conversation variants in table-of-contents output
Suggested and carried out by @cscott (Parsoid) after reviewing an incident in November. The TOC on wikis that rely on the LanguageConverter service (such as Chinese Wikipedia) were no longer localized

Fix unquoted URL parameters in Icgina health checks
Suggested by Riccardo @Volans (SRE Infra) in response to an early warning signal for TLS certificate expiry. He realized that automated checks for a related cluster were still claiming to be in good health, when they in fact should have been firing a similar warning. Carried out by Filippo @fgiunchedi and Daniel @Dzahn.

Provide automation to quickly show replication status when primary is down
Filed in April by Jaime (SRE Data Persistence), carried out by John @jbond and Amir @Ladsgroup.


Since the last edition, we resolved 24 of the 301 unresolved errors that carried over from previous months.

In March, we created 54 new production errors. That's quite high compared to the twenty-odd reports we find most months. Of these, 17 remain open today a month later.

In the month of April, so far, we reported 20 new errors of which also 17 remain open today.

The production error workboard once again adds up to exactly 298 open tasks (spreadsheet).

Take a look at the workboard and look for tasks that could use your help.

View Workboard

proderr-unified 2022-03.png (1×1 px, 113 KB)


– Timo Tijhof

Written by Krinkle on Apr 21 2022, 9:29 PM.
Principal Engineer (WMF Performance Team)

