HomePhabricator
Production Excellence #41: February 2022
Monthly update on our strive for operational excellence.

How’d we do in our strive for operational excellence last month? Read on to find out!

Incidents

3 documented incidents last month.

2022-02-01 ulsfo network
Impact: For 3 minutes, clients served by the ulsfo POP were not able to contribute or display un-cached pages.

2022-02-22 wdqs updater codfw
Impact: For 2 hours, WDQS updates failed to be processed. Most bots and tools were unable to edit Wikidata during this time.

2022-02-22 vrts
Impact: For 12 hours, incoming emails to a specific recently created VRTS queue were not processed with senders receiving a bounce with an SMTP 550 Error.

proderr-incidents 2022-02.png (800×1 px, 122 KB)

Figure from Incident graphs.


Incident follow-up

Remember to review and schedule Incident Follow-up work in Phabricator, which are preventive measures and tech debt mitigations written down after an incident is concluded. Read about past incidents at Incident status on Wikitech.

Recently conducted incident follow-up:

Create a dashboard for Prometheus metrics about health of Prometheus itself.
Pitched by CDanis after an April 2019 incident, carried out by Filippo (@fgiunchedi).

Improve wording around AbuseFilter messages about throttling functionality.
Originally filed in 2018. This came up last month during an incident where the wording may've led to a misunderstanding. Now resolved by @Daimona.

Exclude restart procedure from automated Elasticsearch provisioning.
There can be too much automation! Filed after an incident last September. Fixed by @RKemper.


Outstanding errors

Take a look at the workboard and look for tasks that could use your help.

View Workboard

I skip breakdowns most months as each breakdown has its flaws. However, I hear people find them useful, so I'll try to do them from time to time with my noted caveats. The last breakdown was in the December edition, which focussed on throughput during a typical month. Important to recognise is that neither high nor low throughput is per-se good or bad. It's good when issues are detected, reported, and triaged correctly. It's also good if a team's components are stable and don't produce any errors. A report may be found to be invalid or a duplicate, which is sometimes only determined a few weeks later.

The below "after six months" breakdown takes more of that into consideration by looking at what's on the table after six months (tasks upto Sept 2021). This may be considered "fairer" in some sense, although has the drawback of suffering from hindsight bias, and possibly not highlighting current or most urgent areas.

WMF Product:

  • Anti Harassment Tools (3): 1 MW Blocks, 2 SecurePoll.
  • Community Tech (0).
  • Design Systems (1): 1 WVUI.
  • Editing Team (15): 14 VisualEditor, 1 OOUI.
  • Growth Team (13): 11 Flow, 1 GrowthExperiments, 1 MW Recent changes.
  • Language Team (6): 4 ContentTranslation, 1 CX-server, 1 Translate extension.
  • Parsoid Team (9): 8 Parsoid, 1 ParserFunctions extension .
  • Product Infrastructure: 2 JsonConfig, 1 Kartographer, 1 WikimediaEvents.
  • Reading Web (0).
  • Structured Data (4): 2 MW Uploading, 1 WikibaseMediaInfo, 1 3D extension.

WMF Tech:

  • Data Engineering: 1 EventLogging.
  • Fundraising Tech: 1 CentralNotice.
  • Performance: 1 Rdbms.
  • Platform MediaWiki Team (19): 4 MW-Page-data, 1 MW-REST-API, 1 MW-Action-API, 1 MW-Snapshots, 1 MW-ContentHandler, 1 MW-JobQueue, 1 MW-libs-RequestTimeout, 9 Other.
  • Search Platform: 1 MW-Seach.
  • SRE Service Operations: 1 Other.

WMDE:

  • WMDE-Wikidata (7): 5 Wikibase, 2 Lexeme.
  • WMDE-TechWish: 1 FileImporter.

Other:

  • Missing steward (7): 2 Graph, 2 LiquidThreads, 2 TimedMediaHandler, 1 MW Special-Contributions-page.
  • Individually maintained (2): 1 WikimediaIncubator, 1 Score extension.

Trends

In February, we reported 25 new production errors. Of those, 13 have since been resolved, and 12 remain open as of today (two weeks into the following month). We also resolved 22 errors that remained open from previous months. The overall workboard has grown slightly to a total of 301 outstanding error reports.

proderr-unified 2022-02.png (1×1 px, 105 KB)

For the month-over-month numbers, refer to the spreadsheet data.


Thanks!

Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof

Written by Krinkle on Mar 15 2022, 12:59 AM.
Principal Engineer (Wikimedia Performance)
Projects
None
Subscribers
fgiunchedi, Daimona, RKemper
Tokens
"Barnstar" token, awarded by greg."Yellow Medal" token, awarded by thcipriani.

Event Timeline