Production Excellence #37: October 2021
Monthly update on our strive for operational excellence.

How’d we do in our strive for operational excellence last month? Read on to find out!

Incidents

There were 4 documented incidents last month. This is currently on average, compared to the past five years (per Incident graphs).

2021-10-08 network provider
Impact: For upto an hour, some regions experienced a partial connectivity outage. This primarily affected the US East Coast for ~13 minutes, and Russia for 1 hour. It was caused by a routing problem with one of several redundant network providers.

2021-10-22 eqiad networking
Impact: For ~40 minutes clients that are normally geographically routed to Eqiad experienced connection or timeout errors. We lost about 7K req/s during this time. After initial recovery, Eqiad was ready and repooled in ~10 minutes.

2021-10-25 s3 db replica
Impact: For ~30min MediaWiki backends were slower than usual. For ~12 hours, many wiki replicas were stale for Wikimedia Cloud Services such as Toolforge.

2021-10-29 graphite
Impact: During a server upgrade, historical data was lost for a subset of Graphite metrics. Some were recovered via the redundant server, but others were lost as the redundant was also upgraded since then and lost some in a similar fashion.

Remember to review and schedule Incident Follow-up work in Phabricator, which are preventive measures and tech debt mitigations written down after an incident is concluded. Read about past incidents at Incident status on Wikitech.

proderr-incidents 2021-10.png (840×2 px, 182 KB)


Trends
Norwegian blue 🐦

298 bugs were up on the board.
We solved 20 of those over the past thirty days.

How many might now be left unexplored?
We also added new bugs to our database.

Half those bugs are pining for their fjord.
The other 23 carry on, with their dossiers.

All in all, 301 bugs up on the board.

In October, 49 new tasks were reported as production errors. Of these, we resolved 26, and 23 remain unresolved and carry forward to the next month.

Previously, the production error workboard held an accumulated total of 298 still-open error reports. We resolved 20 of those. Together with the 23 new errors carried over from October, this brings us to 301 unresolved errors on the board.

Figure 1: Unresolved error reports by month.

For the month-over-month numbers, refer to the spreadsheet data.


Outstanding errors

Take a look at the workboard and look for tasks that could use your help.

View Workboard

Issues carried over from recent months:

Apr 20219 of 42 issues left.
May 202116 of 54 issues left.
Jun 20219 of 26 issues left.
Jul 202112 of 31 issues left.
Aug 202112 of 46 issues left.
Sep 202111 of 24 issues left.
Oct 202123 of 49 new issues are carried forward.

Thanks!

Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof

Written by Krinkle on Fri, Nov 5, 2:05 AM.
Principal Engineer (Performance)
Projects
None
Subscribers
None

Event Timeline