Production Excellence #17: December 2019
Monthly update on our strive for operational excellence.

How’d we do in our strive for operational excellence in November and December? Read on to find out!

📊 Month in numbers
  • 0 documented incidents in November, 5 incidents in December. [1]
  • 17 new Wikimedia-prod-error reports. [2]
  • 23 Wikimedia-prod-error reports closed. [3]
  • 190 currently open Wikimedia-prod-error reports in total. [4]

November had zero reported incidents. Prior to this, the last month with no documented incidents was December 2017. To read about past incidents and unresolved actionables; check Incident documentation § 2019.

Explore Wikimedia incident graphs (interactive)

cap.png (654×1 px, 33 KB)

📖 Many dots, do not a query make!

@dcausse investigated a flood of exceptions from SpecialSearch, which reported “Cannot consume query at offset 0 (need to go to 7296)”. This exception served as a safeguard in the parser for search queries. The code path was not meant to be reached. The root cause was narrowed down to the following regex:


This regex looks complex, but it can actually be simplified to:


This regex still triggers the problematic behavior in PHP. It fails with a PREG_JIT_STACKLIMIT_ERROR, when given a long string. Below is a reduced test case:

$ret = preg_match( '/(?:ab|c)+/', str_repeat( 'c', 8192 ) );
if ( $ret === false ) {
    print( "failed with: " . preg_last_error() );
  • Fails when given 1365 contiguous c on PHP 7.0.
  • Fails with 2731 characters on PHP 7.2, PHP 7.1, and PHP 7.0.13.
  • Fails with 8192 characters on PHP 7.3. (Might be due to php-src@bb2f1a6).

In the end, the fix we applied was to split the regex into two separate ones, and remove the non-capturing group with a quantifier, and loop through at the PHP level (Gerrit change 546209).

The lesson learned here is that the code did not properly check the return value of preg_match, this is even more important as the size allowed for the JIT stack changes between PHP versions.

For future reference, @dcausse concluded: The regex could be optimized to support more chars (~3 times more) by using atomic groups, like so /(?>ab|c)+/. — T236419

📉 Outstanding reports

Take a look at the workboard and look for tasks that might need your help. The workboard lists error reports, grouped by the month in which they were first observed.


Or help someone that’s already started with their patch:

→ Open prod-error tasks with a Patch-For-Review

Breakdown of recent months (past two weeks not included):

  • March: 3 of 10 reports left. (unchanged). ⚠️
  • April: Three reports closed, 6 of 14 left.
  • May: (All clear!)
  • June: Three reports closed. 6 of 11 left (unchanged). ⚠️
  • July: One report closed, 12 of 18 left.
  • August: Two reports closed, 4 of 14 left.
  • September: One report closed, with 9 of 12 left.
  • October: Four reports closed, 8 of 12 left.
  • November: 5 new reports survived the month of November.
  • December: 9 new reports survived the month of December.

🎉 Thanks!

Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production.

Until next time,

– Timo Tijhof

[1] Incidents. – wikitech.wikimedia.org/wiki/Incident_documentation#2019
[2] Tasks created. – phabricator.wikimedia.org/maniphest/query…
[3] Tasks closed. – phabricator.wikimedia.org/maniphest/query…
[4] Open tasks. – phabricator.wikimedia.org/maniphest/query…

Written by Krinkle on Jan 10 2020, 2:51 AM.
Principal Engineer (WMF Performance Team)
"Meh!" token, awarded by zeljkofilipin.

Event Timeline