HomePhabricator
Production Excellence #17: December 2019
Monthly update on our strive for operational excellence.

How’d we do in our strive for operational excellence in November and December? Read on to find out!

📊 Month in numbers
  • 0 documented incidents in November, 5 incidents in December. [1]
  • 17 new Wikimedia-prod-error reports. [2]
  • 23 Wikimedia-prod-error reports closed. [3]
  • 190 currently open Wikimedia-prod-error reports in total. [4]

November had zero reported incidents. Prior to this, the last month with no documented incidents was December 2017. To read about past incidents and unresolved actionables; check Incident documentation § 2019.

Explore Wikimedia incident graphs (interactive)

cap.png (654×1 px, 33 KB)


📖 Many dots, do not a query make!

@dcausse investigated a flood of exceptions from SpecialSearch, which reported “Cannot consume query at offset 0 (need to go to 7296)”. This exception served as a safeguard in the parser for search queries. The code path was not meant to be reached. The root cause was narrowed down to the following regex:

/\G(?<negated>[-!](?=[\w]))?(?<word>(?:\\\\.|[!-](?!")|[^"!\pZ\pC-])+)/u

This regex looks complex, but it can actually be simplified to:

/(?:ab|c)+/

This regex still triggers the problematic behavior in PHP. It fails with a PREG_JIT_STACKLIMIT_ERROR, when given a long string. Below is a reduced test case:

$ret = preg_match( '/(?:ab|c)+/', str_repeat( 'c', 8192 ) );
if ( $ret === false ) {
    print( "failed with: " . preg_last_error() );
}
  • Fails when given 1365 contiguous c on PHP 7.0.
  • Fails with 2731 characters on PHP 7.2, PHP 7.1, and PHP 7.0.13.
  • Fails with 8192 characters on PHP 7.3. (Might be due to php-src@bb2f1a6).

In the end, the fix we applied was to split the regex into two separate ones, and remove the non-capturing group with a quantifier, and loop through at the PHP level (Gerrit change 546209).

The lesson learned here is that the code did not properly check the return value of preg_match, this is even more important as the size allowed for the JIT stack changes between PHP versions.

For future reference, @dcausse concluded: The regex could be optimized to support more chars (~3 times more) by using atomic groups, like so /(?>ab|c)+/. — T236419


📉 Outstanding reports

Take a look at the workboard and look for tasks that might need your help. The workboard lists error reports, grouped by the month in which they were first observed.

https://phabricator.wikimedia.org/tag/wikimedia-production-error/

Or help someone that’s already started with their patch:

→ Open prod-error tasks with a Patch-For-Review

Breakdown of recent months (past two weeks not included):

  • March: 3 of 10 reports left. (unchanged). ⚠️
  • April: Three reports closed, 6 of 14 left.
  • May: (All clear!)
  • June: Three reports closed. 6 of 11 left (unchanged). ⚠️
  • July: One report closed, 12 of 18 left.
  • August: Two reports closed, 4 of 14 left.
  • September: One report closed, with 9 of 12 left.
  • October: Four reports closed, 8 of 12 left.
  • November: 5 new reports survived the month of November.
  • December: 9 new reports survived the month of December.

🎉 Thanks!

Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production.

Until next time,

– Timo Tijhof


Footnotes:
[1] Incidents. – wikitech.wikimedia.org/wiki/Incident_documentation#2019
[2] Tasks created. – phabricator.wikimedia.org/maniphest/query…
[3] Tasks closed. – phabricator.wikimedia.org/maniphest/query…
[4] Open tasks. – phabricator.wikimedia.org/maniphest/query…

Written by Krinkle on Jan 10 2020, 2:51 AM.
Principal Engineer (Wikimedia Performance)
Projects
None
Subscribers
None
Tokens
"Meh!" token, awarded by zeljkofilipin.

Event Timeline