Page MenuHomePhabricator

API cluster failure / OOM
Closed, ResolvedPublic

Description

At around 2.00 UTC on 27/11/2016 the api appservers started using more and more memory and eventually throw enough errors and page at 2.27 UTC 02:27 -icinga-wm:#wikimedia-operations- PROBLEM - LVS HTTP IPv4 on api.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds

I've captured some hhvm backtraces on e.g. mw1279 in /root and roll-restarted api_appcluster which *seems* to have helped a bit.

Taking a closer look there seemed to be failing requests for euwiki from parsoid (from apache access log), among other failing requests. The requests for euwiki seemed to be continuing failing even after the rolling restart.

2016-11-27T02:02:53     119996827       10.64.32.74     proxy-server/503        50065   GET     http://eu.wikipedia.org/w/api.php
       -       text/html       -       -       Parsoid/0.6.1   -       -       -       10.64.32.74
2016-11-27T02:02:55     119997898       10.64.32.89     proxy-server/503        50065   GET     http://eu.wikipedia.org/w/api.php
       -       text/html       -       -       Parsoid/0.6.1   -       -       -       10.64.32.89
2016-11-27T02:03:04     119992080       10.64.0.215     proxy-server/503        50065   GET     http://eu.wikipedia.org/w/api.php
       -       text/html       -       -       Parsoid/0.6.1   -       -       -       10.64.0.215
2016-11-27T02:03:09     119996351       10.64.0.215     proxy-server/503        50065   GET     http://eu.wikipedia.org/w/api.php
       -       text/html       -       -       Parsoid/0.6.1   -       -       -       10.64.0.215
2016-11-27T02:03:09     119991462       10.64.32.84     proxy-server/503        50065   GET     http://eu.wikipedia.org/w/api.php
       -       text/html       -       -       Parsoid/0.6.1   -       -       -       10.64.32.84
2016-11-27T02:03:16     119994184       10.64.32.90     proxy-server/503        50065   GET     http://eu.wikipedia.org/w/api.php
       -       text/html       -       -       Parsoid/0.6.1   -       -       -       10.64.32.90
2016-11-27T02:03:18     119990649       10.64.0.218     proxy-server/503        50065   GET     http://eu.wikipedia.org/w/api.php
       -       text/html       -       -       Parsoid/0.6.1   -       -       -       10.64.0.218
2016-11-27T02:03:19     119986385       10.64.0.215     proxy-server/503        50065   GET     http://eu.wikipedia.org/w/api.php
       -       text/html       -       -       Parsoid/0.6.1   -       -       -       10.64.0.215
2016-11-27T02:03:20     119999445       10.64.32.78     proxy-server/503        50065   GET     http://eu.wikipedia.org/w/api.php
       -       text/html       -       -       Parsoid/0.6.1   -       -       -       10.64.32.78

In logstash there were logs related to dbquery for euwiki, e.g. (shortlink https://logstash.wikimedia.org/goto/4b7ce91a52c4feef185a3075d31162d9)

Database::query: lost connection to 10.64.32.136; reconnected:
#0 /srv/mediawiki/php-1.29.0-wmf.3/includes/libs/rdbms/database/Database.php(1254): Database->query()
#1 /srv/mediawiki/php-1.29.0-wmf.3/includes/libs/rdbms/database/Database.php(1322): Database->select()
#2 /srv/mediawiki/php-1.29.0-wmf.3/includes/cache/LinkCache.php(296): Database->selectRow()
#3 /srv/mediawiki/php-1.29.0-wmf.3/includes/cache/LinkCache.php(267): LinkCache->fetchPageRow()
#4 /srv/mediawiki/php-1.29.0-wmf.3/includes/Title.php(3209): LinkCache->addLinkObj()
#5 /srv/mediawiki/php-1.29.0-wmf.3/includes/Title.php(934): Title->getArticleID()
#6 /srv/mediawiki/php-1.29.0-wmf.3/includes/Title.php(955): Title->getContentModel()
#7 /srv/mediawiki/php-1.29.0-wmf.3/extensions/Scribunto/common/Hooks.php(94): Title->hasContentModel()
#8 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Parser.php(3379): ScribuntoHooks::invokeHook()
#9 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Parser.php(3103): Parser->callParserFunction()
#10 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Preprocessor_Hash.php(1015): Parser->braceSubstitution()
#11 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Parser.php(3021): PPFrame_Hash->expand()
#12 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Preprocessor_Hash.php(1015): Parser->braceSubstitution()
#13 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Preprocessor_Hash.php(1448): PPFrame_Hash->expand()
#14 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Parser.php(3253): PPTemplateFrame_Hash->cachedExpand()
#15 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Preprocessor_Hash.php(1015): Parser->braceSubstitution()
#16 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Parser.php(3021): PPFrame_Hash->expand()
#17 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Preprocessor_Hash.php(1015): Parser->braceSubstitution()
#18 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Parser.php(3256): PPFrame_Hash->expand()
#19 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Preprocessor_Hash.php(1015): Parser->braceSubstitution()
#20 /srv/mediawiki/php-1.29.0-wmf.3/extensions/ParserFunctions/ParserFunctions_body.php(104): PPFrame_Hash->expand()
#21 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Parser.php(3379): ExtParserFunctions::ifObj()
#22 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Parser.php(3103): Parser->callParserFunction()
#23 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Preprocessor_Hash.php(1015): Parser->braceSubstitution()
#24 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Preprocessor_Hash.php(1528): PPFrame_Hash->expand()
#25 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Preprocessor_Hash.php(1540): PPTemplateFrame_Hash->getNamedArgument()
#26 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Parser.php(3759): PPTemplateFrame_Hash->getArgument()
#27 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Preprocessor_Hash.php(1032): Parser->argSubstitution()
#28 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Preprocessor_Hash.php(1528): PPFrame_Hash->expand()
#29 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Preprocessor_Hash.php(1540): PPTemplateFrame_Hash->getNamedArgument()
#30 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Parser.php(3759): PPTemplateFrame_Hash->getArgument()
#31 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Preprocessor_Hash.php(1032): Parser->argSubstitution()
#32 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Parser.php(3021): PPFrame_Hash->expand()
#33 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Preprocessor_Hash.php(1015): Parser->braceSubstitution()
#34 /srv/mediawiki/php-1.29.0-wmf.3/extensions/ParserFunctions/ParserFunctions_body.php(106): PPFrame_Hash->expand()
#35 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Parser.php(3379): ExtParserFunctions::ifObj()
#36 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Parser.php(3103): Parser->callParserFunction()
#37 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Preprocessor_Hash.php(1015): Parser->braceSubstitution()
#38 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Parser.php(3256): PPFrame_Hash->expand()
#39 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Preprocessor_Hash.php(1015): Parser->braceSubstitution()
#40 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Parser.php(3256): PPFrame_Hash->expand()
#41 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Preprocessor_Hash.php(1015): Parser->braceSubstitution()
#42 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Parser.php(3256): PPFrame_Hash->expand()
#43 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Preprocessor_Hash.php(1015): Parser->braceSubstitution()
#44 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Parser.php(2917): PPFrame_Hash->expand()
#45 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Parser.php(654): Parser->replaceVariables()
#46 /srv/mediawiki/php-1.29.0-wmf.3/includes/StubObject.php(113): Parser->preprocess()
#47 /srv/mediawiki/php-1.29.0-wmf.3/includes/StubObject.php(139): StubObject->_call()
#48 /srv/mediawiki/php-1.29.0-wmf.3/extensions/ParsoidBatchAPI/includes/ApiParsoidBatch.php(174): StubObject->__call()
#49 /srv/mediawiki/php-1.29.0-wmf.3/extensions/ParsoidBatchAPI/includes/ApiParsoidBatch.php(82): ApiParsoidBatch->preprocess()
#50 /srv/mediawiki/php-1.29.0-wmf.3/includes/api/ApiMain.php(1434): ApiParsoidBatch->execute()
#51 /srv/mediawiki/php-1.29.0-wmf.3/includes/api/ApiMain.php(510): ApiMain->executeAction()
#52 /srv/mediawiki/php-1.29.0-wmf.3/includes/api/ApiMain.php(481): ApiMain->executeActionWithErrorHandling()
#53 /srv/mediawiki/php-1.29.0-wmf.3/api.php(83): ApiMain->execute()
#54 /srv/mediawiki/w/api.php(3): include()
#55 {main}

Details

Related Gerrit Patches:

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Joe added a comment.Nov 29 2016, 8:54 AM

Expanding a bit on the concept of having a separate cluster:

  • Whatever spawns from ChangeProp is an async job, not part of serving content to actual, live users
  • For async processing, we have a separate cluster in mediawiki, and that's the jobrunners cluster. We're not doing the same with Changeprop.
  • While rate-limiting on the Changeprop side is probably a good thing to tackle, getting every edge case right is probably very tricky.
  • Separation of concerns will never be fully ironclad (databases will still be shared after all), but it would at least confine events like the current one to async processing

So my personal view is as follows: for most systems, requests originated by changeprop should not be served by the same servers that are serving the traffic from live users.

This can be very tricky and depend on a lot of details about how every call is made, but let's focus on MediaWiki though: how do calls to MW originate indirectly from changeprop? how can we divert them to a different cluster?

In my understanding, what happens most commonly is:

  • ChangeProp asks Restbase to re-render some content
  • Restbase forwards the re-render request to Parsoid
  • Parsoid forwards it to the mediawiki api

So the easiest way to do this would be to separate a few servers (how many, to be determined) from the restbase, parsoid, and mediawiki api clusters and dedicate those to this specific task. I don't see any better solution, to be honest, unless we put a L7 proxy between the service layers, which IMHO would be a worse solution.

I'll use the morning to assess which percentage of requests to the various clusters are originated by changeprop, to understand what the scope of the change might need to be.

Joe added a comment.Nov 29 2016, 9:14 AM

So, to reinforce my argument, this is my extraction of ratio of requests to parsoid from changeprop:

sudo salt 'wtp1*' cmd.run "awk 'BEGIN {total=0; changeprop=0} /started parsing/ {total++; if (/ChangePropagation/) { changeprop++;}} END {print changeprop/total}' /srv/log/parsoid/main.log" 
wtp1005.eqiad.wmnet:
    0.846491
wtp1021.eqiad.wmnet:
    0.846637
wtp1014.eqiad.wmnet:
    0.847237
wtp1022.eqiad.wmnet:
    0.845636
wtp1013.eqiad.wmnet:
    0.84699
wtp1019.eqiad.wmnet:
    0.848367
wtp1017.eqiad.wmnet:
    0.846816
wtp1010.eqiad.wmnet:
    0.849078
wtp1020.eqiad.wmnet:
    0.845065
wtp1016.eqiad.wmnet:
    0.846363
wtp1004.eqiad.wmnet:
    0.846454
wtp1009.eqiad.wmnet:
    0.844574
wtp1006.eqiad.wmnet:
    0.845354
wtp1023.eqiad.wmnet:
    0.845964
wtp1024.eqiad.wmnet:
    0.847731
wtp1003.eqiad.wmnet:
    0.848589
wtp1001.eqiad.wmnet:
    0.848655
wtp1015.eqiad.wmnet:
    0.849198
wtp1011.eqiad.wmnet:
    0.846692
wtp1008.eqiad.wmnet:
    0.846391
wtp1007.eqiad.wmnet:
    0.844595
wtp1002.eqiad.wmnet:
    0.848969
wtp1018.eqiad.wmnet:
    0.846035
wtp1012.eqiad.wmnet:
    0.845007

a staggering 84.5% of rendering requests to parsoid seem to come from changepropagation.

Joe added a comment.Nov 29 2016, 9:34 AM

For the MediaWiki API, the number is way lower in general:

sudo salt -C 'G@cluster:api_appserver and G@site:eqiad' cmd.run " awk 'BEGIN {total=0; changeprop=0} {total++; if (/ChangePropagation/) { changeprop++;}} END {print changeprop/total}' /var/log/apache2/other_vhosts_access.log.1"
mw1227.eqiad.wmnet:
    0.0363048
mw1223.eqiad.wmnet:
    0.0369746
mw1228.eqiad.wmnet:
    0.036771
mw1231.eqiad.wmnet:
    0.036394
mw1224.eqiad.wmnet:
    0.0365077
mw1234.eqiad.wmnet:
    0.0367172
mw1226.eqiad.wmnet:
    0.0366096
mw1233.eqiad.wmnet:
    0.0365922
mw1222.eqiad.wmnet:
    0.0364166
mw1235.eqiad.wmnet:
    0.0368284
mw1232.eqiad.wmnet:
    0.0364804
mw1230.eqiad.wmnet:
    0.0362725
mw1229.eqiad.wmnet:
    0.0363956
mw1221.eqiad.wmnet:
    0.0367211
mw1196.eqiad.wmnet:
    0.0363961
mw1204.eqiad.wmnet:
    0.0357157
mw1193.eqiad.wmnet:
    0.0358446
mw1207.eqiad.wmnet:
    0.0356797
mw1208.eqiad.wmnet:
    0.0358427
mw1191.eqiad.wmnet:
    0.0360581
mw1195.eqiad.wmnet:
    0.0359184
mw1203.eqiad.wmnet:
    0.0360786
mw1189.eqiad.wmnet:
    0.0364143
mw1192.eqiad.wmnet:
    0.0364371
mw1201.eqiad.wmnet:
    0.0355747
mw1206.eqiad.wmnet:
    0.0357697
mw1198.eqiad.wmnet:
    0.0357252
mw1225.eqiad.wmnet:
    0.0363874
mw1200.eqiad.wmnet:
    0.0362528
mw1197.eqiad.wmnet:
    0.0359116
mw1190.eqiad.wmnet:
    0.0358727
mw1194.eqiad.wmnet:
    0.0357255
mw1202.eqiad.wmnet:
    0.035878
mw1199.eqiad.wmnet:
    0.0357621
mw1205.eqiad.wmnet:
    0.03577
mw1279.eqiad.wmnet:
    0.0350745
mw1282.eqiad.wmnet:
    0.0352047
mw1276.eqiad.wmnet:
    0.035097
mw1280.eqiad.wmnet:
    0.0350594
mw1277.eqiad.wmnet:
    0.0352321
mw1286.eqiad.wmnet:
    0.0350941
mw1290.eqiad.wmnet:
    0.0347753
mw1284.eqiad.wmnet:
    0.0347714
mw1283.eqiad.wmnet:
    0.0348901
mw1287.eqiad.wmnet:
    0.0348356
mw1278.eqiad.wmnet:
    0.034635
mw1289.eqiad.wmnet:
    0.034799
mw1285.eqiad.wmnet:
    0.035124
mw1288.eqiad.wmnet:
    0.0344889
mw1281.eqiad.wmnet:
    0.0350896
mark added a subscriber: mark.Nov 29 2016, 11:57 AM
Reedy added a subscriber: Reedy.Nov 29 2016, 1:55 PM

Reporting the patch here too to avoid relying on external paste

diff --git a/api.php b/api.php
index 6e75fb7..f40e235 100644
--- a/api.php
+++ b/api.php
@@ -32,6 +32,15 @@
 use MediaWiki\Logger\LegacyLogger;
+if (
+       ( isset( $_SERVER['HTTP_USER_AGENT'] ) && strpos( $_SERVER['HTTP_USER_AGENT'], 'Parsoid' ) !== false ) &&
+       ( isset( $_SERVER['HTTP_HOST'] ) && $_SERVER['HTTP_HOST'] === 'eu.wikipedia.org' )
+
+) {
+       header( $_SERVER['SERVER_PROTOCOL'] . ' 500 T151702 Bandaid', true, 500 );
+       die( 1 );
+}
+
 // So extensions (and other code) can check whether they're running in API mode
 define( 'MW_API', true );

Turned into a commit at https://gerrit.wikimedia.org/r/#/c/324182/1

Joe added a comment.Nov 29 2016, 2:36 PM

It should be noted that we had this issue again, between approx 12:35 UTC and 12:47 UTC today, but the problem has had public-facing effect mostly in the last 3 minutes of the incident. @Reedy has added @ori's patch back and that, toghether with a rolling restart of HHVM, cleared the issue.

It seems this is again due to euwiki, and the stacktraces were similar to what collected earlier

Joe added a comment.Nov 29 2016, 3:02 PM

So, the number of requests that ChangePropagation does to the API is small in itself; but most requests are from parsoid, and they do use the Parsoid/X.Y.Z UA.

Counting those gives pretty different numbers, in particular if we only count POSTs:

sudo salt -C 'G@cluster:api_appserver and G@site:eqiad' cmd.run "fgrep POST /var/log/apache2/other_vhosts_access.log | awk 'BEGIN {total=0; changeprop=0} {total++; if (/Parsoid/) { changeprop++;}} END {print changeprop/total}'"
mw1225.eqiad.wmnet:
    0.718383
mw1223.eqiad.wmnet:
    0.720376
mw1227.eqiad.wmnet:
    0.718998
mw1229.eqiad.wmnet:
    0.718939
mw1222.eqiad.wmnet:
    0.718477
mw1204.eqiad.wmnet:
    0.720225
mw1195.eqiad.wmnet:
    0.720146
mw1224.eqiad.wmnet:
    0.720202
mw1208.eqiad.wmnet:
    0.717703
mw1234.eqiad.wmnet:
    0.71629
mw1199.eqiad.wmnet:
    0.717074
mw1226.eqiad.wmnet:
    0.719125
mw1201.eqiad.wmnet:
    0.716597
mw1228.eqiad.wmnet:
    0.72082
mw1192.eqiad.wmnet:
    0.719594
mw1221.eqiad.wmnet:
    0.718242
mw1202.eqiad.wmnet:
    0.71801
mw1233.eqiad.wmnet:
    0.718829
mw1205.eqiad.wmnet:
    0.720631
mw1191.eqiad.wmnet:
    0.714644
mw1196.eqiad.wmnet:
    0.720405
mw1203.eqiad.wmnet:
    0.717774
mw1190.eqiad.wmnet:
    0.715812
mw1232.eqiad.wmnet:
    0.720945
mw1197.eqiad.wmnet:
    0.71542
mw1235.eqiad.wmnet:
    0.718755
mw1231.eqiad.wmnet:
    0.716894
mw1193.eqiad.wmnet:
    0.715542
mw1200.eqiad.wmnet:
    0.717782
mw1206.eqiad.wmnet:
    0.721007
mw1230.eqiad.wmnet:
    0.718386
mw1189.eqiad.wmnet:
    0.717779
mw1194.eqiad.wmnet:
    0.718455
mw1207.eqiad.wmnet:
    0.715529
mw1279.eqiad.wmnet:
    0.715622
mw1198.eqiad.wmnet:
    0.718593
mw1276.eqiad.wmnet:
    0.712992
mw1277.eqiad.wmnet:
    0.71066
mw1281.eqiad.wmnet:
    0.71177
mw1280.eqiad.wmnet:
    0.71181
mw1287.eqiad.wmnet:
    0.712915
mw1290.eqiad.wmnet:
    0.715581
mw1286.eqiad.wmnet:
    0.71094
mw1289.eqiad.wmnet:
    0.712168
mw1284.eqiad.wmnet:
    0.709995
mw1278.eqiad.wmnet:
    0.712887
mw1288.eqiad.wmnet:
    0.711827
mw1282.eqiad.wmnet:
    0.710432
mw1283.eqiad.wmnet:
    0.713564
mw1285.eqiad.wmnet:
    0.713916

Parsoid accounts for 71% of the POST requests to the mw api, and this of course include requests that come from live users; I'll try to narrow the ones coming from changeProp further.

GWicke added a comment.EditedNov 29 2016, 3:13 PM

While rate-limiting on the Changeprop side is probably a good thing to tackle, getting every edge case right is probably very tricky.

To clarify, ChangeProp uses concurrency limiting rather than rate limiting, as this adapts better to the large spread of per-request costs & varying load. There are separate concurrency limits for first request vs. retries, as well as controls for the amount of retry back-off and number of retries.

Joe added a comment.Nov 29 2016, 3:43 PM

I depooled one API server and tried the request in P4516. Single requests don't seem to have any effect, but as soon as I raised the ab concurrency to a value of 10, I started seeing threads waiting on a lock in HPHP::JSON_parser as during the outage.

Seeing what follows

root@mw1276:~# hhvm-dump-debug | fgrep HPHP::JSON_parser | wc -l
Backtrace saved as /tmp/hhvm.18560.bt.
9

makes me wonder if there is some global lock in the json parse code that allows just one thread to parse json at the same time.

Joe added a comment.Nov 29 2016, 3:55 PM

As I raise the concurrency, more requests seem to be blocked in the same lock.

So, apart from fixing the clear issue we have in HHVM, we should focus on what seems to be a thundering herd effect of some sort:

in the long chain changeprop => restbase => parsoid something times out well before HHVM has responded, and then goes on to retry again and again, probably, causing in the end the api appserver to all grind to a halt because of the global lock (all threads are blocked while waiting the one owning the lock to have finished its job).

greg added a subscriber: greg.Nov 29 2016, 5:01 PM

(just housekeeping: this is now in the "post emergency" phase, right? If so, we should create the follow-ups as specific (sub)tasks now and we can close this one once the root-cause investigation is done)

Joe added a comment.EditedNov 29 2016, 5:04 PM

On mw1276, I did take a full backtrace (not with quickstack) of a running hhvm process while the problematic requests were running. It can be found under /root/T151702

From a quick look, most threads seem effectively blocked in a very simple function:

je_malloc_mutex_lock (mutex=0x7f2b30c004d8) at include/jemalloc/internal/mutex.h:77

and in most cases, what is happening is every thread is trying to free that pointer address:

thread 64, 48, 26, 21 are all trying to call free

#11 free (ptr=0x7f2b30c004d8) at src/jemalloc.c:1308

while thread 84 is running arena_run_tree_insert and that refers to a data structure that references that pointer, apparently.

My C++ is way too bad to be able to follow the jemalloc source code at this time of the day; if someone wants to continue investigating, mw1276 has the backtraces and is currently depooled from traffic for debug purposes. If you want to look at the jemalloc sources, please remember downloading the debian source package for jemalloc for jessie, as that's what we're using now.

tstarling changed the visibility from "Custom Policy" to "Custom Policy".Nov 29 2016, 10:33 PM
fgiunchedi changed the visibility from "Custom Policy" to "Public (No Login Required)".Nov 29 2016, 10:51 PM
fgiunchedi changed Security from Other confidential issue to None.
ssastry added a comment.EditedNov 29 2016, 11:14 PM

Thanks @tstarling and @fgiunchedi for fixing the visibility.

In the long chain changeprop => restbase => parsoid something times out well before HHVM has responded, and then goes on to retry again and again ...

@Joe do you have more info about this? We should figure out where these excessive retries originate.

tstarling added a comment.EditedNov 30 2016, 1:59 AM

From a quick look, most threads seem effectively blocked in a very simple function:

je_malloc_mutex_lock (mutex=0x7f2b30c004d8) at include/jemalloc/internal/mutex.h:77

and in most cases, what is happening is every thread is trying to free that pointer address:

That is a gdb bug, the argument to free() is wrong. The disassembly shows the argument to free() is passed in rdi, but gdb claims it is on the stack. As an argument to je_malloc_mutex_lock() it is more plausible since the disassembly confirms that the argument to that function is on the stack.

The threads are not all trying to free the same pointer, but they are all trying to lock the same arena. I previously believed the jemalloc man page, which says that the default number of arenas is 4 times the number of CPUs. However, HHVM overrides this (in alloc.cpp), setting the number of arenas to 1. This has a catastrophic impact on concurrent performance:

The data above was gathered using ab for the request in P4516. For concurrency 1 to 16, I used n=16. Then I dropped the narenas=1 case and gathered data for concurrency 16 to 64 for narenas=4 and 16, with n=64.

[UPDATE: added narenas=160, fixed a bug in the awk one-liner that I used to do averaging, exclude the highest sample since there seemed to be a warmup effect]

demon added a subscriber: demon.Nov 30 2016, 3:03 AM

Filed upstream bug https://github.com/facebook/hhvm/issues/7515 , but we're not blocked on it, we can use the MALLOC_CONF environment variable to change narenas, like I did for my testing above.

Change 324394 had a related patch set uploaded (by Giuseppe Lavagetto):
mediawiki::hhvm: allow to override the default jemalloc arenas

https://gerrit.wikimedia.org/r/324394

Change 324394 merged by Giuseppe Lavagetto:
mediawiki::hhvm: allow to override the default jemalloc arenas

https://gerrit.wikimedia.org/r/324394

Mentioned in SAL (#wikimedia-operations) [2016-11-30T11:22:01Z] <_joe_> repooling mw1276, after tests for T151702

Change 324447 had a related patch set uploaded (by Giuseppe Lavagetto):
mediawiki: tweak jemalloc arenas on api, appserver canaries

https://gerrit.wikimedia.org/r/324447

Change 324447 merged by Giuseppe Lavagetto:
mediawiki: tweak jemalloc arenas on api, appserver canaries

https://gerrit.wikimedia.org/r/324447

Change 324462 had a related patch set uploaded (by Giuseppe Lavagetto):
mediawiki: make jemalloc arenas equal to the processorcount

https://gerrit.wikimedia.org/r/324462

Change 324462 merged by Giuseppe Lavagetto:
mediawiki: make jemalloc arenas equal to the processorcount

https://gerrit.wikimedia.org/r/324462

Joe added a comment.Nov 30 2016, 4:15 PM

I have set arenas for jemalloc to be equal to the number of processors seen by the OS, the bandaid fix should be in the process of being removed now.

Joe added a comment.Nov 30 2016, 4:20 PM

So, with the HHVM part "solved" we still should take the prevention measures I named here:

  • Check the concurrency/retry/timeout rates of all components
  • Separate clusters for async processing from the ones for serving the public requests
greg added a comment.Nov 30 2016, 5:16 PM

So, with the HHVM part "solved" we still should take the prevention measures I named here:

  • Check the concurrency/retry/timeout rates of all components
  • Separate clusters for async processing from the ones for serving the public requests

Add to/update https://wikitech.wikimedia.org/wiki/Incident_documentation/20161127-API and file sub-tasks and close this one (it sounds like what needs to be done to me).

Joe added a comment.Nov 30 2016, 5:29 PM

@greg yeah I know, I'll do my homework, promised :)

I'm just waiting to see if the issue happens again in the next couple of days before closing this.

Change 324196 abandoned by Reedy:
Expand ori's bandaid to only work on POST

Reason:
No longer required

https://gerrit.wikimedia.org/r/324196

Mentioned in SAL (#wikimedia-operations) [2016-12-01T12:49:47Z] <reedy@tin> Synchronized php-1.29.0-wmf.3/api.php: Remove oris bandaid T151702 (duration: 00m 46s)

Elitre added a subscriber: Elitre.Dec 1 2016, 5:22 PM
Joe added a comment.Dec 2 2016, 8:49 AM

So, this happened again this morning, and we have good and bad news:

  • Good news is the system, with a larger number of jemalloc arenas, took more time to go down
  • Bad news is Parsoid is absolutely hammering the API with such requests (some more data coming), and that is basically killing us; I'm unsure such rates of requests are manageable in any ways by us.

Change 324866 had a related patch set uploaded (by Giuseppe Lavagetto):
mediawiki::hhvm: bump up the number of malloc arenas

https://gerrit.wikimedia.org/r/324866

Change 324866 merged by Giuseppe Lavagetto:
mediawiki::hhvm: bump up the number of malloc arenas

https://gerrit.wikimedia.org/r/324866

Joe added a comment.Dec 2 2016, 10:27 AM

Looking at api.log, I found that requests as follows:

  • to euwiki
  • action=parsoid-batch
  • batch-action=preprocess

have had an absurd peak this morning, specifically at 07:50 UTC I count 23737 such requests in one minute. Given some of those are very expensive, it's no surpise. For comparison, normally reqs/min are below 100.

Joe added a comment.Dec 2 2016, 10:51 AM

shows the rate of such requests

Joe added a comment.Dec 2 2016, 11:23 AM

There was an edit of Modulu:Wikidata this morning about 20 minutes before the peak of requests from parsoid happened. Not sure if that's related, but it is included in all the preprocessed pages I looked up until now.

More investigation ongoing.

Joe added a comment.Dec 2 2016, 12:19 PM

This brust in traffic is, looking at parsoid logs, due to reqId: 3ff62f51-cd11-4b44-98e4-6a6aa608b600 from ChangePropagation.

I am unsure how to check restbase's access logs.

mobrovac added a subscriber: mobrovac.EditedDec 2 2016, 12:34 PM

There was an edit of Modulu:Wikidata this morning about 20 minutes before the peak of requests from parsoid happened. Not sure if that's related, but it is included in all the preprocessed pages I looked up until now.

This is indeed the most likely offender. That page in transcluded in 10k+ pages on euwiki. Looking at the logs, all of the Parsoid errors as seen by RESTBase come from Change Propagation, which was cycling through the pages Modulu:Wikidata is transcluded in and forced HTML updates on RESTBase, which in turn asked Parsoid to render them.

Change 324895 had a related patch set uploaded (by Reedy):
Apply Ori's bandaid for API cluster fails

https://gerrit.wikimedia.org/r/324895

Change 324196 restored by Reedy:
Expand ori's bandaid to only work on POST

https://gerrit.wikimedia.org/r/324196

Change 324196 abandoned by Reedy:
Expand ori's bandaid to only work on POST

https://gerrit.wikimedia.org/r/324196

Reedy added a comment.Dec 2 2016, 1:13 PM

I've indefinitely protected (to sysop) https://eu.wikipedia.org/wiki/Modulu:Wikidata for now, and left a message at https://eu.wikipedia.org/wiki/Modulu_eztabaida:Wikidata to say that they refrain from editing it... And to use another page for development

Joe added a comment.Dec 2 2016, 2:57 PM

I removed the bandaid right now, hoping we didn't miss the origin of the issue.

I would still like the concurrency limit of Change Propagation to be reviewed and tested, as next time something like this happens I'll just turn off changeprop, instead of crippling the requests to the mediawiki API.

Mentioned in SAL (#wikimedia-operations) [2016-12-02T15:02:28Z] <_joe_> rolling restart of API appservers to catch up with the new jemalloc arenas config T151702

There was an edit of Modulu:Wikidata this morning about 20 minutes before the peak of requests from parsoid happened. Not sure if that's related, but it is included in all the preprocessed pages I looked up until now.

This is indeed the most likely offender. That page in transcluded in 10k+ pages on euwiki. Looking at the logs, all of the Parsoid errors as seen by RESTBase come from Change Propagation, which was cycling through the pages Modulu:Wikidata is transcluded in and forced HTML updates on RESTBase, which in turn asked Parsoid to render them.

@mobrovac, @GWicke, @Pchelolo: Can ChangeProp spread out requests from edits of a single template? As it is designed now, if that template happens to be expensive to process, then either Parsoid or MediaWiki or both get hammered. In this instance it was MediaWiki. But, I can see this happening to Parsoid instead. So, it is worth thinking about how to mix up the reparse requests, so the expensive ones are spread out over time with the inexpensive ones. Introduce a time lag before insertion into a queue, or add some randomization factor in how requests are picked up .... I am throwing out random crazy ideas without knowing anything about CP's internals.

Separate but related question: given that the PHP parser also needs to refresh the cache, and needs to preprocess all those same pages that RESTBase/Parsoid is updating, where are those requests ending up?

Traditionally, a big issue causing work amplification has been a lack of reliable request timeout support in the MediaWiki API, which is tracked in T97192. This allows requests to consume large amounts of resources, and makes it impossible for clients to distinguish a timed-out request from a temporary issue affecting a single node. As a result, Parsoid ends up hitting its client timeout, and retries those requests once. At the same time, the original request is still consuming resources in MediaWiki.

Joe added a comment.Dec 2 2016, 3:29 PM

Traditionally, a big issue causing work amplification has been a lack of reliable request timeout support in the MediaWiki API, which is tracked in T97192. This allows requests to consume large amounts of resources, and makes it impossible for clients to distinguish a timed-out request from a temporary issue affecting a single node. As a result, Parsoid ends up hitting its client timeout, and retries those requests once. At the same time, the original request is still consuming resources in MediaWiki.

Except in this specific case changeprop/restbase fire out 23K requests for a specific transclusion in the span of less than one minute, and I see those almost 1:1 between parsoid and mediawiki, so there is no crazy timeout/retry issue at play here, it's just a flood of requests, all coming in at roughly the same time.

Except in this specific case changeprop/restbase fire out 23K requests for a specific transclusion in the span of less than one minute

Earlier you said Parsoid did. Could you clarify?

Except in this specific case changeprop/restbase fire out 23K requests for a specific transclusion in the span of less than one minute

Earlier you said Parsoid did. Could you clarify?

Joe meant (AFAICT): CP -> RB -> Parsoid -> MW. So, he is saying the originator of the high concurrency rate is CP which is why I added my comment earlier about spreading out CP's requests.

Joe added a comment.Dec 2 2016, 3:39 PM

Well, from the MediaWiki prespective, those requests come from parsoid.

From the parsoid prespective, they come from ChangePropagation via restbase.

I compared the number of requests incoming to parsoid at the minute of the flood of requests, and the one parsoid does to mediawiki: we get roughly the same number of preprocess requests to the MW api as the number of requests from changeprop to parsoid.

GWicke added a comment.EditedDec 2 2016, 3:55 PM

So, he is saying the originator of the high concurrency rate is CP which is why I added my comment earlier about spreading out CP's requests.

I think you mean request rate.

In any case, in https://grafana-admin.wikimedia.org/dashboard/db/eventbus?panelId=8&fullscreen, you see that request rates drop quite a bit when processing is slow (compare https://grafana-admin.wikimedia.org/dashboard/db/eventbus?panelId=9&fullscreen). This is concurrency limiting (constant concurrency) at work.

The concrete issue seems to be caused by especially expensive requests from Parsoid to the MW API, which are not aborted quickly, and are retried in Parsoid (as evidenced by the logs). This ties up resources in HHVM, causing outages. There are also a lot of cheap requests, in line with the normal processing rate of ~300/s (that's 18k requests if each of them only hits the api once). Those don't time out, and are not retried.

Re spreading out requests triggered by a specific template more: We did discuss increasing the mixing by interleaving backlink expansion with regular job processing before, and I do think this is something we should do. While not a silver bullet for the concrete issue, it is a good idea to increase the mixing in any case.

Edit: This is now tracked in T152229.

So, he is saying the originator of the high concurrency rate is CP which is why I added my comment earlier about spreading out CP's requests.

I think you mean request rate.

Yes. :)

In any case, in https://grafana-admin.wikimedia.org/dashboard/db/eventbus?panelId=8&fullscreen, you see that request rates drop quite a bit when processing is slow (compare https://grafana-admin.wikimedia.org/dashboard/db/eventbus?panelId=9&fullscreen). This is concurrency limiting (constant concurrency) at work.

Re spreading out requests triggered by a specific template more: We did discuss increasing the mixing by interleaving backlink expansion with regular job processing before, and I do think this is something we should do.

Yes, this would be good to do to prevent expensive parse requests from clustering together in time.

GWicke updated the task description. (Show Details)Dec 2 2016, 4:08 PM

Users are reporting problems with watchlist in #wikipedia-en

GWicke added a comment.EditedDec 2 2016, 4:13 PM

To illustrate using the RESTBase dashboard for the outage time frame:

Update request rates (from ChangeProp) to RESTBase:

Request rate from RESTBase to Parsoid during the outage:

Parsoid request latencies

Average request rate between 2am and 6am is 34/s, about 1/10 of the normal rate. Request rates decrease proportional to increases in response latency, reflecting the constant request concurrency limit enforced in ChangeProp.

@Paladox That is definitely a separate issue, not related.

Theklan added a subscriber: Theklan.EditedDec 3 2016, 10:53 PM

Hello,
I'm Theklan, one of the admins of eu:wp.
We have been making some chages in Modulu:Wikidata, in order to fully use Wikidata's power. This changes have been made using the ca:wp system, so yes, maybe there has been some issues in the midway.

Now we are developing an automated biography template, checking Wikidata for info.

About Template:Frantziako udalerri infotaula INSEE, it uses INSEE code to make some calculations, but this is not parsd from Wikidata. It could be maybe divided by departments, making a previous calculation... but it would be quite difficult now to divide all that information.

We have made changes to Template:Frantziako udalerri infotaula INSEE and now the time for loading each articles is considerably smaller. Please, see if the situation has changed and sorry about the problems!

I try to expand it from the wiki part. A page with this template transclusion was rendered in 5-8 seconds according with parser profile data showed in a preview. An edit on Module:Wikidata probably triggers an update of this expensive template transclusions, that is 34 600 pages, among other templates of course.

Some data is now fetched from Wikidata and other data has been splitted in subtemplates with shorter switches. Rendering time has been reduced around 1 second per page.

Is it posible to make any test? Save a null edit on eu:Module:Wikidata and track any worth noting effect on servers.

I try to expand it from the wiki part. A page with this template transclusion was rendered in 5-8 seconds according with parser profile data showed in a preview. An edit on Module:Wikidata probably triggers an update of this expensive template transclusions, that is 34 600 pages, among other templates of course.
Some data is now fetched from Wikidata and other data has been splitted in subtemplates with shorter switches. Rendering time has been reduced around 1 second per page.

Thank you @Vriullop for your effort! This is a significant improvement.

Is it posible to make any test? Save a null edit on eu:Module:Wikidata and track any worth noting effect on servers.

We are still investigating and mitigating the effects of the outage, so performing any changes on Modulu:Wikidata that would prompt all pages to be reparsed is not a good move at this point. Please refrain from doing so.

Change 324895 abandoned by Reedy:
Apply Ori's bandaid for API cluster fails

https://gerrit.wikimedia.org/r/324895

hashar removed a subscriber: hashar.Mar 22 2017, 9:56 AM

Looks like this one can be closed as well since the subtasks as well as parent tasks are resolved. Anything left to do here?

ssastry closed this task as Resolved.Apr 6 2018, 8:40 PM
ssastry claimed this task.

Feel free to re-open / create a new ticket with anything else left to be done here.