API cluster failure / OOM
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	fgiunchedi
	Nov 27 2016, 4:48 AM

Description

At around 2.00 UTC on 27/11/2016 the api appservers started using more and more memory and eventually throw enough errors and page at 2.27 UTC 02:27 -icinga-wm:#wikimedia-operations- PROBLEM - LVS HTTP IPv4 on api.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds

I've captured some hhvm backtraces on e.g. mw1279 in /root and roll-restarted api_appcluster which *seems* to have helped a bit.

Taking a closer look there seemed to be failing requests for euwiki from parsoid (from apache access log), among other failing requests. The requests for euwiki seemed to be continuing failing even after the rolling restart.

2016-11-27T02:02:53     119996827       10.64.32.74     proxy-server/503        50065   GET     http://eu.wikipedia.org/w/api.php
       -       text/html       -       -       Parsoid/0.6.1   -       -       -       10.64.32.74
2016-11-27T02:02:55     119997898       10.64.32.89     proxy-server/503        50065   GET     http://eu.wikipedia.org/w/api.php
       -       text/html       -       -       Parsoid/0.6.1   -       -       -       10.64.32.89
2016-11-27T02:03:04     119992080       10.64.0.215     proxy-server/503        50065   GET     http://eu.wikipedia.org/w/api.php
       -       text/html       -       -       Parsoid/0.6.1   -       -       -       10.64.0.215
2016-11-27T02:03:09     119996351       10.64.0.215     proxy-server/503        50065   GET     http://eu.wikipedia.org/w/api.php
       -       text/html       -       -       Parsoid/0.6.1   -       -       -       10.64.0.215
2016-11-27T02:03:09     119991462       10.64.32.84     proxy-server/503        50065   GET     http://eu.wikipedia.org/w/api.php
       -       text/html       -       -       Parsoid/0.6.1   -       -       -       10.64.32.84
2016-11-27T02:03:16     119994184       10.64.32.90     proxy-server/503        50065   GET     http://eu.wikipedia.org/w/api.php
       -       text/html       -       -       Parsoid/0.6.1   -       -       -       10.64.32.90
2016-11-27T02:03:18     119990649       10.64.0.218     proxy-server/503        50065   GET     http://eu.wikipedia.org/w/api.php
       -       text/html       -       -       Parsoid/0.6.1   -       -       -       10.64.0.218
2016-11-27T02:03:19     119986385       10.64.0.215     proxy-server/503        50065   GET     http://eu.wikipedia.org/w/api.php
       -       text/html       -       -       Parsoid/0.6.1   -       -       -       10.64.0.215
2016-11-27T02:03:20     119999445       10.64.32.78     proxy-server/503        50065   GET     http://eu.wikipedia.org/w/api.php
       -       text/html       -       -       Parsoid/0.6.1   -       -       -       10.64.32.78

In logstash there were logs related to dbquery for euwiki, e.g. (shortlink https://logstash.wikimedia.org/goto/4b7ce91a52c4feef185a3075d31162d9)

Database::query: lost connection to 10.64.32.136; reconnected:
#0 /srv/mediawiki/php-1.29.0-wmf.3/includes/libs/rdbms/database/Database.php(1254): Database->query()
#1 /srv/mediawiki/php-1.29.0-wmf.3/includes/libs/rdbms/database/Database.php(1322): Database->select()
#2 /srv/mediawiki/php-1.29.0-wmf.3/includes/cache/LinkCache.php(296): Database->selectRow()
#3 /srv/mediawiki/php-1.29.0-wmf.3/includes/cache/LinkCache.php(267): LinkCache->fetchPageRow()
#4 /srv/mediawiki/php-1.29.0-wmf.3/includes/Title.php(3209): LinkCache->addLinkObj()
#5 /srv/mediawiki/php-1.29.0-wmf.3/includes/Title.php(934): Title->getArticleID()
#6 /srv/mediawiki/php-1.29.0-wmf.3/includes/Title.php(955): Title->getContentModel()
#7 /srv/mediawiki/php-1.29.0-wmf.3/extensions/Scribunto/common/Hooks.php(94): Title->hasContentModel()
#8 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Parser.php(3379): ScribuntoHooks::invokeHook()
#9 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Parser.php(3103): Parser->callParserFunction()
#10 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Preprocessor_Hash.php(1015): Parser->braceSubstitution()
#11 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Parser.php(3021): PPFrame_Hash->expand()
#12 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Preprocessor_Hash.php(1015): Parser->braceSubstitution()
#13 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Preprocessor_Hash.php(1448): PPFrame_Hash->expand()
#14 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Parser.php(3253): PPTemplateFrame_Hash->cachedExpand()
#15 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Preprocessor_Hash.php(1015): Parser->braceSubstitution()
#16 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Parser.php(3021): PPFrame_Hash->expand()
#17 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Preprocessor_Hash.php(1015): Parser->braceSubstitution()
#18 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Parser.php(3256): PPFrame_Hash->expand()
#19 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Preprocessor_Hash.php(1015): Parser->braceSubstitution()
#20 /srv/mediawiki/php-1.29.0-wmf.3/extensions/ParserFunctions/ParserFunctions_body.php(104): PPFrame_Hash->expand()
#21 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Parser.php(3379): ExtParserFunctions::ifObj()
#22 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Parser.php(3103): Parser->callParserFunction()
#23 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Preprocessor_Hash.php(1015): Parser->braceSubstitution()
#24 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Preprocessor_Hash.php(1528): PPFrame_Hash->expand()
#25 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Preprocessor_Hash.php(1540): PPTemplateFrame_Hash->getNamedArgument()
#26 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Parser.php(3759): PPTemplateFrame_Hash->getArgument()
#27 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Preprocessor_Hash.php(1032): Parser->argSubstitution()
#28 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Preprocessor_Hash.php(1528): PPFrame_Hash->expand()
#29 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Preprocessor_Hash.php(1540): PPTemplateFrame_Hash->getNamedArgument()
#30 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Parser.php(3759): PPTemplateFrame_Hash->getArgument()
#31 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Preprocessor_Hash.php(1032): Parser->argSubstitution()
#32 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Parser.php(3021): PPFrame_Hash->expand()
#33 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Preprocessor_Hash.php(1015): Parser->braceSubstitution()
#34 /srv/mediawiki/php-1.29.0-wmf.3/extensions/ParserFunctions/ParserFunctions_body.php(106): PPFrame_Hash->expand()
#35 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Parser.php(3379): ExtParserFunctions::ifObj()
#36 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Parser.php(3103): Parser->callParserFunction()
#37 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Preprocessor_Hash.php(1015): Parser->braceSubstitution()
#38 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Parser.php(3256): PPFrame_Hash->expand()
#39 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Preprocessor_Hash.php(1015): Parser->braceSubstitution()
#40 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Parser.php(3256): PPFrame_Hash->expand()
#41 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Preprocessor_Hash.php(1015): Parser->braceSubstitution()
#42 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Parser.php(3256): PPFrame_Hash->expand()
#43 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Preprocessor_Hash.php(1015): Parser->braceSubstitution()
#44 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Parser.php(2917): PPFrame_Hash->expand()
#45 /srv/mediawiki/php-1.29.0-wmf.3/includes/parser/Parser.php(654): Parser->replaceVariables()
#46 /srv/mediawiki/php-1.29.0-wmf.3/includes/StubObject.php(113): Parser->preprocess()
#47 /srv/mediawiki/php-1.29.0-wmf.3/includes/StubObject.php(139): StubObject->_call()
#48 /srv/mediawiki/php-1.29.0-wmf.3/extensions/ParsoidBatchAPI/includes/ApiParsoidBatch.php(174): StubObject->__call()
#49 /srv/mediawiki/php-1.29.0-wmf.3/extensions/ParsoidBatchAPI/includes/ApiParsoidBatch.php(82): ApiParsoidBatch->preprocess()
#50 /srv/mediawiki/php-1.29.0-wmf.3/includes/api/ApiMain.php(1434): ApiParsoidBatch->execute()
#51 /srv/mediawiki/php-1.29.0-wmf.3/includes/api/ApiMain.php(510): ApiMain->executeAction()
#52 /srv/mediawiki/php-1.29.0-wmf.3/includes/api/ApiMain.php(481): ApiMain->executeActionWithErrorHandling()
#53 /srv/mediawiki/php-1.29.0-wmf.3/api.php(83): ApiMain->execute()
#54 /srv/mediawiki/w/api.php(3): include()
#55 {main}

Details

Subject	Repo	Branch	Lines +/-
Apply Ori's bandaid for API cluster fails	mediawiki/core	wmf/1.29.0-wmf.4	+15 -0
Expand ori's bandaid to only work on POST	mediawiki/core	wmf/1.29.0-wmf.3	+4 -0
Apply Ori's bandaid for API cluster fails	mediawiki/core	wmf/1.29.0-wmf.3	+11 -0
mediawiki::hhvm: bump up the number of malloc arenas	operations/puppet	production	+5 -3
mediawiki: make jemalloc arenas equal to the processorcount	operations/puppet	production	+1 -4
mediawiki: tweak jemalloc arenas on api, appserver canaries	operations/puppet	production	+2 -0
mediawiki::hhvm: allow to override the default jemalloc arenas	operations/puppet	production	+10 -1
hhvm: fix hhvm-needs-restart logic for memory	operations/puppet	production	+3 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	• demon	T150465 MW-1.29.0-wmf.4 deployment blockers
Resolved	• ssastry	T151702 API cluster failure / OOM
Resolved	• Pchelolo	T152073 Check concurrency/retry/timeout limits and syncronize those between services
Resolved	Joe	T152074 Separate clusters for asynchronous processing from the ones for public consumption
Resolved	Joe	T153042 Enable TLS termination on the MediaWiki clusters

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

So, to reinforce my argument, this is my extraction of ratio of requests to parsoid from changeprop:

sudo salt 'wtp1*' cmd.run "awk 'BEGIN {total=0; changeprop=0} /started parsing/ {total++; if (/ChangePropagation/) { changeprop++;}} END {print changeprop/total}' /srv/log/parsoid/main.log" 
wtp1005.eqiad.wmnet:
    0.846491
wtp1021.eqiad.wmnet:
    0.846637
wtp1014.eqiad.wmnet:
    0.847237
wtp1022.eqiad.wmnet:
    0.845636
wtp1013.eqiad.wmnet:
    0.84699
wtp1019.eqiad.wmnet:
    0.848367
wtp1017.eqiad.wmnet:
    0.846816
wtp1010.eqiad.wmnet:
    0.849078
wtp1020.eqiad.wmnet:
    0.845065
wtp1016.eqiad.wmnet:
    0.846363
wtp1004.eqiad.wmnet:
    0.846454
wtp1009.eqiad.wmnet:
    0.844574
wtp1006.eqiad.wmnet:
    0.845354
wtp1023.eqiad.wmnet:
    0.845964
wtp1024.eqiad.wmnet:
    0.847731
wtp1003.eqiad.wmnet:
    0.848589
wtp1001.eqiad.wmnet:
    0.848655
wtp1015.eqiad.wmnet:
    0.849198
wtp1011.eqiad.wmnet:
    0.846692
wtp1008.eqiad.wmnet:
    0.846391
wtp1007.eqiad.wmnet:
    0.844595
wtp1002.eqiad.wmnet:
    0.848969
wtp1018.eqiad.wmnet:
    0.846035
wtp1012.eqiad.wmnet:
    0.845007

a staggering 84.5% of rendering requests to parsoid seem to come from changepropagation.

For the MediaWiki API, the number is way lower in general:

sudo salt -C 'G@cluster:api_appserver and G@site:eqiad' cmd.run " awk 'BEGIN {total=0; changeprop=0} {total++; if (/ChangePropagation/) { changeprop++;}} END {print changeprop/total}' /var/log/apache2/other_vhosts_access.log.1"
mw1227.eqiad.wmnet:
    0.0363048
mw1223.eqiad.wmnet:
    0.0369746
mw1228.eqiad.wmnet:
    0.036771
mw1231.eqiad.wmnet:
    0.036394
mw1224.eqiad.wmnet:
    0.0365077
mw1234.eqiad.wmnet:
    0.0367172
mw1226.eqiad.wmnet:
    0.0366096
mw1233.eqiad.wmnet:
    0.0365922
mw1222.eqiad.wmnet:
    0.0364166
mw1235.eqiad.wmnet:
    0.0368284
mw1232.eqiad.wmnet:
    0.0364804
mw1230.eqiad.wmnet:
    0.0362725
mw1229.eqiad.wmnet:
    0.0363956
mw1221.eqiad.wmnet:
    0.0367211
mw1196.eqiad.wmnet:
    0.0363961
mw1204.eqiad.wmnet:
    0.0357157
mw1193.eqiad.wmnet:
    0.0358446
mw1207.eqiad.wmnet:
    0.0356797
mw1208.eqiad.wmnet:
    0.0358427
mw1191.eqiad.wmnet:
    0.0360581
mw1195.eqiad.wmnet:
    0.0359184
mw1203.eqiad.wmnet:
    0.0360786
mw1189.eqiad.wmnet:
    0.0364143
mw1192.eqiad.wmnet:
    0.0364371
mw1201.eqiad.wmnet:
    0.0355747
mw1206.eqiad.wmnet:
    0.0357697
mw1198.eqiad.wmnet:
    0.0357252
mw1225.eqiad.wmnet:
    0.0363874
mw1200.eqiad.wmnet:
    0.0362528
mw1197.eqiad.wmnet:
    0.0359116
mw1190.eqiad.wmnet:
    0.0358727
mw1194.eqiad.wmnet:
    0.0357255
mw1202.eqiad.wmnet:
    0.035878
mw1199.eqiad.wmnet:
    0.0357621
mw1205.eqiad.wmnet:
    0.03577
mw1279.eqiad.wmnet:
    0.0350745
mw1282.eqiad.wmnet:
    0.0352047
mw1276.eqiad.wmnet:
    0.035097
mw1280.eqiad.wmnet:
    0.0350594
mw1277.eqiad.wmnet:
    0.0352321
mw1286.eqiad.wmnet:
    0.0350941
mw1290.eqiad.wmnet:
    0.0347753
mw1284.eqiad.wmnet:
    0.0347714
mw1283.eqiad.wmnet:
    0.0348901
mw1287.eqiad.wmnet:
    0.0348356
mw1278.eqiad.wmnet:
    0.034635
mw1289.eqiad.wmnet:
    0.034799
mw1285.eqiad.wmnet:
    0.035124
mw1288.eqiad.wmnet:
    0.0344889
mw1281.eqiad.wmnet:
    0.0350896

mark subscribed.Nov 29 2016, 11:57 AM

In T151702#2828755, @fgiunchedi wrote:

Reporting the patch here too to avoid relying on external paste

diff --git a/api.php b/api.php
index 6e75fb7..f40e235 100644
--- a/api.php
+++ b/api.php
@@ -32,6 +32,15 @@
 
 use MediaWiki\Logger\LegacyLogger;
 
+if (
+       ( isset( $_SERVER['HTTP_USER_AGENT'] ) && strpos( $_SERVER['HTTP_USER_AGENT'], 'Parsoid' ) !== false ) &&
+       ( isset( $_SERVER['HTTP_HOST'] ) && $_SERVER['HTTP_HOST'] === 'eu.wikipedia.org' )
+
+) {
+       header( $_SERVER['SERVER_PROTOCOL'] . ' 500 T151702 Bandaid', true, 500 );
+       die( 1 );
+}
+
 // So extensions (and other code) can check whether they're running in API mode
 define( 'MW_API', true );

Turned into a commit at https://gerrit.wikimedia.org/r/#/c/324182/1

It should be noted that we had this issue again, between approx 12:35 UTC and 12:47 UTC today, but the problem has had public-facing effect mostly in the last 3 minutes of the incident. @Reedy has added @ori's patch back and that, toghether with a rolling restart of HHVM, cleared the issue.

It seems this is again due to euwiki, and the stacktraces were similar to what collected earlier

Reedy mentioned this in T150465: MW-1.29.0-wmf.4 deployment blockers.Nov 29 2016, 2:38 PM

So, the number of requests that ChangePropagation does to the API is small in itself; but most requests are from parsoid, and they do use the Parsoid/X.Y.Z UA.

Counting those gives pretty different numbers, in particular if we only count POSTs:

sudo salt -C 'G@cluster:api_appserver and G@site:eqiad' cmd.run "fgrep POST /var/log/apache2/other_vhosts_access.log | awk 'BEGIN {total=0; changeprop=0} {total++; if (/Parsoid/) { changeprop++;}} END {print changeprop/total}'"
mw1225.eqiad.wmnet:
    0.718383
mw1223.eqiad.wmnet:
    0.720376
mw1227.eqiad.wmnet:
    0.718998
mw1229.eqiad.wmnet:
    0.718939
mw1222.eqiad.wmnet:
    0.718477
mw1204.eqiad.wmnet:
    0.720225
mw1195.eqiad.wmnet:
    0.720146
mw1224.eqiad.wmnet:
    0.720202
mw1208.eqiad.wmnet:
    0.717703
mw1234.eqiad.wmnet:
    0.71629
mw1199.eqiad.wmnet:
    0.717074
mw1226.eqiad.wmnet:
    0.719125
mw1201.eqiad.wmnet:
    0.716597
mw1228.eqiad.wmnet:
    0.72082
mw1192.eqiad.wmnet:
    0.719594
mw1221.eqiad.wmnet:
    0.718242
mw1202.eqiad.wmnet:
    0.71801
mw1233.eqiad.wmnet:
    0.718829
mw1205.eqiad.wmnet:
    0.720631
mw1191.eqiad.wmnet:
    0.714644
mw1196.eqiad.wmnet:
    0.720405
mw1203.eqiad.wmnet:
    0.717774
mw1190.eqiad.wmnet:
    0.715812
mw1232.eqiad.wmnet:
    0.720945
mw1197.eqiad.wmnet:
    0.71542
mw1235.eqiad.wmnet:
    0.718755
mw1231.eqiad.wmnet:
    0.716894
mw1193.eqiad.wmnet:
    0.715542
mw1200.eqiad.wmnet:
    0.717782
mw1206.eqiad.wmnet:
    0.721007
mw1230.eqiad.wmnet:
    0.718386
mw1189.eqiad.wmnet:
    0.717779
mw1194.eqiad.wmnet:
    0.718455
mw1207.eqiad.wmnet:
    0.715529
mw1279.eqiad.wmnet:
    0.715622
mw1198.eqiad.wmnet:
    0.718593
mw1276.eqiad.wmnet:
    0.712992
mw1277.eqiad.wmnet:
    0.71066
mw1281.eqiad.wmnet:
    0.71177
mw1280.eqiad.wmnet:
    0.71181
mw1287.eqiad.wmnet:
    0.712915
mw1290.eqiad.wmnet:
    0.715581
mw1286.eqiad.wmnet:
    0.71094
mw1289.eqiad.wmnet:
    0.712168
mw1284.eqiad.wmnet:
    0.709995
mw1278.eqiad.wmnet:
    0.712887
mw1288.eqiad.wmnet:
    0.711827
mw1282.eqiad.wmnet:
    0.710432
mw1283.eqiad.wmnet:
    0.713564
mw1285.eqiad.wmnet:
    0.713916

Parsoid accounts for 71% of the POST requests to the mw api, and this of course include requests that come from live users; I'll try to narrow the ones coming from changeProp further.

While rate-limiting on the Changeprop side is probably a good thing to tackle, getting every edge case right is probably very tricky.

To clarify, ChangeProp uses concurrency limiting rather than rate limiting, as this adapts better to the large spread of per-request costs & varying load. There are separate concurrency limits for first request vs. retries, as well as controls for the amount of retry back-off and number of retries.

I depooled one API server and tried the request in P4516. Single requests don't seem to have any effect, but as soon as I raised the ab concurrency to a value of 10, I started seeing threads waiting on a lock in HPHP::JSON_parser as during the outage.

Seeing what follows

root@mw1276:~# hhvm-dump-debug | fgrep HPHP::JSON_parser | wc -l
Backtrace saved as /tmp/hhvm.18560.bt.
9

makes me wonder if there is some global lock in the json parse code that allows just one thread to parse json at the same time.

As I raise the concurrency, more requests seem to be blocked in the same lock.

So, apart from fixing the clear issue we have in HHVM, we should focus on what seems to be a thundering herd effect of some sort:

in the long chain changeprop => restbase => parsoid something times out well before HHVM has responded, and then goes on to retry again and again, probably, causing in the end the api appserver to all grind to a halt because of the global lock (all threads are blocked while waiting the one owning the lock to have finished its job).

(just housekeeping: this is now in the "post emergency" phase, right? If so, we should create the follow-ups as specific (sub)tasks now and we can close this one once the root-cause investigation is done)

On mw1276, I did take a full backtrace (not with quickstack) of a running hhvm process while the problematic requests were running. It can be found under /root/T151702

From a quick look, most threads seem effectively blocked in a very simple function:

je_malloc_mutex_lock (mutex=0x7f2b30c004d8) at include/jemalloc/internal/mutex.h:77

and in most cases, what is happening is every thread is trying to free that pointer address:

thread 64, 48, 26, 21 are all trying to call free

#11 free (ptr=0x7f2b30c004d8) at src/jemalloc.c:1308

while thread 84 is running arena_run_tree_insert and that refers to a data structure that references that pointer, apparently.

My C++ is way too bad to be able to follow the jemalloc source code at this time of the day; if someone wants to continue investigating, mw1276 has the backtraces and is currently depooled from traffic for debug purposes. If you want to look at the jemalloc sources, please remember downloading the debian source package for jemalloc for jessie, as that's what we're using now.

Jdforrester-WMF subscribed.Nov 29 2016, 5:38 PM

• demon added a parent task: T150465: MW-1.29.0-wmf.4 deployment blockers.Nov 29 2016, 6:16 PM

• GWicke mentioned this in T151934: Check & possibly improve request id handling in changeprop.Nov 29 2016, 8:31 PM

tstarling changed the visibility from "Custom Policy" to "Custom Policy".Nov 29 2016, 10:33 PM

fgiunchedi changed the visibility from "Custom Policy" to "Public (No Login Required)".Nov 29 2016, 10:51 PM

fgiunchedi changed Security from Other confidential issue to None.

Thanks @tstarling and @fgiunchedi for fixing the visibility.

In T151702#2831201, @Joe wrote:

In the long chain changeprop => restbase => parsoid something times out well before HHVM has responded, and then goes on to retry again and again ...

@Joe do you have more info about this? We should figure out where these excessive retries originate.

Paladox subscribed.Nov 30 2016, 12:14 AM

In T151702#2831448, @Joe wrote:
From a quick look, most threads seem effectively blocked in a very simple function:
je_malloc_mutex_lock (mutex=0x7f2b30c004d8) at include/jemalloc/internal/mutex.h:77
and in most cases, what is happening is every thread is trying to free that pointer address:

That is a gdb bug, the argument to free() is wrong. The disassembly shows the argument to free() is passed in rdi, but gdb claims it is on the stack. As an argument to je_malloc_mutex_lock() it is more plausible since the disassembly confirms that the argument to that function is on the stack.

The threads are not all trying to free the same pointer, but they are all trying to lock the same arena. I previously believed the jemalloc man page, which says that the default number of arenas is 4 times the number of CPUs. However, HHVM overrides this (in alloc.cpp), setting the number of arenas to 1. This has a catastrophic impact on concurrent performance:

hhvm narenas tuning.png (767×1 px, 52 KB)

The data above was gathered using ab for the request in P4516. For concurrency 1 to 16, I used n=16. Then I dropped the narenas=1 case and gathered data for concurrency 16 to 64 for narenas=4 and 16, with n=64.

[UPDATE: added narenas=160, fixed a bug in the awk one-liner that I used to do averaging, exclude the highest sample since there seemed to be a warmup effect]

• demon subscribed.Nov 30 2016, 3:03 AM

Legoktm subscribed.Nov 30 2016, 3:32 AM

Filed upstream bug https://github.com/facebook/hhvm/issues/7515 , but we're not blocked on it, we can use the MALLOC_CONF environment variable to change narenas, like I did for my testing above.

• MZMcBride subscribed.Nov 30 2016, 7:44 AM

Change 324394 had a related patch set uploaded (by Giuseppe Lavagetto):
mediawiki::hhvm: allow to override the default jemalloc arenas

https://gerrit.wikimedia.org/r/324394

gerritbot added a project: Patch-For-Review.Nov 30 2016, 10:15 AM

Change 324394 merged by Giuseppe Lavagetto:
mediawiki::hhvm: allow to override the default jemalloc arenas

https://gerrit.wikimedia.org/r/324394

Mentioned in SAL (#wikimedia-operations) [2016-11-30T11:22:01Z] <_joe_> repooling mw1276, after tests for T151702

Change 324447 had a related patch set uploaded (by Giuseppe Lavagetto):
mediawiki: tweak jemalloc arenas on api, appserver canaries

https://gerrit.wikimedia.org/r/324447

Change 324447 merged by Giuseppe Lavagetto:
mediawiki: tweak jemalloc arenas on api, appserver canaries

https://gerrit.wikimedia.org/r/324447

Change 324462 had a related patch set uploaded (by Giuseppe Lavagetto):
mediawiki: make jemalloc arenas equal to the processorcount

https://gerrit.wikimedia.org/r/324462

Change 324462 merged by Giuseppe Lavagetto:
mediawiki: make jemalloc arenas equal to the processorcount

https://gerrit.wikimedia.org/r/324462

I have set arenas for jemalloc to be equal to the number of processors seen by the OS, the bandaid fix should be in the process of being removed now.

So, with the HHVM part "solved" we still should take the prevention measures I named here:

Check the concurrency/retry/timeout rates of all components
Separate clusters for async processing from the ones for serving the public requests

elukey awarded a token.Nov 30 2016, 4:22 PM

In T151702#2835255, @Joe wrote:

So, with the HHVM part "solved" we still should take the prevention measures I named here:

Check the concurrency/retry/timeout rates of all components

Separate clusters for async processing from the ones for serving the public requests

Add to/update https://wikitech.wikimedia.org/wiki/Incident_documentation/20161127-API and file sub-tasks and close this one (it sounds like what needs to be done to me).

@greg yeah I know, I'll do my homework, promised :)

I'm just waiting to see if the issue happens again in the next couple of days before closing this.

Joe created subtask T152073: Check concurrency/retry/timeout limits and syncronize those between services.Dec 1 2016, 6:49 AM

Joe created subtask T152074: Separate clusters for asynchronous processing from the ones for public consumption.Dec 1 2016, 6:56 AM

Change 324196 abandoned by Reedy:
Expand ori's bandaid to only work on POST

Reason:
No longer required

https://gerrit.wikimedia.org/r/324196

Mentioned in SAL (#wikimedia-operations) [2016-12-01T12:49:47Z] <reedy@tin> Synchronized php-1.29.0-wmf.3/api.php: Remove oris bandaid T151702 (duration: 00m 46s)

• Elitre subscribed.Dec 1 2016, 5:22 PM

So, this happened again this morning, and we have good and bad news:

Good news is the system, with a larger number of jemalloc arenas, took more time to go down
Bad news is Parsoid is absolutely hammering the API with such requests (some more data coming), and that is basically killing us; I'm unsure such rates of requests are manageable in any ways by us.

Change 324866 had a related patch set uploaded (by Giuseppe Lavagetto):
mediawiki::hhvm: bump up the number of malloc arenas

https://gerrit.wikimedia.org/r/324866

Change 324866 merged by Giuseppe Lavagetto:
mediawiki::hhvm: bump up the number of malloc arenas

https://gerrit.wikimedia.org/r/324866

Looking at api.log, I found that requests as follows:

to euwiki
action=parsoid-batch
batch-action=preprocess

have had an absurd peak this morning, specifically at 07:50 UTC I count 23737 such requests in one minute. Given some of those are very expensive, it's no surpise. For comparison, normally reqs/min are below 100.

shows the rate of such requests

There was an edit of Modulu:Wikidata this morning about 20 minutes before the peak of requests from parsoid happened. Not sure if that's related, but it is included in all the preprocessed pages I looked up until now.

More investigation ongoing.

This brust in traffic is, looking at parsoid logs, due to reqId: 3ff62f51-cd11-4b44-98e4-6a6aa608b600 from ChangePropagation.

I am unsure how to check restbase's access logs.

In T151702#2841250, @Joe wrote:

There was an edit of Modulu:Wikidata this morning about 20 minutes before the peak of requests from parsoid happened. Not sure if that's related, but it is included in all the preprocessed pages I looked up until now.

This is indeed the most likely offender. That page in transcluded in 10k+ pages on euwiki. Looking at the logs, all of the Parsoid errors as seen by RESTBase come from Change Propagation, which was cycling through the pages Modulu:Wikidata is transcluded in and forced HTML updates on RESTBase, which in turn asked Parsoid to render them.

Change 324895 had a related patch set uploaded (by Reedy):
Apply Ori's bandaid for API cluster fails

https://gerrit.wikimedia.org/r/324895

Change 324196 restored by Reedy:
Expand ori's bandaid to only work on POST

https://gerrit.wikimedia.org/r/324196

Change 324196 abandoned by Reedy:
Expand ori's bandaid to only work on POST

https://gerrit.wikimedia.org/r/324196

I've indefinitely protected (to sysop) https://eu.wikipedia.org/wiki/Modulu:Wikidata for now, and left a message at https://eu.wikipedia.org/wiki/Modulu_eztabaida:Wikidata to say that they refrain from editing it... And to use another page for development

I removed the bandaid right now, hoping we didn't miss the origin of the issue.

I would still like the concurrency limit of Change Propagation to be reviewed and tested, as next time something like this happens I'll just turn off changeprop, instead of crippling the requests to the mediawiki API.

Mentioned in SAL (#wikimedia-operations) [2016-12-02T15:02:28Z] <_joe_> rolling restart of API appservers to catch up with the new jemalloc arenas config T151702

In T151702#2841428, @mobrovac wrote:

In T151702#2841250, @Joe wrote:

There was an edit of Modulu:Wikidata this morning about 20 minutes before the peak of requests from parsoid happened. Not sure if that's related, but it is included in all the preprocessed pages I looked up until now.

This is indeed the most likely offender. That page in transcluded in 10k+ pages on euwiki. Looking at the logs, all of the Parsoid errors as seen by RESTBase come from Change Propagation, which was cycling through the pages Modulu:Wikidata is transcluded in and forced HTML updates on RESTBase, which in turn asked Parsoid to render them.

@mobrovac, @GWicke, @Pchelolo: Can ChangeProp spread out requests from edits of a single template? As it is designed now, if that template happens to be expensive to process, then either Parsoid or MediaWiki or both get hammered. In this instance it was MediaWiki. But, I can see this happening to Parsoid instead. So, it is worth thinking about how to mix up the reparse requests, so the expensive ones are spread out over time with the inexpensive ones. Introduce a time lag before insertion into a queue, or add some randomization factor in how requests are picked up .... I am throwing out random crazy ideas without knowing anything about CP's internals.

Separate but related question: given that the PHP parser also needs to refresh the cache, and needs to preprocess all those same pages that RESTBase/Parsoid is updating, where are those requests ending up?

matmarex subscribed.Dec 2 2016, 3:10 PM

Traditionally, a big issue causing work amplification has been a lack of reliable request timeout support in the MediaWiki API, which is tracked in T97192. This allows requests to consume large amounts of resources, and makes it impossible for clients to distinguish a timed-out request from a temporary issue affecting a single node. As a result, Parsoid ends up hitting its client timeout, and retries those requests once. At the same time, the original request is still consuming resources in MediaWiki.

In T151702#2841758, @GWicke wrote:

Traditionally, a big issue causing work amplification has been a lack of reliable request timeout support in the MediaWiki API, which is tracked in T97192. This allows requests to consume large amounts of resources, and makes it impossible for clients to distinguish a timed-out request from a temporary issue affecting a single node. As a result, Parsoid ends up hitting its client timeout, and retries those requests once. At the same time, the original request is still consuming resources in MediaWiki.

Except in this specific case changeprop/restbase fire out 23K requests for a specific transclusion in the span of less than one minute, and I see those almost 1:1 between parsoid and mediawiki, so there is no crazy timeout/retry issue at play here, it's just a flood of requests, all coming in at roughly the same time.

Except in this specific case changeprop/restbase fire out 23K requests for a specific transclusion in the span of less than one minute

Earlier you said Parsoid did. Could you clarify?

In T151702#2841774, @GWicke wrote:

Except in this specific case changeprop/restbase fire out 23K requests for a specific transclusion in the span of less than one minute

Earlier you said Parsoid did. Could you clarify?

Joe meant (AFAICT): CP -> RB -> Parsoid -> MW. So, he is saying the originator of the high concurrency rate is CP which is why I added my comment earlier about spreading out CP's requests.

Well, from the MediaWiki prespective, those requests come from parsoid.

From the parsoid prespective, they come from ChangePropagation via restbase.

I compared the number of requests incoming to parsoid at the minute of the flood of requests, and the one parsoid does to mediawiki: we get roughly the same number of preprocess requests to the MW api as the number of requests from changeprop to parsoid.

So, he is saying the originator of the high concurrency rate is CP which is why I added my comment earlier about spreading out CP's requests.

I think you mean request rate.

In any case, in https://grafana-admin.wikimedia.org/dashboard/db/eventbus?panelId=8&fullscreen, you see that request rates drop quite a bit when processing is slow (compare https://grafana-admin.wikimedia.org/dashboard/db/eventbus?panelId=9&fullscreen). This is concurrency limiting (constant concurrency) at work.

The concrete issue seems to be caused by especially expensive requests from Parsoid to the MW API, which are not aborted quickly, and are retried in Parsoid (as evidenced by the logs). This ties up resources in HHVM, causing outages. There are also a lot of cheap requests, in line with the normal processing rate of ~300/s (that's 18k requests if each of them only hits the api once). Those don't time out, and are not retried.

Re spreading out requests triggered by a specific template more: We did discuss increasing the mixing by interleaving backlink expansion with regular job processing before, and I do think this is something we should do. While not a silver bullet for the concrete issue, it is a good idea to increase the mixing in any case.

Edit: This is now tracked in T152229.

In T151702#2841818, @GWicke wrote:

So, he is saying the originator of the high concurrency rate is CP which is why I added my comment earlier about spreading out CP's requests.

I think you mean request rate.

Yes. :)

In any case, in https://grafana-admin.wikimedia.org/dashboard/db/eventbus?panelId=8&fullscreen, you see that request rates drop quite a bit when processing is slow (compare https://grafana-admin.wikimedia.org/dashboard/db/eventbus?panelId=9&fullscreen). This is concurrency limiting (constant concurrency) at work.

Re spreading out requests triggered by a specific template more: We did discuss increasing the mixing by interleaving backlink expansion with regular job processing before, and I do think this is something we should do.

Yes, this would be good to do to prevent expensive parse requests from clustering together in time.

• GWicke updated the task description. (Show Details)Dec 2 2016, 4:08 PM

Users are reporting problems with watchlist in #wikipedia-en

Users have also reported it on-wiki see https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Several_technical_problems please.

To illustrate using the RESTBase dashboard for the outage time frame:

Update request rates (from ChangeProp) to RESTBase:

Request rate from RESTBase to Parsoid during the outage:

Parsoid request latencies

Average request rate between 2am and 6am is 34/s, about 1/10 of the normal rate. Request rates decrease proportional to increases in response latency, reflecting the constant request concurrency limit enforced in ChangeProp.

@Paladox That is definitely a separate issue, not related.

Hello,
I'm Theklan, one of the admins of eu:wp.
We have been making some chages in Modulu:Wikidata, in order to fully use Wikidata's power. This changes have been made using the ca:wp system, so yes, maybe there has been some issues in the midway.

Now we are developing an automated biography template, checking Wikidata for info.

About Template:Frantziako udalerri infotaula INSEE, it uses INSEE code to make some calculations, but this is not parsd from Wikidata. It could be maybe divided by departments, making a previous calculation... but it would be quite difficult now to divide all that information.

Joe mentioned this in T152074: Separate clusters for asynchronous processing from the ones for public consumption.Dec 6 2016, 7:48 AM

We have made changes to Template:Frantziako udalerri infotaula INSEE and now the time for loading each articles is considerably smaller. Please, see if the situation has changed and sorry about the problems!

The Wiki diff

I try to expand it from the wiki part. A page with this template transclusion was rendered in 5-8 seconds according with parser profile data showed in a preview. An edit on Module:Wikidata probably triggers an update of this expensive template transclusions, that is 34 600 pages, among other templates of course.

Some data is now fetched from Wikidata and other data has been splitted in subtemplates with shorter switches. Rendering time has been reduced around 1 second per page.

Is it posible to make any test? Save a null edit on eu:Module:Wikidata and track any worth noting effect on servers.

In T151702#2859748, @Vriullop wrote:

I try to expand it from the wiki part. A page with this template transclusion was rendered in 5-8 seconds according with parser profile data showed in a preview. An edit on Module:Wikidata probably triggers an update of this expensive template transclusions, that is 34 600 pages, among other templates of course.

Some data is now fetched from Wikidata and other data has been splitted in subtemplates with shorter switches. Rendering time has been reduced around 1 second per page.

Thank you @Vriullop for your effort! This is a significant improvement.

Is it posible to make any test? Save a null edit on eu:Module:Wikidata and track any worth noting effect on servers.

We are still investigating and mitigating the effects of the outage, so performing any changes on Modulu:Wikidata that would prompt all pages to be reparsed is not a good move at this point. Please refrain from doing so.

• GWicke mentioned this in T97192: HHVM request timeouts not working; support lowering the API request timeout per request.Dec 9 2016, 6:29 PM

Change 324895 abandoned by Reedy:
Apply Ori's bandaid for API cluster fails

https://gerrit.wikimedia.org/r/324895

greg moved this task from Active Situation to Follow-up prevention on the Wikimedia-Incident board.Jan 5 2017, 5:32 PM

Joe closed subtask T152074: Separate clusters for asynchronous processing from the ones for public consumption as Resolved.Jan 22 2017, 10:37 AM

greg mentioned this in T148652: 2016-10-17 API cluster overload.Feb 13 2017, 10:22 PM

Krinkle removed projects: Patch-For-Review, WMF-NDA.Mar 22 2017, 1:16 AM

hashar unsubscribed.Mar 22 2017, 9:56 AM

greg edited projects, added Release-Engineering-Team (Watching / External); removed Release-Engineering-Team.May 16 2017, 3:35 PM

• Pchelolo closed subtask T152073: Check concurrency/retry/timeout limits and syncronize those between services as Resolved.Apr 6 2018, 7:53 PM

Looks like this one can be closed as well since the subtasks as well as parent tasks are resolved. Anything left to do here?

Feel free to re-open / create a new ticket with anything else left to be done here.

Krinkle edited projects, added Sustainability (Incident Followup); removed Wikimedia-Incident.Apr 28 2020, 9:50 PM

	F4941034: pasted_file
	Dec 2 2016, 4:15 PM

	F4941029: pasted_file
	Dec 2 2016, 4:14 PM

	F4941018: pasted_file
	Dec 2 2016, 4:13 PM

	F4941013: pasted_file
	Dec 2 2016, 4:13 PM

API cluster failure / OOMClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

API cluster failure / OOM
Closed, ResolvedPublic
Actions

Related Objects
Search...