Page MenuHomePhabricator

HHVM: segfault when serializing/unserializing large preprocessor cache items
Closed, ResolvedPublic

Description

A lot of long or complicated pages show http status code 503:

namespace / pagetitle / page_length
(wikitext)
: 2 [https://de.wikipedia.org/w/index.php?curid=8360886 Doc_Taxon/who1] 3179409
: 3 [https://de.wikipedia.org/w/index.php?curid=3593098 Hubertl/Archiv] 2047262
: 2 [https://de.wikipedia.org/w/index.php?curid=8360909 Doc_Taxon/Doctor_Who/vh2] 2000639
: 3 [https://de.wikipedia.org/w/index.php?curid=2976487 TUBS/Archiv] 1909020
: 5 [https://de.wikipedia.org/w/index.php?curid=6369360 WikiProjekt_Österreichische_Denkmallisten/Archiv] 1801304
: 2 [https://de.wikipedia.org/w/index.php?curid=5309816 Armin_P._/_ältere_Beiträge] 1620265
: 3 [https://de.wikipedia.org/w/index.php?curid=3539955 Kriddl/Archiv] 1595476
: 3 [https://de.wikipedia.org/w/index.php?curid=7009571 CherryX/Archiv] 1587846
: 2 [https://de.wikipedia.org/w/index.php?curid=7936938 Sebbot/Log/2013-10-21/07:00] 1573086
: 3 [https://de.wikipedia.org/w/index.php?curid=3839871 Dorado/Archiv] 1551258
: 3 [https://de.wikipedia.org/w/index.php?curid=1703543 Schmelzle/Archiv] 1463051
: 4 [https://de.wikipedia.org/w/index.php?curid=6674445 Relevanzcheck/Archiv/2012] 1408176
: 2 [https://de.wikipedia.org/w/index.php?curid=5531914 B3t/Interwikis] 1358623
: 2 [https://de.wikipedia.org/w/index.php?curid=2854727 Stefan64/Archiv] 1347561
: 3 [https://de.wikipedia.org/w/index.php?curid=6026556 Polarlys/Archiv6] 1243238
: 3 [https://de.wikipedia.org/w/index.php?curid=4551146 Freimut_Bahlo/Archiv] 1206084
: 3 [https://de.wikipedia.org/w/index.php?curid=7468090 Hephaion/Archiv/2013] 1202453
: 4 [https://de.wikipedia.org/w/index.php?curid=1835525 Vandalismusmeldung/Archiv/2006/10] 1190127
: 3 [https://de.wikipedia.org/w/index.php?curid=6153460 Nicola/Archiv/2011] 1112779
: 3 [https://de.wikipedia.org/w/index.php?curid=5962185 Merlissimo/Archiv/2011] 1061895
: 4 [https://de.wikipedia.org/w/index.php?curid=1755523 Vandalismusmeldung/Archiv/2006/09] 1017207
: 4 [https://de.wikipedia.org/w/index.php?curid=3110054 Auskunft/Archiv/2007/Dez] 1012768
: 5 [https://de.wikipedia.org/w/index.php?curid=5899054 Hauptseite/Artikel_des_Tages/Archiv/Vorschläge/2011/1] 1009062
: 3 [https://de.wikipedia.org/w/index.php?curid=4177733 Umweltschützen/Archiv] 1000518
: 3 [https://de.wikipedia.org/w/index.php?curid=3703063 Merlissimo] 991284
: 4 [https://de.wikipedia.org/w/index.php?curid=1590732 Vandalismusmeldung/Archiv/2006/07] 975081
: 3 [https://de.wikipedia.org/w/index.php?curid=5622355 Umweltschützen/Archiv_2] 899335
: 101 [https://de.wikipedia.org/w/index.php?curid=7901814 Auto_und_Motorrad/Archiv/2013] 887866
: 3 [https://de.wikipedia.org/w/index.php?curid=2525289 Usquam/Archiv] 875821
: 4 [https://de.wikipedia.org/w/index.php?curid=1669929 Vandalismusmeldung/Archiv/2006/08] 869716
: 3 [https://de.wikipedia.org/w/index.php?curid=1052376 KV_28/Archiv_1] 866489
: 3 [https://de.wikipedia.org/w/index.php?curid=4561916 Hannibal21/Archiv] 828270
: 3 [https://de.wikipedia.org/w/index.php?curid=7592701 Nicola/Archiv/Archiv2013] 807241
: 2 [https://de.wikipedia.org/w/index.php?curid=8134873 Fox122/Fehlende_Artikel] 793347
: 3 [https://de.wikipedia.org/w/index.php?curid=6669982 Hephaion/Archiv/2012] 792193
: 3 [https://de.wikipedia.org/w/index.php?curid=4780766 XenonX3/Archiv/2009] 775658
: 2 [https://de.wikipedia.org/w/index.php?curid=3170744 RevoBot/Log/2007-12-29T06:00:00] 764430
: 2 [https://de.wikipedia.org/w/index.php?curid=8025047 Krdbot/Klammerrotlinks] 722107
: 4 [https://de.wikipedia.org/w/index.php?curid=1532618 Vandalismusmeldung/Archiv/2006/06] 712778
: 3 [https://de.wikipedia.org/w/index.php?curid=6201881 Enomine/Ausrufer] 703680
: 2 [https://de.wikipedia.org/w/index.php?curid=6520971 Flominator/Bilderwunsch/Listeneintrag] 703621
: 2 [https://de.wikipedia.org/w/index.php?curid=5108588 Antonsusi/A-Liste] 698589
: 4 [https://de.wikipedia.org/w/index.php?curid=6318601 Sperrprüfung/Archiv/2011/Juli] 698170
: 3 [https://de.wikipedia.org/w/index.php?curid=5853716 Eingangskontrolle/Archiv2010] 690773
: 3 [https://de.wikipedia.org/w/index.php?curid=788515 Tilla] 679823
: 2 [https://de.wikipedia.org/w/index.php?curid=6330852 Ne_discere_cessa!/Ausrufer] 676629
: 2 [https://de.wikipedia.org/w/index.php?curid=5983458 Akkakk/Unicode3] 647266
: 4 [https://de.wikipedia.org/w/index.php?curid=5806571 Wartungsbausteinwettbewerb/Herbst_2010/Auswertung] 645066
: 4 [https://de.wikipedia.org/w/index.php?curid=7961565 Redaktion_Physik/Kategorienpflege] 644928
: 4 [https://de.wikipedia.org/w/index.php?curid=253737 Löschkandidaten/Archiv:Datei-Logbuch/20040506] 639876
: 2 [https://de.wikipedia.org/w/index.php?curid=7039067 Flominator/ÜA] 636296
: 2 [https://de.wikipedia.org/w/index.php?curid=7449577 Krdbot/ANR-Meta-Links] 633950
: 2 [https://de.wikipedia.org/w/index.php?curid=6631063 Krdbot/Chemredirs] 630035
: 2 [https://de.wikipedia.org/w/index.php?curid=8309162 Dsds55/Russland2/Disk] 554652
: 4 [https://de.wikipedia.org/w/index.php?curid=8358780 Umfragen/Superschutz] 523971
: 2 [https://de.wikipedia.org/w/index.php?curid=7455036 Entbert/Werkstatt] 523509
: 0 [https://de.wikipedia.org/w/index.php?curid=69205 Liste_von_Pkw-Marken] 520035
: 4 [https://de.wikipedia.org/w/index.php?curid=270802 Löschkandidaten/Archiv:Lösch-Logbuch/20040802] 517838
: 4 [https://de.wikipedia.org/w/index.php?curid=7949454 Redaktion_Biologie/Artikelbewertung/Statistik/Lebewesen_–_systematische_Übersicht] 513870
: 0 [https://de.wikipedia.org/w/index.php?curid=7658948 Liste_der_RISM-Bibliothekssigel] 503190
: 0 [https://de.wikipedia.org/w/index.php?curid=6829214 Liste_der_Kulturdenkmäler_in_Bremen-Mitte] 480287
: 0 [https://de.wikipedia.org/w/index.php?curid=1725607 Liste_der_olympischen_Medaillengewinner_aus_den_Vereinigten_Staaten] 479622
: 4 [https://de.wikipedia.org/w/index.php?curid=195222 WikiProjekt_Eishockey/Archiv_Artikelliste] 474099
: 2 [https://de.wikipedia.org/w/index.php?curid=4556999 MiBot/Vorlage_Bahnlinie] 413619
: 2 [https://de.wikipedia.org/w/index.php?curid=7177880 Kaisehr74/In_Arbeit_8] 410605
: 4 [https://de.wikipedia.org/w/index.php?curid=6083830 WikiProjekt_Bremen/WLM2011/Liste_Bremen/Mitte] 403491
: 4 [https://de.wikipedia.org/w/index.php?curid=714308 Soundtrack_der_Wikipedianer] 392282
: 4 [https://de.wikipedia.org/w/index.php?curid=7723683 Bücher/Badminton] 381626
: 2 [https://de.wikipedia.org/w/index.php?curid=8256010 Etmot/Rangliste_des_deutschen_Fußballs/Chronik] 376953
: 0 [https://de.wikipedia.org/w/index.php?curid=2245118 Liste_der_Länderspiele_der_deutschen_Fußballnationalmannschaft] 375085
: 2 [https://de.wikipedia.org/w/index.php?curid=8309138 Dsds55/Russland2] 354030
: 2 [https://de.wikipedia.org/w/index.php?curid=6479743 Merlissimo/Spielwiese8] 343468
: 0 [https://de.wikipedia.org/w/index.php?curid=7318612 Liste_der_Stolpersteine_in_Berlin-Charlottenburg] 341948
: 4 [https://de.wikipedia.org/w/index.php?curid=8101759 Redaktion_Biologie/Artikelbewertung/Statistik/Vielzellige_Tiere] 323498
: 2 [https://de.wikipedia.org/w/index.php?curid=7177539 SteEis./Medaillenspiegel_bei_Weltmeisterschaften] 312789
: 2 [https://de.wikipedia.org/w/index.php?curid=8012814 Atamari/Listeneintrag_fehlt] 309078
: 0 [https://de.wikipedia.org/w/index.php?curid=7640457 Liste_der_Länderspiele_der_sowjetischen_Eishockeynationalmannschaft_1980_bis_1989] 306934
: 2 [https://de.wikipedia.org/w/index.php?curid=7576418 Dsds55/Russland1] 301940
: 2 [https://de.wikipedia.org/w/index.php?curid=3875381 Sokkok/Liste_der_Kirchen_des_Bundes_Evangelisch-Freikirchlicher_Gemeinden] 288433

79 page of about 2850 have error 503.


Version: 1.24rc
Severity: major
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=71385
https://bugzilla.wikimedia.org/show_bug.cgi?id=71519

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 3:52 AM
bzimport set Reference to bz71486.

Changing component doesn't seem related to mediawiki ui

More details at https://www.mediawiki.org/wiki/Talk:HHVM#Big_Pages_and_Lua_Timeout

(and I'm going to trim out the CC's of the old component.)

  • Bug 71515 has been marked as a duplicate of this bug. ***

I couldn't reproduce with an anonymous test script or with my own user account, but with Ori's test script, including login cookies, I reproduced it and got a gdb backtrace. It is infinite (or at least very deep) recursion in serialize(), causing a segfault in HHVM.

The issue is that the peculiar structure of a PPNode_Hash_Tree has references nested very deeply, potentially hundreds of thousands of levels. In the case of

https://de.wikipedia.org/w/index.php?curid=5622355

the resulting PPNode_Hash_Tree for the article itself has nesting at least 13500 levels deep. serialize() was observed to crash after 8584 levels, and unserialize() was observed to crash after about 12500 levels, in both cases due to exhaustion of the 8MB thread stack.

The solution options that occur to me are:

a) Increase the thread stack size with ulimit -Ss. This is tested and works for the above test case, but to work for any conceivable test case would require a very large stack, perhaps 2GB.
b) Introduce a depth limit to serialize() and unserialize(). Probably sensible in any case, but potentially tricky to sell the PR to upstream, and would still leave us with an error message delivered to the user.
c) Rewrite serialize() and unserialize() to be non-recursive. That would be quite a lot of work.
d) Use a different data structure in PPNode_Hash_Tree, for example using integer keys to break references, perhaps in an HHVM-specific subclass.
e) Use a custom serializer in PPNode_Hash_Tree, which can traverse the tree without recursion.

I upped the stack size soft limit to 32MiB, which doesn't fix the problem, but certainly makes it a lot more esoteric.

I tried (e), using XML instead of unserialize(): parsing XML was slower than just parsing the original wikitext, and 9 times slower than unserialize().

Are these still occurring in production or can we resolve fixed while working on a better solution? I've not been able to reproduce this with any of the cited pages.

It's much more esoteric but reproducable:

https://de.wikipedia.org/w/index.php?title=Benutzer:Boshomi/Test2&action=submit

add a little bit => 503

On my original testpage https://de.wikipedia.org/wiki/Benutzer:Boshomi/testurl I receive much more 15 sec timeouts, compared with php.

(In reply to Erik Moeller from comment #9)

Are these still occurring in production or can we resolve fixed while
working on a better solution? I've not been able to reproduce this with any
of the cited pages.

Impact on users is effectively nil, in that the bug is now only reproducible with specially-contrived degenerate input. (I should note that it is possible to make Zend crash this way, too.)

It's not good to segfault on user input, so there is still a bit of work to do before we mark this resolved. I think the sanest approach would be (b) (cf comment 6). From the user's perspective the only thing that would change would be the phrasing of the error message we present.

(In reply to Ori Livneh from comment #11)

It's not good to segfault on user input, so there is still a bit of work to
do before we mark this resolved. I think the sanest approach would be (b)
(cf comment 6). From the user's perspective the only thing that would change
would be the phrasing of the error message we present.

A depth limit on serialize() wouldn't completely fix the problem though. Try something like this and you will get a segfault in both PHP and HHVM:

$a = null;
for ( $i = 0; $i < 100000; ++$i ) {

$a = (object)array( $a );

}
$a = null;

In the above example, the segfault happens on the last line, when the objects are recursively destroyed and the stack overflows. So I think (d) should be considered.

Someone's already doing option (b) internally, which will at least turn segfaults into fatal errors. I'll let you know when the fix makes it out to master (it's pretty simple). He's also going to look at (c) but that might not end up happening if it's too complex or makes things slower.

ori set Security to None.

For the record, I just tested this and it's still not solved in HHVM 3.12; I guess we'll keep our hack for the time being.

Reopening since it seems like this is being deliberately exploited as a DoS attack vector (see T135483). Also it was never really resolved in the first place, as was noted at the time.

It's been 8 years since I wrote Preprocessor_Hash, and reviewing the code now gives me a new perspective. I think the rationale for using singly-linked lists does not hold up. I think I did it for the following reasons:

  • O(1) list merge, e.g. line 500. This is cute but doesn't give you a better overall time order than say array_splice() to append to a child node array, since we spent O(N) time already creating the list, and the list merge is only done once.
  • O(1) list truncation, line 605-628. Again, replacing with array_splice() would apparently not affect the overall time order.
  • Using a DOM-like API it made it easier to port the code from Preprocessor_DOM and to provide a common interface. For example we have a public function PPNode::getNextSibling() which is easier to implement if you have a linked list. An alternative is a cached array index , possibly stored in a temporary object. PPNode_DOM is a temporary object, created when it is needed by the external API, so PPNode_Hash could be similar. Note that the public API is read-only.

I'm going to try to modify the data structure and see how that goes.

Change 299710 had a related patch set uploaded (by Tim Starling):
Preprocessor_Hash: use child arrays instead of linked lists

https://gerrit.wikimedia.org/r/299710

Re-assigning to Tim as @ori's assignment was from when he previously closed this and now Tim has a patch up for review. Correct if I'm wrong.

greg added subscribers: fgiunchedi, bd808, cscott and 10 others.

Change 299710 merged by jenkins-bot:
Preprocessor_Hash: use child arrays instead of linked lists

https://gerrit.wikimedia.org/r/299710

I used a reproduction case from T135483 and it is apparently all fixed. Seems https://gerrit.wikimedia.org/r/299710 solved it.

I have opened a few of the URL from this ask description and they all render.

@tstarling @ori I think you can now close this task.

@tstarling @ori I think you can now close this task.

Two months without reply, hence closing.