Based on logstash records of crashers, investigate and determine what can be done:
- reduce memory usage to avoid OOM
- increase memory allocation to parsoid (by how much)
- reduce mediawiki limits to prevent the creation of pages like these
Based on logstash records of crashers, investigate and determine what can be done:
| Subject | Repo | Branch | Lines +/- | |
|---|---|---|---|---|
| Bump wikimedia/parsoid to 0.22.0-a6 | mediawiki/vendor | master | +625 -759 | |
| Add mildly hacky --debug-oom flag to parse.php | mediawiki/services/parsoid | master | +31 -2 |
Looking just at enwiki timeouts in our Logstash dashboard for the last 3 months,
So, about 8% of requests are not the User and Wikipedia namespaces which is a somewhat good sign.
It is not so good that 27% of these timeouts are from the Wikipedia: namespace given that they tend to be needed for adminstrative, patrolling, and other wiki maintenance and bureaucratic work.
If we are able to download the list of all these urls from logstash, I expect we'll find a smaller subset of user and wikipedia pages that are contributing to these timeouts.
But, as a start, we should probably focus on the non-User and non-Wikipedia namespaces and see if we can pare down the timeouts further.
I downloaded the logstash data from the last month and extracted the exception urls, stripped the revision ids (and exclude File, Template, Category, *Talk namespaces as well) that had timeouts in the last month:
Ankh_Morpork_City_Watch Battle_of_khaybar Fairbanks%2C_Morse_and_Company Filters_in_topology Gagarin%27s_Start Good_Morning%2C_Judge List_of_Evolve_Tag_Team_Champions List_of_set_identities_and_relations Magnum_Airlines_Helicopters New_Super_Mario_Bros._(series) Sputnik_1 Yuri%27s_Night
That is just 12 titles and we should be able to work through those titles and figure out if we can fix timeouts on those.
That turned out to be mostly a nothingburger for the most part. Here is the dump of parse.php times on the above titles (after resolving redirects). So, except for the two math pages (Filters_in_topology, List_of_set_identities_and_relations), everything else parses pretty quickly and I confirmed with an "?action=purge" on two of the pages that the pages do render fine. So, except for those two titles, everything else turned out to be probably transient timeouts. Alternatively, we should look at the timeout logs with rev ids -- specific revisions might have the timeout problems.
*But* the two math pages highlight a real issue -- the wall-clock time is close to 60s or higher than that. This seems to be something that is busted with Parsoid's interfacing with the Math extension. Legacy parser (based on limit report inspection) doesn't have the same wallclock time bloat. So, that is worth an investigation!
Ankh-Morpork_City_Watch real 0m1.134s user 0m0.903s sys 0m0.052s Battle_of_Khaybar real 0m2.570s user 0m2.120s sys 0m0.113s Fairbanks-Morse real 0m1.355s user 0m1.114s sys 0m0.045s Filters_in_topology real 1m4.411s user 0m13.537s sys 0m0.608s Gagarin's_Start real 0m1.414s user 0m1.094s sys 0m0.080s Good_Morning,_Judge real 0m0.538s user 0m0.445s sys 0m0.049s Evolve_Tag_Team_Championship real 0m1.418s user 0m1.241s sys 0m0.073s List_of_set_identities_and_relations real 0m57.846s user 0m20.256s sys 0m0.489s Command_Airways_(South_Africa) real 0m1.014s user 0m0.805s sys 0m0.069s New_Super_Mario_Bros._(series) real 0m0.583s user 0m0.459s sys 0m0.066s Sputnik_1 real 0m2.887s user 0m2.481s sys 0m0.081s Yuri's_Night real 0m1.393s user 0m1.099s sys 0m0.088s
Aha .. so,, revid 1288359999 on enwiki:Sputnik_1 is a vandalized version and has 15323 uses of Template:Chem_name and 15835 uses of Template:Sic. Using --profile, it turns out that WrapTemplates explodes in time usage on that page and takes 35s! So, that is worth fixing.
It might have been the same thing with https://en.wikipedia.org/w/index.php?title=Yuri%27s_Night&action=history and https://en.wikipedia.org/w/index.php?title=Gagarin%27s_Start&action=history which show a number of deleted revisions.
For all the other titles, the errors are transient. So, we have two issues here -- I'll file separate phab tasks for them tomorrow
Spot-checking other wikis for last month:
I am going to stop now. But, I think besides the two issues above, we should focus on large pages. User namespace, Wikipedia namespace, and Project pages tend to large pages with long lists and/or tables with lots of links. So, picking a few sample pages and analyzing them and fixing what we can should push these numbers down on those namespaces.
Regarding OOMs, after excluding user pages and FST-based langconversion pages (which has known issues), I found at least two pages that are legitimate OOMs (haven't looked at others closely):
I added "?useparsoid=0" so these won't cause repeated OOMs with Parsoid.
Memory limit is set to 1400 MiB in the config repo.
For enwiki:Wikipedia:User_scripts/Ranking, memory usage is 1553 MiB
ssastry@parsoidtest1001:/srv/parsoid-testing$ time sudo -u www-data php /srv/mediawiki/multiversion/MWScript.php /srv/parsoid-testing/bin/parse.php --wiki=enwiki --integrated --page "Wikipedia:User_scripts/Ranking" --benchmark Total time: 19625.157197 ms Peak memory usage: 1553.61 MiB
So, we would have to see what is allocating all this memory and fix that to fix OOMs. So, first step would be to get some insight into memory usage by pipeline stage similar to how --profile gives us that info for time spent.
Found during personal editing session: enwiki:Remineralisation_of_teeth (https://en.wikipedia.org/wiki/Remineralisation_of_teeth?useparsoid=0) OOMs on Parsoid.
More pages to look at from T366082:
https://en.wikipedia.org/w/rest.php/v1/page/Remineralisation_of_teeth/html
https://crh.wikipedia.org/w/rest.php/v1/page/Suniy_zek%C3%A2/html
https://en.wikipedia.org/w/rest.php/v1/page/Timeline_of_the_COVID-19_pandemic_in_Canada/html
https://sr.wikipedia.org/w/rest.php/v1/page/Nau%C4%8Dno-stru%C4%8Dno_dru%C5%A1tvo_za_upravljanje_rizicima_u_vanrednim_situacijama/html
https://crh.wikipedia.org/w/rest.php/v1/page/Noyabr/html
https://crh.wikipedia.org/w/rest.php/v1/page/Elektrik/html
https://crh.wikipedia.org/w/rest.php/v1/page/Silâ/html
In case it's related, we got InternalServerError three times yesterday for https://fa.wikipedia.org/w/rest.php/v1/revision/41917302/html
Regarding OOMs, I filed T395492: Infinite loop in Cite's linter code, and for a bunch of titles collected from this task and from logstash (I skipped enwiki User namespace), here is some data of peak memory usage (as reported on parsoidtest1001 via Parosid's parse.php script and the --benchmark option):
enwiki:Template:Syrian_Civil_War_detailed_map ------ 1964.85 MiB enwiki:Wikipedia:User_scripts/Ranking ------ 1553.93 MiB enwiki:Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Publisher1 ------ 1517.26 MiB enwiki:Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Publisher2 ------ 1518.95 MiB enwiki:Wikipedia:WikiProject_Spam/LinkReports/elections.alifailaan.pk ------ 1631.92 MiB frwiki:Projet:Football/Maintenance ------ 1794.44 MiB srwiki:Malacostraca ------ 2013.71 MiB crhwiki:Brânsk_rayonı ------ 1211.06 MiB crhwiki:Suniy_zekâ ------ 1929.90 MiB crhwiki:Noyabr ------ 4733.70 MiB crhwiki:Elektrik ------ 1891.67 MiB crhwiki:Silâ ------ 1736.65 MiB
Two of the pages from T392261#10849861 (originally reported T366082 in May 2024) above are no longer a problem. They parse within the memory limit and time limit.
enwiki:Timeline_of_the_COVID-19_pandemic_in_Canada ------ 255.49 MiB srwiki:Naučno-stručno_društvo_za_upravljanje_rizicima_u_vanrednim_situacijama ------ 105.36 MiB
From the above list, we should look at pages that take over 2000MB for starters and then look at the others. Separately worth investigating memory use in crhwiki:Brânsk_rayonı since it is a smallish page! Why did it peak at 1200MB?
Change #1152160 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):
[mediawiki/services/parsoid@master] WIP: Ultra hacky --debug-oom flag to parse.php
With an additional hacked up version of that patch on parsoidtest1001, here is some info
enwiki:Wikipedia:User_scripts/Ranking
[dump] HTML5 TreeBuilder-PMU: 83.008453369141 [dump] PEG-PMU: 83.008453369141 [dump] PEG-PMU: 1553.7850875854
enwiki:Template:Syrian_Civil_War_detailed_map
[dump] HTML5 TreeBuilder-PMU: 80.396835327148 [dump] PEG-PMU: 80.396835327148 [dump] PEG-PMU: 1965.0513458252
enwiki:Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Publisher1
[dump] HTML5 TreeBuilder-PMU: 94.690872192383 [dump] PEG-PMU: 790.2268447876 [dump] PEG-PMU: 1517.1468887329
frwiki:Projet:Football/Maintenance
[dump] HTML5 TreeBuilder-PMU: 98.90210723877 [dump] PEG-PMU: 1794.8142089844
crhwiki:Brânsk_rayonı (now reporting reduced memory usage)
[dump] DOMPasses:TOP-PMU: 105.30983734131 [dump] DOMPasses:TOP-PMU: 589.87114715576
crhwiki:Suniy_zekâ
[dump] DOMPasses:TOP-PMU: 83.635368347168 [dump] DOMPasses:TOP-PMU: 1688.9247894287
crhwiki:Noyabr
[dump] DOMPasses:TOP-PMU: 108.5365447998 [dump] DOMPasses:TOP-PMU: 4733.9775314331
crhwiki:Elektrik
[dump] DOMPasses:TOP-PMU: 78.782569885254 [dump] DOMPasses:TOP-PMU: 1891.4405822754
crhwiki:Silâ
[dump] DOMPasses:TOP-PMU: 79.01033782959 [dump] DOMPasses:TOP-PMU: 1736.4234008789
srwiki:Malacostraca
[dump] HTML5 TreeBuilder-PMU: 95.080284118652 [dump] PEG-PMU: 2014.2462615967
Summary: For a number of wikis, PEG (probably parsing template output is what it looks like) is the culprit. But, for crhwiki, some DOM pass blows up the memory usage -- if I had to guess, it is probably deep recursion somewhere that leads to stack blowout. With improved instrumentation in the above patch, we can extract additional info for crhwiki and probably fix it.
Okay, with some more instrumentation, for crhwiki, it looks AddRedLinks is the culprit:
crhwiki:Brânsk_rayonı
[dump] DOMPasses:TOP-LangConverter-PMU: 105.33726501465 [dump] DOMPasses:TOP-AddRedLinks-PMU: 589.89879608154
crhwiki:Noyabr
[dump] DOMPasses:TOP-LangConverter-PMU: 108.3929901123 [dump] DOMPasses:TOP-AddRedLinks-PMU: 4733.834197998
Change #1152160 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Add mildly hacky --debug-oom flag to parse.php
Change #1154863 had a related patch set uploaded (by Arlolra; author: Arlolra):
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.22.0-a6
Change #1154863 merged by jenkins-bot:
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.22.0-a6
Okay, we've filed separate tasks for specific issues and fixed some of them. The only uninvestigated issue is to look at timeouts on User and Wikipedia namespaces -- both of which point to issues with large pages and not necessarily something substantially broken. I'm going to close this task and when we get to a second round of performance work, we can revisit our state of OOMs and timeouts and file fresh tasks as needed.