Page MenuHomePhabricator

Investigate crashers (out of memory, timeouts)
Closed, ResolvedPublic

Description

Based on logstash records of crashers, investigate and determine what can be done:

  • reduce memory usage to avoid OOM
  • increase memory allocation to parsoid (by how much)
  • reduce mediawiki limits to prevent the creation of pages like these

Related Objects

Event Timeline

Looking just at enwiki timeouts in our Logstash dashboard for the last 3 months,

  • If I exclude the "User:" and "Wikipedia:" namespaces, we have 2072 timeouts and 1629 OOMs.
  • If I look at just the "User:" namespace, we have ~16000 timeouts, and ~19700 OOMs.
  • If I look at just the "Wikipedia:" namespace, we have ~6900 timeouts and ~5800 OOMs.

So, about 8% of requests are not the User and Wikipedia namespaces which is a somewhat good sign.

It is not so good that 27% of these timeouts are from the Wikipedia: namespace given that they tend to be needed for adminstrative, patrolling, and other wiki maintenance and bureaucratic work.

If we are able to download the list of all these urls from logstash, I expect we'll find a smaller subset of user and wikipedia pages that are contributing to these timeouts.

But, as a start, we should probably focus on the non-User and non-Wikipedia namespaces and see if we can pare down the timeouts further.

I downloaded the logstash data from the last month and extracted the exception urls, stripped the revision ids (and exclude File, Template, Category, *Talk namespaces as well) that had timeouts in the last month:

Ankh_Morpork_City_Watch
Battle_of_khaybar
Fairbanks%2C_Morse_and_Company
Filters_in_topology
Gagarin%27s_Start
Good_Morning%2C_Judge
List_of_Evolve_Tag_Team_Champions
List_of_set_identities_and_relations
Magnum_Airlines_Helicopters
New_Super_Mario_Bros._(series)
Sputnik_1
Yuri%27s_Night

That is just 12 titles and we should be able to work through those titles and figure out if we can fix timeouts on those.

That turned out to be mostly a nothingburger for the most part. Here is the dump of parse.php times on the above titles (after resolving redirects). So, except for the two math pages (Filters_in_topology, List_of_set_identities_and_relations), everything else parses pretty quickly and I confirmed with an "?action=purge" on two of the pages that the pages do render fine. So, except for those two titles, everything else turned out to be probably transient timeouts. Alternatively, we should look at the timeout logs with rev ids -- specific revisions might have the timeout problems.

*But* the two math pages highlight a real issue -- the wall-clock time is close to 60s or higher than that. This seems to be something that is busted with Parsoid's interfacing with the Math extension. Legacy parser (based on limit report inspection) doesn't have the same wallclock time bloat. So, that is worth an investigation!

Ankh-Morpork_City_Watch

real	0m1.134s
user	0m0.903s
sys	0m0.052s
Battle_of_Khaybar

real	0m2.570s
user	0m2.120s
sys	0m0.113s
Fairbanks-Morse

real	0m1.355s
user	0m1.114s
sys	0m0.045s
Filters_in_topology

real	1m4.411s
user	0m13.537s
sys	0m0.608s
Gagarin's_Start

real	0m1.414s
user	0m1.094s
sys	0m0.080s
Good_Morning,_Judge

real	0m0.538s
user	0m0.445s
sys	0m0.049s
Evolve_Tag_Team_Championship

real	0m1.418s
user	0m1.241s
sys	0m0.073s
List_of_set_identities_and_relations

real	0m57.846s
user	0m20.256s
sys	0m0.489s
Command_Airways_(South_Africa)

real	0m1.014s
user	0m0.805s
sys	0m0.069s
New_Super_Mario_Bros._(series)

real	0m0.583s
user	0m0.459s
sys	0m0.066s
Sputnik_1

real	0m2.887s
user	0m2.481s
sys	0m0.081s
Yuri's_Night

real	0m1.393s
user	0m1.099s
sys	0m0.088s

Aha .. so,, revid 1288359999 on enwiki:Sputnik_1 is a vandalized version and has 15323 uses of Template:Chem_name and 15835 uses of Template:Sic. Using --profile, it turns out that WrapTemplates explodes in time usage on that page and takes 35s! So, that is worth fixing.

It might have been the same thing with https://en.wikipedia.org/w/index.php?title=Yuri%27s_Night&action=history and https://en.wikipedia.org/w/index.php?title=Gagarin%27s_Start&action=history which show a number of deleted revisions.

For all the other titles, the errors are transient. So, we have two issues here -- I'll file separate phab tasks for them tomorrow

Spot-checking other wikis for last month:

  • nlwiki: all user pages
  • kowiki: no timeouts
  • jawiki: 14 across all namespaces, one user page & rest wikipedia namespace
  • frwiki: user pages OR project pages like this with large lists
  • itwiki: except user pages, wikipedia pages, project pages, there are 12 entries -- all of them seem to have been transient ones and are small pages and all use timeline charts (so could have been a transient timeline outage).

I am going to stop now. But, I think besides the two issues above, we should focus on large pages. User namespace, Wikipedia namespace, and Project pages tend to large pages with long lists and/or tables with lots of links. So, picking a few sample pages and analyzing them and fixing what we can should push these numbers down on those namespaces.

Regarding OOMs, after excluding user pages and FST-based langconversion pages (which has known issues), I found at least two pages that are legitimate OOMs (haven't looked at others closely):

I added "?useparsoid=0" so these won't cause repeated OOMs with Parsoid.

Memory limit is set to 1400 MiB in the config repo.

For enwiki:Wikipedia:User_scripts/Ranking, memory usage is 1553 MiB

ssastry@parsoidtest1001:/srv/parsoid-testing$ time sudo -u www-data php /srv/mediawiki/multiversion/MWScript.php /srv/parsoid-testing/bin/parse.php --wiki=enwiki --integrated --page "Wikipedia:User_scripts/Ranking" --benchmark
Total time: 19625.157197 ms
Peak memory usage: 1553.61 MiB

So, we would have to see what is allocating all this memory and fix that to fix OOMs. So, first step would be to get some insight into memory usage by pipeline stage similar to how --profile gives us that info for time spent.

Found during personal editing session: enwiki:Remineralisation_of_teeth (https://en.wikipedia.org/wiki/Remineralisation_of_teeth?useparsoid=0) OOMs on Parsoid.

ssastry triaged this task as High priority.May 22 2025, 5:23 PM

In case it's related, we got InternalServerError three times yesterday for https://fa.wikipedia.org/w/rest.php/v1/revision/41917302/html

Regarding OOMs, I filed T395492: Infinite loop in Cite's linter code, and for a bunch of titles collected from this task and from logstash (I skipped enwiki User namespace), here is some data of peak memory usage (as reported on parsoidtest1001 via Parosid's parse.php script and the --benchmark option):

enwiki:Template:Syrian_Civil_War_detailed_map ------ 1964.85 MiB
enwiki:Wikipedia:User_scripts/Ranking ------ 1553.93 MiB
enwiki:Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Publisher1 ------ 1517.26 MiB
enwiki:Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Publisher2 ------ 1518.95 MiB
enwiki:Wikipedia:WikiProject_Spam/LinkReports/elections.alifailaan.pk ------ 1631.92 MiB
frwiki:Projet:Football/Maintenance ------ 1794.44 MiB
srwiki:Malacostraca ------ 2013.71 MiB
crhwiki:Brânsk_rayonı ------ 1211.06 MiB
crhwiki:Suniy_zekâ ------ 1929.90 MiB
crhwiki:Noyabr ------ 4733.70 MiB
crhwiki:Elektrik ------ 1891.67 MiB
crhwiki:Silâ ------ 1736.65 MiB

Two of the pages from T392261#10849861 (originally reported T366082 in May 2024) above are no longer a problem. They parse within the memory limit and time limit.

enwiki:Timeline_of_the_COVID-19_pandemic_in_Canada ------ 255.49 MiB
srwiki:Naučno-stručno_društvo_za_upravljanje_rizicima_u_vanrednim_situacijama ------ 105.36 MiB

From the above list, we should look at pages that take over 2000MB for starters and then look at the others. Separately worth investigating memory use in crhwiki:Brânsk_rayonı since it is a smallish page! Why did it peak at 1200MB?

Change #1152160 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/services/parsoid@master] WIP: Ultra hacky --debug-oom flag to parse.php

https://gerrit.wikimedia.org/r/1152160

With an additional hacked up version of that patch on parsoidtest1001, here is some info

enwiki:Wikipedia:User_scripts/Ranking

[dump] HTML5 TreeBuilder-PMU: 83.008453369141
[dump] PEG-PMU: 83.008453369141
[dump] PEG-PMU: 1553.7850875854

enwiki:Template:Syrian_Civil_War_detailed_map

[dump] HTML5 TreeBuilder-PMU: 80.396835327148
[dump] PEG-PMU: 80.396835327148
[dump] PEG-PMU: 1965.0513458252

enwiki:Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Publisher1

[dump] HTML5 TreeBuilder-PMU: 94.690872192383
[dump] PEG-PMU: 790.2268447876
[dump] PEG-PMU: 1517.1468887329

frwiki:Projet:Football/Maintenance

[dump] HTML5 TreeBuilder-PMU: 98.90210723877
[dump] PEG-PMU: 1794.8142089844

crhwiki:Brânsk_rayonı (now reporting reduced memory usage)

[dump] DOMPasses:TOP-PMU: 105.30983734131
[dump] DOMPasses:TOP-PMU: 589.87114715576

crhwiki:Suniy_zekâ

[dump] DOMPasses:TOP-PMU: 83.635368347168
[dump] DOMPasses:TOP-PMU: 1688.9247894287

crhwiki:Noyabr

[dump] DOMPasses:TOP-PMU: 108.5365447998
[dump] DOMPasses:TOP-PMU: 4733.9775314331

crhwiki:Elektrik

[dump] DOMPasses:TOP-PMU: 78.782569885254
[dump] DOMPasses:TOP-PMU: 1891.4405822754

crhwiki:Silâ

[dump] DOMPasses:TOP-PMU: 79.01033782959
[dump] DOMPasses:TOP-PMU: 1736.4234008789

srwiki:Malacostraca

[dump] HTML5 TreeBuilder-PMU: 95.080284118652
[dump] PEG-PMU: 2014.2462615967

Summary: For a number of wikis, PEG (probably parsing template output is what it looks like) is the culprit. But, for crhwiki, some DOM pass blows up the memory usage -- if I had to guess, it is probably deep recursion somewhere that leads to stack blowout. With improved instrumentation in the above patch, we can extract additional info for crhwiki and probably fix it.

Okay, with some more instrumentation, for crhwiki, it looks AddRedLinks is the culprit:

crhwiki:Brânsk_rayonı

[dump] DOMPasses:TOP-LangConverter-PMU: 105.33726501465
[dump] DOMPasses:TOP-AddRedLinks-PMU: 589.89879608154

crhwiki:Noyabr

[dump] DOMPasses:TOP-LangConverter-PMU: 108.3929901123
[dump] DOMPasses:TOP-AddRedLinks-PMU: 4733.834197998

Change #1152160 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Add mildly hacky --debug-oom flag to parse.php

https://gerrit.wikimedia.org/r/1152160

Change #1154863 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.22.0-a6

https://gerrit.wikimedia.org/r/1154863

Change #1154863 merged by jenkins-bot:

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.22.0-a6

https://gerrit.wikimedia.org/r/1154863

Okay, we've filed separate tasks for specific issues and fixed some of them. The only uninvestigated issue is to look at timeouts on User and Wikipedia namespaces -- both of which point to issues with large pages and not necessarily something substantially broken. I'm going to close this task and when we get to a second round of performance work, we can revisit our state of OOMs and timeouts and file fresh tasks as needed.