Page MenuHomePhabricator

Detect and monitor against multiple Parser invocation during edit requests
Closed, ResolvedPublic

Description

Background

One of, if not the, most expensive procedures during edit "stash" and edit "save" requests is the page rendering (Parser invocation). It has happened numerous times that an extension directly or indirectly causes the input to be parsed additional times, which significantly slows down the person's interaction with the site.

For reference, this is at least the *fourth* time this has happened:

I believe there have been others, and for the record, the latest (T288639) was in Aug 2021.

Since September 2021 we have a Logstash dashboard for the DuplicateParse logging bucket: https://logstash.wikimedia.org/app/dashboards#/view/11bb9170-1bd2-11ec-85b7-9d1831ce7631

Monitoring

I don't know if we can completely disallow this (e.g. throw an exception when it happens, and catch it early during testing/deployment), but that seems like an "ideal" to explore first.

Short of that, the next thing we can do is detect it at run-time and issue a warning against it (e.g. trigger_error, similar to what we do with Hard-deprecations and PHP's own deprecation warnings), plus a diagnostic message into a relevant logger channel with details about how and what is going on (e.g. stack of first parse, and whatever is different between them that led to a second parse).

This warning will ensure CI fails if it happens during any of the simulated edits from PHPUnit or Selenium integration tests, and ensures it is caught in a likely-train-blocking fashion when happening in production, yet without immediate user impact (no fatal exception), apart from the obvious regression in latency.

Event Timeline

Very much support. Is performance team planning to work on implementing this, or PET is better positioned to work on this in your opinion?

I think we should collaborate on the evaluation, and then it depends on the direction we go.

If we end up going with detection, logging and an Icinga alert; then I think we could do that. But, if we find a viable way to completely avoid this issue through a (minor?) refactor of PageUpdater code, then that might be something you'd be more familiar with and better positioned to carry out.

ok. Added this to our agenda of tomorrow technical discussions meeting. Will update the ticket with our shared thinking on this.

I believe there are two main things that need addressing/investigating:

  • Some logging/monitoring to detect when a page is being double parsed
  • A better API for developers and/or comments to prevent regressions from happening again. My recollection from 2013 is that was caused by some ContentHandler refactoring, reading the 2018 tickets they were also caused by some refactoring, and today's was caused by some other refactoring (since it was a deprecated function at fault).
    • Why is this code constantly being refactored and changed? If people keeping making the same mistakes, the code or its docs must not be clear about how to avoid this problem.
    • For extension developers, it should be very hard and awkward to do the wrong thing (triggering a double parse). Why does this keep cropping up in SpamBlacklist? Is it doing something wrong? Can its usecase be better accommodated in core code?

Related T172941: Track duplicate parses on page save / https://gerrit.wikimedia.org/r/c/mediawiki/core/+/444264/ ("Track duplicate parses on page save via logging") which has some notes from @daniel about how to track:

This is probably obsolete per the introduction of RevisionRenderer, since RenderedRevision has internal caching.

One side effect of this is that there is no good place to detect duplicate parsing any more. It would only happen when two RenderedRevision instances are crated for the same RevisionRecord with the same options. We could track some kind of hash of all the RenderedRevision instances that have been created, and log when more than one is created - but if we do that, it would make more sense to use that hash as a key of an internal cache in RevisionRenderer.

Should we merge this into that as a duplicate and re-open, then?

PET discussed it and we support. We think the best place to put this would be ContentHandler::getParserOutput (being moved over from Content in T287158) Doing it in Parser is too low, Doing it in RenderedRevision and above will basically depend on proper reuse of RenderedRevision, and that' exactly the bug we're trying to catch. ContentHandler is essentially a singleton, so it's easy to track parses.

We can add support in ContentHandler to register an observer on a parse operation so that we don't bring too many new dependencies into ContentHandler, it's already pretty big.

dpifke added a subscriber: dpifke.

Sounds like PET can work on this. If you want help from Performance Team, please let us know.

Let me know if I can help on anything.

If you don't mind, we'd like to keep this issue on our radar. We use that tag for tracking performance-related issues, even if another team is working on them.

If you don't mind, we'd like to keep this issue on our radar. We use that tag for tracking performance-related issues, even if another team is working on them.

Sorry, that was unintentional. I mean to remove it from the PET inbox, not your radar.

We are planning to look at this during our upcoming "Code Jam" the week of Sept. 20.

Discussed this with Petr in preparation for Code Jam, here are some notes from that discussion. @Pchelolo , please correct me if I misunderstood/misstated any of this.

In a comment above, Petr, suggested doing duplicate parse detection in ContentHandler. However, the prerequisite changes aren't done yet, and won't be done in time for Code Jam. So we discussed doing the duplicate parse detection elsewhere for now. It'll be a relatively small patch to move it when ContentHandler is ready.

CURRENT STATE:

  • WikitextContext::fillParserOutput currently has duplicate parse detection
  • this detection only works for this instance of WikitextContent and is therefore unreliable
  • RenderedRevision::getRevisionParserOutput (via the injected combineOutput function) and RenderedRevision::getSlotParserOutput (via $content->getParserOutput) invoke parsing
  • If a RenderedRevision instance were reliably injected everywhere, it could detect duplicate parsing. But that's not currently the case. And even if it were, it will wouldn't offer guarantees other than people using it properly.
  • "Content" is an interface containing (among other things) function getParserOutput.
  • "AbstractContent" is an abstract class (which, interestingly, contains no abstract methods) that provides a concrete implementation of getParserOutput.
  • AbstractContent::getParserOutput would be a good place to put duplicate detection

PROPOSED STATE AFTER THIS CHANGE:

  • create a ParserObserver/ParserMonitor/whatever-we-call-it class as a service. I'll call it ParserObserver in this comment. Note that herein we are using the word "service" in a MediaWikiServices sense, to mean a PHP class within MediaWiki core, not an external application with its own codebase in the "microservice" sense.
  • ParserObserver could be an interface, and could be injected into AbstractContent.
  • statsd can emit to different places. To control this, we could inject a reporter into ParserObserver to control where it outputs info
  • some reparses may be legitimate. So for now at least, we should log duplicates in a way that doesn't cause the appearance of a major production issue.
  • It is a reparse if (1) revid is the same, (2) title is the same (page reference), and (3) parser options match (there are existing functions to check for matching parser options)
  • we'll want to log stack traces, so that we can tell what code triggered the reparse

Some of the above is a bit rough, and is taken directly from my meeting notes. We can figure out the details together, and correct anything I got wrong.

Conversation notes:

  • Daniel suggested an alternative implementation of a Decorator pattern, in which we wrap the content class. He says this might be simpler, but would be more suitable for temporary logging, while Petr's proposal is more theoretically correct for code that will persist for some time.
  • Antoine pointed out that "temporary" code often lives for longer than intended, often much longer, which probably makes the injected ParserObserver a better choice.
  • Daniel also mentioned that there are some things that are so universal that we don't bother injecting them, because we'd have to inject them everywhere. Logging and stats are the main two examples. Nothing says we *can't* inject them if it makes sense in a particular context, we just don't necessarily *have* to.

Change 722648 had a related patch set uploaded (by Cicalese; author: Cicalese):

[mediawiki/core@master] Detect and monitor against multiple Parser invocation during edit requests

https://gerrit.wikimedia.org/r/722648

I found a few via testwiki using:

get.sh
#!/bin/bash

exec ssh mwlog1002.eqiad.wmnet grep -Pzo '(?s)DuplicateParse.*?2021-' /srv/mw-log/testwiki.log > duplicateparse.txt

Then some limited splitter based on the md5sum of the message:

splitter.sh
#!/bin/bash

set -eu -o pipefail

INPUT=duplicateparse.txt
PREFIX=trace-
SUFFIX=.txt

echo "Removing old trace files if any"
rm -f ${PREFIX}*${SUFFIX}
rm -fR md5*
echo "Splitting..."
split --lines=1 --separator='\0' -d --additional-suffix=$SUFFIX $INPUT $PREFIX
for trace_file in "${PREFIX}"*"${SUFFIX}"; do
	md5=$(md5sum "$trace_file" | cut -d\  -f1)
	mkdir -p "md5-$md5"
	mv "$trace_file" "./md5-$md5"
done

Most of them came from https://test.wikipedia.org/wiki/User_talk:Doc_Taxon , potentially while crawling the history or making edits so there are a bunch of traces that look the same. I have pasted two different ones as P17310 P17311

DuplicateParse is now logged for all wikis and if such messages arises they will be searchable via the MediaWiki log dashboard at https://logstash.wikimedia.org/app/dashboards#/view/AXFV7JE83bOlOASGccsT and searching for channel.keyword:DuplicateParse. There are none so far.

Nice finds!

P17311 - https://gerrit.wikimedia.org/r/c/mediawiki/extensions/SpamBlacklist/+/705854 - I have some ideas different from @daniel there, but it's totally fixable.

P17310 - same idea, different extension. onEditFilterMergedContent hook passes Content to an extension, but doesn't share any of the parsing machinery.

ArielGlenn triaged this task as Medium priority.Sep 22 2021, 3:51 PM

Messages emitted to DuplicateParse logging bucket are now written to Logstash. I have created a basic dashboard which shows the top trace and previous-trace at https://logstash.wikimedia.org/app/dashboards#/view/11bb9170-1bd2-11ec-85b7-9d1831ce763

I could not find a way to create a visualization that would aggregate based on the couple (trace, previous-trace). I would need to invoke someone familiar with Kibana / ElasticSearch. Edit: got assistance from Timo and I have added a visualization that shows the top pair of trace/previous-trace.

Change 722648 merged by jenkins-bot:

[mediawiki/core@master] Detect and monitor against multiple Parser invocation during edit requests

https://gerrit.wikimedia.org/r/722648

Pchelolo claimed this task.

Duplicate parse detection is working all right https://logstash.wikimedia.org/goto/122ea5616538e329a743b46ad88048b8

We now know for example that we parse every page on commons twice on every parse due to CommonsMetadata extension. Good fun.

I'll resolve this ticket and create a new one with subtasks with dashboard analysis.

Histogram of DuplicateParse logs as the patch got rolled out to production last week:

duplicateparse.png (245×659 px, 18 KB)

It is great to see we are detecting a lot more of them and to see sub tasks being filed.