Page MenuHomePhabricator

ParserOptions validation to avoid pollution on ParserCache save
Closed, ResolvedPublic

Description

I noticed that the ApiParse module sets a number of ParserOptions options which are not represented in ParserOptions::optionsHash(), specifically enableLimitReport(), setIsPreview() and setIsSectionPreview(). It calls WikiPage::getParserOutput() with these options, which will cause the parser cache to be polluted with the specified options.

In https://gerrit.wikimedia.org/r/#/c/233344 , in addition to adding setTidy() to the list of modified options, I proposed rectifying the parser cache pollution by having ApiParse skip the parser cache when the relevant API module parameters are specified.

In code review, Brad pointed out that the pollution issue may wider than just ApiParse, and suggested having WikiPage::getParserOutput() validate the ParserOptions object. So I did a caller survey, which I will post below. It shows two callers that are polluting the parser cache: ApiParse and GeoData.

GeoData uses the ParserOptions constructor with no parameters. The ParserOptions constructor is intended for message parsing and the like and so does not call setTidy() and enableLimitReport(), and these options are not in optionsHash(). So GeoData pollutes the parser cache with untidied output.

Design notes and open questions:

  • ParserOptions has a lot of options (~35), and more are frequently added. So validation of the ParserOptions object may need a more abstract concept of an "option" and the ability to iterate over them.
  • When adding new options, adding validation should be simple and obvious, so that developers do not omit it.
  • Is skipping the cache an appropriate response to GeoData's incorrect construction of ParserOptions? We could throw an exception instead. We could provide a separate validation interface, and callers could opt in to skipping the cache.
  • Wikidata uses ParserOptions::addExtraKey() to avoid polluting the cache. How can this be reflected in validation? Should addExtraKey() be replaced with a method which tells us which options are to be excluded from validation?

Event Timeline

tstarling raised the priority of this task from to Needs Triage.
tstarling updated the task description. (Show Details)
tstarling added subscribers: tstarling, Anomie, MaxSem, Jackmcbarn.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 26 2015, 12:59 AM

Here is the caller survey, in core and WMF deployed extensions. "From context" means WikiPage::makeParserOptions() with an IContextSource.

ParserCache::save callers:

  • WikiPage
  • PoolWorkArticleView
  • RefreshLinksJob (canonical options)
  • ApiPurge (canonical options)

shouldCheckParserCache callers:

  • Article
  • WikiPage
  • API
  • RefreshLinksJob
  • TextExtracts

WikiPage::getParserOutput() callers and their ParserOptions sources:

  • Article (from context or $user param)
  • ApiParse (from context but modifies lots of things)
  • DifferenceEngine (from context but calls setEditSection)
  • GeoData (from ParserOptions constructor)
  • GeoCrumbs (from WikiPage::makeParserOptions($wgUser))
  • ApiMobileView (from context)
  • Wikibase\Repo\Diff::EntityContentDiffView (uses ParserOptions::addExtraKey() to avoid polluting cache)
tstarling set Security to None.

To avoid the pollution issue, something somewhere needs to know which options the cache varies on and which it assumes are at default values. It makes most sense to me for that to be checked as close to cache population as possible, i.e. ParserCache::save() callers rather than WikiPage::getParserOutput() callers. As for where the actual checking-function could live, ParserOptions itself would be good for locality, since it already has ParserOptions::optionsHash(). Or we could put it into ParserCache as a parallel to ParserCache::getParserOutputKey(), but I like that slightly less.

Regarding whether failure to cache should be an exception, I'm not sure that something like WikiPage::getParserOutput() should necessarily be throwing on that issue. But I'm not sure it shouldn't, either.

Regarding Wikibase, I see two calls. The one in EntityHandler seems entirely straightforward. The one in EntityContentDiffView is less so: it adds an arbitrary "diff=1" key to fragment the cache on the use of various (unspecified) options that aren't normally varied on. If we want to properly support this, it'd might better be done by giving the ability to tell ParserOptions to vary on usually-unvaried options.

For giving ParserOptions the ability to iterate over options, the most straightforward solution would be to store options internally as an associative array rather than object parameters. I'll submit patches momentarily to clean up users of the existing public fields (there are surprisingly few) to make that sort of thing possible.

Change 233976 had a related patch set uploaded (by Anomie):
Remove direct ParserOptions field access

https://gerrit.wikimedia.org/r/233976

Change 233977 had a related patch set uploaded (by Anomie):
Remove direct ParserOptions field access

https://gerrit.wikimedia.org/r/233977

Change 233976 merged by jenkins-bot:
Remove direct ParserOptions field access

https://gerrit.wikimedia.org/r/233976

Change 233977 merged by jenkins-bot:
Remove direct ParserOptions field access

https://gerrit.wikimedia.org/r/233977

Change 234270 had a related patch set uploaded (by Anomie):
Make ParserOptions fields private

https://gerrit.wikimedia.org/r/234270

Change 234270 merged by jenkins-bot:
Make ParserOptions fields private

https://gerrit.wikimedia.org/r/234270

This could have prevented T165115. I should probably look at finishing this.

Change 354503 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/extensions/GeoData@master] Use canonical ParserOptions for WikiPage::getParserOutput()

https://gerrit.wikimedia.org/r/354503

Change 354504 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] Try harder to avoid parser cache pollution

https://gerrit.wikimedia.org/r/354504

Change 354505 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/extensions/Wikibase@master] Use WikiPage::makeParserOptions to get options for caching

https://gerrit.wikimedia.org/r/354505

Change 354505 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Use WikiPage::makeParserOptions to get options for caching

https://gerrit.wikimedia.org/r/354505

Change 354503 merged by jenkins-bot:
[mediawiki/extensions/GeoData@master] Use canonical ParserOptions for WikiPage::getParserOutput()

https://gerrit.wikimedia.org/r/354503

Change 354504 merged by jenkins-bot:
[mediawiki/core@master] Try harder to avoid parser cache pollution

https://gerrit.wikimedia.org/r/354504