Page MenuHomePhabricator

Add reply links to the parser cache
Closed, ResolvedPublic

Description

Currently reply links are added in the onOutputPageBeforeHTML hook, and therefore not cached for logged-in users. While the median time spend in this hook is <50ms, on longer pages this can take a few hundred ms[1].

(At https://www.mediawiki.org/wiki/Topic:Vwtdrzx2hszbdul1 someone reported a page where adding reply links takes about 1.5 s.)

We should be able to store the output in the parser cache, so that this method is called much less frequently.

We will need to split the cache on user language as the reply link itself uses the user language.

Done


  1. https://graphite.wikimedia.org/?width=586&height=308&target=MediaWiki.discussiontools.addReplyLinks.mean

Testing

Environment

  • Beta cluster

Behaviors

  • Check that the availability of the tool is unaffected by this patch, that it can be enabled/disabled via preferences or beta feature
  • Check that the reply tool still loads correctly when enabled

From April 14, 2021 meeting between Parsing, Data Persistence and Editing teams, we came up with these action items.

  • [Lukasz] DBAs might want to look at other wikis, assuming investigation has been limited to smaller wikis, as the impact is still possible on others
  • [C Scott] Talk to DBAs about setting up an alarm or monitor that tells us if DT is spiking something
  • [Subbu] Talk to Seve about whether additional hardware is compatible with longterm plans in Technology & get a delivery date if so.
  • [Subbu/Lukasz]: When can we expect new hardware to arrive?
  • [Editing/ Leza/ Peter] Editing Team to talk to Performance Team (we think?) about whether Editing should/can pursue optimizations in the code we've written [or config changes]
  • [C Scott/DBAs/Perf] Test reducing retention and see what happens determine the actual storage and performance implications
  • [Leza] When shall we three meet again? A: Answer early next week.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 641837 had a related patch set uploaded (by Esanders; owner: Esanders):
[mediawiki/extensions/DiscussionTools@master] Store reply links in the parser cache

https://gerrit.wikimedia.org/r/641837

CC @LSobanski . We are planning to merge this for next week's train. It will split the parser cache on talk pages for user who have this feature enabled.

@Esanders could you help me understand what impact this is expected to have on the parser cache databases?

@LSobanski we are adding a new parser cache option, which can only ever be set on discussion pages.

I suppose in the worst case, where every talk page has been visited by someone with this feature enabled, we would be doubling the cache size for talk pages. In practice it will less than that as many talk pages receive little to no traffic from real users.

In the long term, should the tool become very widely used, we may consider only generating one version of the pages with the reply links, and hiding them with CSS for the minority that have the feature disabled.

@Esanders is this something that can be disabled/expired if needed?
Parsercache is very stable in terms of disk space usage, and any small variation on the new keys and expiration time can lead to the hosts filling up quite quickly.
This is one of the disk usage graphs for the last month

Captura de pantalla 2021-01-14 a las 7.25.47.png (778×1 px, 90 KB)

As you see there are pretty much no variations of the disk usage as everything is pretty aligned, so it would be great to know:

  • what if once this is deployed it causes disk issues on our parsercache infra? can this be rolled back/disabled/keys deleted
  • Could your team monitor this once this is deployed and for a few days?:

https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=12&orgId=1&refresh=5m&var-server=pc1007&var-datasource=thanos&var-cluster=mysql&from=now-24h&to=now
https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=12&orgId=1&refresh=5m&var-server=pc1008&var-datasource=thanos&var-cluster=mysql&from=now-24h&to=now
https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=12&orgId=1&refresh=5m&var-server=pc1009&var-datasource=thanos&var-cluster=mysql&from=now-24h&to=now

  • When would this be deployed?
  • Will this be deployed to all wikis at once or will this be enabled slowly? I would request the latter, so we can see how this affects our servers disk growth.

Thanks

@Esanders is this something that can be disabled/expired if needed?

  • what if once this is deployed it causes disk issues on our parsercache infra? can this be rolled back/disabled/keys deleted

It's pretty easy to rollback this feature, we will also have a flag so it can disabled via config.

Every entry will have dtreply=1 in the key.

  • Could your team monitor this once this is deployed and for a few days?:

Sure.

  • When would this be deployed?

The earliest now would be next week's train (wmf 28)

  • Will this be deployed to all wikis at once or will this be enabled slowly? I would request the latter, so we can see how this affects our servers disk growth.

DiscussionTools is only enabled by default on ar/cs/huwikis, and is a beta feature on ~250 other Wikipedias (but not enwiki). The new cache key will only get used if someone with the feature enabled visits the talk page, so by its nature it will be a gradual rollout.

Thanks for the reply, one more question regarding this comment:

DiscussionTools is only enabled by default on ar/cs/huwikis, and is a beta feature on ~250 other Wikipedias (but not enwiki). The new cache key will only get used if someone with the feature enabled visits the talk page, so by its nature it will be a gradual rollout.

The plan I would assume is to enable it everywhere at some point? (including big wikis like enwiki and commonswiki)
Thank you

ok - thanks. Let's monitor how this first deployment goes and the effect on the disk growth.
Thanks for all the answers!

Change 641837 merged by jenkins-bot:
[mediawiki/extensions/DiscussionTools@master] Store reply links in the parser cache

https://gerrit.wikimedia.org/r/641837

ppelberg updated the task description. (Show Details)

@Marostegui This finally went out in wmf30. I don't see any significant change in disk usage, but I don't know what you consider a significant change :)

Thanks @Esanders I will give it a week and if nothing shows up, I will close this.

So far so good, going to keep looking during this week and if it still good, will close this.

So far so good, going to keep looking during this week and if it still good, will close this.

We appreciate you continuing to monitor this, @Marostegui.

@Esanders can you confirm if this featured was enabled 25th of Feb? I am seeing a worrying pattern of disk growth on parsercache - it would help to know whether this was released around that date or earlier, to try to correlate things.
Thanks

wmf30 which I think was 18th Feb at the latest

Thanks, that sorts of matches the start of growth. Parsercache grows very slowly over the time, but starting in Feb it has grown around 6% which is quite a lot.
For instance from sept to 31st dec it grew 6%.

Can we disable this to see if we notice a change in the growth pattern?

You can set $wgDiscussionToolsUseParserCache = false in the configuration, it won't cause any problems for us (other than somewhat degraded performance for users).

You can set $wgDiscussionToolsUseParserCache = false in the configuration, it won't cause any problems for us (other than somewhat degraded performance for users).

I would appreciate if that can be deployed by your team, we do not deploy MW changes :-)

We (Editing engineers and Scott) talked about this today in a meeting. Some thoughts:

Some growth was obviously expected as a result of splitting the parser cache, and we thought that was okay. It seems no one really had any idea how much it would grow though. 6% seems to be more than I'd expect, but…

Is this just a one-time effect, or is it going to keep growing? It changes so slowly (items are only added when someone views a page, and expire apparently after 30 days), we should probably wait longer to see.

Also, we just enabled the beta feature on en.wp (T273146), which might cause more parser cache growth, but also might make performance degradation by not using parser cache worse, so we don't really want to make that change today.

Anyway, if this becomes an emergency, then you or us can toggle that config option. But if it doesn't, we'd prefer to first learn more about what's in the box before we just turn it off.

Thanks for discussing this within your team.
Keep in mind that I am not 100% sure this is caused by this new feature, but I do want to discard and try to narrow things.

We (Editing engineers and Scott) talked about this today in a meeting. Some thoughts:

Some growth was obviously expected as a result of splitting the parser cache, and we thought that was okay. It seems no one really had any idea how much it would grow though. 6% seems to be more than I'd expect, but…

Is this just a one-time effect, or is it going to keep growing? It changes so slowly (items are only added when someone views a page, and expire apparently after 30 days), we should probably wait longer to see.

Hard to say, I guess it depends on the usage. The expiration date is configured indeed to be 30 days for new items. But it really depends on how much this gets used.
Per T267404#6912946, it looks like we've had this enabled for 27 days. I am fine waiting until Monday/Tuesday next week so we can re-evaluate this.

Also, we just enabled the beta feature on en.wp (T273146), which might cause more parser cache growth, but also might make performance degradation by not using parser cache worse, so we don't really want to make that change today.

I understand that, but getting the hosts close to 80% isn't great as it doesn't give us much margin in case of more unexpected growth (which has happened in the past T167784 ). We are 77-78% on the three hosts. If this doesn't stop the first immediate action we'd need to do would be to reduce the expiration time from 30 to 20 days, which would be affecting everything that goes into parsercache, not just this new feature.

Is splitting the parser cache absolutely necessary? Could you use CSS to hide the links for users who don't have the preference enabled? (at the cost of slightly larger HTML for everyone before the rollout is complete)

Is splitting the parser cache absolutely necessary? Could you use CSS to hide the links for users who don't have the preference enabled? (at the cost of slightly larger HTML for everyone before the rollout is complete)

We could, and we're planning to (T273072), but we wanted to keep it separate for now, so that any bugs in our code would only affect users who enabled the beta feature (and it seems to have been a good decision given T274709 and T275440).

I'll note that we're hoping to eventually use Parsoid output for DiscussionTools, so a similar split-then-merge technique will be helpful as well: first split the cache for "discussion tools using parsoid output", then eventually merge the split and serve parsoid html to all talk page views.

I don't know if that's still the right chart to look at, but it seems to have been decreasing over the last two weeks.

I think it has kept increasing: https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=12&orgId=1&from=1616099930204&to=1617688963176&var-server=pc1007&var-datasource=thanos&var-cluster=mysql and it is now over 80% which is what we consider the critical point where we need to start looking at changing hardware, as it doesn't really give us much room in case of unexpected increases.
We need to look at new hardware and/or decrease the retention (it is now set to 30 days), which means possibly degraded performance for everything that lives in parsercache.

Looking at he 90 day view, parsercache usage was at 75% in January, so this is a very gradual increase of about 5% due to discussiontools.

Some options:

  • Decrease retention overall to 29 days, will save roughly 3.3%, or to 28 days, save 6.7%. Problem here is that parsercache is a FIFO cache, not a LRU cache, so we'd be affecting frequently-used entries as well.
  • Provision 5% more storage (but considering that the parsercache might be replaced, this might be new hardware for an soon-obsolete use)
  • Change the critical point to 85% considering that the increase is very slow and there are long-term plans to (a) remove the discussiontools split once it leaves beta, and (b) replace the parsercache entirely.
  • Decrease the retention time for discussiontools-related parser cache entries specifically. Given that usage is ~5%, discussiontools cache time would have to be reduced to 15 days to save ~2.5% parser cache space.
  • Prioritize parsercache improvements, so "v2" parsercache software&hardware can be used for discussiontools (Parsoid is blocked on this too)
  • (Orthogonally) build some tools to walk the parsercache and expand the key structure so we can get more fine-grained insight as to the types of entries in parsercache and validate that discussiontools accounts for the 5% space usage seen. (Perhaps there are other cache splits that can be tweaked to expire sooner which would have more of an effect on space usage.)

Looking at he 90 day view, parsercache usage was at 75% in January, so this is a very gradual increase of about 5% due to discussiontools.

5% in around 2 months is quite a lot for parsercache. I am counting 18 Feb as the latest when this was enabled per T267404#6912946.

Some options:

  • Decrease retention overall to 29 days, will save roughly 3.3%, or to 28 days, save 6.7%. Problem here is that parsercache is a FIFO cache, not a LRU cache, so we'd be affecting frequently-used entries as well.

Correct, this will affect everything stored in parsercache.

  • Provision 5% more storage (but considering that the parsercache might be replaced, this might be new hardware for an soon-obsolete use)

That's not doable short-term. We'd need to allocate budget for it, order, rack, install and populate. That takes a few months.

  • Change the critical point to 85% considering that the increase is very slow and there are long-term plans to (a) remove the discussiontools split once it leaves beta, and (b) replace the parsercache entirel

Unfortunately the critical point isn't something we can move anytime we like. From experience we know when we approach 80%, any mistake or change in parsercache will put us in a very delicate situation and we'd need to act under lots of operational pressure to avoid parsercache to have all its disks filled up.
We don't pick 80% and not 85% just because, we've learned the hard way unfortunately.

  • Decrease the retention time for discussiontools-related parser cache entries specifically. Given that usage is ~5%, discussiontools cache time would have to be reduced to 15 days to save ~2.5% parser cache space.

What's the implications of doing this performance wise? If it is only used on the discussion page, why not starting with this, optimize the tables to reclaim space and see in which position it leaves us?

  • Prioritize parsercache improvements, so "v2" parsercache software&hardware can be used for discussiontools (Parsoid is blocked on this too)

This isn't a short-term solution, but something that would involve many moving parts.

Please keep in mind I am not specifically blaming this feature for causing this, parsercache was already in a non healthy state and we've always warned that adding new stuff to it can cause these issues.
We need to do things short, medium and long term to get this infra in a better state.

Now that annual planning will start soon, I do want to request budget to get bigger disks, but again, this will take months and we cannot only relay on hardware. Until it arrives (if it arrives), we need to make sure we have room for the unexpected, and hence my requests to address this problem.

Do we even know for sure that this is caused by DiscussionTools?

Can you estimate whether disabling our parser cache stuff would even resolve this problem? (That is, if you dropped all parser cache entries with 'dtreply' in the key, how much disk space would that save?)

Do we even know for sure that this is caused by DiscussionTools?

As far as I know, it is the last feature that has been added to parsercache. As I mentioned above, parsercache was already in a bad state, but this might has caused it to go over 80% (or close to it).
To be honest, I am a bit surprised with the level of pushback I am getting :-(
I especially asked whether this easy to revert/disabled and it was said it was pretty easy: T267404#6758561

I don't have any particular interest in disabling this, but we do look after parsercache health and if parsercache crashes/gets filled up, we'll have major problems across the site. It is a delicate piece of infra that isn't in the best state possible. I mentioned these concerns before this was deployed: T267404#6746658

Can you estimate whether disabling our parser cache stuff would even resolve this problem? (That is, if you dropped all parser cache entries with 'dtreply' in the key, how much disk space would that save?)

That's hard to do, with the table structure for parsercache and how we store data on it, makes it difficult do to that from SQL point of view.
Also, we'd need to do the drops on all the tables and then run optimize table to claim the space back anyways.

Personally I'm pushing back because I'm hearing you say that it's not clear whether making this change will improve the parser cache situation, but it is clear to me that doing it will worsen the page load performance for users of DiscussionTools. (At first I got the impression that you were certain that it will, and I was myself suggesting disabling the parser cache use in DiscussionTools with that in mind, but now it seems to me that you don't know.)

I was also hoping that you had ways to debug the problem, and that we'd disable our use of parser cache after it was identified as a problem. I'm somewhat unhappy to be making random changes and waiting a month to see if they had any effect. But if you say that's what we gotta do, then I won't push back.

You also phrased everything as questions, so we didn't see the need for immediate action.

Also, none of us (except you) can do anything about this, other than make MediaWiki configuration changes. A few other solutions were proposed and I have no idea if or when they're being done.

there are long-term plans to (a) remove the discussiontools split once it leaves beta

I hope I'm not talking out of turn here, but @Marostegui, would this problem be solved by turning those "long-term plans" into "short-term plans"? If so, from your perspective, does this need to be turned on for "everyone everywhere", or would it be enough to do that at most wikis, or only enwiki, or any subset of your choice? (It's already default-on at arwiki, cswiki, and huwiki.)

@LSobanski Given a concern (raised in an email thread) that the whole site might crash if parsercache crashes, this issue strikes me as potentially urgent. Can you possibly speak to the imminence of such an event? We have a meeting scheduled for next week, but I want to make sure it is safe to wait that long. :)

@MBinder_WMF We should be OK for now but let's not enable the extension on more wikis before fully understanding what's at play here. Getting some clarity and next steps next week is a reasonable timeline.

Added some notes in ticket description after meeting today with Editing, Parsing and Data Persistence teams.

Possibly related: T278655: Appservers latency spike / parser cache growth 2021-03-28

From the task description:
[…] We will need to split the cache on user language as the reply link itself uses the user language. […]

I think as a general rule, varying parser cache by interface language is not something we can accept for official developments. It's not something totally out of the question to negotiate and plan for, but it's a deviation from the norm, with very significant costs and operational risks associated with them. We support hundreds of languages, that may be used on any wiki. That's two orders of magnitude (more than 400x). And these would be trivially triggerable from URL query parameters (uselang), over Get-HTTP requests, for which we don't have much precedent to rate limit.

We'd have to look very closely to the product need and what alternative technical approaches we have available. I think it's most likely that we'd conclude the same way as we have before, that there are other more cost-effective solutions we should invest in instead.

Both the TOC and Edit section features also utilize the user language, and specifically avoid this ParserCache cost by using "runtime placeholders" (see ParserOutput.php). Given we have invested in that as a technical solution before, that seems worth exploring for DiscussionTools as well. That would bring it down from ~400x to 1x or 2x. E.g. 2x now (feature is off by default and allowed to be used via a query parameter) and ~1x long-term (after the query param is gone, feature on by default, with a user preference only; we'd be close to today's status quo with only a minimal increase for people who opt-out; or not even that if we then feel comfortable using runtime CSS.)

Possibly related: T278655: Appservers latency spike / parser cache growth 2021-03-28

From the task description:
[…] We will need to split the cache on user language as the reply link itself uses the user language. […]

I think as a general rule, varying parser cache by interface language is not something we can accept for official developments.

I wrote this before looking at the code, but as it turns out the parser cache is already varied on user language. However, these are only generated on demand, so probably doesn't account for much of the additional cache usage we have been seeing.

Possibly related: T278655: Appservers latency spike / parser cache growth 2021-03-28

From the task description:
[…] We will need to split the cache on user language as the reply link itself uses the user language. […]

I think as a general rule, varying parser cache by interface language is not something we can accept for official developments.

I wrote this before looking at the code, but as it turns out the parser cache is already varied on user language. […]

If you mean that user language is a standard unconditional part of every cache key, then no, afaik that is not the case. I haven't checked this emperically, but I hope that's not the case for article or discussion pages "in general".

But yes, the Parser does support features such as {{int:}} which can cause a page to vary by user interface language if those features are used. The ParserOptions class tracks which options are used, and only varies the cache if they are used. I generally preceive these hacks as being on the way out and limited to constructs from end-users, not our own software. If I recall correctly, this was mainly to support the uselang hack on Commons, which probably doesn't even get cached (since they are Special-page interface messages, not wiki pages), and is also something I'm looking to deprecate/remove in favour of alternate approaches on Commons (e.g. guided pages, a gadget, or part of UploadWizard).

So we call $popts->getUserLangObj(). It looks like that marks the userlang option as used, and so the getter causes the cache to split (seems like a bit of an anti-pattern there).

We initially discussed using string replacement, and we are using it for other features, so switching back to that won't be a problem.

I'm not sure if this is much of a contributing factor at the moment, but an easy one to fix right now.

Current ongoing work around DiscussionTools' usage of the parser cache is happening in T280599.

I think we should close this task – this has been implemented since January, and was not reverted. The task was reopened when issues surfaced, but there's the new task for them now.