Page MenuHomePhabricator

Post-deployment: (partly) ramp parser cache retention back up
Closed, ResolvedPublic

Description

This task represents the work involved with the Data-Persistence team evaluating what impact the mitigation had on the parser cache utilisation during and after the 21 day period following the mitigation, and subsequently ramping the non-talkpage retention back up based on whether the reduced talkpage-retention successfully freed up the needed space.

Requirements

  • To have done: Ramp parser cache retention back up from 20 to 30 days.
  • To monitor and confirm the same site performance metrics as per T280606#7323626:
    1. "Parser cache disk space available", should remain above 20%. Measured via Grafana: Parser Cache
    2. "Parser cache hit ratio", has been stable around ~80% for article page views. Measured via Grafana: Parser Cache (contenttype; wikitext)
    3. "Backend pageview response time (p75)", has been stable around ~250ms for the past two years. Measured via Grafana: Backend pageview time.
    4. Monitor overall appserver load and internal latencies via the "Application Servers RED Dashboard".
    5. Daily purge of parser cache MUST take less than an actual day to run.

Done

  • The ===Requirements above are met

Event Timeline

Krinkle renamed this task from Post-deployment: evaluate impact on parser cache utilization to Post-deployment: (partly) ramp parser cache retention back up .May 4 2021, 5:33 PM
Krinkle assigned this task to Marostegui.
Krinkle added a project: Data-Persistence.
Krinkle updated the task description. (Show Details)
Marostegui edited projects, added DBA; removed Data-Persistence.
Marostegui moved this task from Triage to Blocked on the DBA board.
Marostegui subscribed.

Not assigning it to me specifically, as anyone could pick this up after the mitigation

  • disk space is still quite rapidly increasing despite shortened retention and daily purging, which suggests we're not going to stay stable for long given more data will mean longer purge times.
  • as part of restoring retention, purge time is expected to go up even furhter.
LSobanski triaged this task as Medium priority.Aug 30 2021, 3:10 PM
Krinkle updated the task description. (Show Details)

Unblocked from perf side per T280606#7323626. Signinging over to @Kormat to lead the next steps.

We have some additional margin today even on the old hardware, so we could start ramping up one day at a time now, or we could wait until your team is comfortable taking the old hardware out of rotation. I'll leave that to you.

Let's wait until Editing rolls out the changes to all wikis before doing this.

Krinkle moved this task from Watching to Perf recommendation on the Performance-Team (Radar) board.
Krinkle added a subscriber: Ladsgroup.

[…]

The result of distribution of exptime (per hour) is here:

res.png (480×640 px, 12 KB)

[…] One really interesting observation is the solid hit rate to expiry between 400 and 500 hours. While it's technically 20% of the PC, it provides 34% of the hits.

This probably means we should increase the ttl to thirty days and revisit the value with new sampling.
The optima seems to be somewhere above 21 days (but hopefully somewhere below 30 days).

@Marostegui Afaik we haven't yet ramped retention back up. Our own past analysis and Amir's recent analysis above both lead me to still think this is worthwhile in the long run. Perhaps in January we can give this another go?

Change 877205 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] parsercachepurging.pp: Increase retention back to 30 days

https://gerrit.wikimedia.org/r/877205

Marostegui changed the task status from Open to Stalled.Jan 9 2023, 7:17 PM

Per @Ladsgroup comment on the above patch, we are going to wait a couple of months before pushing it.

Will do it once some clean ups are in place.

We are waiting for some hardware to be installed so we can increase number of clusters per dc from three to four and once that has been done and settled (20 days after the deployment so misplaced keys get purged), then we should start ramping it up.

Change 877205 merged by Marostegui:

[operations/puppet@production] parsercachepurging.pp: Increase retention back to 30 days

https://gerrit.wikimedia.org/r/877205

Marostegui added a subscriber: Kormat.

This is done, back to 30 days

Change 979920 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Bump ParserCache TTL back to 30 days

https://gerrit.wikimedia.org/r/979920

There is one thing missing :D I get it done.

Change 979920 merged by jenkins-bot:

[operations/mediawiki-config@master] Bump ParserCache TTL back to 30 days

https://gerrit.wikimedia.org/r/979920

Mentioned in SAL (#wikimedia-operations) [2023-12-05T11:30:43Z] <ladsgroup@deploy2002> Started scap: Backport for [[gerrit:979920|Bump ParserCache TTL back to 30 days (T280604)]]

Mentioned in SAL (#wikimedia-operations) [2023-12-05T11:32:01Z] <ladsgroup@deploy2002> ladsgroup: Backport for [[gerrit:979920|Bump ParserCache TTL back to 30 days (T280604)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2023-12-05T11:38:30Z] <ladsgroup@deploy2002> Finished scap: Backport for [[gerrit:979920|Bump ParserCache TTL back to 30 days (T280604)]] (duration: 07m 47s)