Re-evaluate purging strategies
Open, Needs TriagePublic

Description

It seems older articles are getting lost in trending, presumably due to the purging strategy.
We should take a days dump of events from Wikipedia and locally work out which notable articles are being lost during purging.

Currently when the list of articles exceeds max_pages we purge the older articles.

Open questions

  • Are certain trending pages getting purged? (Use the result of T159967 to debug)
  • What is a typical max_pages for the period of a day?
  • What is an acceptable/performant value of max_pages
  • What strategies could we use to ensure max_pages is rarely exceeded? Can we purge under other criteria?
Jdlrobson created this task.Thu, Mar 9, 9:54 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptThu, Mar 9, 9:54 PM

Still seeing issues today and after some further investigation the min speed config value 0.1 is also to blame. This is speed per minutes.

Typically a trending article in a day on Wikipedia gets up to 100 edits

Right now Park Geun-hye should be the top trending article - she's had 64 edits (16 of those by anon) (updated about an hour ago) by 46 editors 16 anon) with 3 reverts (since about 14 hours ago). However she doesn't show up when I query the RESTBase service. If you calculate the speed of this actvity - that's 64/14*60 = 64/840 = 0.07 edits per minute - below our threshold.

If that continues for a 24hr period - 64/1440 = 0.06 edits per sec.
Bumping this down to 0.05 thus might help... but this aims to get rid of slow edits - e.g. if a page gets only 5 edits in an hour, we can probably safely drop it after 1hr. If the threshold is 0.05 that page will only get dropped after 2hrs (5/60 = 0.08, 5/120 = 0.04)... and right now speed is the main thing we purge on.

In Weekipedia what I do to circumvent this issue is mark articles as safe when they reach a certain threshold. I advise we do the same here - basically if a page gets a certain amount of edits we only purge it if it's become inactive or old.

Still seeing issues today and after some further investigation the min speed config value 0.1 is also to blame. This is speed per minutes.

Typically a trending article in a day on Wikipedia gets up to 100 edits

Right now Park Geun-hye should be the top trending article - she's had 64 edits (16 of those by anon) (updated about an hour ago) by 46 editors 16 anon) with 3 reverts (since about 14 hours ago). However she doesn't show up when I query the RESTBase service. If you calculate the speed of this actvity - that's 64/14*60 = 64/840 = 0.07 edits per minute - below our threshold.

Would it be possible that that page is in memory, but just not in the top 20 returned results?

If that continues for a 24hr period - 64/1440 = 0.06 edits per sec.
Bumping this down to 0.05 thus might help... but this aims to get rid of slow edits - e.g. if a page gets only 5 edits in an hour, we can probably safely drop it after 1hr. If the threshold is 0.05 that page will only get dropped after 2hrs (5/60 = 0.08, 5/120 = 0.04)... and right now speed is the main thing we purge on.

Perhaps then the activity should be calculated per hour regardless of the retention interval? Those could then be stored in memory and ordered by that.

In Weekipedia what I do to circumvent this issue is mark articles as safe when they reach a certain threshold. I advise we do the same here - basically if a page gets a certain amount of edits we only purge it if it's become inactive or old.

Sounds like a good idea.

Change 342285 had a related patch set uploaded (by Jdlrobson):
[mediawiki/services/trending-edits] WIP: Do not purge articles which have trended

https://gerrit.wikimedia.org/r/342285

Change 342285 merged by Ppchelko:
[mediawiki/services/trending-edits] Do not purge articles which have trended

https://gerrit.wikimedia.org/r/342285

Mentioned in SAL (#wikimedia-operations) [2017-03-20T22:30:30Z] <ppchelko@tin> Started deploy [trending-edits/deploy@5d3eb7f]: Do not purge articles that have trended T160127

Mentioned in SAL (#wikimedia-operations) [2017-03-20T22:38:28Z] <ppchelko@tin> Finished deploy [trending-edits/deploy@5d3eb7f]: Do not purge articles that have trended T160127 (duration: 07m 57s)

Change 343780 had a related patch set uploaded (by Ppchelko):
[mediawiki/services/trending-edits/deploy] Set 'trends_at' property for the purge_strategy

https://gerrit.wikimedia.org/r/343780

Change 343780 merged by Ppchelko:
[mediawiki/services/trending-edits/deploy] Config: Set 'trends_at' property for the purge_strategy

https://gerrit.wikimedia.org/r/343780

Mentioned in SAL (#wikimedia-operations) [2017-03-20T22:47:55Z] <ppchelko@tin> Started deploy [trending-edits/deploy@e4fa9b8]: Config: Set up 'trends_at' property T160127

Mentioned in SAL (#wikimedia-operations) [2017-03-20T22:54:16Z] <ppchelko@tin> Finished deploy [trending-edits/deploy@e4fa9b8]: Config: Set up 'trends_at' property T160127 (duration: 06m 20s)