Re-evaluate purging strategies
Closed, ResolvedPublic

Description

It seems older articles are getting lost in trending, presumably due to the purging strategy.
We should take a days dump of events from Wikipedia and locally work out which notable articles are being lost during purging.

Currently when the list of articles exceeds max_pages we purge the older articles.

Open questions

  • Are certain trending pages getting purged? (Use the result of T159967 to debug)
  • What is a typical max_pages for the period of a day?
  • What is an acceptable/performant value of max_pages
  • What strategies could we use to ensure max_pages is rarely exceeded? Can we purge under other criteria?
Jdlrobson created this task.Mar 9 2017, 9:54 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 9 2017, 9:54 PM
Jdlrobson moved this task from Backlog to Next up on the Trending-Service board.Mar 9 2017, 11:46 PM

Still seeing issues today and after some further investigation the min speed config value 0.1 is also to blame. This is speed per minutes.

Typically a trending article in a day on Wikipedia gets up to 100 edits

Right now Park Geun-hye should be the top trending article - she's had 64 edits (16 of those by anon) (updated about an hour ago) by 46 editors 16 anon) with 3 reverts (since about 14 hours ago). However she doesn't show up when I query the RESTBase service. If you calculate the speed of this actvity - that's 64/14*60 = 64/840 = 0.07 edits per minute - below our threshold.

If that continues for a 24hr period - 64/1440 = 0.06 edits per sec.
Bumping this down to 0.05 thus might help... but this aims to get rid of slow edits - e.g. if a page gets only 5 edits in an hour, we can probably safely drop it after 1hr. If the threshold is 0.05 that page will only get dropped after 2hrs (5/60 = 0.08, 5/120 = 0.04)... and right now speed is the main thing we purge on.

In Weekipedia what I do to circumvent this issue is mark articles as safe when they reach a certain threshold. I advise we do the same here - basically if a page gets a certain amount of edits we only purge it if it's become inactive or old.

Still seeing issues today and after some further investigation the min speed config value 0.1 is also to blame. This is speed per minutes.

Typically a trending article in a day on Wikipedia gets up to 100 edits

Right now Park Geun-hye should be the top trending article - she's had 64 edits (16 of those by anon) (updated about an hour ago) by 46 editors 16 anon) with 3 reverts (since about 14 hours ago). However she doesn't show up when I query the RESTBase service. If you calculate the speed of this actvity - that's 64/14*60 = 64/840 = 0.07 edits per minute - below our threshold.

Would it be possible that that page is in memory, but just not in the top 20 returned results?

If that continues for a 24hr period - 64/1440 = 0.06 edits per sec.
Bumping this down to 0.05 thus might help... but this aims to get rid of slow edits - e.g. if a page gets only 5 edits in an hour, we can probably safely drop it after 1hr. If the threshold is 0.05 that page will only get dropped after 2hrs (5/60 = 0.08, 5/120 = 0.04)... and right now speed is the main thing we purge on.

Perhaps then the activity should be calculated per hour regardless of the retention interval? Those could then be stored in memory and ordered by that.

In Weekipedia what I do to circumvent this issue is mark articles as safe when they reach a certain threshold. I advise we do the same here - basically if a page gets a certain amount of edits we only purge it if it's become inactive or old.

Sounds like a good idea.

Change 342285 had a related patch set uploaded (by Jdlrobson):
[mediawiki/services/trending-edits] WIP: Do not purge articles which have trended

https://gerrit.wikimedia.org/r/342285

Change 342285 merged by Ppchelko:
[mediawiki/services/trending-edits] Do not purge articles which have trended

https://gerrit.wikimedia.org/r/342285

Mentioned in SAL (#wikimedia-operations) [2017-03-20T22:30:30Z] <ppchelko@tin> Started deploy [trending-edits/deploy@5d3eb7f]: Do not purge articles that have trended T160127

Mentioned in SAL (#wikimedia-operations) [2017-03-20T22:38:28Z] <ppchelko@tin> Finished deploy [trending-edits/deploy@5d3eb7f]: Do not purge articles that have trended T160127 (duration: 07m 57s)

Change 343780 had a related patch set uploaded (by Ppchelko):
[mediawiki/services/trending-edits/deploy] Set 'trends_at' property for the purge_strategy

https://gerrit.wikimedia.org/r/343780

Change 343780 merged by Ppchelko:
[mediawiki/services/trending-edits/deploy] Config: Set 'trends_at' property for the purge_strategy

https://gerrit.wikimedia.org/r/343780

Mentioned in SAL (#wikimedia-operations) [2017-03-20T22:47:55Z] <ppchelko@tin> Started deploy [trending-edits/deploy@e4fa9b8]: Config: Set up 'trends_at' property T160127

Mentioned in SAL (#wikimedia-operations) [2017-03-20T22:54:16Z] <ppchelko@tin> Finished deploy [trending-edits/deploy@e4fa9b8]: Config: Set up 'trends_at' property T160127 (duration: 06m 20s)

Change 345780 had a related patch set uploaded (by Jdlrobson):
[mediawiki/services/trending-edits/deploy@master] Drop min-edits value

https://gerrit.wikimedia.org/r/345780

Change 345780 merged by Mobrovac:
[mediawiki/services/trending-edits/deploy@master] Drop min-edits value

https://gerrit.wikimedia.org/r/345780

Mentioned in SAL (#wikimedia-operations) [2017-03-31T15:49:19Z] <mobrovac@tin> Started deploy [trending-edits/deploy@26b5eb4]: Config change: lower min_edits to 15 T160127

Mentioned in SAL (#wikimedia-operations) [2017-03-31T15:55:56Z] <mobrovac@tin> Finished deploy [trending-edits/deploy@26b5eb4]: Config change: lower min_edits to 15 T160127 (duration: 06m 37s)

Jdlrobson removed a project: Patch-For-Review.EditedApr 4 2017, 10:22 PM
Jdlrobson added a subscriber: Pchelolo.

Still not sure exactly what's going on here, but I'm not seeing a significant amount of pages that I feel I should be seeing. The difference between the prototype and the production version are vasty different for all time periods. Despite being clearly one of the topics of the day "2017 Khan Shaykhun chemical attack" doesn't appear in the production endpoint at all and I have no idea why.

Let's see if there are complaints in the log thanks to the https://gerrit.wikimedia.org/r/#/c/342288/
If not, we're going to need the debugging URLs (T159967). @Pchelolo maybe we could spend several hours working out the issue here as right now I'm a little stumped with what's going on.

Was in "Needs testing" on Trending-Service workboard as of 04/25/2017.

I think after our debugging session with @Jdlrobson earlier this month everything is fine here? Can you confirm and close the ticket @Jdlrobson

Jdlrobson closed this task as Resolved.Apr 26 2017, 3:47 AM
Jdlrobson claimed this task.

After deploying some debugging we were able to confirm that max_pages does not get exceeded so that's good.
The debugging tools provided in T159967 surfaced bugs in the score calculation and the filtering ( see https://gerrit.wikimedia.org/r/#/c/346644/ and https://gerrit.wikimedia.org/r/#/c/346616/ )