Compute the trending articles over a period of 24h rather than 1h
Closed, ResolvedPublic

Description

We have deployed the trending edits service with only a one-hour window for computing the articles that trend. Given that replaying the last hour takes less than a second and has had no significant impact on the SCB cluster. It should be safe to bump it up to 24h.

Related: T156680

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a project: Analytics. ยท View Herald TranscriptJan 26 2017, 7:51 PM
Restricted Application added a subscriber: Aklapper. ยท View Herald Transcript
mobrovac changed the title from "Compute the tending articles over a period of 24h rather than 1h" to "Compute the trending articles over a period of 24h rather than 1h".Jan 26 2017, 7:51 PM

Change 334432 had a related patch set uploaded (by Mobrovac):
Config: Bump max_age to 24h (1440 mins)

https://gerrit.wikimedia.org/r/334432

Change 334432 merged by Ppchelko:
Config: Bump max_age to 6h (360 mins)

https://gerrit.wikimedia.org/r/334432

Change 334434 had a related patch set uploaded (by Mobrovac):
Config: Bump max_inactivity to 24h (1440 mins)

https://gerrit.wikimedia.org/r/334434

Change 334434 merged by Mobrovac:
Config: Bump max_inactivity to 24h (1440 mins)

https://gerrit.wikimedia.org/r/334434

Mentioned in SAL (#wikimedia-operations) [2017-01-26T22:04:46Z] <mobrovac@tin> Starting deploy [trending-edits/deploy@e0e32bb]: Bump replay time to 6h for T156411

Mentioned in SAL (#wikimedia-operations) [2017-01-26T22:06:28Z] <mobrovac@tin> Finished deploy [trending-edits/deploy@e0e32bb]: Bump replay time to 6h for T156411 (duration: 01m 42s)

Mentioned in SAL (#wikimedia-operations) [2017-01-27T21:27:12Z] <mobrovac@tin> Starting deploy [trending-edits/deploy@e0e32bb]: Restart the service to assess the load of replaying the last 6h T156411

Mentioned in SAL (#wikimedia-operations) [2017-01-27T21:28:16Z] <mobrovac@tin> Finished deploy [trending-edits/deploy@e0e32bb]: Restart the service to assess the load of replaying the last 6h T156411 (duration: 01m 03s)

Change 334718 had a related patch set uploaded (by Mobrovac):
Config: Bump max_age to 12h (720 mins)

https://gerrit.wikimedia.org/r/334718

Change 334718 merged by Mobrovac:
Config: Bump max_age to 12h (720 mins)

https://gerrit.wikimedia.org/r/334718

Mentioned in SAL (#wikimedia-operations) [2017-01-27T21:34:48Z] <mobrovac@tin> Starting deploy [trending-edits/deploy@0e79bec]: Bump max_age to 12h T156411

Mentioned in SAL (#wikimedia-operations) [2017-01-27T21:36:46Z] <mobrovac@tin> Finished deploy [trending-edits/deploy@0e79bec]: Bump max_age to 12h T156411 (duration: 01m 58s)

Nuria moved this task from Incoming to Radar on the Analytics board.Jan 30 2017, 5:17 PM

Change 335083 had a related patch set uploaded (by Mobrovac):
Config: Bump max_age to 18h (1080 mins)

https://gerrit.wikimedia.org/r/335083

Change 335083 merged by Mobrovac:
Config: Bump max_age to 18h (1080 mins)

https://gerrit.wikimedia.org/r/335083

Mentioned in SAL (#wikimedia-operations) [2017-01-30T20:43:00Z] <mobrovac@tin> Started deploy [trending-edits/deploy@9addcd0]: Bump max_age to 18h for T156411

Mentioned in SAL (#wikimedia-operations) [2017-01-30T20:45:40Z] <mobrovac@tin> Finished deploy [trending-edits/deploy@9addcd0]: Bump max_age to 18h for T156411 (duration: 02m 39s)

Before we continue bumping this to 24hrs I'd like to suggest we address T156666 and T156680 after reviewing the impact these changes are having.

The results on
https://trending.wmflabs.org/wiki/Special:Feed/enwiki/beta
are now no longer as fresh as the initial use case expressed by the apps team.

Before we continue bumping this to 24hrs I'd like to suggest we address T156666 and T156680 after reviewing the impact these changes are having.

I fail to see the connection between this task and the ones you mention (especially T156680). Mind elaborating?

@mobrovac sure. Previously we only replayed only one hour of edits. The benefit of doing this was that if you made a request to see what was trending it would always show you pages that had only been edited in the last hour.

However, now we have bumped it up to 18hours - the meaning of the service has changed - it now returns all articles edited within the last 18 hours. The downside of this is that edits weigh heavily - so if an article was edited say 400 times starting 18 hours ago, an article which has been edited 30 times in the last 30 minutes, is suppressed even though it is extremely significant (someone may have just died for example).

I originally built the service on the assumption we would only have 1 hours worth of edits, but now we have almost 24 we need to ensure there's a way to get more fresher content. The main motivation of building this service after all was that the pages listed in most read were 24hrs old and we wanted something more recent.

(I should add that we do decay scores over time using a half life of 1.5hrs so this does help somewhat but I'm not 100% confident right now to what extent)

I see, thank you @Jdlrobson for the explanation. It sounds like we need considerable revamps in order to progress on this front. However, I am not sure we should bring back max_age to 1h because bumping it up is what prompted this reflection and discussion. Would you agree?

It sounds like we need to decouple the amount of data that we are ingesting from the algorithm itself. Which I think is what @Jdlrobson is getting to with this ticket: T156680

In light of this, it seems like we shouldn't bring max_age back (as @mobrovac is suggestion) but instead allow consumers to specify a window as a parameter and use that in place of "max_age" for the calculation (as @Jdlrobson is suggesting).

This can satisfy a few different use cases.

@mobrovac any concerns here with caching (or anything else) if we allow a parameter?

The results on https://trending.wmflabs.org/wiki/Special:Feed/enwiki/beta are now no longer as fresh as the initial use case expressed by the apps team.

@Jdlrobson this is intentional as an outcome of the meeting we had last week as we want to look at a 24 hour period and evaluate. I assume some tweaking of the algorithm + a larger window will help us hone in on better results.

Jdlrobson moved this task from Backlog to Next up on the Trending-Service board.Feb 1 2017, 9:40 PM

I'm not sure whether intentional is the right word.. but it's something I observed. We are losing freshness by bumping this - so the API result value proposition does change. I'd like to preserve both bits of value - not replace one with another - T156680 should take care of this easily.

Adding the extra parameter shouldn't be a problem from the caching perspective, as we would still just cache the response for 30 minutes as we currently do. However, service load is a concern. By allowing clients to specify max_age, we are effectively increasing the volume of requests that reach the service itself, and at the same same increase the amount of computations the service has to do per request. If most clients use (regularly) the same max_age then that shouldn't be a big concern in practice.

Clients wouldn't specify the value of max_age in config. They would specify a period to filter by (see my patch for T156680).

Yup, lapsus linguae, but the concern still stands.

Right now I envisioned this as an integer but we could use an enumerator instead to limit the options. Right now the main motivation is to just provide the 24hr period and recent (e.g. 1-3hrs)

Fixed integers in the (0,24] range would be better. the more options we introduce, the more we are fragmenting the cache.

I'm ok with integers. The ones we use will be limited in practice and I wouldn't expect the load to change significantly - we just need ways for the clients to experiment with the output while in beta.

I'm ok with just selecting a few for now - AND we can absolutely lock this down even further after the beta testing if we want.

Also to preserve the existing functionality, should we automatically redirect routes that don't pass an integer (trending/edits) and make them "3" hours (trending/edits/3)?

I'm ok with integers. The ones we use will be limited in practice and I wouldn't expect the load to change significantly - we just need ways for the clients to experiment with the output while in beta.

I'm ok with just selecting a few for now - AND we can absolutely lock this down even further after the beta testing if we want.

This is great news, thank you for the explanation, @Fjalapeno !

Also to preserve the existing functionality, should we automatically redirect routes that don't pass an integer (trending/edits) and make them "3" hours (trending/edits/3)?

If we model the route as /trending/edits{/period}, we could set the default period = 3 for clients that don't provide it. However, for any other (invalid) value, I'd prefer to error out, as it gives a clear message to the clients. If we were to just enforce a value in such a case, clients might not know our assumptions and use or present the results in a wrong way.

Done. I've updated the patch and changed period to filter to be more generic. Right now it only takes the value recent. Any other value will throw a 400. I hope that works.

Feel free to continue bumping this up after T156680 is resolved

@mobrovac can we safely bump this up to 24 hours now?

Jdlrobson edited the task description. (Show Details)Mar 6 2017, 6:30 PM

Will try to get it scheduled for this week. Apologies for the delay.

mobrovac changed the status of subtask T156687: Delay repooling trending service after a restart from "Open" to "Stalled".Mar 7 2017, 11:57 PM

Change 341715 had a related patch set uploaded (by Mobrovac):
[mediawiki/services/trending-edits/deploy] Bump trending period to 24h

https://gerrit.wikimedia.org/r/341715

Change 341715 merged by Mobrovac:
[mediawiki/services/trending-edits/deploy] Bump trending period to 24h

https://gerrit.wikimedia.org/r/341715

Mentioned in SAL (#wikimedia-operations) [2017-03-08T00:16:25Z] <mobrovac@tin> Finished deploy [trending-edits/deploy@88e2f74]: Deploy changes for T156666 T156680 T159486 T156411 (duration: 06m 58s)

mobrovac closed this task as "Resolved".Mar 8 2017, 12:36 AM

Deployed, resolving

Note, until the period parameter is available this is likely to render only things edited within the last hour:
https://trending.wmflabs.org/wiki/Special:Feed/enwiki/beta
If I understand correctly @Pchelolo needs to update REST to expose the halflife parameter - https://en.wikipedia.org/api/rest_v1/feed/trending/edits/24 - at which point we'll be able to see things based on edits per day.

Yup, making a PR as we speak.

If all works correctly I'd expect to see 2017 BNP Paribas Open โ€“ Men's Singles in the api response for the last 24 hours.

We may need to revisit the config values for max_pages
The fact I'm seeing older articles with lots of edits disappear and I'm curious if max_pages is to blame.

Do we have any logging in place that tells us how often we exceed max_pages?
max_pages is meant as a last resort, as it can delete legitimate trending articles so if we're hitting this at all that's bad and we need to find a better purge strategy.

We may need to revisit the config values for max_pages
The fact I'm seeing older articles with lots of edits disappear and I'm curious if max_pages is to blame.

It would be nice if we could determine this accurately instead of shooting blanks of the type maybe it's this.

Do we have any logging in place that tells us how often we exceed max_pages?
max_pages is meant as a last resort, as it can delete legitimate trending articles so if we're hitting this at all that's bad and we need to find a better purge strategy.

Hehe, you are the author of the service, you should know the answer to that ;)

@Jdlrobson What you could do for quick testing of ideas and outputs is the following:

  1. Export all en-wiki edits from production kafka to some file
  2. Set up kafka locally on your machine and produce all the edits in there
  3. Instruct the service to not commit anything and always start from the beginning of the topic

That way you would be able to quickly replay last week of edits and play with parameters locally instead of doing blind shots in production hoping to figure out the right one. I can assist you setting up such a system if you want.

Right now the fact is that older articles are not showing up in results.

Theres only two reasons that can happen - it's either purged or it's score is incorrect. This is not shooting blanks.

The debug task i have opened will help with checking the scoring as will @Pchelolo suggestion (but i will need some help with that).

With regards to the purge question you can best help me understand this - what was your metric you used to bump the value? I assume it was memory consumption and processing and we didn't pay attention to size of pages? If so we were not looking at all the things we should have been. Tbh when we added max_pages as we were only targeting a 1hr period i didnt think that would be ever invoked but requirements change.

I'll also be able to get a quick answer to the purge question after getting a local dump of events as suggested.

There are 10 edits/second across all wikis. If we assume we are tracking all of them (which we are not) and that all of them are edits of different pages (which they are not), we come to 36k pages per hour, or 864k pages per day. Since max_pages = 5000 seemed to have worked well for 1h (I assume this, but am not entirely confident in this statement, TBH), we might want to consider setting max_pages = 20000.

Change 342085 had a related patch set uploaded (by Mobrovac):
[mediawiki/services/trending-edits/deploy] Bump max_pages to 20k

https://gerrit.wikimedia.org/r/342085

@mobrovac that sounds like a good starting point and wouldn't hurt. I'll let you know when I have some more concrete data around how purging is performing and where it can be improved.

Change 342085 merged by Mobrovac:
[mediawiki/services/trending-edits/deploy] Bump max_pages to 20k

https://gerrit.wikimedia.org/r/342085

Mentioned in SAL (#wikimedia-operations) [2017-03-09T21:54:48Z] <mobrovac@tin> Started deploy [trending-edits/deploy@57a654e]: Bump max_pages for T156411

Thanks for the discussion - I've captured all this in a spike - T160127.

Mentioned in SAL (#wikimedia-operations) [2017-03-09T22:00:54Z] <mobrovac@tin> Finished deploy [trending-edits/deploy@57a654e]: Bump max_pages for T156411 (duration: 06m 07s)

I have bumped max_pages, let's see if that helps. The memory has increased but the number of pages in memory seems to have stayed the same after the deploy. Let's monitor this for a while and see.