Page MenuHomePhabricator

Yearly endpoint for the /pageviews/top API
Open, LowPublic

Description

I think it would be really interesting to see the most viewed pages for a given year, sort of like a reflection of the major events, topics, etc. that brought users to the wiki. Is it feasible to add a yearly endpoint? It seems this would comparatively be inexpensive in terms of storage. Going by the endpoint for monthly stats, the yearly could simply be GET /metrics/pageviews/top/{project}/{access}/{year}/all-months/all-days.

UPDATE: according to a recent test, our bigger cluster also means this is relatively cheap to compute: T211827#4847998. Specifically, to make this more generic, instead of https://phabricator.wikimedia.org/P7945$14 we could compute total views and total distinct articles per wiki for a specific day or week. Using that and the pigeonhole principle, we should be able to come up with a pretty robust filter to ignore relatively low traffic.

Event Timeline

Restricted Application added a project: Analytics. · View Herald TranscriptDec 31 2016, 10:13 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
MusikAnimal renamed this task from Most viewed pages of 2016 to Yearly endpoint for the /pageviews/top API.
Shangkuanlc added a subscriber: Shangkuanlc.

I had been working on something similar to this in T139934, but for monthly endpoint. If analytics team approve this idea, I think I can implement that.

Nuria added a subscriber: Nuria.Jan 10 2017, 4:01 PM

The monthly code still needs to be load tested to make sure we can still deliver data within SLA, once we do that we can see whether that idea can be used elsewhere.

MusikAnimal updated the task description. (Show Details)Jan 10 2017, 6:01 PM

This is different from the per-article monthly endpoint. The storage needs would indeed be very modest, but with the current pipeline the job to compute the top article for the year would never finish. We tried it when we first launched the API and it just didn't work. We should revisit this when we think about stream processing the webrequest data.

Nuria moved this task from Incoming to Dashiki on the Analytics board.Jan 23 2017, 4:55 PM
Milimetric triaged this task as Normal priority.May 8 2017, 2:51 PM
Nuria moved this task from Dashiki to Backlog (Later) on the Analytics board.May 16 2017, 12:51 PM
elukey added a subscriber: elukey.Feb 2 2018, 3:03 PM
4nn1l2 awarded a token.Jan 4 2019, 7:49 AM
Milimetric moved this task from Deprioritized to Incoming on the Analytics board.Jan 7 2019, 8:12 PM
Milimetric added a subscriber: Tbayer.

@Tbayer recently showed that performance of this query is pretty good: T211827#4847998, so we should re-consider this. Editing and prepping for grooming again.

Milimetric updated the task description. (Show Details)Jan 7 2019, 8:14 PM
Milimetric updated the task description. (Show Details)Jan 7 2019, 8:21 PM

I meant to say this earlier: P7945 did not give me any results for 2016 (adjusting only the year= clause at https://phabricator.wikimedia.org/P7945$10). I tried three times. It worked fine for 2017.

@MusikAnimal that's because the namespace_id field was added later, so the first CTE would just be empty with the >= 100 filter.

@MusikAnimal that's because the namespace_id field was added later, so the first CTE would just be empty with the >= 100 filter.

Indeed, see also T211827#4822761 .

MusikAnimal added a comment.EditedJan 7 2019, 8:44 PM

Eek, so the data I have for 2017 may also be a little off? I see T156993 was resolved in February of 2017.

Nuria added a comment.Jan 8 2019, 8:38 AM

We can get back to our pageviewAPI work after we make significant improvment on quality and addition of new tables in Data Lake, moving to priority normal for Q4. the earliest we could take this work up.

Nuria raised the priority of this task from Normal to High.
Nuria lowered the priority of this task from High to Low.