Page MenuHomePhabricator

Yearly endpoint for the /pageviews/top API
Open, LowPublic

Assigned To
None
Authored By
MusikAnimal
Dec 31 2016, 10:13 PM
Referenced Files
None
Tokens
"Like" token, awarded by 4nn1l2."Like" token, awarded by Shizhao."Love" token, awarded by Quiddity."Love" token, awarded by Shangkuanlc.

Description

I think it would be really interesting to see the most viewed pages for a given year, sort of like a reflection of the major events, topics, etc. that brought users to the wiki. Is it feasible to add a yearly endpoint? It seems this would comparatively be inexpensive in terms of storage. Going by the endpoint for monthly stats, the yearly could simply be GET /metrics/pageviews/top/{project}/{access}/{year}/all-months/all-days.

UPDATE: according to a recent test, our bigger cluster also means this is relatively cheap to compute: T211827#4847998. Specifically, to make this more generic, instead of https://phabricator.wikimedia.org/P7945$14 we could compute total views and total distinct articles per wiki for a specific day or week. Using that and the pigeonhole principle, we should be able to come up with a pretty robust filter to ignore relatively low traffic.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
MusikAnimal renamed this task from Most viewed pages of 2016 to Yearly endpoint for the /pageviews/top API.Jan 2 2017, 11:30 PM

I had been working on something similar to this in T139934, but for monthly endpoint. If analytics team approve this idea, I think I can implement that.

The monthly code still needs to be load tested to make sure we can still deliver data within SLA, once we do that we can see whether that idea can be used elsewhere.

This is different from the per-article monthly endpoint. The storage needs would indeed be very modest, but with the current pipeline the job to compute the top article for the year would never finish. We tried it when we first launched the API and it just didn't work. We should revisit this when we think about stream processing the webrequest data.

Milimetric triaged this task as Medium priority.May 8 2017, 2:51 PM
Milimetric added a subscriber: Tbayer.

@Tbayer recently showed that performance of this query is pretty good: T211827#4847998, so we should re-consider this. Editing and prepping for grooming again.

I meant to say this earlier: P7945 did not give me any results for 2016 (adjusting only the year= clause at https://phabricator.wikimedia.org/P7945$10). I tried three times. It worked fine for 2017.

@MusikAnimal that's because the namespace_id field was added later, so the first CTE would just be empty with the >= 100 filter.

@MusikAnimal that's because the namespace_id field was added later, so the first CTE would just be empty with the >= 100 filter.

Indeed, see also T211827#4822761 .

Eek, so the data I have for 2017 may also be a little off? I see T156993 was resolved in February of 2017.

We can get back to our pageviewAPI work after we make significant improvment on quality and addition of new tables in Data Lake, moving to priority normal for Q4. the earliest we could take this work up.

Nuria raised the priority of this task from Medium to High.Jan 8 2019, 8:38 AM
Nuria moved this task from Incoming to Smart Tools for Better Data on the Analytics board.
Nuria lowered the priority of this task from High to Low.Feb 11 2019, 4:47 PM