Page MenuHomePhabricator

Yearly endpoint for the /pageviews/top API
Open, LowPublic

Description

I think it would be really interesting to see the most viewed pages for a given year, sort of like a reflection of the major events, topics, etc. that brought users to the wiki. Is it feasible to add a yearly endpoint? It seems this would comparatively be inexpensive in terms of storage. Going by the endpoint for monthly stats, the yearly could simply be GET /metrics/pageviews/top/{project}/{access}/{year}/all-months/all-days.

UPDATE: according to a recent test, our bigger cluster also means this is relatively cheap to compute: T211827#4847998. Specifically, to make this more generic, instead of https://phabricator.wikimedia.org/P7945$14 we could compute total views and total distinct articles per wiki for a specific day or week. Using that and the pigeonhole principle, we should be able to come up with a pretty robust filter to ignore relatively low traffic.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
MusikAnimal renamed this task from Most viewed pages of 2016 to Yearly endpoint for the /pageviews/top API.Jan 2 2017, 11:30 PM

I had been working on something similar to this in T139934, but for monthly endpoint. If analytics team approve this idea, I think I can implement that.

The monthly code still needs to be load tested to make sure we can still deliver data within SLA, once we do that we can see whether that idea can be used elsewhere.

This is different from the per-article monthly endpoint. The storage needs would indeed be very modest, but with the current pipeline the job to compute the top article for the year would never finish. We tried it when we first launched the API and it just didn't work. We should revisit this when we think about stream processing the webrequest data.

Milimetric triaged this task as Medium priority.May 8 2017, 2:51 PM
Milimetric added a subscriber: Tbayer.

@Tbayer recently showed that performance of this query is pretty good: T211827#4847998, so we should re-consider this. Editing and prepping for grooming again.

I meant to say this earlier: P7945 did not give me any results for 2016 (adjusting only the year= clause at https://phabricator.wikimedia.org/P7945$10). I tried three times. It worked fine for 2017.

@MusikAnimal that's because the namespace_id field was added later, so the first CTE would just be empty with the >= 100 filter.

@MusikAnimal that's because the namespace_id field was added later, so the first CTE would just be empty with the >= 100 filter.

Indeed, see also T211827#4822761 .

Eek, so the data I have for 2017 may also be a little off? I see T156993 was resolved in February of 2017.

We can get back to our pageviewAPI work after we make significant improvment on quality and addition of new tables in Data Lake, moving to priority normal for Q4. the earliest we could take this work up.

Nuria raised the priority of this task from Medium to High.Jan 8 2019, 8:38 AM
Nuria moved this task from Incoming to Smart Tools for Better Data on the Analytics board.
Nuria lowered the priority of this task from High to Low.Feb 11 2019, 4:47 PM

@VirginiaPoundstonem just pinging you from this task in case you want to prioritize it.

There is this use case in the pageviews tool (yearly top article views), that is the only one that requires manual calculation.
The other use cases get the data from AQS. This one, however, is reported yearly, and has not been super high priority, and thus has no AQS endpoint.
So far we've executed the data gathering and formatting manually, every January. But it takes some annoying time, since we have to remember
how the query and formatting worked, and the code usually requires changes, because of system updates that happened during the year.
For a reference, last time @MusikAnimal and I spent about 1-2 days working on this, and adding the AQS endpoint would probably require a couple weeks of work (3?).