Page MenuHomePhabricator

Create a method to fetch page view data
Closed, ResolvedPublic5 Story Points

Description

We have several reports that request information about page views and related information:

This task should concentrate on implementing a proper method to fetching views-related data from the relevant API so it can be used in the given reports (either the CSV download, the Wikitext download, or, later, the on-screen report)

Pageview metrics defined

For the "Pages Created" reports [metrics defined in T206058]

  • "Pageviews, cumulative"
  • "Avg. daily pageviews"

For the "Pages Improved" reports [metric defined in T210775]

  • "Avg. daily pageviews"

For the "Event summary" report [metrics defined in T205561]

  • "Views to pages created"
  • "Avg. daily views to pages improved"
  • "Plays to uploaded audio/video" ( also see T206819, which creates a method for tracking files uploaded to local wikis). [turns out the API we need for this is not ready as of Jan '19]

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
jmatazzoni triaged this task as Medium priority.Oct 15 2018, 9:53 PM
jmatazzoni updated the task description. (Show Details)
jmatazzoni set the point value for this task to 5.Oct 16 2018, 11:13 PM
Samwilson moved this task from Ready to In Development on the Community-Tech-Sprint board.

Do the pageviews for pages created and improved also include Wikidata items and Commons uploads? I assume they do.

Do the pageviews for pages created and improved also include Wikidata items and Commons uploads? I assume they do.

Wikidata I'm sure. For uploads, pageviews (to file pages themselves) probably aren't very useful, but let @jmatazzoni decide. Instead we'd want mediacounts. This task mentions getting this data only for playable media (audio/video), but I would side-step that for now. T210313 should happen soon-ish and will provide mediacounts for all media, including static images.

@jmatazzoni confirmed that this is just for pageviews for Wikipedias, not Commons and not Wikidata.

The stat names will be pages-created-pageviews and pages-improved-pageviews.

jmatazzoni updated the task description. (Show Details)

The PR https://github.com/wikimedia/eventmetrics/pull/152 is ready for review.

It does not contain "Plays to uploaded audio/video"; this will come in a subsequent patch.

"Plays to uploaded A/V" can not be done until T198628 is resolved.

jmatazzoni updated the task description. (Show Details)

Code has been merged for the two pageviews stats, and the AV stuff has been removed from this ticket.

This is ready for QA, but the functionality isn't currently exposed in the UI so QA will have to wait until one of the reports have been implemented.

This can now be reviewed by appending /summary?format=wikitext to the URL of the event summary page. For example https://eventmetrics-dev.wmflabs.org/programs/76/events/134/summary

@Samwilson Could you check the figures for: https://eventmetrics-dev.wmflabs.org/programs/118/events/262/summary

I constructed the event so that it would only cover the creation of the page https://en.wikipedia.org/wiki/Frances_Bodomo and compared the results with the Pageviews tools.

It records 1 created page and 1 improved page (reading the code comments created pages are also considered improved pages, which may be fair enough).

The cumulative figure ("Views to pages created") is consistent with Pageviews, but the average ("Views to pages improved") is 0. According to Pageviews it should be 8:
https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&start=2019-01-11&end=2019-02-10&pages=Frances_Bodomo

I have constructed other events which cover a larger period (including subsequent edits of the created page) and this shows figures for both metrics consistent with Pageviews (https://eventmetrics-dev.wmflabs.org/programs/118/events/264/summary).

Please check my calculations.

@Samwilson should this move back to In Dev?

Yep. I'll have a dig into this later today, maybe add some more tests with the above parameters. Thanks @dom_walden for finding this.

I think the problem here is not with the pages-improved-pageviews stat, but rather with the pages-improved stat. For the period 20170311212100 to 20170311212200 User:Srkrm has a single revision recorded (ID 769825324), and this revision has no parent (i.e. rev_parent_id = 0). So that means that the total pages created is 1, but the total pages improved should actually be 0 — and not 1 as is currently being reported. That's right isn't it? We don't want to treat a page-creation as an edit as well as a creation do we?

Perhaps I'm misunderstanding. Anyway, this is the change that might be required: https://github.com/wikimedia/eventmetrics/pull/183

I think the problem here is not with the pages-improved-pageviews stat, but rather with the pages-improved stat. For the period 20170311212100 to 20170311212200 User:Srkrm has a single revision recorded (ID 769825324), and this revision has no parent (i.e. rev_parent_id = 0). So that means that the total pages created is 1, but the total pages improved should actually be 0 — and not 1 as is currently being reported. That's right isn't it? We don't want to treat a page-creation as an edit as well as a creation do we?
Perhaps I'm misunderstanding. Anyway, this is the change that might be required: https://github.com/wikimedia/eventmetrics/pull/183

This was discussed way back in the way back when at T182083#3836513 (December 2017, time flies!). It was decided that a page creation is still an improvement to a page, so it is counted as such. @jmatazzoni Thoughts?

Ah cool, that makes sense.

So does that mean the list of IDs of pages edited should always include everything in the list of IDs of pages created? i.e. this should be modified?

So does that mean the list of IDs of pages edited should always include everything in the list of IDs of pages created? i.e. this should be modified?

Yes I think so! Surprised that went unnoticed. I think I forgot that Pages Improved should include Created.

! In T206817#4950177, @MusikAnimal wrote:

This was discussed way back in the way back when at T182083#3836513 (December 2017, time flies!). It was decided that a page creation is still an improvement to a page, so it is counted as such. @jmatazzoni Thoughts?

Hmm, no, I have always taken it for granted that we would keep Pages Created and Pages Improved separate for purposes of metrics. I think most people would assume that the total of the two would be the total of all pages worked on. Otherwise all the metrics, from bytes changed to pageviews etc., will suffer from similar confusion--you'll always have numbers for Pages Improved that are inflated. This partly comes from the way that organizers work: they have lists of pages to improve and pages to create. Is keeping these separate a problem?

Is keeping these separate a problem?

No, shouldn't be. We actually have been keeping them separate when setting the page IDs (which is used in other queries), but the "Pages improved" count you see on the Event Summary page is including "created". This should be a straightforward fix. Sam's part of the way there with https://github.com/wikimedia/eventmetrics/pull/183

I'm confused! Sorry. :)

a page creation is still an improvement to a page, so it is counted as such.

So that means that when we fetch IDs of pages improved, it should at a minimum include all pages created? At the moment a page can not be in both lists. For example, the test for that method contains this data:

$allPagesExpected     = [368527, 2112961, 368654, 368673];
$pagesCreatedExpected = [368527,          368654, 368673];
$pagesEditedExpected  = [        2112961                ];

i.e. there's no overlap between created and edited.

Sorry, disregard me, I just saw the comment on the PR. I'm updating the patch to count correctly.

In T206817#4953119, @Samwilson wrote:

So that means that when we fetch IDs of pages improved, it should at a minimum include all pages created? At the moment a page can not be in both lists....

I'm just checking, since it looks like you didn't see my answer to Leon in T206817#4951457. No, Pages Improved should not include Pages Created. The idea that a page "cannot be in both lists" is correct. Do not change that.

Thanks Joe; I think I'm understanding things now. The patch is updated.

Patch merged. Ready for QA.

EventRevisions in event's time spanPages createdPages improvedViews to pages createdAvg. daily views to pages improved
OneFrances_Bodomo created103,4790
TwoFrances_Bodomo improved0107
ThreeFrances_Bodomo created and improved113,4797

I checked against https://tools.wmflabs.org/pageviews that the total number of views and avg. daily views for that page are correct.

If a page gets moved (by a participant) that would count as 1 page created (the old redirect page, assuming they don't delete it) and 1 page improved (the renamed page).

I could not find appropriate data to see what happens when a page is deleted.

jmatazzoni added a comment.EditedFeb 19 2019, 4:48 PM

In T206817#4964546, @dom_walden wrote:

...If a page gets moved (by a participant) that would count as 1 page created (the old redirect page, assuming they don't delete it) and 1 page improved (the renamed page).

No, as per T206817#4951457, Pages Created and Pages Improved are mutually exclusive. If I create a page and then improve it that does not put that page into the Pages Improved column. All pages that were created during the event (consistent with all filters, e.g. Participants) are Pages Created and only Pages Created. Pages Improved are pages that were edited during the event (consistent with all filters) that either a) existed before the event or b) were created by someone not in the event (so not a Page Created) and then edited during the event.

This is only logical, in my view. I doubt anyone ever creates a page and then never edits it after creation. So if we count those subsequent edits as "improvements", then all Pages Created would also be Pages Improved. Pages Improved seems to me to be a more useful and meaningful category if it is distinct from Pages Creates.

Or am I misunderstanding something?

So how it should behave instead is:

EventRevisions in event's time spanPages created Pages improvedViews to pages createdAvg. daily views to pages improved
OneFrances_Bodomo created103,4790
TwoFrances_Bodomo edited0107
ThreeFrances_Bodomo created and edited103,4790

@jmatazzoni Correct?

If a page gets moved (by a participant) that would count as 1 page created (the old redirect page, assuming they don't delete it) and 1 page improved (the renamed page).

This is a great point. I don't think we're handling redirects at all... probably should exclude those. The renamed page should still have a rev_parent_id = 0 hence it'd be under Pages Created.

So how it should behave instead is:

EventRevisions in event's time spanPages created Pages improvedViews to pages createdAvg. daily views to pages improved
OneFrances_Bodomo created103,4790
TwoFrances_Bodomo edited0107
ThreeFrances_Bodomo created and edited103,4790

@jmatazzoni Correct?

No. There should be 0 Views to Pages Improved, because there is no Page Improved. Only a Page Created. I think that is what Leon is saying as well.

@MusikAnimal, excluding redirects is probably a good idea; do you want to write a task for that?

No. There should be 0 Views to Pages Improved, because there is no Page Improved. Only a Page Created. I think that is what Leon is saying as well.

The start time of the event in the second row is after the time when the page was created. It corresponds to the case where the page "a) existed before the event" that you referred to in T206817#4965299.

Events One and Three cover the period when the page was created.

No. There should be 0 Views to Pages Improved, because there is no Page Improved. Only a Page Created. I think that is what Leon is saying as well.

The start time of the event in the second row is after the time when the page was created. It corresponds to the case where the page "a) existed before the event" that you referred to in T206817#4965299.
Events One and Three cover the period when the page was created.

Oh, I didn't get that those were different events! In that case, yes, we are on the same page.

MusikAnimal, excluding redirects is probably a good idea; do you want to write a task for that?

T216557 :)

jmatazzoni added a comment.EditedFeb 25 2019, 6:23 PM

@dom_walden, same question as on other tickets. We have now seen this in action on the Event Summary report. Should we wait to close it until we see it on Pages Improved and Pages Created? Or should we close it and reopen if we don't like what we see there? (I think I'd prefer not to have it hanging around, if you think it's safe to close.)

jmatazzoni closed this task as Resolved.Feb 25 2019, 11:42 PM
jmatazzoni moved this task from Product sign-off to Q3 2018-19 on the Community-Tech-Sprint board.