Page MenuHomePhabricator

'Avg. daily pageviews' methods seem to be dividing by the wrong number of days
Closed, ResolvedPublic5 Estimated Story PointsBUG REPORT

Description

I'm noticing discrepancies in Avg. daily pageviews metrics between our calculation and that of Pageview Analysis. These discrepancies appear in both “Avg. daily views to files uploaded” and the “Avg. daily pageviews” metrics.

The errors seem to be caused by the system using the wrong divisors to get averages. There may, however, be multiple errors in play, as the somewhat contradictory examples below suggest.

NOTE: Leon has changed the spec for various reasons (described in T217704#5006661 ) so that instead of 30 days, Avg. Daily Pageviews averages over 31 days. For QA purposes, the good thing about this is that it matches what is done in the Pageview Analysis tool.

Examples

Avg. daily pageviews, Q61506256

The Pageview Analysis graph gives a clue as to what may be happening, If you look at it, pageviews occurred in only 3 of the last 30 days. The grand total of 14 pvs / 3 = 4 (ish). So it appears possible we are not dividing by the length of the entire period but only by the number of non-0 days.

Avg. daily views to files uploaded, for event Baldwin

"Baldwin" is a one-minute event with 1 uploaded file

The Uploaded file is placed on two pages.

113 + 2 = 115, which is pretty close to the Event Metrics figure of 114.

In conclusion...

As noted above, it looks like the errors are caused by using the wrong divisor to calculate averages. But the particulars of each case are quite different:

  • In the "Avg. daily pageviews" case, the method appears to be dividing only by the number of non-zero days.
  • In the "Avg. views to files uploaded" case, it looks like the problem is that the method doesn't recognize the page-creation date; it's dividing by the full 30 days, including many days of 0 pageviews—which on the surface looks like the opposite of the example above. (Of course, these examples could be providing red herrings, and it could be something completely different.)

What should happen

What should happen, for the record, is that for each unique article:

  • If the article has existed > 30 days
    • Fetch daily pageviews from the present back 30 days.
    • Total those daily numbers
    • Divide by 30
  • If the article has existed < 30 days
    • Determine how many days the article has existed.
    • Fetch daily pageviews from the present back that many days.
    • Total those daily numbers
    • Divide by the number of days the article has existed to get the daily average.

Then, depending on what metric we're reporting, we either total all the daily averages to get the Event Summary grand dotal, or report individual daily averages for individual Pages Created or Pages Improved.

  • If the article no longer exists I'm not sure what is possible of if you can figure this out
    • If you can figure out the page is deleted, don't include that page in calculations at all. Or, for metrics that give individual article results, report the average as "n/a" for not applicable.
    • If you can't, then just calculate the page average as above and over time the number will diminish. That's fine.

Event Timeline

jmatazzoni changed the subtype of this task from "Task" to "Bug Report".
MBinder_WMF set the point value for this task to 5.Mar 6 2019, 12:17 AM

Alright, some notes on my pull request (#211):

  1. As you know, it can take up to a full 24 hours for pageviews data to become available. For this reason Pageviews Analysis does not attempt to show data for the current day (UTC, or it should be UTC if it's not...). I have added the same functionality to Event Metrics. The end date is always the previous day, UTC. Keep this in mind when testing.
  2. In our meeting I explained how the "Latest 30" feature of Pageviews Analysis actually shows a total of 31 days (yesterday plus the latest 30 days). I'm proposing we do the same in Event Metrics. The reasoning is how normal addition/subtraction works. Let's say you have Day 5, and you subtract 3 days. 5 - 3 = 2. So you end up with day 2, 3, 4 and 5 -- which is actually 4 days inclusive, not 3. We can address this by simply subtracting by one less, but other tools that work with date ranges (such as Pageviews Analysis, XTools, and even MediaWiki) work in the same way where the given range is inclusive. I think we should keep it consistent. After all, it's not like it is wrong (just an alternate way of interpreting "latest 30 days"), and users might go to these other tools to follow-up on the data, and be confused when they get different numbers. So, going this route our equation is actually total pageviews / 31, not by 30, as the date range is inclusive (total of 31 days).

Does this sound okay? If so, I've got a new PR that I'm quite confident in.

In T217704#5006661, @MusikAnimal wrote:

...So, going this route our equation is actually total pageviews / 31, not by 30, as the date range is inclusive (total of 31 days).

Does this sound okay? If so, I've got a new PR that I'm quite confident in.

That's fine with me. The 30-day number is essentially arbitrary anyway—it's just a number of days that is long enough to eliminate daily or weekly fluctuations. I'll update the Description accordingly. Thanks Leon.

@MusikAnimal, I tried to test this just now and something funny is happening. I created a new testing event, called 1-image event. It includes 1 image that was uploaded to 1 page. (You are an organizer.)

So, what is up? Is your method still not counting days when there are no results?

@MusikAnimal, I tried to test this just now and something funny is happening. I created a new testing event, called 1-image event. It includes 1 image that was uploaded to 1 page. (You are an organizer.)

So, what is up? Is your method still not counting days when there are no results?

There was only one page improved, which is the same page where the file is being used. So 903 is correct, no?

(You are an organizer.)

Myself and Dom are admins on staging so no need for this :)

In T217704#5018855, @MusikAnimal wrote:

There was only one page improved, which is the same page where the file is being used. So 903 is correct, no?

Yes, it's right that we have the same figure for the file and the page it's on. That's why I created this event so that it has 1 file on 1 page--it's an easy diagnostic.

But you told me the Event Metrics daily average should match Pageview Analysis, right?

So, 903 is the total you get on Pageview Analysis if you count only 29 days-which happen to be the 29 days that have nonzero results. If you set Pageview to count 30 days, the average is 873, and if you set it to 31 (the 30-day preset, which you told me our system should match), the average is 845.

Bottom line, it looks suspiciously like the divisor is still wrong; it looks like we're using 29 instead of 31.

you told me the Event Metrics daily average should match Pageview Analysis, right?

Yes, only in the sense that it averages over 31 days and not 30. I did not mean to imply we would ignore article age like Pageviews Analysis does.

In T217704#5018913, @MusikAnimal wrote:

...I did not mean to imply we would ignore article age like Pageviews Analysis does.

I don't understand what you mean about article age. The article was created in Nov, long before the report period. Why Feb 9 and 10 have 0 pvs is unclear.

...I did not mean to imply we would ignore article age like Pageviews Analysis does.

I don't understand what you mean about article age. The article was created in Nov, long before the report period. Why Feb 9 and 10 have 0 pvs is unclear.

Sorry, I guess I meant "available data". I'm not sure why pageviews are missing prior to February 11, but that isn't our problem :)

In T217704#5018934, @MusikAnimal wrote:

Sorry, I guess I meant "available data". I'm not sure why pageviews are missing prior to February 11, but that isn't our problem :)

Ahhhhhhh, I hadn't realized it wasn't just those two days missing but everything before Feb. 11. In fact, there was a page move on Feb.11, which would be an edge case we can ignore, I think. So in that case your system is doing the right thing and dividing by the number of days the "page" existed. Thanks! I'll now try a few more tests.

there was a page move on Feb.11, which would be an edge case we can ignore

Ah yes, I looked for a page move and couldn't find one. Indeed that's it. This is a problem for all the tools that query for pageviews that I think we can also ignore. Page moves/redirects/etc. can be complicated to trace and ensure the pageviews are exact. T159046 is the relevant task for the API itself.

@dom_walden, I having worked through the conundrum above, I did another test and it looks great! This event tests the system's ability to detect that a page just came into existence, and that we have only 2 days of data about it.

The event 1 Page, 1 File, American Gods
The page the uploaded file was placed on.

Looking good!

@MusikAnimal Attempting to run Pages Created report for https://eventmetrics-dev.wmflabs.org/programs/101/events/205 and https://eventmetrics-dev.wmflabs.org/programs/56/events/110 I get:

500: Internal Server Error

The server said: DateTime::diff() expects parameter 1 to be DateTimeInterface, null given

I cannot work out why it effects those particular events and not others.

I am guessing the problem is src/AppBundle/Repository/PageviewsRepository.php around line 100. If I have understood it correctly, $lastAvgDate is only set if the API returns results and if one of those results is within the last 31 days.

@MusikAnimal Could you check my workings of the figures for https://eventmetrics-dev.wmflabs.org/programs/76/events/259/pages-created?format=wikitext

https://en.wikipedia.org/wiki/Gustavo_Baquero
EventMetrics: 13
Pageviews: 15
My calculations: 14.97 (to 2 decimal places) (my calculations based on calling the API)

The page was created over 31 days ago (5th Feb), hasn't been moved as far as the revision history shows, and the API returns results for each day.

If I have understood the above correctly, in these circumstances EventMetrics should return the same figures as the pageviews tool for the "latest-30".

Similarly, in the same report, https://www.wikidata.org/wiki/Q61506256
Eventmetrics: 1
Pageviews: 0
My calculations: 0.29.

Attempting to run Pages Created report for https://eventmetrics-dev.wmflabs.org/programs/101/events/205 and https://eventmetrics-dev.wmflabs.org/programs/56/events/110 I get:

500: Internal Server Error

The server said: DateTime::diff() expects parameter 1 to be DateTimeInterface, null given

I cannot work out why it effects those particular events and not others.

I am guessing the problem is src/AppBundle/Repository/PageviewsRepository.php around line 100. If I have understood it correctly, $lastAvgDate is only set if the API returns results and if one of those results is within the last 31 days.

You are correct. PR at https://github.com/wikimedia/eventmetrics/pull/219

MusikAnimal Could you check my workings of the figures for https://eventmetrics-dev.wmflabs.org/programs/76/events/259/pages-created?format=wikitext

https://en.wikipedia.org/wiki/Gustavo_Baquero
EventMetrics: 13
Pageviews: 15
My calculations: 14.97 (to 2 decimal places) (my calculations based on calling the API)

The page was created over 31 days ago (5th Feb), hasn't been moved as far as the revision history shows, and the API returns results for each day.

If I have understood the above correctly, in these circumstances EventMetrics should return the same figures as the pageviews tool for the "latest-30".

Similarly, in the same report, https://www.wikidata.org/wiki/Q61506256
Eventmetrics: 1
Pageviews: 0
My calculations: 0.29.

Thanks, this is a bug. The pages created part of the code wasn't enforcing pageviews as of yesterday at 00:00 UTC, like the other parts of the app do.

Fixed with https://github.com/wikimedia/eventmetrics/pull/220

@dom_walden Thanks for identifying those two issues! They should both be fixed now.

I am seeing occasional discrepancies for pages with very sparse pageviews.

E.g. https://eventmetrics-dev.wmflabs.org/programs/76/events/259/pages-created?format=wikitext for item Q61506256, compared to https://tools.wmflabs.org/pageviews/?project=wikidata.org&platform=all-access&agent=user&range=latest-30&pages=Q61506256

In the last 31 days it's had 10 views, and the earliest day it has data for in that time-frame is 27th Feb. Therefore, it is taking that as the earliest date and dividing by ~16 days. So it gets 1 (after rounding).

But it has data going back further than 31 days, so should it just divide by 31 (~0.3)?

I am seeing occasional discrepancies for pages with very sparse pageviews.

E.g. https://eventmetrics-dev.wmflabs.org/programs/76/events/259/pages-created?format=wikitext for item Q61506256, compared to https://tools.wmflabs.org/pageviews/?project=wikidata.org&platform=all-access&agent=user&range=latest-30&pages=Q61506256

In the last 31 days it's had 10 views, and the earliest day it has data for in that time-frame is 27th Feb. Therefore, it is taking that as the earliest date and dividing by ~16 days. So it gets 1 (after rounding).

But it has data going back further than 31 days, so should it just divide by 31 (~0.3)?

The API response does not have entries for dates prior to Feb 27 within the 31-day window, so Event Metrics assumes the item was created on the 27th. We're talking about an average here, so I think this particular scenario is fine -- even if it happens a lot. Regardless, we're running out of time for this project so we might turn a blind eye to trivial differences in pageviews, specifically, unless you suspect there is a larger issue that might be more noticeable and widespread.

In T217704#5034011, @MusikAnimal wrote:

The API response does not have entries for dates prior to Feb 27 within the 31-day window, so Event Metrics assumes the item was created on the 27th....Regardless, we're running out of time for this project so we might turn a blind eye to trivial differences in pageviews.

We're talking about getting the right divisor again. I want to be sure I understand the issue we're proposing to ignore.

Between Feb 27 and the day the report was run there were about 19 days. Pageviews were reported for 8 of those days. Leon, if I understand you, you are saying the system sees Feb 27 as the creation day, meaning it is dividing by 19. It's not dividing by 8, correct?

That would mean the method is smart enough to know that the right divisor is not 8 (that the item existed even on days when no results were reported). But it's not smart enough to know that the item existed before the 27th. In other words, it doesn't have a method for determining the true item-creation date when the data is spotty.

Does that sum it up? If so, I think that we can let this go. The errors introduced will be for items with spotty pageviews and low numbers in general, so should be insignificant in the scheme of things.

@dom_walden, does that sound like sound reasoning?

That would mean the method is smart enough to know that the right divisor is not 8 (that the item existed even on days when no results were reported). But it's not smart enough to know that the item existed before the 27th. In other words, it doesn't have a method for determining the true item-creation date when the data is spotty.

Does that sum it up?

Yes, precisely. Going by available pageviews data (in period) instead of true creation date was discussed somewhere, I don't recall which ticket. It's certainly possible to use the creation date but the expense and engineering effort probably is not worth it. Indeed this should only ever be a problem for articles with low traffic, since anything remotely popular will have at least 1 pageview a day.

The API response does not have entries for dates prior to Feb 27 within the 31-day window, so Event Metrics assumes the item was created on the 27th. We're talking about an average here, so I think this particular scenario is fine -- even if it happens a lot. Regardless, we're running out of time for this project so we might turn a blind eye to trivial differences in pageviews, specifically, unless you suspect there is a larger issue that might be more noticeable and widespread.

I don't know. You see discrepancies for pages with low page views (e.g. wikidata items). As long as we are aware of this.

Taking this into account...

If the article has existed > 30 days
If the article has existed < 30 days

No discrepancies seen for events inside and outside of 31 days (compared to my own scripts which call the API), apart from the fact I think the tool rounds the average page views figure for each article before summing, so rounding errors become more noticeable.

If the article no longer exists I'm not sure what is possible of if you can figure this out

  • If you can figure out the page is deleted, don't include that page in calculations at all. Or, for metrics that give individual article results, report the average as "n/a" for not applicable.
  • If you can't, then just calculate the page average as above and over time the number will diminish. That's fine.

I don't believe we include deleted files in either pages created report or summary report.

Where the event ended today, page views show as "0".

jmatazzoni moved this task from Product sign-off to Q3 2018-19 on the Community-Tech-Sprint board.

Thanks for detecting and explaining those possible sources of error Dom. Here's my thinking: No one is making important decisions based on these metrics, nor are they used, for example, for legal or financial purposes. They are designed to give an idea of the impact of event participants' contributions. Keeping that in mind, it seems to me they are accurate enough to fulfill that goal, and the level of error is acceptable. If anyone disagrees, please speak up. Resolving this.