Page MenuHomePhabricator

Investigate how to make it so that the Pages Created metric remains stable after the event is over
Open, Needs TriagePublic

Description

In the Estimation meeting yesterday, we discussed various ways to address an issue in data accuracy that could occur if the user did not Update event data at or near the close of the event. Namely, if articles that were created during the event get deleted between the event's close and the update, the system will not count them as ever having been created. (I don't know if Pages Improved is subject to the same error.)

We should find a way to make this number remain stable. E.g., this ticket previously suggested querying the archive table instead of revision for deleted pages—see the description of that solution preserved in T216158#5006630.

Besides being various ways to solve this problem, there are a variety of features that are adjacent to this issue—like being able to report on a pages-created Survival Rate. While that and other ideas might be desirable, the minimum requirement here is only that fixed event data not be changeable.

Event Timeline

Via @MusikAnimal : This would give you things like "survival rate." This will also tell you if a report has been deleted in the interim. This task fixes issues when you update an event after it is done, or even during.

jmatazzoni subscribed.

I'm rewriting this ticket, as discussed at yesterday's Estimation meeting. For the record, here is the original Description:

If a page is deleted, it should have the same page ID except the revisions will be in the archive table instead of revision table. archive will contain all the same relevant information, so it's a matter of tweaking our queries accordingly.

The first step is probably to introduce something like deletedPageIds (perhaps one for "created" and another for "improved"). This would get set in the same place we set the live page IDs. From there, the other methods can have a bool $archive argument telling it to query archive instead of revision.

jmatazzoni renamed this task from Query archive table instead of revision for deleted pages to Investigate how to make it so that the Pages Created metric remains stable after the event is over.Mar 6 2019, 9:32 PM
jmatazzoni updated the task description. (Show Details)
jmatazzoni updated the task description. (Show Details)

@jmatazzoni We talked about this in the engineering meeting and I will try to explain what the proposed change means from a product standpoint.

  • When we fetch page IDs, we'd could also look for deleted page IDs. Depending on which metrics you want to remain static, other queries may be slowed down too. Overall I don't think it will be too bad, just making it clear there is a performance impact.
  • This will allow us to keep figures like "Pages Created" permanent, as you are requesting (there are edge cases, but let's ignore those for now to keep things simple).
  • If you want, we can also expose the survival rate in the Event Summary report, Pages Created, etc.
  • All the other metrics are subject to the same data loss as Pages Created, unless we check the archive table (as proposed). This includes:
    • Edits
    • Byte difference
    • Pages created
    • Pages improved
    • Pageviews -- The average pageviews metric doesn't make sense because we only look at the past 30 days, which may not exist because the page was deleted. Historical data is there, however, so we can say how many times the page was viewed before it was deleted.
    • Files uploaded -- Though we currently are checking the image table. We'd need to start checking the page table, looking within the file namespace. I think that's a good idea regardless, but it's for a separate task.
    • Wikidata items created
    • Wikidata items improved
  • Metrics that will not benefit from this change (meaning if the relevant pages were deleted, we lose the information and it cannot be recovered):
    • Pages using files -- If the file is deleted, all references to it will be removed.
    • Pageviews of pages using files -- Same as above.
    • "Incoming links" in the Pages Created report -- It is common that links to a deleted page will be removed.
  • Important: If there are categories on an event, the proposed change won't help. Once a page is deleted, we can't tell what categories it used to be in. The Pages Created figure along with all the others could not feasibly remain static.
  • It is possible to only keep the "Pages Created" figure static, and let all others change naturally (they would go down if the related pages were deleted).

Consider if we really want some figures to remain static. For instance, "Pages Improved" gives you an idea of the impact an event had. If those pages that were improved are now deleted, maybe those edits didn't really have an impact, in a literal sense.

Finally, the biggest thing to understand is that this is not a trivial change. My guess is at 8 points for the whole lot of the above. If we only did "Pages Created", that might bring it down a bit. Please weigh out engineering cost/benefit accordingly.

Hope this helps!

Thanks for your note @MusikAnimal. We are not in a position to do anything complex here. I'll talk to @Mooeypoo about what she feels is necessary to take care, if anything, to avoid any major inaccuracies.