Page MenuHomePhabricator

Don't refresh fixed data if last updated datestamp is after end date of the event
Open, Needs TriagePublic

Description

If the last update of the event is after the end date, we don't need to update some statistics, instead using what is already stored in the database:

  • Participants
  • New editors
  • Pages created
  • Pages improved
  • Edits
  • Bytes changed
  • Uploaded files
  • Wikidata items created
  • Wikidata items improved

The most important is the page IDs, which is used for several of the above metrics.

Event Timeline

Yes, should be a quick and easy fix too.

This is great. @MusikAnimal question -- for old events, do we want to create a check that redoes this ? So, if we kept fetching PageIDs for events that ended up until now, we might have, in the DB, extra PageIDs that are redundant and unnecassary for that event. Do we want to add a small check that if / when people update older events, we make sure the info in the PageIDs only contains the pages that are in the range?

I'm not sure if this already happens (that is, do we even fetch anything after the end of the event? Just retention, no?)

This is great. @MusikAnimal question -- for old events, do we want to create a check that redoes this ? So, if we kept fetching PageIDs for events that ended up until now, we might have, in the DB, extra PageIDs that are redundant and unnecassary for that event. Do we want to add a small check that if / when people update older events, we make sure the info in the PageIDs only contains the pages that are in the range?

I'm not sure if this already happens (that is, do we even fetch anything after the end of the event? Just retention, no?)

I think we constrain everything to the timestamp of the event. But, we unnecessarily re-fetch page IDs after the event is over.

That said, I just realized that T206695: Create a method for 'New page survival rate' and 'Still exists' metrics is probably a prerequisite to T206058: Implement 'Pages Created' downloadable csv. Basically when generating the report, if we can't find anything in page with a given ID, we need to check archive (which should be there, unless it was suppressed). We might want to talk more about that.

Moriel requested I move this to estimation. After (during?) estimation we can decide if this should be Release I or Release II.

@aezell @MusikAnimal I understand that rounding up the page IDs is a foundational task that underlies many metrics we present. But as we've talked about, for each report we create, there are a set of metrics that remain fixed once the event is over. See the lists below for our first two reports.

So my question is, do we want to handle the storing of page IDs completely separately from possibly storing some of these other fixed metrics? Or is it all part of the same project of making this more efficient?

Event Summary report

Remain fixed once event ends

  • Participants
  • New editors
  • Pages created
  • Pages improved
  • Edits
  • Bytes changed
  • Uploaded files
  • Wikidata items created

Continue to develop

  • Views to pages created
  • Views to pages improved
  • Views to uploaded files
  • Pages with uploaded files
  • Uploaded files in use

Pages Created report

Remain fixed once event ends

  • Wiki
  • Creator
  • Edits during event
  • Bytes changed during event

Continue to develop

  • Title
  • URL
  • Pageviews, cumulative
  • Avg. daily pageviews
  • Incoming links
  • More page metrics [URL]

In reference to Joe's question --

For now, we won't change the way we store things; whatever is already being stored will continue to be stored, and whatever needs refreshing of data will be refreshed -- the pageIDs however, will no longer be refreshed after the end of the event. If we have a method that requires us to go out and recalculate based on the existing (non updated) page IDs.

@jmatazzoni Is this the question we talked about earlier this week? It seems like it. In that case, Moriel said it better than I could have. If you were asking me something different, let me know.

@jmatazzoni Is this the question we talked about earlier this week? It seems like it. In that case, Moriel said it better than I could have. If you were asking me something different, let me know.

Yes, that is what we were talking about. Moriel and I discussed today, so you don't need to look into this further. We'll wait and see how performance is and look into storing some of the "fixed" data only if necessary.

MusikAnimal renamed this task from Don't refresh page IDs if current date is after end date of the event to Don't refresh page IDs if last updated datestamp is after end date of the event.Feb 7 2019, 12:58 AM
MusikAnimal updated the task description. (Show Details)
MusikAnimal renamed this task from Don't refresh page IDs if last updated datestamp is after end date of the event to Don't refresh fixed data if last updated datestamp is after end date of the event.Feb 12 2019, 7:15 PM
MusikAnimal updated the task description. (Show Details)

I've updated the task to include all the affected stats (based on T210682#4930075).

Note that the stats in the Pages Created report are not stored, so they will be fetched every time you generate that report. It is however idempotent -- meaning you should get the same data anyway, because the queries are still limited by the date range of the event.

In T210682#4948965, @MusikAnimal wrote:

I've updated the task to include all the affected stats (based on T210682#4930075).

No, you have to keep reading. I asked the question in the comment above and @Mooeypoo answered in T210682#4932764, saying

For now, we won't change the way we store things; whatever is already being stored will continue to be stored, and whatever needs refreshing of data will be refreshed -- the pageIDs however, will no longer be refreshed after the end of the event.

Please put the ticket back to just being about page IDs.

In T210682#4948965, @MusikAnimal wrote:

I've updated the task to include all the affected stats (based on T210682#4930075).

No, you have to keep reading. I asked the question in the comment above and @Mooeypoo answered in T210682#4932764, saying

For now, we won't change the way we store things; whatever is already being stored will continue to be stored, and whatever needs refreshing of data will be refreshed -- the pageIDs however, will no longer be refreshed after the end of the event.

Please put the ticket back to just being about page IDs.

What Moriel said is correct. We aren't changing the way we store things. This ticket is purely about avoiding redundant queries (since the data we need is already stored). Or I'm still misunderstanding? After this task is implemented as currently written, the only change you should notice is that post-event updates will be faster.

There is however the issue of what to do when pages were deleted after the event. We talked about this a while back. To put it simply, there currently is no logic to check archived revisions. So keeping the other related stats "fixed" will actually help this problem, since we won't need to check archive because the relevant data (edits, bytes changed, etc.) will already be stored. The "pages created" report will still be broken in this scenario, but this would still be the case if we only did the page IDs thing (as this task was originally written).

I only expanded this task because it came up in https://github.com/wikimedia/eventmetrics/pull/173. We didn't want to forget about it. The implementation is identical, just add the same "is last updated timestamp after end date" check to the relevant methods.

Additionally, to be fully transparent, there are edge cases where edits could later suppressed or revision-deleted. These edge cases exist now and can skew the data. We need a separate discussion on what to do in that case, but I don't think we need to worry about it right now.

The accuracy of the data is important but to a point where a non-wiki-expert human should care. That is, it might be more confusing to try to explain to a user what we mean by "suppressed or revision-deleted" than it is to say something "due to the intricacies of how wiki edits work, these numbers have a 3-5 point margin of error."

This is Joe's call but for simplicity of not just the code but also the brains of our users, my opinion is to opt for an approach that acknowledges that edge cases exist without necessarily solving for or explaining exactly what they are.

Sorry for the confusion; the same edge cases can exist (in some form) whether we do only page IDs or not. Let's discuss this more in a meeting.

I will be very happy if we can not look up and recalculate stuff that is fixed once the event is over. As Leon points out, it will not only make the updates faster, it will also make the numbers more accurate (in cases like the one he cites, where a file uploaded was later changed, so had a date outside the event timeframe).

In terms of the pages created report being "broken," @MusikAnimal is presumably referring to the case where a stored page ID refers to a page that was subsequently deleted? (If that is the issue, I'm sure we can find a way to handle such errors....E.g., by listing the page IDs that no longer connect to a title at the bottom of report or even just omitting them....)

(If that is the issue, I'm sure we can find a way to handle such errors....E.g., by listing the page IDs that no longer connect to a title at the bottom of report or even just omitting them....)

That would mean another calculation methodology; it is not trivial to do this. Just for the sake of avoiding complexities right now, I would advise against doing that for the MVP.

If we stop refreshing PageIDs after the event is over, then we incidentally solve for that. That said, I'm concerned about cases where we didn't have actual valid information and now the data we have isn't updated. For example:

  • An event lasts a week - Monday to Friday
  • Event organizer updates on Monday and goes away
  • Event organizer comes back on Saturday and hits "update" after the event ended ---> Data is not updated since Monday

We might want to do a quick calculation about when the event data was last updated and make sure the "last update" is sensible for when the event was running (say, at the last day or something?)

In T210682#4951818, @Mooeypoo wrote:

....That would mean another calculation methodology; it is not trivial to do this. Just for the sake of avoiding complexities right now, I would advise against doing that for the MVP.

I'm not too worried about this, which is not uncommon but more or less an edge case. So if we don't find a way to handle such errors, what would happen in the event, as I said, of a "case where a stored page ID refers to a page that was subsequently deleted"? (As I say, I assume that is what Leon was referring to by "broken," but I might be wrong.) What would the system do?

... I'm concerned about cases where we didn't have actual valid information and now the data we have isn't updated. For example:

  • An event lasts a week - Monday to Friday
  • Event organizer updates on Monday and goes away
  • Event organizer comes back on Saturday and hits "update" after the event ended ---> Data is not updated since Monday

We might want to do a quick calculation about when the event data was last updated and make sure the "last update" is sensible for when the event was running (say, at the last day or something?)

Indeed. The logic is to check if the updated at timestamp is after end date of the event, so the first post-event update will refresh everything one last time.

....That would mean another calculation methodology; it is not trivial to do this. Just for the sake of avoiding complexities right now, I would advise against doing that for the MVP.

I'm not too worried about this, which is not uncommon but more or less an edge case. So if we don't find a way to handle such errors, what would happen in the event, as I said, of a "case where a stored page ID refers to a page that was subsequently deleted"? (As I say, I assume that is what Leon was referring to by "broken," but I might be wrong.) What would the system do?

It's a matter of saying "this page isn't in revision, so use archive instead". Easier said than done, but it shouldn't be too bad. If we don't do anything, it might error out (since the data it wants isn't there), or you just get blank values in the report, not sure.

....That would mean another calculation methodology; it is not trivial to do this. Just for the sake of avoiding complexities right now, I would advise against doing that for the MVP.

I'm not too worried about this, which is not uncommon but more or less an edge case. So if we don't find a way to handle such errors, what would happen in the event, as I said, of a "case where a stored page ID refers to a page that was subsequently deleted"? (As I say, I assume that is what Leon was referring to by "broken," but I might be wrong.) What would the system do?

It's a matter of saying "this page isn't in revision, so use archive instead". Easier said than done, but it shouldn't be too bad. If we don't do anything, it might error out (since the data it wants isn't there), or you just get blank values in the report, not sure.

I think what Joe meant is that in case the page is deleted, don't reference the archive table but omit the page in the revision list. Add text to the bottom of the table saying something like: 5 pages are not in this list because they were deleted.

I think what Joe meant is that in case the page is deleted, don't reference the archive table but omit the page in the revision list. Add text to the bottom of the table saying something like: 5 pages are not in this list because they were deleted.

Can do, but I'll note we should be able to get all the same info from archive. Doing this might be just as easy. We need a separate task about that. This one is solely about a performance improvement, methinks.

In T210682#4951881, @MusikAnimal wrote:

Can do, but I'll note we should be able to get all the same info from archive. Doing this might be just as easy. We need a separate task about that. This one is solely about a performance improvement, methinks.

@MusikAnimal, can you say specifically what the new task would be for? Maybe you can write that task?

Can do, but I'll note we should be able to get all the same info from archive. Doing this might be just as easy. We need a separate task about that. This one is solely about a performance improvement, methinks.

MusikAnimal, can you say specifically what the new task would be for? Maybe you can write that task?

Highly technical but here it is T216158