Page MenuHomePhabricator

Create a method for 'New page survival rate' and 'Still exists' metrics
Open, Needs TriagePublic


Organizers and their sponsors and partners want to understand the impact and quality of articles created during an event. Organizers who are focused on creating content also have a desire to know precisely which of the articles created during an event have been deleted, so they can attempt to rescue those articles.

This metric will be used (initially) in the Event Summary (CSV only, not Wikitext T205561 ) and Pages Created (T206058 and T205502) reports. The data will be somewhat different in each of the reports.

  • In Event Summary reports: here, the metric will be the overall "New Page Survival Rate" for the whole event, expressed as a percentage. E.g., if 100 articles were created during the event and 15 deleted afterwards, the New Page Survival Rate is 85%.
  • In Pages Created reports: this report is a list of articles created. So the metric here will be an indication, for each individual article, of whether it "Still Exists?". Possible answers = "yes" or "deleted".

Proposed method and its strengths/weaknesses

  • Based on the Archive log After discussing various approaches, the method we've selected here uses the Archive log. This has certain strengths and weaknesses.
  • Won't work with Category filter: if the organizer has used the Category filter to define the event, no Survival Rate figures will be possible.
    • In such cases, the system will post an answer of "n/a", for not available. I.e., we will not simply omit the column from spreadsheets or screen display.
  • Will track articles deleted during the event: If an article is created during the event and then deleted before the end of the event, we can still track that article.
  • Metric can be run at any time: This metric will be available during the time period of the event (i.e., organizers will not have to wait until the event is over to see the metric).
  • Metric continues to develop: The number of articles created during an event is fixed once the event ends. However, the metric for survival rate can continue to go up or down, as created articles are either deleted or restored.

Event Timeline

Method—make a record at close of event: To track pages created and lost with perfect accuracy, we would have to make a record of all existing articles every time the data was updated and keep all the diffs for comparison. This is not practical for our system. Instead, we propose to automatically trigger a data update at the close of the event period and save a snapshot of extant articles at that time. All future statistics and reports will be compared to that.

@Mooeypoo @MusikAnimal I was thinking about this a bit. If we are saving the page titles - we might as well do it every time someone updates data because we are fetching them every time anyway. We can update the stored titles array with new titles we find in the update without deleting any already in the list. That will allow us to be more accurate about survival rate. Thoughts?

Yeah if we do store titles (IDs, rather), we'd probably want to do something like that. But for now I'd assume we won't be tracking pages in this way. It'd be a huge infrastructural change and there are also concerns with storage.

@Mooeypoo and I were just discussing this with @jmatazzoni. I think we were both confused about exactly what we will or will not store.

So, we should come to a consensus here quickly as this work will be starting soon.

Can you describe the "infrastructural change" you foresee? I'm also curious about storage. Are there known/published limitations of the sizes of the MySQL DBs we are using?

I'm a little confused, too! =P I think I may not have read the task description in full, my apologies. It makes it pretty clear that we want to store some things. The second bullet says "Save as much data as possible...". Unless you want to create a chart of how figures changed during the event, most of these things don't require any sort of tracking. "Article class" and "incoming links" would be the two where historical data does not already exist. Note also that "Words added during event" can't apply to already-deleted pages because we can't see the content.

On the surface, storing page IDs/titles, for every event, forever and always, feels wrong to me. I like the single isolated queries that are easily testable. That being said, as an aside, having the page IDs on hand would seemingly make all the other queries go faster, which is definitely appealing!

"Infrastructural change" may be the wrong term, but implementation would involve storing/retrieving the page IDs, rewriting all the other queries to use them (we can't JOIN on our db and the replicas, so you know), and figuring out how test it. I'm not even sure how much it will improve accuracy. It varies by project, but on English Wikipedia I suspect many articles are deleted as "no indication of importance" (super common), which qualifies for speedy deletion. The likelihood they click on "Update data", or even with our automated system (T189911), during this short time seems slim. Then you have figures like "words added during event". If we want to know these numbers on a per-page basis, we need to store them associated with each individual page. That multiplies our storage needs and would be difficult to implement.

So allow me to propose an alternative MVP. Checking the archive table to see deleted pages created by the users in the given timeframe is easy. It's true this won't work when we need to go by a category, but it should otherwise be accurate. So really I see two distinct tasks -- add survival rate (with the caveat that this doesn't work for categories) and improving accuracy/performance by storing page IDs. I think we can do these separately, and focus on pushing out the slightly less accurate MVP. It's the difference between a few 8-point tasks and a single 3-pointer.

I don't know about any enforced size limitations on Toolforge DBs, but assuming we'd only storing page IDs (no associated per-page data), I was thinking we might store a md5 hash containing all the page IDs for an event?

Thanks for the explanation. That helps a great deal.

I don't think I'd considered the issue of the data being in our DB and in the replicas before. I guess I knew it worked that way but I hadn't really visualized it yet. That does provide a decent-sized hurdle for a lot of these performance issues we'd consider. It seems that no matter how we make performance better, it's going to require us to store more data to achieve it.

I like the idea you have for the MVP. It seems like a decent compromise so we can get this out in the world and start getting user feedback.

I have rewritten the Description of this task above to base this metric on the Archive log method we talked about. Under this scheme, it won't be available with Categories, but we don't have to do the more difficult snapshot thing. I am therefore putting it in for estimation.

@MusikAnimal @Mooeypoo, please read through the Description. Does it agree with your understanding of what we decided? If not— if you have, in fact, decided to do the superior but more difficult snapshot method—please take this out of estimation and let's talk.

jmatazzoni renamed this task from Put 'New page survival rate' into 'Event Summary' and 'Pages Created' reports to Create a method for 'New page survival rate' and 'Still exists' metrics.Nov 29 2018, 10:02 PM

@jmatazzoni Is this an epic? It's in the epics column.

@jmatazzoni Is this an epic? It's in the epics column.

No, this should be in backlog. Thanks.