Page MenuHomePhabricator

Measurement for AfC improvement (April 2018)
Open, LowPublic

Description

Product Analytics --

I have my first request for the team. I am working on a potential improvement to the Articles for Creation (AfC) process in English Wikipedia. You can read about the challenge on the project page, and the likely improvements here on the talk page.

I know which metrics the improvement will be aimed at improving, and I have had a lot of trouble calculating them (and associated reports) myself using the MediaWiki core data. This task is about either (a) building lightweight reporting for AfC and its relevant metrics (lightweight in that it will not need to go up the chain to executives) or (b) helping me do this myself -- whichever turns out to be easier.

Here are the main metrics that we'll be measuring (potentially broken out by some other dimensions):

  • 60-day rolling mainspace rate: for a given cohort of drafts submitted to AfC, what percent of them are in the main namespace 60 days after their submission?
  • 90-day rolling survival rate: for a given cohort of articles that came from AfC, what percent of them are nominated for deletion 90 days after being moved to main namespace?
  • Quality article waiting period: for those drafts that were accepted to main namespace on their first review, how long did they have to wait for that first review after being submitted to AfC?

These can likely all be calculated using templates and categories in the MediaWiki history data, and I have a lot more information (and some code that I've written) that give us a head start!

The engineering work will likely begin in two weeks, and end in six weeks. Therefore, I would like to be able to start producing these numbers within about three weeks. Please let me know how we should proceed, and if you need additional clarity to inform that decision.

Thank you,

Marshall

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 19 2018, 12:39 AM
Tbayer added subscribers: Halfak, Tbayer.EditedApr 19 2018, 3:39 AM

Regarding the second metric (survival rate), does it have to be defined via "nominated for deletion", or could it also be based on actual deletions? The latter might be much easier to instrument.

Also, have you already looked at previous AfC research to see whether there may be definitions and/or code that could be adapted? E.g. this by @Halfak and others:
https://meta.wikimedia.org/wiki/Research:AfC_processes_and_productivity (see also review/summary of the corresponding paper),
https://meta.wikimedia.org/wiki/Research:Wikipedia_article_creation
(See also talk page for each [edit to clarify: click "Expand" under "Work log"]. The code there may be outdated in parts, in particular it predates the introduction of the Data Lake tables. But this research should still be worth a look for comparison.)

@Tbayer -- on the second metric, I don't think it will make a substantial difference to use actual deletions, but we should discuss if there are any pros/cons to that choice.

And yes, @Halfak did point me toward those pages where he has existing code, and I forgot to call that out on this task. Thanks for adding it. I haven't gotten a chance to check out that code yet, but hopefully it will help.

Neil_P._Quinn_WMF triaged this task as High priority.
Neil_P._Quinn_WMF moved this task from Triage to Next Up on the Product-Analytics board.

As promised, here are some very rough estimates of how long these might take:

  • 60-day rolling mainspace rate: for a given cohort of drafts submitted to AfC, what percent of them are in the main namespace 60 days after their submission?

1 day. The mediawiki_page_history table in the Analytics Data Lake contains all the necessary information.

  • 90-day rolling survival rate: for a given cohort of articles that came from AfC, what percent of them are nominated for deletion 90 days after being moved to main namespace?

1 day if tracking actual deletions rather than deletion nominations is sufficient. In this case, mediawiki_page_history also contains all the data we need.

1 week if we want to look for deletion nominations specficially. In this case, we'll have to parse the actual page content using the API or the dumps to look for templates and categories (the templatelinks and categorylinks applications tables contains data on current templates and categories, but nothing on historical status changes).

@Halfak's code would probably help a lot here, but of course there would be a learning curve in working with it. For me particularly, it would be more significant because I haven't worked much with the dumps and API before

  • Quality article waiting period: for those drafts that were accepted to main namespace on their first review, how long did they have to wait for that first review after being submitted to AfC?

1 week (which may partly combine with the 1 week above). In this case, there's no way around manually building a submission history by looking at historical templates and categories, since this requires tracking the times of both AfC submissions and reviews.

Neil_P._Quinn_WMF removed Neil_P._Quinn_WMF as the assignee of this task.Apr 26 2018, 7:58 PM

@Neil_P._Quinn_WMF and I just talked about this, and we have an additional idea.

First, I want to add as additional background that while the three metrics listed in the description of this task are the essentials for measurement in this project, we don't want to end up in a situation where it's difficult to interrogate those metrics to figure out what is causing them to behave in certain ways -- especially unexpected ways. To that end, the ideal data scenario would be to be able to know for all submitted AfC drafts, where they were in the AfC process at every point, based on what templates and categories they contained.

To that end, the idea is: in addition to calculating the two metrics above based on the namespace history of a page, to also start storing the
templatelinks and categorylinks records for those pages every day into their own tables. Though this dataset would only go as far back as when we start storing the records, we would be able to easily look back to investigate the AfC status of a page at any given point.

Interested in thoughts from @Tbayer and other product analysts on this.

@Neil_P._Quinn_WMF -- I thought of something that might present a challenge. Using the mediawiki_page_history table, we'll know when pages were initialized in Draft space and then moved to Mainspace. But we won't know when they were submitted to AfC -- because that's indicated by a category placed on the page. We do need the measurements to start from that moment, because AfC isn't reviewing drafts until they are submitted. Do you see what I mean? Would that constraint cause us to need to use the "1 week" version to get the basic metric?

@MMiller_WMF and I just discussed this. Our current understanding:

  • We can't see from the Data Lake when and how often drafts were submitted to AfC, which is necessary to calculate all three metrics Marshall has suggested.
  • The dumps alone don't contain all the necessary information either, because they omit at least the content of deleted pages (which we need to see their templates and categories). @Halfak addressed this in his research by first parsing a dump and then requesting the content of deleted pages from the API (which requires the wmf-researcher user group.
  • Marshall already wrote some code to build an AfC table from the categorylinks and templatelinks tables and ran it for the first day. It's all currently on his laptop :)
  • It looks like @Nettrom may already have the exact kind of AfC table we want (staging.nettrom_afc_submissions on analytics-store). The code to generate it is probably in this GitHub repo.

@Halfak and @Nettrom -- @Neil_P._Quinn_WMF and I are really interested in your recommendations here before we start coding, given your experience in those domain.

I see you're running into some of the same challenges that I had with getting good data on this for ACTRIAL, and that you've found some of the code and data that I have. Since I'm currently working on T192574, there's also some newer code and data available.

The key challenge here is that a draft is submitted to AfC by transcluding the relevant template[1]. This information disappears when either the template or the page containing it is deleted. The former happens after the draft is accepted and moved into Main, and is a natural part of the AfC process. When it comes to page deletions, it seems to me that drafts generally are deleted either quickly (e.g. copyright, vandalism) or around the 6-month mark when they're eligible for G13.

The second part is that drafts tend to be submitted from both the User and Draft namespaces. For ACTRIAL, I chose to ignore the User namespace because I wasn't interested in processing it to find historical template usage, the main problem being that we want to go through the history of both live and deleted user pages. In that case I could argue that it was a reasonable decision because the landing page directed users to create Draft pages. For this particular study, it might be useful to get some statistics on AfC submissions from each of those namespaces, partly because there's likely differences in acceptance rates between them[2].

The data I have in the staging database is an updated version of the data from ACTRIAL that I'm working on in order to complete T192574. It's the AfC history of all pages created in the Draft namespace between 2014-07-01 and 2018-03-31, and the data gathering was done last week. The code and database schema behind it are both in the ACTRIAL repo that Neil pointed to. afc_draft_predictions.py is the Python code, staging_draft_predictions.sql is the schema. As you can see there are three tables. nettrom_drafts has page ID, creation timestamp, publication timestamp (if it was moved into Main), deletion timestamp (if deleted). nettrom_draft_predictions has ORES predictions for the revision that was submitted for review. Lastly, there's nettrom_afc_submissions containing info about each submission, basically each transclusion of the AfC template.

Hope some of this is helpful. I'm not sure how to solve the User drafts problem. Feel free to ask about any of this and I'll do my best to help!

Footnotes:
1: "Template:AFC submission" is typically used. "Template:Articles for creation" is a redirect to it.
2: This difference can lead to long discussions, see for example this thread on the talk page of the ACTRIAL report.

Hey folks. I've been asked to comment. It seems like you have discovered the same problems that I did with measurement work. Morten has offered some relevant notes. I don't think there will be an easy way to get around the issue of deleted pages and historical template usage. In the end, you'll need to process text and worse, you'll need to process deleted text.

The problem isn't intractable, but we should consider whether we want a one-off solution or some long-term monitoring. If we want long-term monitoring in place, our approach will be very different. Maybe @Milimetric has an idea for how we could turn this type of historical template tracking on deleted pages into something that could be queried more easily.

In a meeting today, @Nuria suggested we look into the following additional options:

Just a quick follow-up here. Since we don't have the bandwidth to do this in a completely thorough way using the Mediawiki dumps, I've set up some scripts to export the relevant parts of the Mediawiki database daily. This should allow us to track the status of drafts over time -- but only starting from when I started running the scripts:

  • I have been exporting the following things daily since 2018-04-29:
    • Drafts awaiting review, along with their categories and templates.
    • Articles that were promoted through AfC that are nominated for deletion.
  • I have been exporting the following things four times daily since 2018-05-31 (this addition gives a fuller picture of a draft's lifecycle):
    • All drafts and their categories, with their status of whether they are unsubmitted, awaiting their first review, awaiting a subsequent review, or declined.

I want to leave this task open in case we need to do the more thorough version of these metrics in the future.

I missed this task for a long time, but I just caught up. I consider it very important to parse each revision of each page, including deleted pages, and extract template and category information. The resulting data would be critical to many projects that have become stuck in its absence. Sadly, we as always have too many things on our plate.

Marshall, regarding your solution it seems like a good way to move forward. The only nagging question is: is 4 times daily too slow in some cases? Like do pages await review and get promoted within the 6 hour periods where your scripts can't see? If so, we could look at the change-prop.backlinks.continue schema and/or make a new schema and publish what we need via EventBus.

Thanks for bringing this up, @Milimetric. Yes, I think it is possible that the six-hour snapshots will miss some of the action, and I definitely would want to capture all the action.

How much is what you're saying related to T186559?

I'm not familiar with the change-prop.backlinks.continue schema or what we would do with EventBus. What are the options there?

Nuria added a comment.Jun 13 2018, 6:50 PM

I'm not familiar with the change-prop.backlinks.continue schema or what we would do with EventBus. What are the options there?

Mediawiki produces events of what is happening (like pages being moved around from one namespace to another). So if you are looking for changes on X value,
there are (abstractly speaking) two ways to go about that if your data is not on data lake: 1) take snapshots and see whether X has changed on application database and 2) send an event from mediawiki when X changes.

See events being sent by mediawiki right now: https://github.com/wikimedia/mediawiki-event-schemas/blob/master/config/eventbus-topics.yaml

See schemas for those events: https://github.com/wikimedia/mediawiki-event-schemas/tree/master/jsonschema

Take a look at events database in hadoop and there see tables prefixed with "mediawiki_job_" or "mediawiki_" to see those events.
This probably will help with some of your questions but it is not going to solve the "deleted page content" problem.

How much is what you're saying related to T186559?

Only generally. That ask from Diego is more static, getting dumps periodically for analysis. I think if you want to not miss any activity that happens on AfCs, you need to stream changes to categories/templates. I think the schema I linked is a good start, but it might need more contextual data to do what you want.

I'm not familiar with the change-prop.backlinks.continue schema or what we would do with EventBus. What are the options there?

Basically what @Nuria said, EventBus and EventLogging are two ways of doing the same thing, and the Modern Event Platform will essentially unify how we think about this type of data: streaming, schema-ed, easy. I think the next step is to see if any schemas there contain the kind of data or fire events at a point in mediawiki processing where you *could* get the right data. If so, that would be one step closer to having real-time data that can answer your questions.

Neil_P._Quinn_WMF moved this task from Next Up to Tracking on the Product-Analytics board.
MMiller_WMF moved this task from Tracking to Triage on the Product-Analytics board.

I am assigning this to @Nettrom, as he is the new product analyst for the Growth team.

The most important thing to do in the near term is to make sure that the data that is currently being recorded will be usable for analysis in the future, and to change any scripts for recording data if needed. @Nettrom, I recommended reading the comments on this task because they reference other potential ways of acquiring the data we need.

MMiller_WMF edited projects, added Growth-Team (Current Sprint); removed Growth-Team.

@Nettrom -- putting this in the "To Do" column of the current Growth sprint because I think you'll probably be able to get started on this in the next few days.

Nettrom moved this task from Triage to Next Up on the Product-Analytics board.Aug 16 2018, 8:17 PM
nettrom_WMF removed a subscriber: Nettrom.
MMiller_WMF edited projects, added Growth-Team; removed Growth-Team (Current Sprint).

We have decided not to pursue this work for now because of other Growth team priorities. We are still collecting the relevant data.

MMiller_WMF moved this task from Inbox to FY 2019-20 on the Growth-Team board.Sep 18 2018, 10:31 PM
Neil_P._Quinn_WMF lowered the priority of this task from High to Low.Sep 20 2018, 8:24 PM
MMiller_WMF moved this task from FY 2019-20 to Revisit on the Growth-Team board.Oct 3 2018, 6:03 PM

Moving to Revisit, as we continue to collect data, but have not prioritized the analysis.