Page MenuHomePhabricator

Create an extension for storing WikiProject article assessment metadata in a database table via a parser function
Closed, ResolvedPublic13 Estimated Story Points

Description

Right now WikiProject article assessment metadata is all stored in WikiText in templates on article talk pages (although some of it can also be accessed via categories). This means that there are no easy ways to query this data and it has to be aggregated and reported through scripts and bots.

Then each time the template was transcluded on an article talk page, it would add an entry into a table like so:

+-------------------------------------------------------------------------+
| Page       | Namespace  | Project    | Class   | Importance | Revision  |
+-------------------------------------------------------------------------+
| First aid  | 0          | Medicine   | C       | High       | 73264     |
+-------------------------------------------------------------------------+

It should also record a log entry each time an assessment is updated.

Once this table exists, developers (WMF or volunteer) can create tools that help WikiProjects organize their ratings.

Event Timeline

kaldari raised the priority of this task from to Needs Triage.
kaldari updated the task description. (Show Details)
kaldari subscribed.
DannyH triaged this task as Medium priority.Oct 30 2015, 4:29 PM
DannyH moved this task from New & TBD Tickets to Older: Team Work on the Community-Tech board.
DannyH set Security to None.

If portability between projects is a concern, I would recommend using a term other than "WikiProject," which as far as I know is only used on a handful of Wikimedia projects (and is itself a fairly ambiguous term if you don't already know what it means). I would recommend something like "editorial review".

We should talk to WikiProject (or equivalent) users on non-English Wikipedia sites to see what would work best for them (as far as what metadata to store), but we need to make sure we keep it simple as this could easily spiral into Wikidata-lite.

Assessment for prioritization:
Support: Medium (came out of discussions with WikiProject X, but unclear how much demand there is for this outside of en.wiki)
Feasibility: Medium (fairly clear scope, but involves creating a new extension and probably a new database table)
Impact: High (would replace a lot of existing bot/script functionality like WP1.0 Bot, and would be available to all projects)
Risk: Medium (need to make sure that it will work for all projects, but need to keep scope under control)

Priority: Normal

@Harej: Do you know any community folks who are active in non-English Wikipedia WikiProjects that we could talk to?

@yuvipanda: Do you think that using a parser function (rather than encoding data in the HTML and reparsing it out), is a good solution? I imagine it might have a tiny negative effect on parsing performance, but since assessment templates are only included on talk pages, I don't think anyone would notice.

Looks like a great idea to me.  I'm going to describe a niche use-case that is probably out of scope, but I figured it may be valuable to record it in case it's easy to lump on.

I've been building machine learning models to predict quality assessments. I use the current template systems as training data for the models. In order to do this effectively, I need to work out which revision of what page was assessed. This is complex and error-prone due to the inconsistent use of templates and the presence of the template of the talk page. There are a surprising amount of *broken redirects* that get assessed as Featured Articles in English Wikipedia if you follow that strategy naively. ;)

If your implementation can save assessments historically and include information about the relevant revision of the article at the time of assessment, I would find that very valuable. This wouldn't need to be a substantial part of the feature; it could be done with the logging table. I imagine something like log_type="assessment" and log_action="updated" that would behave like log_type="rights" and log_action="rights" and have the log_page field set to the page that was assessed (not the talk page please).

@kaldari yes, parser functions definitely sound better. I can't think of any reasons to use the HTML reparsing outside of Yuvi in 2011 was more stupid...

@Halfak: Good point. I think it would be useful to at least record the revision ID as part of the assessment data. Using the logging table to log changes is also a good idea. I'll update the task description.

Note from the sprint planning meeting: After we finish this task, there are two follow-up tasks -- Adding the tables in production and a security review.

Note from the sprint planning meeting: After we finish this task, there are two follow-up tasks -- Adding the tables in production and a security review.

There are significantly more steps required to deploy a new extension, see the checklist on https://www.mediawiki.org/wiki/Review_queue

@Harej, I got the basic extension implementation in place, yay. I had a question - how often is it that people would remove the assessment template from a page and we'd need to remove the record from the DB? Would you be having any idea how significant this problem is?

Right now the extension is triggered every time a page having the assessment parser function is saved, but in order to take care of removed templates, we'd need to trigger the extension on every (Talk) page save.

Another thing, if you have other ideas for other features we could have in this extension, feel free to open tickets so we can work on them as we go along. Thank you!

@NiharikaKohli, regarding removal, I don't know that it happens a lot but it does happen from time to time (usually when a WikiProject is deemed defunct) so it is a situation worth accounting for. In lieu of triggering the extension on every talk page save, could you have a maintenance script that checks talk pages for deleted parser functions?

For other ideas: The most useful thing for me is the association between pages and WikiProjects. That information is the backbone of the other WikiProject X reports. What I am wondering, though, is if this extension could be used to generate the reports directly. I don't think it would be sensible to bake these reports directly into the extension, but rather, a generic framework for producing reports based on the data.

Some approaches to reports:

  • Intersections between the WikiProject table and a category (WikiProject Chemistry pages in need of Expert Attention, for example)
  • Subsets of a WikiProject table (all WikiProject Chemistry pages missing quality assessments, or for showcasing jobs well done, all the WikiProject Chemistry articles that are featured articles)

Building report generation function directly into this extension helps make it easier for others to do it; currently if you want a report, you have to either (a) know that wikiproject.json exists and you can edit it or (b) get me to do it, and that's for the specific set of reports my tools make available. Building in report generation also helps this extension realize its potential as an editorial curation tool; not only is the information stored, it can be analyzed and presented in meaningful ways.

@Harej: One of the things we're talking about adding to this extension is an API for querying the data (T119997). That should make report generation a lot easier. We're hoping that the community will use this API to generate a wide variety of different reports. Your input on what types of queries it should support would be useful. So far, we only have 2 use cases listed: Returning all the assessment data for an article, and returning all the assessments for a WikiProject.

Consider also allowing people to add (optionally) the reason (a short text) why they think an article must have assessment X instead of Y.

@He7d3r: That's not a bad idea, although our current plan is to piggyback on the existing assessment templates, which don't have such a parameter (at least on English and French Wikipedias).

I've been thinking about this over the past few days, and I don't think adding a parser function to store this info in a database table is a substantial benefit given that it's already available through categorylinks mostly. A parser function has the downsides of requiring wikitext and complex wrapper templates, meaning that any tool/human that wants to write data or make an assessment still needs to navigate the mess of templates. There's also the whole awkwardness that the parser function goes on the talk page instead of the actual page.

An alternative idea I've been thinking of would be to move the ratings to Wikidata (property like "English Wikipedia WikiProject rating" -> "B-class" qualified by "WikiProject" -> "WikiProject Birds"). A MediaWiki extension might provide some simplified Lua bindings, an api.php query/prop module, and a special page to generate WP:1.0-style reports using the Wikidata query service. It would also provide a JavaScript/OOUI widget of some kind that would let you edit the ratings directly on the Wikipedia itself (like the current sitelinks widget). I think this would give us a lot of flexibility in how the data can be read and written, not locking us into wikitext banner templates or something.

The main problems I forsee with such a solution would be 1) Whether Wikidata would accept this kind of data and 2) Initial import of said data. At least for #2, I can volunteer for helping with that :)

As I commented on IRC, it would require Wikidata to play along. This cannot be guaranteed. However, a proposal they might consider would entail a generic “Wikimedia editorial assessment” property, with qualifiers for project (English Wikipedia) and WikiProject (for which there should be Wikidata items).

Another problem is that as a meta property, it may not necessarily belong on a Wikidata item. After all, the Wikidata item for “San Francisco” is not about the Wikipedia article on San Francisco, but is about the city (and county) in California. Items are necessarily tied to concepts, rather than specific articles.

I do know that Legoktm has used Wikidata to power an extension to great success, so it’s certainly something we can consider.

I'm surprised the Wikidata people didn't create a "MetaWikidata" for these purposes.

I think of an extension that would allow sysops and 'quality rating administrators' to edit quality status of pages; then such badges would be displayed in the 'Other languages' section, when linked from another language version.

@Legoktm: The idea of storing assessments in Wikidata has been suggested many times, but consistently rejected by the Wikidata community. The badges feature is for storing a single quality assessment for a page (such as "Good Article"). That's great for interlanguage links but not really useful for WikiProjects (the target user group for this feature). There is no plan for adding WikiProject-level assessments to badges, i.e. triplets of [ project : importance : quality ]. Keep in mind that the importance assessment in particular typically varies by project.

Using the categorylinks is a possibility (that's what WP1.0 bot uses), but it's English Wikipedia specific as you have to know the naming convention for the categories as set by the templates, including which categories are importance related and which are quality related.

I'm open to other ideas for how to accomplish this, but so far I haven't been able to come up with any realistic ideas other than using a template-embedded parser function.

The first iteration for this is basically done, but is awaiting security review.

Adding Job Queue functionality is in T121069. Adding an API is in T119997.

@Legoktm: The idea of storing assessments in Wikidata has been suggested many times, but consistently rejected by the Wikidata community.

Is there evidence to support the "consistently rejected" claim? I see two discussions on Wikidata, both from March 2014 and both have limited participation. And even in those small discussions, people were somewhat amenable to the idea, but there was a desire to start with badges and then re-evaluate in the future. Are there other discussions/rejections from Wikidata?

@MZMcBride: Even if Wikidata didn't reject storing article assessments, this is still a simpler solution (especially for querying the data), and it doesn't preclude the option of using Wikidata in the future. This is just a simple hack to get the assessment data from the existing templates into a structured and consistent format so that community developers, such as yourself, have easy access to the data. It doesn't require any changes to anyone's workflow, it doesn't require any modifications to existing software, and it only requires a small bit of developer effort to implement (most of which is already done). I'm not sure why you are so opposed to the implementation of this extension, as I really don't see any down-side to it. If you really think it would make more sense to store this data in Wikidata, you are welcome to try to convince the community there. The "perfect" solution should not, however, be the enemy of the "good" solution.

@MZMcBride: These discussions are getting hard to keep track of. Please direct additional comments about the architecture, implementation, and/or need for this extension to T120219.

@MZMcBride: These discussions are getting hard to keep track of.

I agree, though I didn't file all of these tasks, fracturing the discussion.