Page MenuHomePhabricator

Investigate RESTBase as a possible storage solution for wikitext "errors" and issues that are found by Parsoid
Closed, DeclinedPublic

Description

T48705 is a proposal to expose information available in Parsoid to enable editors to start fixing up pages. As part of the GSoC project, Hardik used a mongodb based store in labs to prototype the solution. We had initial conversations with WikiProject Check Wikipedia folks about how to integrate this with their workflows. But, we didn't get very far at that time and this project has languished since waiting for someone to resolve this.

On the Parsoid end, the linting code has been around for a long time now and has not been enabled because of this last mile problem of where to dump this info. One possibility is for Parsoid to dump this information in RESTBase and let bots and other Check Wikpedia tools use this information to start fixing things. This ticket is to explore that possibility.

If this seems feasible, we need to develop an API to add / fetch / purge / de-dupe these entries. We also need to figure out consistency requirements and a schema for storage that enables this API. We might be able to pick this up based on what Hardik had used for the mongodb instance.

Event Timeline

ssastry raised the priority of this task from to Medium.
ssastry updated the task description. (Show Details)
ssastry added projects: Parsoid, RESTBase.
ssastry added subscribers: ssastry, tstarling.

In the context of T120256: Add tracking category to pages that generate empty <li> elements, something like this might be a possibility for doing a similar thing on the Parsoid end instead of adding a tracking category in the mysql table. But, this wouldn't be as immediately usable by editors that have a workflow based on tracking categories.So, that part needs thinking through.

It would be really handy to see this concept extended to a check-as-you-type-kind-of feature, so that possible errors and issues can be corrected on the spot by the user editing the article.

@ssastry, I think the main question for RB would be the number of dimensions by which this needs to be queried. If this number is fairly low & fits a hierarchical model, then RB with the table storage backend could be a good fit. If it's a random combination of a large number of criteria, then elasticsearch indexing would likely be better.

I have been wondering about integrating elasticsearch indexing with RESTBase before, but haven't actually written the code or drafted an actual design. Lots of handwaving, basically.

GWicke lowered the priority of this task from Medium to Lowest.Oct 12 2016, 11:29 PM
GWicke edited projects, added Services (watching); removed Services (later).

I actually don't think this has seen any movement since 2015, and we don't currently plan to work on this.

@ssastry, should we close this?

Legoktm subscribed.

The current plan is to store these in a database table, so closing this as declined.