Page MenuHomePhabricator

Implement wikicode linting framework / library
Closed, ResolvedPublic

Description

This could be Google-Code / Outreachy (11) project

Description

Right now there is no simple way to detect syntax errors of a wikicode (text user input as a source code for wiki page). Its existence would make it simple for tools and bots to validate wikitext (tools like ORES, Huggle, AWB and many others) as well as it could make it possible for new features such as real-time syntax checking to be implemented into MediaWiki (so that user would be notified about syntax errors while typing the source code or upon page saving).

Designs

Library

The "linting framework" should be probably implemented in a way that there is some library for linting which is as much portable as possible, so that tools could use it locally without having to contact some external web server. There also could be a web service similar to ORES or parsoid that would just accept wikitext and validated it using this library.

Potential issues with this design

If this service really was external, eg. it would be some library located on different system than MediaWiki installation, it probably would be very complex to adjust it to all customizations of target wiki for which the text is linted. Rules for wikitext source may differ based on MW version and all customizations and parser extensions, so if this was an external service it would probably be able to validate just basic syntax.

Pros:

  • Very fast (linting can be run locally even for large amount of edits)
  • More scalable and secure (linting is not performed on same machine as where wiki is installed, no way to DDoS the server where wiki is installed)

Cons:

  • No way to check broken links
  • Doesn't work for parser extensions of target wiki

Implement this into MediaWiki

So that there would be some linting interface / API. The wikicode would be evaluated by wiki itself, so that all extensions would be considered and even things like broken links / non-existent targets (Images or wiki pages) could be detected, as linting would have direct access to SQL database of a target wiki. This way it would be significantly more efficient, but probably also slightly harder to implement. There would be no way for 3rd tools to use this service other than having to contact the web server somehow.

Pros:

  • Detects broken links
  • Works with parser extensions

Cons:

  • Requires 3rd tools to access this service through web requests, thus slower

Mix both library and integration to MW

There is also an option the code that would integrate linting into MediaWiki would itself use the mentioned library used for linting (and extended it with all parser extensions and checks for missing targets). That is most complex design but has pretty much all benefits:

Pros:

  • Very fast (linting can be run locally even for large amount of edits)
  • Scalable
  • Detects broken links (if used through mediawiki)
  • Works with all extensions

This linting can be programmed in any language, but I strongly recommend using PHP (eventually JavaScript) just to stay consistent with MediaWiki itself.

See Also

  • Primary mentor: (Phabricator username)
  • Co-mentor: (Phabricator username)
  • Other mentors: (optional, Phabricator username)
  • Skills: (Phabricator tags are welcome)
  • Estimated project time for a senior contributor: (must be 2-3 weeks)
  • Microtasks: (links to Phabricator tasks that must be completed in order to become a strong candidate)

Event Timeline

Petrb raised the priority of this task from to Needs Triage.
Petrb updated the task description. (Show Details)
Petrb added a project: Possible-Tech-Projects.
Petrb added subscribers: Petrb, Ricordisamoa, Halfak and 2 others.
Petrb set Security to None.

I think it's a great idea and I can immediately see the benefits of having a linter for wikitext!

Some thoughts on it from my side:

  • Officially, every wikitext is valid. So in theory there is nothing to lint. In practice, there are a few things that are definately broken or at least very likely broken. But this property of wikitext will make the development of a linter somewhat hard, as there are no official, hard constraints defined.
  • If you want font-end linting, a JavaScript implementation would be the best choice, especially since it would support real-time linting (as you type). But it also has the disadvantage, that PHP cannot directly make use of it - except the server has a JS engine installed, like Node.js. Then you would end up with a similar setup as VisualEditor/Parsoid - which is not trivial to setup, especially on shared hosts.

Best,
Simon

Is really every wikitext valid? <strike>Even that which results in MediaWiki large red syntax errors after save?</strike> Ok I take that back, I wasn't able to get testwiki to yell at me, but I swear I saw this on some mediawiki installations.

https://test.wikipedia.org/wiki/Broken_text

@Petrb
That what I ment, you may freely write broken wikitext and it's still accepted because it's technically valid. There are no "syntax" errors which would prevent basic errors. But of course there is still wikitext that produced broken output. Your link gives a nice summary on this!

Is really every wikitext valid?

Yes, it is. That said, Cite gives you the red errors in some scenarios (ex: T85386 ) and maybe some other extensions might. But, in general, "errors" in wikitext are more recognized because they break expectations of how it is supposed to render. The parsing infrastructure recovers from these "errors" automatically, but there is no spec for what should or what might happen, so, Parsoid and PHP parser might and sometimes do different things on such broken wikitext.

See https://phabricator.wikimedia.org/T94370#1701988 about current status of this project in Parsoid. I think what would be ideal is to take this forward and finish up the integration with project wikicheck and other tools / bots that might benefit from this information.

Also, the other aspect where linting is useful is in enforcing wiki-specific coding norms. See this wikitech-l mail where I broach this issue. So, that is another alternative source for linting, but that would be advanced usage for the linter.

As @ssastry said, there is the basic infrastructure for this already in Parsoid, which gets automatically run after every edit.

The missing components are:

  1. Integration with project wikicheck and/or other UI which make it as easy as possible for human editors to go through the list of errors and correct them (or flag them as reviewed). This could include visualization of the "lint queue" so we can see how well we're doing.
  2. Make it as easy as possible to write new rules. Currently rules live in the Parsoid respository, this could be factored out and documented.
  3. Write tools to act on the lint queue to automatically correct errors when possible. This is especially important long-term, to allow us to make subtle changes to wikitext parsing and automatically fix up all articles with the deprecated constructs.

Of these, the first is the most important right now. Parsoid already knows about a huge number of wikitext errors but we cannot communicate that list to human editors.

As @ssastry said, there is the basic infrastructure for this already in Parsoid, which gets automatically run after every edit.

The missing components are:

  1. Integration with project wikicheck and/or other UI which make it as easy as possible for human editors to go through the list of errors and correct them (or flag them as reviewed). This could include visualization of the "lint queue" so we can see how well we're doing.
  2. Make it as easy as possible to write new rules. Currently rules live in the Parsoid respository, this could be factored out and documented.
  3. Write tools to act on the lint queue to automatically correct errors when possible. This is especially important long-term, to allow us to make subtle changes to wikitext parsing and automatically fix up all articles with the deprecated constructs.

Of these, the first is the most important right now. Parsoid already knows about a huge number of wikitext errors but we cannot communicate that list to human editors.

Right, can you communicate them to machines? I don't particularly care about human editors here. Is there any API / interface that would let me obtain list of errors or have parsoid execute this sort of lint on a text I provided to it?

Can you provide examples of "broken wikitext"?

Can you provide examples of "broken wikitext"?

T94370#1701988

Can you provide examples of "broken wikitext"?

T94370#1701988

or the any number of cases where quotes are missing / mismatched around attributes, or missing '' or ''', or misnesting of links [http://foo.bar this is company [[Foo]]'s website], to name a few more.

So, I'm guessing the best way to go about this is to actually make a list of tokens that are being used in wiki-code and then making a AST out of it.
Then the tokens can be used to actually do some static code analysis ? Something like PHPCS can be used as a model.

Or was a simple regex based parser planned ? I think regex may be insufficient to incorporate the complexity of it.

This is solely a proposal with no exact plan. It's up to the parson who grab this task to implement it in any way they prefer. But any ideas are welcome.

IMPORTANT: This is a message posted to all tasks under "Need Discussion" at Possible-Tech-Projects. Wikimedia has been accepted as a mentor organization for GSoC '16. If you want to propose this task as a featured project idea, we need a clear plan with community support, and two mentors willing to support it.

@ssastry , would the Parsing-Team--ARCHIVED be willing to push this forward in the current round of GSoC '16/Outreachy 12 ?

If this intended to be portable, then the natural choice is a basic C library , so bindings can be added to any language, and incorporated into lint frameworks of any language. If there is any interest in a basic C library, I would be happy to be a mentor.

@Sumit We already have a basic functional linter within Parsoid. The part that needs working, that we haven't found the resources for, is integrating it with the workflow of existing fixup tools, bots, etc. Hardik, did write a demo tool (as part of the GSoC work that built this linter) that recorded this information in an external webapp. But, we didn't get through far enough to integrate it with the existing workflows. That is the reason why we haven't put this up for GSoC yet.

The bulk of the work that needs to be done is planning and figuring out the details of (a) where the information from the linter will be stored (b) format of that storage (c) the API for bots and other tools to use it (d) whether we can introduce a fixup mode within Parsoid itself to fix these errors. These issues need to be worked through before code can be written and I am not sure this design and planning work is GSoC / Outreachy material.

Is there a reason not to close this ticket? We now have https://www.mediawiki.org/wiki/Extension:Linter which is deployed and used on the wikimedia cluster.

ssastry claimed this task.

I am going to close this as resolved. Please file tickets against the the linter project if there are feature requests and ideas for enhancements, etc.