Page MenuHomePhabricator

Implement a mechanism that recognizes syntax errors of wikicode as a Huggle extension
Open, Stalled, LowPublic

Description

This is potential task for google summer of code / Outreachy

About huggle

Huggle is a fast diff browser application intended for dealing with vandalism or other unconstructive edits on Wikimedia projects, written in C++.

Huggle is able to load and review edits made to Wikipedia in real time, helps users identify unconstructive edits, and allows them to be reverted quickly. Various mechanisms are used to draw conclusions to whether an edit is constructive or not. It uses a semi-distributed model where edits are retrieved using a "provider" (this can be anything that is capable of distributing a stream of edit information, such as the Wikipedia API or IRC recent changes feed), pre-parsed and analyzed. This information is then shared with other anti-vandalism tools, such as ClueBot NG. Huggle also uses a number of self-learning mechanisms, including a global white-list (users that are considered trusted) and user-badness scores that are stored locally on the client's computer.

Description of this task

Create a mechanism that would recognize common mistakes in wikitext syntax and implement this into Huggle as extension. For every syntax error that is found in a new text of a diff, raise edit score by certain value. It could also lookup commons files and if someone replaced a file name with non existent file, it should raise score as well.

Summary of error checks should be written to MetaLabels or PropertyBag of every edit (see header file wikiedit.hpp for details), so that it would be visible in edit details in huggle interface. The mechanism to detect syntax errors should be designed, if possible, in a way that it can be reused by other tools (it must be GNU-license compatible open source at least).

External linting service (optional)

In case you weren't feeling that strong in C++ there is also an option to design an external linting service that could be hosted on tool labs, written in a language of your choice. The extension would then use this service in order to validate edits. That is also an option, but not necessary.

Hints

You can re-use code from existing tools that do similar job, for example AWB, or CheckWiki https://tools.wmflabs.org/checkwiki/

You can take the source code of this extension as a reference, because it already does re-score edits in huggle: https://github.com/huggle/extension-scoring

Basic information about huggle can be found at http://enwp.org/WP:HG, documentation for developers can be found at http://tools.wmflabs.org/huggle/docs/head and on wiki at https://github.com/huggle/huggle3-qt-lx/wiki

It's strongly recommended to discuss any potential issues or questions regarding the huggle code on our IRC channel Huggle on freenode.net

  • Primary mentor: Petrb (petan on freenode)
  • Co-mentor: (Phabricator username)
  • Other mentors: (optional, Phabricator username)
  • Skills: C++ (Optionally PHP, Perl, .Net or python in order to analyze the existing tools that do this job)
  • Estimated project time for a senior contributor: 3 weeks
  • Microtasks: (links to Phabricator tasks that must be completed in order to become a strong candidate)

Event Timeline

Petrb created this task.Mar 29 2015, 10:33 PM
Petrb raised the priority of this task from to Needs Triage.
Petrb updated the task description. (Show Details)
Petrb added a project: Huggle.
Petrb added a subscriber: Petrb.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 29 2015, 10:33 PM
Petrb moved this task from Backlog to Need volunteer on the Huggle board.Apr 2 2015, 11:10 AM
Petrb triaged this task as Low priority.Apr 10 2015, 11:47 AM
Restricted Application added a subscriber: Luke081515. · View Herald TranscriptAug 17 2015, 3:03 PM

I wish to know what type of errors need to be detected. Since the extension-splitter-helper file has solved the problem of taking into account the added text, I wish to know what further work needs to be done.

A reference point could be to see if an edit caused any of these errors https://tools.wmflabs.org/checkwiki/cgi-bin/checkwiki.cgi?project=enwiki&view=project which wold be cause for some alarm, and we could raise score points in Huggle for a specific edit.

@Shrutika719 pretty much every syntax error in mediawiki page source, if someone edited the page, removing its random part, or overwriting part of it, placing nonsense there and so on, this all should be detected somehow. It very often happens that vandals break syntax of a page, but these edits are not always easy to filter out from regular "proper" edits.

This comment was removed by Shrutika719.

This is a message posted to all tasks under "Backlog" at Possible-Tech-Projects. Outreachy-Round-11 is around the corner. If you want to propose this task as a featured project idea, we need a clear plan with community support, and two mentors willing to support it.

This is a message sent to all Possible-Tech-Projects. The new round of Wikimedia Individual Engagement Grants is open until 29 Sep. For the first time, technical projects are within scope, thanks to the feedback received at Wikimania 2015, before, and after (T105414). If someone is interested in obtaining funds to push this task, this might be a good way.

I volunteer to work on this project for the Outreach Round 2015, I wish to know do the mentors see this project as a potential project for the Outreach Program?

Qgil renamed this task from Recognize syntax errors to Recognize syntax errors with Huggle.Oct 1 2015, 8:01 AM
Qgil updated the task description. (Show Details)
Qgil set Security to None.

@Petrb are you interested in mentoring this project? Please update the description, I have added required fields.

Also, what do you mean with "This is WMF specific task"?

Petrb added a comment.Oct 1 2015, 9:13 AM

Hi,

yes, I can mentor this task, but I need to add 2 important things before anyone is going to take it:

  • This task is NOT trivial
  • This task requires knowledge of C++, it's not a PHP-only project - although php could be used as well

Given that this task is not trivial and that it covers only huggle, while this certain functionality "mediawiki syntax-checking" is something that many other projects could benefit from, I would like to discuss the design of potential solution with other developers if possible.

I believe it should be possible to create some sort of framework, or a library that would be able to check mediawiki syntax so that other tools could access this functionality as well. At some point this functionality could be implemented into mediawiki core / API instead and huggle would just retrieve the information about syntax validity through API, it could even be some sort of revision attribute, whether syntax is broken or not.

My point is, I don't want to waste someone's time just to create a complex (yet useful) feature like this, that could only be used by huggle and nothing else. If such effort was taken by someone to do this, I would prefer if the solution was sort of portable, or easily reusable by other tools.

Having said that, the scope of this task differs based on design that would be used, there are following options:

Implement this in huggle only

  • Deep knowledge of C++ required, non trivial task
  • Can't be easily reused by other tools

Create this feature mediawiki side, implement some interface in huggle that would access it

Mediawiki side

  • Deep knowledge of PHP required, non trivial task
  • Can be reused by other tools

Huggle side

  • Elemental knowledge of C++ required, simple task

One more thing: I didn't do any research so far, maybe there already is similar library / tool that is able to do this. At least mediawiki itself is already able to throw syntax error (but only for fatal syntax issues, not minor ones). Maybe some of that code could be (re)used.

Petrb updated the task description. (Show Details)Oct 1 2015, 9:14 AM
Petrb added a comment.Oct 1 2015, 9:22 AM

Another design option:

Create separate instance of mw syntax validator, similar to parsoid

Could be written in any language and would run as a service provider (running on a server allowing clients to submit mw code for validation). The basic idea is that the client would submit wiki source text to service and validator would return results of syntax check. That way it could be used by other tools, could be written in any language and wouldn't require modifications of MW core. Only issue I see here is performance, other 2 designs would have direct access to information whether syntax is broken or not, here you would need to issue extra query to 3rd server just to check the syntax.

I don't know how to handle errors in syntax of code that would be parsed by mediawiki extensions, as this service probably wouldn't know much about target mediawiki installation for which it would check the source of text. Therefore this would need to be "WMF specific" eg. designed for WMF installations of MediaWiki, which have uniform extension setup.

Petrb added a comment.Oct 1 2015, 9:25 AM

BTW if we really had this feature it could be implemented into MediaWiki's edit form, so that it would pop out a warning if user wanted to save a page with syntax errors, preventing users from saving broken pages by mistake.

  • This task is NOT trivial

Equivalent to 2-3 weeks of an experienced developer?

Petrb added a comment.Oct 1 2015, 9:58 AM
  • This task is NOT trivial

Equivalent to 2-3 weeks of an experienced developer?

Yes, for it to cover at least the basic syntax errors, which for beginning should be sufficient, 2-3 weeks is enough.

i am interested in this project and i want to contribute in this. i have a good knowledge of c++ also

@Petrb, I cannot help you finding the right strategy for this project idea. I can remind you that in order to feature this project in Outreachy we need a description up to date and two mentors. :)

Petrb added a comment.Oct 1 2015, 8:22 PM

OK

@Qgil I want to make 100% sure there isn't some redundancy (there might be already some tool that does same or similar thing) and I also want to make this feature as much re-usable as possible, given the resources invested in that. So that is why I would rather do some research before making some final decision of how it should work.

@Ricordisamoa I guess that thing is a wish, rather than working feature?

Does any MediaWiki dev knows if there is any code in core that could be used in order to validate wikitext syntax?

Petrb added a comment.Oct 1 2015, 8:24 PM

I will add some devs so that they can join this as well

Petrb added a subscriber: Halfak.

At some point this feature might be used / provided by ORES as well

Petrb updated the task description. (Show Details)Oct 1 2015, 8:30 PM
Petrb renamed this task from Recognize syntax errors with Huggle to Implement a mechanism that recognizes syntax errors of wikicode and create a Huggle extension that would utilize it.Oct 1 2015, 8:32 PM
Petrb added a comment.Oct 1 2015, 8:40 PM

@Iamneha & @Shrutika719 I do welcome your interest in this task, but please wait a bit until the discussion on design of this feature settle down a bit :) I am sorry for the original description being so confusing, it's just that I rather quickly drafted it and didn't really expect any big interest in this task. Now that I see people want to work on it, I would like to polish the description as much as possible so that the final product can be as much useful as it possibly can be, but yet doable within the Outreachy difficulty level :)

I will try to make this task ready to take ASAP. Thanks for your patience guys.

Petrb added a comment.Oct 1 2015, 8:44 PM

I found this page https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Check_Wikipedia which lists some of the existing and working tools that are able to check some syntax issues. I believe they are all open source and some parts and algorithms could be used from these.

Josve05a added a comment.EditedOct 1 2015, 8:45 PM

I found this page https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Check_Wikipedia which lists some of the existing and working tools that are able to check some syntax issues. I believe they are all open source and some parts and algorithms could be used from these.

The tool database-thingy is at https://tools.wmflabs.org/checkwiki/ which is maintained by @Bgwhite who I cc'ed earlier.

Anomie removed a subscriber: Anomie.Oct 1 2015, 8:54 PM

I found this page https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Check_Wikipedia which lists some of the existing and working tools that are able to check some syntax issues. I believe they are all open source and some parts and algorithms could be used from these.

The tool database-thingy is at https://tools.wmflabs.org/checkwiki/ which is maintained by @Bgwhite who I cc'ed earlier.

The code for Check Wikipedia is GPL and written in Perl. Check Wikipedia can work on a dump file, list of articles or an article on-the-fly. Go ahead an steal any code. Check Wikipedia is used within AWB and WPCleaner. Both these tools also have ways to check and fix issues. AWB is GPL and written in .Net. WPCleaner is Apache and written in Java.

@Petrb thank you so much for your support. Waiting till project fully prepared.

Petrb renamed this task from Implement a mechanism that recognizes syntax errors of wikicode and create a Huggle extension that would utilize it to Implement a mechanism that recognizes syntax errors of wikicode as a Huggle extension.Oct 2 2015, 2:54 PM
Petrb updated the task description. (Show Details)
Petrb added a comment.Oct 2 2015, 2:59 PM

OK, I have updated the task information, I think we can drop the idea of creating some widely available service that would allow for these checks, because that is probably overkill and wouldn't even fit within Outreachy scope, it's just a huge task.

So, I guess creating an extension for huggle that would use similar mechanisms as in other mentioned tools would be more than enough. In the end, if the source code was written in a sane way, it could be probably reused as well.

The task needs a co-mentor though (@Addshore maybe :)?), not sure what microtasks are needed. I suppose it needs to be approved by someone as well later (@Qgil?). Until that is done I am afraid it's can't be taken yet by any candidate, but isn't phabricator supposed to be used just for proposals? I thought candidates were assigned in melange.

Petrb updated the task description. (Show Details)Oct 2 2015, 3:03 PM

I think we can drop the idea of creating some widely available service that would allow for these checks

What a wonderful waste of interns' time.

Petrb added a comment.Oct 4 2015, 8:17 PM

@Ricordisamoa: can you please elaborate on this?

The task in not yet confirmed for anyone to take, it's still being prepared and I wanted to discuss this with other developers. Yet nobody voiced any opinions whatsoever (only silently removed themselves from this ticket to avoid any need to talk). So maybe if instead of complaining how bad idea it is you gave us some real input on how it should be actually done, that would be really helpful.

If it's preferred to be done as external service so that others can utilize I can change the description of task so that it requires it. But to be honest I didn't receive any input from anyone whether that actually would be useful or not. Would AWB, ORES or anything else benefit from having it?

I am waiting for your responses guys. Thank you

What a wonderful waste of interns' time.

No intern has been assigned to this task and I'm not sure how this comment provided some help to others...

not sure what microtasks are needed.

Any smaller task that helps getting prepared:

Applicants must provide a small contribution prior to applying. (Known as "microtask".)
Assessing newcomers based on a proposal and a code repository out of context is risky.
The first contribution must be related to the type of project proposed.

(quoted from Lessons learned)

I suppose it needs to be approved by someone as well later (@Qgil?).

See https://www.mediawiki.org/wiki/Outreachy/Round_11 and its links for the application and selection process.

If a task gets 'dumbed down' for the sake of Outreachy applicants, it becomes more of a void exercise for one of them than a real benefit for the movement.

Petrb changed the task status from Open to Stalled.Oct 5 2015, 8:01 AM

That's true, on other hand if we really want to make this an external service, there is should be probably a different Huggle unrelated task for this.

I don't know if it would take just 3 weeks for a developer to implement it though. I am afraid that for this to be done properly it might be just as complex to create as ORES was and I think it took more than 3 weeks to create.

But yes, I recently figured out the scope of Outreachy and the fact that there is a big monetary award, I thought this is more or less something like Code-In. Given the significant monetary award I think this task should be much broader than what it is now.

So let's scratch this task for now, I will create a new one

Petrb added a comment.Oct 5 2015, 9:31 AM

Ok, I created https://phabricator.wikimedia.org/T114631 if we decide to do that one, this task should be probably changed just to a wrapper extension that would utilize that framework, which is pretty trivial task which could be probably just some Code-In task or similar.

Discussion is needed

But yes, I recently figured out the scope of Outreachy and the fact that there is a big monetary award, I thought this is more or less something like Code-In. Given the significant monetary award I think this task should be much broader than what it is now.
So let's scratch this task for now, I will create a new one

Possible-Tech-Projects tag removed for now. Thank you @Petrb for your perseverance. :)

ssastry added a subscriber: ssastry.EditedOct 5 2015, 2:14 PM

Create separate instance of mw syntax validator, similar to parsoid

T48705: Parsoid-based wikitext "linting" tool for "buggy" / "deprecated" wikitext usage; keywords: broken wikitext information
.....
!In T94370#1694819@Petrb wrote:

@Ricordisamoa I guess that thing is a wish, rather than working feature?

No, it is not a wish. Hardik Juneja did work on the project as part of GSoC 2014 based on plans here. Parsoid has some initial functionality to detect a bunch of broken wikitext scenarios (which you can test via the parse.js script on the commandline).

[subbu@earth tests] echo "<table>foo<td>bar</td></table> and <div>boo" | node parse --lint
...
[ { type: 'fostered',
    wiki: 'enwiki',
    page: 'Main Page',
    revision: undefined,
    wikiurl: 'https://en.wikipedia.org/',
    location: '[enwiki/Main Page]',
    dsr: [ 0, 30, 7, 8 ],
    src: 'foo' },
  { type: 'missing-end-tag',
    wiki: 'enwiki',
    page: 'Main Page',
    revision: undefined,
    wikiurl: 'https://en.wikipedia.org/',
    location: '[enwiki/Main Page]',
    dsr: [ 35, 43, 5, 0 ],
    src: '<div>boo' } ]
....

We haven't enabled this in production because we haven't done the last mile piece of how to integrate this with editing tools, bots, project wikicheck, etc. We had initial conversations as part of the GSoc internship, but we haven't had the time to follow through on it.

I'll move further conversation about this on T114631 but it would be great to take this project further.