Page MenuHomePhabricator

Integrate revscoring and/or wikilabels into Huggle
Closed, ResolvedPublic

Description

Huggle is provably the most wikimedia-wide anti-vandalism tool used, which could be used to new trainings. By pressing the revert by vandalisms it could label the diff has vandalism, and a new button could be applied to label as good faith, moving the active page to the next one on the queue. If T108304 would be applied, there's already the button for test edits.

Event Timeline

Alchimista raised the priority of this task from to Needs Triage.
Alchimista updated the task description. (Show Details)
Alchimista subscribed.
He7d3r renamed this task from Integrate revscoring into huggle to Integrate revscoring and/or wikilabels into Huggle.Aug 7 2015, 3:39 PM
He7d3r awarded a token.
He7d3r set Security to None.
He7d3r subscribed.

I believe there are two fronts in which this integration with Huggle could happen:

  1. Using Huggle's interface to constantly label the stream of new edits according to criteria defined for previous campaigns. E.g. in the just finished "edit quality" campaign, users labeled each edit according to two different aspects:
    1. goodfaith vs badfaith; and
    2. damaging vs constructive This could then be used to train again the existing machine learning models from time to time, to keep them updated, reflecting the current wiki environment
  2. Using scores provided by ores to show relevant information for users of Huggle. For example, highlight likely vandalisms (i.e. edits whose "goodfaith" score is low and whose "damaging" score is high), or highlight productive new users (whose last N edits have high "goodfaith" score and low "damaging" score, in avarage)

    This was one of the motivations for the edit quality campaigns: https://en.wikipedia.org/wiki/Wikipedia:Labels/Edit_quality#Why.3F
Petrb triaged this task as Medium priority.Aug 24 2015, 11:49 AM
Petrb subscribed.

I suppose you are talking about this https://github.com/wiki-ai/revscoring? if not please provide me with links to what you mean and documentation on how to use it.

Is there any working instance I could use?

Is there a way to get more Qt friendly format such as XML? This format is extremely hard to parse.

Is there any documentation for whole thing?

I submitted a request for XML format (in addition to the existing JSON format) to https://github.com/wiki-ai/ores/issues/81.

Other than the page linked in previous comments, you may find some general information at
https://meta.wikimedia.org/wiki/Grants:IEG/Revision_scoring_as_a_service
and details about the Python library at http://pythonhosted.org/revscoring/

@Petrb, what is difficult about parsing JSON?

Is this what you are looking for? http://doc.qt.io/qt-5/json.html

@Petrb, what is difficult about parsing JSON?

Is this what you are looking for? http://doc.qt.io/qt-5/json.html

Huggle is using Qt4 on linux because Qt5 is widely not supported, especially on debian and ubuntu. I can write some pseudo parser that would scrape the scoring out of that but it may easily break should the format change in future.

At some point I could create this extension for windows only, if it's so much of a problem to provide XML version of output, we use Qt5 there, in the end there aren't so many linux users of huggle

Where is the documentation on how to submit data to this service? There is option in huggle to classify edit as a good edit, and you probably also would like to know if edit was reverted or considered harmful. How do I do that?

Screen Shot 2015-08-26 at 14.19.06.png (1×1 px, 368 KB)

Here we go. The scores need to be slightly adjusted though, right now they are too much positive, at least when compared with cluebot. It seems that ORES predicts TRUE for most edits (or at least that's how it looks to me).

I've been running huggle with ORES for a while and I must say its predictions are very often wrong, I made huggle display all 3 scores (huggle / ORES / CB) and for many edits both ORES and CB fail to predict vandalism, even for quite obvious vandalisms which are easily detected by huggle (such as CAPITALS or obvious swear words).

But in some cases it seems to be useful. I implemented amplifier constant which can be adjusted in configuration for each wiki which actually gives a weight to ORES scores, right now it's rather low, so ORES doesn't do significant scoring of edits, but once its internal database become more efficient I think we could change the default value to something higher.

But it seems that it has similar outputs as cluebot so together these 2 make a decent addition to huggle's scoring mechanisms. I think that after little bit more testing I can make this extension default part of Windows and MacOS distributions. I just need to somehow figure out how to determine whether wiki is supported by ORES or not. I guess for unknown wikis it does just 404 errors.

fyi this is how I convert ORES score to huggle score:

long final = (long)((probability["true"].toDouble() - probability["false"].toDouble()) * this->GetAmplifier(wiki_edit->GetSite()));

You can see that I basically create a number between 200 and -200 (depending on size of amplifier constant) that is later added to overall score.

@Petrb, I'm sorry that you are having so much trouble processing JSON, but JSON has been a common data format since the early 2000's and it seems that everyone is happy with it, but you, so I'd like to not build additional data formats. If you think that writing your own 3rd party parser is more desirable than using someone else's 3rd party parser, then I welcome you to it. FWIW, I suspect that JSON is going to be quite easy to parse due to it's extremely simple, local syntax. Furthery, we don't include any fancy unicode escapes in our strings, so you'll probably get away with a minimalistic implementation.

Reading a probability from a SVM is a complex thing. The model you are using will try to predict True at 50% probability because it was trained with a balanced set. This probability is not as intuitive as you might like, but it is useful. We usually put the threshold of *needs review* at around 80%. Either way, this still allows the scores/revisions to be sorted/triaged and that seems to work as intended in practice.

Petrb claimed this task.

@Petrb, I'm sorry that you are having so much trouble processing JSON, but JSON has been a common data format since the early 2000's and it seems that everyone is happy with it, but you, so I'd like to not build additional data formats. If you think that writing your own 3rd party parser is more desirable than using someone else's 3rd party parser, then I welcome you to it. FWIW, I suspect that JSON is going to be quite easy to parse due to it's extremely simple, local syntax. Furthery, we don't include any fancy unicode escapes in our strings, so you'll probably get away with a minimalistic implementation.

Reading a probability from a SVM is a complex thing. The model you are using will try to predict True at 50% probability because it was trained with a balanced set. This probability is not as intuitive as you might like, but it is useful. We usually put the threshold of *needs review* at around 80%. Either way, this still allows the scores/revisions to be sorted/triaged and that seems to work as intended in practice.

I don't have problem with JSON, but Qt4 does, so far the extension is going to be available only for those who have huggle compiled with Qt5 (which are all windows and MacOS users). I think that is sufficient so I am not going to write any parser in the end.

I forgot to comment on sending us data from Huggle. Since Huggle shows users a biased set of revisions (not a random sample -- which is the point after all), we can't make direct use of the evaluations. For the moment, we can learn which revisions were reverted through Huggle processing revision comments post-hoc. So, when we want to run another random sample through Wiki labels, then we can use the information about what tool was used to revert the edit to supplement our dataset.

BTW before I flag this for deployment I would like to know if the webserver of ORES is handling all these requests fine, right now it processes around 100 HTTP requests per minute during peak hours per 1 huggle client, which in future could be somewhere around 800 or more HTTP requests during normal hours and about 2000 HTTP requests per minute during peak hours when most users are online in same moment.

I hope it's OK

I forgot to comment on sending us data from Huggle. Since Huggle shows users a biased set of revisions (not a random sample -- which is the point after all), we can't make direct use of the evaluations. For the moment, we can learn which revisions were reverted through Huggle processing revision comments post-hoc. So, when we want to run another random sample through Wiki labels, then we can use the information about what tool was used to revert the edit to supplement our dataset.

OK, so do you want any feedback at all? I still believe that knowing if edit was good or not is useful for you even if it wasn't a random edit. In case ORES predicted it's BAD edit it was false positive which need to be fixed and in case it predicted it's good you at least can track how many good predictions you get.

Yes. We welcome the additional load that Huggle will place on the servers.

We are intending to serve well beyond this type of capacity. We have a load balancer and a small processing cluster behind the service, so it should easily handle the additional load. This service is still young though, so you should expect a little bit of downtime in the early days. FWIW, we've managed nearly no downtime in the last 3 months.

In case ORES predicted it's BAD edit it was false positive which need to be fixed and in case it predicted it's good you at least can track how many good predictions you get.

ORES won't be able to use them directly. In this case, we want to manually inspect these edits and iterate on ORES's feature set. It would be good to have a library of these to draw from when doing this review. They could be posted to a wiki page or captured in a data file. If it's easier for you to send them directly to us, I can make a simple HTTP end point that will record them.

Yes that would be much more simple, it can be some dumb php script
that doesn't do anything now but would do something in future, huggle
can perform some simple HTTP requests with POST data if you want, but
you should expect it to handle complex stuff like oauth authentication
so this interface probably wouldn't be very secure. I can imagine
something like

domain/feedback/good/{revid}
domain/feedback/revert/{revid}
domain/feedback/suspicious/{revid}

or something like that. I already have similar interface to collect
suspicious edits, it's defunct since last wmflabs outage but I will
eventually recover it. As a security measure I track IP addresses of
users who submit data there so should there be any issue I can easily
nuke all contributions by problematic users or revoke access for them.

fix: you shouldn't expect it to handle complex stuff

woah, woah woah...

As a security measure I track IP addresses of users who submit data

@Petrb Are you saying you collect IPs of users of Huggle? Is this specified/presented to the end-user of the product that you will do? Otherwise it seems like a high privacy breach.

While I agree that it is good to have privacy policies posted in plain view, you should always assume that a service you interact with is performing some type of logging of your request data. Logging IP/User-agent pairs is a common way to perform basic analytics and to address DOS attacks.

The WMF labs terms of use does not prevent this, but strongly suggests that if personal information (IP included) is recorded, that the tool author should "Clearly communicate to End Users a) that Private Information is being collected, b) how you will use it, and c) when you will delete it;" along with some other, related information. See https://wikitech.wikimedia.org/wiki/Wikitech:Labs_Terms_of_use

FWIW, ORES does not track any information about requests other than raw counts ATM. We only have plans to do breakdowns by User-Agent.

For auth/security issues with the revisions that were inappropriately flagged for review, we might just have users append to a page on-wiki. That would use whatever auth strategy you are already using in Huggle to revert edits.

... for many edits both ORES and CB fail to predict vandalism, even for quite obvious vandalisms

I should add a note that the model called "reverted" was trained to detect edits which are likely to be reverted, not to detect vandalism (which are not the same). Once the models "damaging" and "goodfaith" are trained (e. g. T108679), we can the new scores on Huggle/ScoredRevisions instead of the "reverted" scores. That should produce better predictions for vandalism.

I just need to somehow figure out how to determine whether wiki is supported by ORES or not. I guess for unknown wikis it does just 404 errors.

You can check the URL http://ores.wmflabs.org/scores/ to get a list of the supported wikis.
And given a wiki, you can check
http://ores.wmflabs.org/scores/ptwiki/
http://ores.wmflabs.org/scores/enwiki/
etc... to see which models are available for that wiki

BTW before I flag this for deployment I would like to know if the webserver of ORES is handling all these requests fine, right now it processes around 100 HTTP requests per minute during peak hours per 1 huggle client, which in future could be somewhere around 800 or more HTTP requests during normal hours and about 2000 HTTP requests per minute during peak hours when most users are online in same moment.

I hope it's OK

Also, I see that the code at scoring.cpp gets only a single score per request:

query->URL = this->GetServer() + WikiEdit->GetSite()->Name + "/reverted/" + QString::number(WikiEdit->RevID) + "/";

Maybe this can be optimized, since ORES can provide the scores of say 50 revisions in a single request (see the first URL above, at T108305#1569796).

It would be good to have a library of these to draw from when doing this review. They could be posted to a wiki page or captured in a data file. If it's easier for you to send them directly to us, I can make a simple HTTP end point that will record them.

E.g.: a user from ptwiki has reported some at m:Research talk:Revision scoring as a service#Misclassifications.

woah, woah woah...

As a security measure I track IP addresses of users who submit data

@Petrb Are you saying you collect IPs of users of Huggle? Is this specified/presented to the end-user of the product that you will do? Otherwise it seems like a high privacy breach.

Of course I do: https://en.wikipedia.org/wiki/Wikipedia:Huggle/Privacy

BTW it's not huggle that would collect the IP, it's the website that is being accessed by huggle. It's your responsibility to understand that visiting a website gives the owner of that website to collect your IP.

I can't change or do anything about privacy policies of websites (wikipedia included) that are being accessed by huggle. If you have a problem with that, you probably shouldn't use internet in first place.

... for many edits both ORES and CB fail to predict vandalism, even for quite obvious vandalisms

I should add a note that the model called "reverted" was trained to detect edits which are likely to be reverted, not to detect vandalism (which are not the same). Once the models "damaging" and "goodfaith" are trained (e. g. T108679), we can the new scores on Huggle/ScoredRevisions instead of the "reverted" scores. That should produce better predictions for vandalism.

I just need to somehow figure out how to determine whether wiki is supported by ORES or not. I guess for unknown wikis it does just 404 errors.

You can check the URL http://ores.wmflabs.org/scores/ to get a list of the supported wikis.
And given a wiki, you can check
http://ores.wmflabs.org/scores/ptwiki/
http://ores.wmflabs.org/scores/enwiki/
etc... to see which models are available for that wiki

BTW before I flag this for deployment I would like to know if the webserver of ORES is handling all these requests fine, right now it processes around 100 HTTP requests per minute during peak hours per 1 huggle client, which in future could be somewhere around 800 or more HTTP requests during normal hours and about 2000 HTTP requests per minute during peak hours when most users are online in same moment.

I hope it's OK

Also, I see that the code at scoring.cpp gets only a single score per request:

query->URL = this->GetServer() + WikiEdit->GetSite()->Name + "/reverted/" + QString::number(WikiEdit->RevID) + "/";

Maybe this can be optimized, since ORES can provide the scores of say 50 revisions in a single request (see the first URL above, at T108305#1569796).

It would be good to have a library of these to draw from when doing this review. They could be posted to a wiki page or captured in a data file. If it's easier for you to send them directly to us, I can make a simple HTTP end point that will record them.

E.g.: a user from ptwiki has reported some at m:Research talk:Revision scoring as a service#Misclassifications.

It seems to be a funny fact that most of edits to wikipedia are supposed to be reverted :P this probably isn't problem with ORES but with people in general.

Regarding "optimizations" that is not easily doable as huggle is working with single edits and not groups of them. If it had to wait and group them, it would significantly affect the performance, which is probably not wanted.