Maniphest T120170

[Epic] Paid editing (COI) detection model
Open, LowPublic
Actions

Assigned To

None

Authored By

	Halfak
	Dec 3 2015, 5:54 AM

Description

Could we build a model that detects paid (conflict-of-interest) editing? There's probably a general set of positive tone words that are likely in paid editing scenarios that we can pick up on. There's also likely some spacio-temporal patterns (e.g. editor just edits one page or pages closely linked).

Getting a training set should be relatively easy as we can look for structured edit comments and CSD tags. halfak is releasing a dataset of COI editors, we'll model them against normal editors.

Related Objects
Search...

Status	Subtype	Assigned	Task
Open		None	T120170 [Epic] Paid editing (COI) detection model
Declined		None	T157041 Address memory usage issues for deploying PCFG-based features
Declined	Spike	None	T155111 [Spike] Investigate use of Apertium LTtoolbox API in labs/production

Event Timeline

Halfak created this task.Dec 3 2015, 5:54 AM

Halfak raised the priority of this task from to Needs Triage.

Halfak updated the task description. (Show Details)

Halfak added a project: Machine-Learning-Team (Active Tasks).

Halfak moved this task to Parked on the Machine-Learning-Team (Active Tasks) board.

Halfak subscribed.

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptDec 3 2015, 5:54 AM

Halfak set Security to None.Dec 3 2015, 5:55 AM

Halfak added a subscriber: Doc_James.

We have a number of large groups of confirmed paid editors. We have these 381 Orangemoody socks for example https://en.wikipedia.org/wiki/Orangemoody_editing_of_Wikipedia

I'm very unclear what this is intending to achieve?
Is it intended to find socks of Orangemoody specifically? If so why - is there evidence that it is necessary?
Is it to find socks of any known paid editor? If so why restrict it to paid editors? Why not socks of any prolific sockpuppeter?
Is it to detect paid editing generally? If so, how do you propose to distinguish legitimate (i.e. disclosed) paid editors from those who are not disclosing their paid status, particularly as disclosure can be made on the userpage, edit summary or article talk page and there is no standard wording? How and why do you propose to distinguish between paid editors and non-paid editors who make similar edits (e.g. fans/enthusiasts of a subject)?

Paid editors often have a similar editing style. Checking disclosure would be done by humans. I guess it would more pick up promotional editing rather than paid editing. And promotional editing is undesirable regardless of whether or not payment is being received.

With respect to socks, the paid socks have a similar pattern that would be easier to work on than trying to pick up all socks.

Jbhunley subscribed.Dec 3 2015, 11:48 AM

Not a comment on this but why does Phabricator report my full name on the subscriber list? I provided it when I signed the confidentiality agreement but not for general publication.

On this - great idea! I have a few COI/Paid editors, some of which are likely SOCKS. It is not a big list but I can provide it for the training set.

I think there are probably some strong indicators of paid/COI/socking editors. Unusually strong editor maturity compared to account maturity is one. COI-prone topics is another. I'd say roughly in order of significance:

Large (>2k) new-article created by editor with one or more of the following indicators
- Less than 50 mainspace edits
- Little or no talk history
- Little or no userpage content
- No other article creations
Referencing to prnewswire or other well known press release sources (see [https://www.dmoz.org/Computers/Internet/Web_Design_and_Development/Promotion/Press_Release_Services/ list])
Editors with a name strongly correlated to article title (e.g. XYZEditor or XYZWiki on article XYZ)
Editors with a name strongly correlated with any name in the infobox (trying to look for corp officers' names)
New articles with a "products" section or the like
New articles with an "awards" section or the like
New articles with a name similar to a deleted article (e.g. XYZ Inc. when XYZ was deleted)
Perennial problem subjects including professionals (lawyers, doctors); Internet companies; Asian media companies, film, television; music celebrities; real estate; consumer-grade investment services

Probably no one indicator alone is golden, but a weighted scoring system might work.

Wow! This task got a lot of attention. :)

Thank you @Brianhe for proposing a list of features that we could explore. It seems likely that these features will be useful for our first experimentation. This is exactly the kind of help we need to work effectively on this problem.

I'm hoping to discuss the project with the team at our meeting tomorrow. We'll come back with ideas on what else we need to get started.

To the list of indicators Brianhe gave I would all a large number of citations/references in proportion to the article size . I see a lot of citation overload on likely COI articles, particularly biographies. The paid editors are learning. I think most have figured out how to avoid a CSD by making a claim of significance and now many have learned to avoid a BLPPROD by citing non-RS and tend to go overboard.

There is an academic paper which concluded that "stylometric features ... work very well for detecting promotional articles in Wikipedia" with machine learning (trained on articles tagged as {{Advert}} etc., with peacock and weasel terms turning out to be useful features for the classifier, among others). See my short summary here: https://meta.wikimedia.org/wiki/Research:Newsletter/2013/September#Briefly

https://en.wikipedia.org/wiki/User:COIBot is also related.

Personal opinion, as an editor who has spent quite a bit of time mitigating problems caused by paid/COI editing: Anything that effectively lightens the workload / force-multiplies the efforts of the Wikipedians taking care of these unpleasant tasks would be super useful.

And we have https://en.wikipedia.org/wiki/Wikipedia:Conflict_of_interest/Noticeboard which of course has more data / people with experience in this area.

He7d3r subscribed.Dec 3 2015, 7:36 PM

Jbribeiro1 awarded a token.Dec 3 2015, 8:49 PM

Jbribeiro1 subscribed.

Smartse subscribed.Dec 3 2015, 9:02 PM

Teles subscribed.Dec 3 2015, 9:38 PM

Thanks for getting the ball rolling on this. Here are a few of my ideas:

*Does anyone know a way to filter contributions to only include removals similar to how contribution surveyor sorts major additions? Most of the removals I make are tidying up after paid editors and this would help find diffs containing problematic content.
*Can we filter the deletion log by G11 articles we could get a large data set together. With some curation we could select the less clear-cut articles which are more likely to be written by professional paid editors.
*The ratio of business and BLPs articles edited relative to the site average might be a good metric. since these are the predominant problem areas. I'm sure we can think of keywords associated with these which in combination with Brianhe's new article criteria would catch a lot.
*The citation/link discussion is interesting - how about a metric like containing few whitelisted reliable sources (do we know which are the most cited websites on WP?) or many references to sites that are rarely cited?
*There were more ideas in the discussion in September.
*Edit filters 148, 149 & 354 were set up to try and detect some similar types of editing but they aren't very sophisticated.

Finally, can we make this list invite-only? It's a bit pointless to share our tactics in the open when we're up against deceptive organisations.

If you need more training data, many archived COIN cases already have a list of correlated editors & articles. I put a sample list of 10 cases on my en.wp user talk in reply to Doc James.

• gpaumier updated the task description. (Show Details)Dec 3 2015, 11:29 PM

• gpaumier subscribed.

Ixocactus subscribed.Dec 4 2015, 12:38 AM

In T120170#1850186, @Smartse wrote:

*Does anyone know a way to filter contributions to only include removals similar to how contribution surveyor sorts major additions? Most of the removals I make are tidying up after paid editors and this would help find diffs containing problematic content.

You can use the "rightsfilter" gadget for that. E.g. search for \) \. \. \(-:
https://commons.wikimedia.org/w/index.php?title=Special:Contributions/Stefan2&offset=&limit=500&target=Stefan2&withJS=MediaWiki%3AGadget-rightsfilter.js&lifilter=1&lifilterexpr=%5C)+%5C.+%5C.+%5C(-

*Can we filter the deletion log by G11 articles we could get a large data set together. With some curation we could select the less clear-cut articles which are more likely to be written by professional paid editors.

Filter edits containing G11
https://en.wikipedia.org/w/index.php?title=Special:Log/delete&offset=&limit=250&type=delete&user=&withJS=MediaWiki%3AGadget-rightsfilter.js&lifilter=1&lifilterexpr=G11

JEumerus subscribed.Dec 9 2015, 1:40 PM

Halfak edited projects, added Machine-Learning-Team; removed Machine-Learning-Team (Active Tasks).Mar 30 2016, 2:54 PM

Halfak moved this task from Unsorted to Ideas on the Machine-Learning-Team board.Mar 30 2016, 3:11 PM

Halfak renamed this task from [Discuss] Potential for paid editing detection model to Paid editing (COI) detection model.Aug 18 2016, 9:27 PM

Halfak triaged this task as Low priority.

Halfak updated the task description. (Show Details)

Halfak moved this task from Ideas to Research & analysis on the Machine-Learning-Team board.Sep 22 2016, 2:55 PM

Halfak moved this task from Research & analysis to Ideas on the Machine-Learning-Team board.Sep 22 2016, 2:58 PM

Halfak added projects: artificial-intelligence, research-ideas.Jan 20 2017, 8:40 PM

I've trained a PCFG on content from Spam articles and it seems to work pretty well. See https://meta.wikimedia.org/wiki/Research_talk:Automated_classification_of_draft_quality/Work_log/2016-12-01 and https://meta.wikimedia.org/wiki/Research_talk:Automated_classification_of_draft_quality/Work_log/2016-12-03

Ricordisamoa subscribed.Jan 22 2017, 12:47 AM

• Tbayer mentioned this in T158476: [Open Question] Automatically detecting accounts that do paid editing activity.Mar 23 2017, 4:42 PM

awight renamed this task from Paid editing (COI) detection model to [Epic] Paid editing (COI) detection model.Mar 13 2018, 8:16 PM

awight added a subtask: T157041: Address memory usage issues for deploying PCFG-based features.

awight edited projects, added Machine-Learning-Team (Active Tasks); removed Machine-Learning-Team.

• gpaumier unsubscribed.Mar 13 2018, 8:19 PM

awight moved this task from Parked to Non-Epic on the Machine-Learning-Team (Active Tasks) board.Mar 26 2018, 2:38 PM

awight updated the task description. (Show Details)Mar 26 2018, 3:02 PM

Capankajsmilyo subscribed.Apr 18 2018, 2:38 PM

Cirdan subscribed.May 9 2018, 2:38 PM

The title of this epic should be changed to reference paid advocacy editing, which is in infringement of Wikimedia policies, as opposed to (self-disclosed) paid editing or COI-editing, which the above policies allow under some conditions.

Although there is also the closely related issue of undisclosed paid editing, which is in infringement of Wikimedia's Terms of Use.

@DarTar, from a modeling perspective, I don't think we care whether it is disclosed or not. The model should not stand in the place of policy but rather be a tool to help policy enforcement.

@Halfak I see your point. I also feel this is going to be a much more delicate issue (compared to other applications of ORES) which may produce a significant negative impact on policy-compliant editors if say, third-party tools use a model as is, without realizing that potentially a very large fraction of cases flagged by this model are legitimate.

Can you expand as to why from a modeling perspective you don't care about disclosed vs undisclosed? The training data you'll use will depend directly on the answer to this question, right? For example, WiR edits are often considered disclosed (thereby acceptable) examples of paid editing. Would they be included in a training set of positive examples of paid editing?

All the details being feed in are undisclosed paid promotional editors. Thus it will be geared towards the undisclosed type. I do not consider WiR to be typically "paid editors" and this is reflect in the Q and A of our TOU.

James

Re. "disclosed vs. undisclosed" -- I think disclosure is a problem that is better handled by structured data than modeling. In this case, it should be trivial to detect and scope someone's disclosure. Thus detecting an undisclosed paid editor would be an intersection of the "is this editor advocating some point of view" model and the "have they disclosed a COI" model.

Generally, I expect that properly disclosed COI editors will rarely make edits that appear as though they are advocating some point of view. And that in some case, people with out a clear COI will end up accidentally (via non-encyclopedic tone) end up advocating some point of view.

I've been looking into https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Words_to_watch and I think that's going to be a really interesting source of signal for ORES :) If nothing else, it'll help us flag language that we'd rather avoid.

We have a dataset of known paid editors: https://figshare.com/articles/Known_Undisclosed_Paid_Editors_English_Wikipedia_/6176927

Also some thoughts on feature engineering:

Build from the types in the dataset
Articles about western men seem to dominate paid editor activities
Articles that are very unpopular also dominate paid editor activities
The words to watch noted above
Creating a bio/company as a first article (look for founder-company pairs, and they link together)
Picture of company/founder uploaded to commons with the article around the same time of article creation
Picture of professional quality (high resolution and well-framed [not a side-angle at a podium])
Image of a logo
Presence of infobox (newcomers don't usually add infoboxes)
Prevalence of external links

saurabhbatra96 subscribed.Oct 24 2018, 4:54 PM

nettrom_WMF subscribed.Jan 8 2019, 3:41 PM

Halfak edited projects, added Machine-Learning-Team; removed Machine-Learning-Team (Active Tasks).Apr 2 2019, 9:24 PM

Harej moved this task from Ideas to Epic on the Machine-Learning-Team board.Apr 3 2019, 4:47 AM

I've been working on detecting UPE in new articles with some researchers for the last few months and we have some promising results from the training of the model. We're now testing on some new articles and will manually review the positive results to judge how well it does in practice. There are some extra features that I am looking to add myself that I have held back for now as the researchers would like to publish their findings.

Wonderful news Smartse. Anything further you need help on? I will connect you with a student who is also interested in this work.

Not for now, but there may be work to do at some point to evaluate the output. It should be possible to use manual reviewing to improve the model (I think this was how cluebot was trained). Yes please put us in touch.

I take that you are going to consider both false positives and false negatives - especially when it comes to declared paid editors - when training this, yes?

Yes of course. To be clear though, the model would only ever be a tool to assist manual reviewing, so there is no need to distinguish undisclosed and declared paid editors. My interest is in being able to assess thousands of articles and find those most likely to have been written by UPEs.

I am concerned that it will result in a lot of false positives that way.

Is this tool really detecting paid editing or actually promotional editing? The two are not the same thing, merely overlapping sets (not all promotional editing is paid, not all paid editing is promotional).

So the data set is

Accounts that are newly created that make around 10 minor edits go dormant for a period of time (at least a week)
Create a perfect article about a non notable / barely notable subject and the article is promotional in nature
The account than never edits again.

These accounts / articles typically are paid for in nature and the accounts involved are socks.
James

(EC) @JEumerus/@Thryduulf, false positives/negatives don't come into play until there is a model. We'll certainly be looking at fitness statistics and manually reviewing false positives once that model is first built. This has been the pattern for vandalism fighting ORES models and I think it is wholly appropriate for this modeling work as well. Once we know what the model is able to detect and what it can be useful for, then we can discuss "tools" and usecases.

ToBeFree awarded a token.Jun 23 2019, 5:56 PM

Chtnnh subscribed.Oct 2 2020, 5:50 AM

• ACraze moved this task from Epic to Backlog/Other on the Machine-Learning-Team board.Jan 19 2021, 8:40 PM

Vahurzpu subscribed.Feb 22 2021, 10:21 PM

I have created a model that ranks vulnerable articles from most to least promotional. It seems to be very functional. Sign-up here if you'd like to try it out: https://en.wikipedia.org/w/index.php?title=User_talk:Sam_at_Megaputer&oldid=1015888690.

elukey closed subtask T157041: Address memory usage issues for deploying PCFG-based features as Declined.May 29 2023, 9:20 AM

Harej added a project: Wikimedia-Medicine.Jan 17 2024, 3:29 AM

Harej moved this task from Backlog: Other to Backlog: COI Detection on the Wikimedia-Medicine board.Jan 23 2024, 5:54 PM

Harej removed a project: Wikimedia-Medicine.Jan 29 2024, 10:28 PM

[Epic] Paid editing (COI) detection modelOpen, LowPublicActions

Description

Related ObjectsSearch...

Event Timeline

[Epic] Paid editing (COI) detection model
Open, LowPublic
Actions

Related Objects
Search...