Page MenuHomePhabricator

Spam blacklist by-pass right for agreed batch upload projects
Closed, DuplicatePublic

Description

I have been repeatedly hitting the spam blacklist with disallowed links when running the Portable Antiquities Scheme batch upload project (this is around 400,000 files, see link below). The reason that the links are are used is that curators and reporters of finds, have been using link shortening when referring to well respected sites, such as the British Museum. These links are then used when importing to Commons, as we are using the curator's descriptions to help describe the imported photograph on Commons.

Though it is possible to trap specific links (like bitly) and reword the link to "hide" it from the spam filter, the source database is not a risk to Commons and as the curators are free to add any links they find useful to the descriptions, it is unpredictable for a batch uploader to try to work around them and "hiding" the links is not actually solving any real problem.

This proposal is that we should either white-list an upload source source site when using API upload calls, so that at the point of image upload with its associated image text, the text itself is exempt from the spam filter, or the uploading user can apply for an account temporary exemption to the spam blacklist for the duration of their upload project. The exceptions could be managed via bureaucrat approval and would pose a very low risk to Wikimedia Commons, compared to the lost value when we have failed uploads from sites like the Portable Antiquities Scheme that are unlikely to be re-done due a lack of skilled volunteers to address the blacklist 'bounces' on a case by case basis.

Event Timeline

Fae created this task.Feb 12 2017, 1:10 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 12 2017, 1:10 PM
Yann awarded a token.Feb 12 2017, 1:51 PM

This is impossible. It would make the file descriptions non-editable, since any later edit would be blocked by the spam blacklist again.

With respect, nothing is impossible, it's just a matter of how much effort it would require.

I agree with Fae that its worth looking into. There really needs to be a way to make the system smarter by allowing a workaround. Maybe the blacklist process could look at the status of the person and then allow it to process. I'm not really sure what the right way to do this is since I suspect there are multiple ways to do it, but it would be a shame if we exclude good content because of an inflexible process.

Fae added a comment.Feb 12 2017, 7:41 PM

With regard to later edit, my edit today to https://commons.wikimedia.org/w/index.php?title=File:Marcha_das_Mulheres_Negras_(23137414611).jpg&action=history would have been impossible as the text contained a bit.ly link. So, I dispute impossible.

With respect, nothing is impossible, it's just a matter of how much effort it would require.

For the former suggestion: Impossible isn't the right word; the right word is "terrible", having very bad consequences if its done.
For the latter: see T36928: Create a user right that allows ignoring the spam blacklist, an already declined task.

Instead, I'll suggest to bitly short urls in your code.

With regard to later edit, my edit today to https://commons.wikimedia.org/w/index.php?title=File:Marcha_das_Mulheres_Negras_(23137414611).jpg&action=history would have been impossible as the text contained a bit.ly link. So, I dispute impossible.

That would be a bug, not a shiny example to follow.

Billinghurst added a subscriber: Billinghurst.EditedFeb 13 2017, 9:41 AM

Why can't you resolve all the underlying urls prior to loading them into your upload process? That process could be done by you today without others having to change any mediawiki code.

With respect, nothing is impossible, it's just a matter of how much effort it would require.
I agree with Fae that its worth looking into. There really needs to be a way to make the system smarter by allowing a workaround. Maybe the blacklist process could look at the status of the person and then allow it to process. I'm not really sure what the right way to do this is since I suspect there are multiple ways to do it, but it would be a shame if we exclude good content because of an inflexible process.

These are (abused/abusable) url shorteners, they are not the root urls of the product. The product is not blocked, the base url presumably is not blocked (though it is possible), the target is the url shortener. Blacklisted urls are blocked all the time for many reasons, and there is a message to the user. Please don't hyperextend the consequences.

Fae added a comment.Feb 13 2017, 4:30 PM

@Billinghurst that's the point of this task, to avoid volunteers like me having to write ever extending amounts of code to by-pass blacklists. We have the same problem with filename blacklists. I already have around 10 types of error trap in my upload process, I see little benefit in creating my own unique parser for all metadata fields on an GLAM import when the results post absolutely no risk whatsoever to Wikimedia Commons or our reusers and readers.

As a design and operating principle, it is common sense to put the workload on the system, rather than on each volunteer on every content-generating project which helps our mission by running approved and well managed batch uploads.

To repeat, the risk here is zero. The blacklist in this situation is an obstacle, not an aide.

To repeat, the risk here is zero. The blacklist in this situation is an obstacle, not an aide.

The risk here is the file page cannot be edited at all. If it can be edited, that's a bug and should not happen.

As a design and operating principle, it is common sense to put the workload on the system, rather than on each volunteer on every content-generating project which helps our mission by running approved and well managed batch uploads.

The basic design principle is to have a unified pass/fail logic for all cases, whether they are uploaded via api or special:upload, whether they are batch uploads or single uploads. Having exceptions is against this principle.

To repeat, the risk here is zero. The blacklist in this situation is an obstacle, not an aide.

The risk here is the file page cannot be edited at all. If it can be edited, that's a bug and should not happen.

This is not a bug, spam blacklist looks for added (clickable) urls and check these url to the spamblacklist to avoid making pages uneditable when new url get added to the spamblacklist.

zhuyifei1999 added a comment.EditedFeb 13 2017, 6:42 PM

This is not a bug, spam blacklist looks for added (clickable) urls and check these url to the spamblacklist to avoid making pages uneditable when new url get added to the spamblacklist.

Would you mind pointing to a ticket or patchset that changed this behavior? According to T36928, T134453, and my personal experiences this isn't true a few years ago, as SpamBlacklist checks all urls on a page.

With respect, nothing is impossible, it's just a matter of how much effort it would require.
I agree with Fae that its worth looking into. There really needs to be a way to make the system smarter by allowing a workaround. Maybe the blacklist process could look at the status of the person and then allow it to process. I'm not really sure what the right way to do this is since I suspect there are multiple ways to do it, but it would be a shame if we exclude good content because of an inflexible process.

Right, okay, I said "impossible" as a shorthand for "requiring an unreasonable amount of effort". This task was linked on T157436 and I got the impression you're (you in general, not you personally) trying to make me (or another developer) put in all that effort, which triggered an immediate objection :)

If you're willing to put in an unreasonable amount of effort into this yourself (you personally :) ), I can even suggest a workaround for you: list every link you want to be allowed in MediaWiki:Spam-whitelist. But this would also require effort from all the maintainers of MediaWiki:Spam-whitelist later to deal with all your links.

Although, if SpamBlacklist actually ignores links that already exist on the page, this might be doable. I was not aware of that behavior. You might want to reopen T36928 with new facts :)

matmarex removed a subscriber: matmarex.Feb 13 2017, 9:55 PM
Fae added a subscriber: matmarex.Feb 13 2017, 10:13 PM

@matmarex This task is not about adding links to the spam-whitelist for a single GLAM upload project that has already completed. It is pointless to list individual bit.ly links, when what is requested is a generic solution in order to support and encourage "officially agreed" batch upload projects.

You're free to submit a patch to this extension. Otherwise, unless SpamBlacklist ignores the links already present, I'm inclined to close this ticket as declined with the same reason as T36928.

IMO, the proper solution: you just expand the urls before uploading...

This is not a bug, spam blacklist looks for added (clickable) urls and check these url to the spamblacklist to avoid making pages uneditable when new url get added to the spamblacklist.

Would you mind pointing to a ticket or patchset that changed this behavior? According to T36928, T134453, and my personal experiences this isn't true a few years ago, as SpamBlacklist checks all urls on a page.

FYI @zhuyifei1999 the blacklist checks the addition of a new url, when it exists already then it is by-passed BUT if someone moves the url within the context of the other text on the page, that can often be seen as a change of text, new addition and then flag it as a new addition. That is my practical experience, there are many talk pages in enWP that are successfully edited, though you try to archive parts of the page containing a blacklisted url, and NADA, fail.

With respect, nothing is impossible, it's just a matter of how much effort it would require.
I agree with Fae that its worth looking into. There really needs to be a way to make the system smarter by allowing a workaround. Maybe the blacklist process could look at the status of the person and then allow it to process. I'm not really sure what the right way to do this is since I suspect there are multiple ways to do it, but it would be a shame if we exclude good content because of an inflexible process.

Right, okay, I said "impossible" as a shorthand for "requiring an unreasonable amount of effort". This task was linked on T157436 and I got the impression you're (you in general, not you personally) trying to make me (or another developer) put in all that effort, which triggered an immediate objection :)
If you're willing to put in an unreasonable amount of effort into this yourself (you personally :) ), I can even suggest a workaround for you: list every link you want to be allowed in MediaWiki:Spam-whitelist. But this would also require effort from all the maintainers of MediaWiki:Spam-whitelist later to deal with all your links.
Although, if SpamBlacklist actually ignores links that already exist on the page, this might be doable. I was not aware of that behavior. You might want to reopen T36928 with new facts :)

Can I say that there is not even a community consensus for such a change, nor any discussion. There is no point defining a process that involves rights allocations at a particular wiki when that wiki may not wish to have any part in the process, in part or full. The bug may stay open as a task, though it should have no priority, nor should it be assigned to a Wikimedia wiki at this point in time.

If it is desired there is a mechanism and process for allocating resources and requirements of the community which take place annually. That would identify community consensus and need.

This can be resolved immediately by resolving the underlying urls, (which surely is not an impossible task, nor even an complex task) rather than writing workarounds for blacklisted urls for a small subgroup of people at one wiki, where there is neither process, identified community need, nor consensus for the suggested change.

Fae added a comment.EditedFeb 14 2017, 12:31 PM

@Billinghurst See the discussion at https://commons.wikimedia.org/wiki/Commons:Bureaucrats%27_noticeboard#Upload_project_spam_blacklist_exception_.27right.27 which was started at the same time this task was opened.

With regard to expanding URLs, this is an unnecessary burden on batch uploaders, realistically adds no protection from any threat for Commons and cannot be done when uploaders are using tools that they have not personally written.

You may recall that I originally led the volunteer based funding for the GLAM wiki toolset and stayed on the steering group throughout its development. At that time there was, and remains, an overwhelming consensus that mass uploads by institutions and volunteers to Wikimedia Commons should be possible for the untrained amateur, not just the domain of experienced programmers. As a design objective for the Wikimedia Foundation, I believe this is still central in the values that drive our future development decisions and the funding strategy.

It is not yet a proposal of the community, and it belongs to a larger scope than bureaucrats.

With regard to expanding URLs, this is an unnecessary burden on batch uploaders, realistically adds no protection from any threat for Commons and cannot be done when uploaders are using tools that they have not personally written.

I disagree on that point. There are url shorteners that are not wanted, and their are original sources that are not wanted. The global community has been blocking these for years, and to the point that url shorteners are blocked on sight.

Controlling to the base url should be clearly the preferred source, though that argument belongs at Commons not here.

You may recall that I originally led the volunteer based funding for the GLAM wiki toolset and stayed on the steering group throughout its development. At that time there was, and remains, an overwhelming consensus that mass uploads by institutions and volunteers to Wikimedia Commons should be possible for the untrained amateur, not just the domain of experienced programmers. As a design objective for the Wikimedia Foundation, I believe this is still central in the values that drive our future development decisions and the funding strategy.

All of our uploaders are untrained volunteers. That is not an argument for the ticket. The community also talks about good practice and reliable sources, url shorteners are not reliable or original sources.

Please stop adding me back to this task, I regret having commented on it.

matmarex removed a subscriber: matmarex.Feb 14 2017, 2:36 PM

[offtopic]

In T157897#3025207, matmarex wrote:

Please stop adding me back to this task, I regret having commented on it.

That's due to T96464: Upon edit, a task description which mentions a Phab user (re)adds that Phab user to CC/Subscribers field.
Removing the @ in front of the user name is a workaround.

Fae added a comment.Feb 15 2017, 11:35 AM

I suggest this task is closed. There's too much push back against the task description for this to be realistic. If someone wishes to propose a new task aimed at the blacklist filter process helpfully parsing url redirects, perhaps for a limited number of iterations, and checking those against the blacklist again before rejecting a text, that would be positive.

In the meantime I'll default to skipping images where this is a problem and ignore this issue. The quantities are not worth the investment of Commons volunteer time needed to parse all incoming metadata for potential short-urls, the nature of which will vary on each batch upload.

FYI: video2commons had two related issues, #72 and #65, although the solution is entirely different.

This is not a bug, spam blacklist looks for added (clickable) urls and check these url to the spamblacklist to avoid making pages uneditable when new url get added to the spamblacklist.

Would you mind pointing to a ticket or patchset that changed this behavior? According to T36928, T134453, and my personal experiences this isn't true a few years ago, as SpamBlacklist checks all urls on a page.

Seems to be T3505

This is not a bug, spam blacklist looks for added (clickable) urls and check these url to the spamblacklist to avoid making pages uneditable when new url get added to the spamblacklist.

Would you mind pointing to a ticket or patchset that changed this behavior? According to T36928, T134453, and my personal experiences this isn't true a few years ago, as SpamBlacklist checks all urls on a page.

FYI @zhuyifei1999 the blacklist checks the addition of a new url, when it exists already then it is by-passed BUT if someone moves the url within the context of the other text on the page, that can often be seen as a change of text, new addition and then flag it as a new addition. That is my practical experience, there are many talk pages in enWP that are successfully edited, though you try to archive parts of the page containing a blacklisted url, and NADA, fail.

Text moves are not a problem. Spam blaclist is using the externallinks table to determine the old links. It is not using text comparision to determine added links.

It is possible that newly added links from templates now checked when someone edit the page because sometimes the externallinks table is not up to date (long job queue) and these links than part of the added links or when T19154 is in effect for the page

Just for the record: There is a spamblacklist api module, where urls could be checked against the blacklist.

Seems to be T3505

This was fixed in 2008, earlier than the given tasks. Reading related tickets, I also see T18325#212771, which once again indicates the opposite behavior of what you have mentioned.

Text moves are not a problem. Spam blaclist is using the externallinks table to determine the old links. It is not using text comparision to determine added links.

Again, do you have a link to a patchset or ticket or the exact lines in which the extension does this? I fail to find any code that filters out existing urls

Text moves are not a problem. Spam blaclist is using the externallinks table to determine the old links. It is not using text comparision to determine added links.

Again, do you have a link to a patchset or ticket or the exact lines in which the extension does this? I fail to find any code that filters out existing urls

https://github.com/wikimedia/mediawiki-extensions-SpamBlacklist/blob/6b419e553e72be9448b8926d4fd5785687f80f62/SpamBlacklist_body.php#L102

Hmm. Thanks. That part of code seemed to be added at least 4 years ago, no later than a3defb8. Any ideas about the recent claims (after that part of the code was added) that saving such a page is often errored-out? This task should be dependent on SpamBlacklist not yelling when such page is saved.

Seems to be T3505

This was fixed in 2008, earlier than the given tasks. Reading related tickets, I also see T18325#212771, which once again indicates the opposite behavior of what you have mentioned.

Text moves are not a problem. Spam blaclist is using the externallinks table to determine the old links. It is not using text comparision to determine added links.

Again, do you have a link to a patchset or ticket or the exact lines in which the extension does this? I fail to find any code that filters out existing urls

Just for the record to find changes: The linked T3505 has in one comment a hint to r34769, which is old svn. This can be seen with Special:Code on mediawiki.org - https://www.mediawiki.org/wiki/Special:Code/MediaWiki/r34769 - and there is the code: With an array_diff the old links are removed from the list of new links before the list is checked. That is the magic.

Hmm. Thanks. That part of code seemed to be added at least 4 years ago, no later than a3defb8. Any ideas about the recent claims (after that part of the code was added) that saving such a page is often errored-out? This task should be dependent on SpamBlacklist not yelling when such page is saved.

a3defb8 does not add the array_diff, it is moved.
Have you a actually report, where a save failed for existing links on that page to be sure, it is "often"?
And you are sure, there are not affected from bugs mention in my comment? (Job queue or rollback problems)

T18325#212771 is from 2011 and it seems the user was in the same state as you - he/she does not know about the fixed bug T3505

That is the magic.

That is, assuming everything works as expected :)

a3defb8 does not add the array_diff, it is moved.

Yes, I did not have the time to look into that patch closely. That search was a result of git blame.

Have you a actually report, where a save failed for existing links on that page to be sure, it is "often"?

This should be one:

In T134453, Magog_the_Ogre wrote:

Thus it is impossible to change the page without removing or altering the links.

I do believe few reports, if any, has been filed because this is the expected behavior.

And you are sure, there are not affected from bugs mention in my comment? (Job queue or rollback problems)

Nope. I do not currently have neither the interest nor time to watch the Job queue.

In the same ticket mentioned above:

In T134453, Magog_the_Ogre wrote:

For some odd reason, this also has the side effect of not refreshing categories on the page when the batch process runs around 12AM Eastern time.

... and a new question is: does the job queue run at all for these file pages?

T18325#212771 is from 2011 and it seems the user was in the same state as you - he/she does not know about the fixed bug T3505

T3505 was fixed in 2008, 3 years earlier than T18325#212771. The time gap seems sufficient enough.