Page MenuHomePhabricator

CAPTCHA required for edits that do not add new external links
Open, Needs TriagePublic

Description

A user not using an account is asking on Commons:Village_pump/Technical (permalink) why they have to fill in CAPTCHA for edits that to not add external links while editing pages. I can't answer this question.

The error message they got is: "Your edit includes new external links. To protect the wiki against automated spam, we kindly ask you to enter the words that appear below in the box [...]"

This message appears to be Captcha-addurl which is not customized at WM Commons.

Expected behaviour: CAPTCHA only required in cases external links were added or more specific error message.

Event Timeline

Basically, these edits needed to add a new (or changed) external URL, otherwise ConfirmEdit would not trigger with the addurl trigger. As far as I understand the code correctly, the link does not need to be added by the edit itself, it could've been added by a new or updated template or something like that.

However, it's kind of difficult to debug this without having access to the Debug logging. ConfirmEdit logs any captcha trigger with some additional info, e.g., which url triggered a change for the addurl trigger.

@Reedy or @Jdforrester-WMF Do you know how we can access these logs and see what ConfirmEdit logged for these edits? :)

Basically, these edits needed to add a new (or changed) external URL, otherwise ConfirmEdit would not trigger with the addurl trigger. As far as I understand the code correctly, the link does not need to be added by the edit itself, it could've been added by a new or updated template or something like that.

However, it's kind of difficult to debug this without having access to the Debug logging. ConfirmEdit logs any captcha trigger with some additional info, e.g., which url triggered a change for the addurl trigger.

@Reedy or @Jdforrester-WMF Do you know how we can access these logs and see what ConfirmEdit logged for these edits? :)

They are stored in LogStash, as (almost?) all logs :-). NDA is required to access. Anyway, I see two logs relevant to https://commons.wikimedia.org/w/index.php?title=File:WMDE_republica_2012.png&diff=prev&oldid=357147960, which are pasted below:

Time 	server 	channel 	reqId 	message 
2019-07-05T12:53:28	commons.wikimedia.org	captcha	XR9ISApAIC0AAHTEvJIAAABC	ConfirmEdit: passed; 2x url trigger by '2A02:908:D83:E460:216:CBFF:FEAD:FF9' at [[File:WMDE republica 2012.png]]: https://creativecommons.org/licenses/by-sa/3.0/deed.en, https://creativecommons.org/share-your-work/licensing-considerations/compatible-licenses
2019-07-05T12:33:11	commons.wikimedia.org	captcha	XR9DhwpAICsAAJnFd@sAAACI	ConfirmEdit: new captcha session; 2x url trigger by '2A02:908:D83:E460:216:CBFF:FEAD:FF9' at [[File:WMDE republica 2012.png]]: https://creativecommons.org/licenses/by-sa/3.0/deed.en, https://creativecommons.org/share-your-work/licensing-considerations/compatible-licenses

That edit doesn't seem to add links mentioned in the logs.

Another example mentioned on Village pump is https://commons.wikimedia.org/w/index.php?title=File:BKS_als_Vorlage_falsch.png&diff=357269295&oldid=123439819, logs below:

Time 	server 	channel 	reqId 	message 
2019-07-06T16:16:21	commons.wikimedia.org	captcha	XSDJVQpAMEoAAIZybzEAAACL	ConfirmEdit: passed; 1x url trigger by '2A02:908:D83:E460:216:CBFF:FEAD:FF9' at [[File:BKS als Vorlage falsch.png]]: https://creativecommons.org/share-your-work/licensing-considerations/compatible-licenses
2019-07-06T16:15:40	commons.wikimedia.org	captcha	XSDJLApAME4AAEzQNhgAAACO	ConfirmEdit: new captcha session; 1x url trigger by '2A02:908:D83:E460:216:CBFF:FEAD:FF9' at [[File:BKS als Vorlage falsch.png]]: https://creativecommons.org/share-your-work/licensing-considerations/compatible-licenses

Logs for https://commons.wikimedia.org/w/index.php?title=File:Wikimedia_Deutschland_bei_der_Frankfurter_Buchmesse_2018.jpg&diff=prev&oldid=356906358:

Time 	server 	channel 	reqId 	message 
2019-07-03T08:34:49	commons.wikimedia.org	captcha	XRxoqQpAMF0AAKtT0RwAAAAJ	ConfirmEdit: passed; 1x url trigger by '2A02:908:D83:E460:216:CBFF:FEAD:FF9' at [[File:Wikimedia Deutschland bei der Frankfurter Buchmesse 2018.jpg]]: https://creativecommons.org/share-your-work/licensing-considerations/compatible-licenses
2019-07-03T08:34:21	commons.wikimedia.org	captcha	XRxojQpAICIAALq8d9EAAABJ	ConfirmEdit: new captcha session; 1x url trigger by '2A02:908:D83:E460:216:CBFF:FEAD:FF9' at [[File:Wikimedia Deutschland bei der Frankfurter Buchmesse 2018.jpg]]: https://creativecommons.org/share-your-work/licensing-considerations/compatible-licenses

@Florian All URLs I found seem to share the same pattern, so it should be enough to help you move forward. Let me know if you need anything else!

They are stored in LogStash, as (almost?) all logs :-). NDA is required to access.

I know, sorry for my confusing post :P I meant, that I do not have the permission to access the logs, and I did not know who to ask to take a look :D Thanks for looking into the logs and pasting the relevant parts!

The logs confirm what I had in mind already: The captcha was triggered by an URL added by a template. Not sure, why the captcha was triggered, though, the templates look clean and the URLs in it did not seem to change for a while. However, the logic in ConfirmEdit looks clean, so my assumption is, that the links were not added to the externallinks table before (which is used to get the already added links of the page). I'm not sure if this is a bug, and if so, I'm not sure what we can do in ConfirmEdit to fix it :/ Did that happen again in the last times?

They are stored in LogStash, as (almost?) all logs :-). NDA is required to access.

I know, sorry for my confusing post :P I meant, that I do not have the permission to access the logs, and I did not know who to ask to take a look :D Thanks for looking into the logs and pasting the relevant parts!

Ah, I see. Well, then the answer is "anybody in nda or wmf LDAP groups should have Logstash access". https://tools.wmflabs.org/ldap/group/nda and https://tools.wmflabs.org/ldap/group/wmf shows the lists.

The logs confirm what I had in mind already: The captcha was triggered by an URL added by a template. Not sure, why the captcha was triggered, though, the templates look clean and the URLs in it did not seem to change for a while. However, the logic in ConfirmEdit looks clean, so my assumption is, that the links were not added to the externallinks table before (which is used to get the already added links of the page). I'm not sure if this is a bug, and if so, I'm not sure what we can do in ConfirmEdit to fix it :/ Did that happen again in the last times?

I'd say this is a bug, since it isn't desired behaviour to throw captcha when no link was added.

One "solution" would be to run to run refreshLinks.php, but that'll take a while (and if run only for commonswiki, this will exist at other wikis too) and before running this, I'd like to ask "Why the links werent in externallinks table?" Maybe the table didn't exist when the links were added? In that case, it'd happen only in small subset of cases.

Is there a simple way how to trigger addurl only if url was added directly by the edit? If it is because of template/edit of any other wiki page, that edit should've caused captcha too, and as such, the URL is already verified. What do you think, @Florian?

Is there a simple way how to trigger addurl only if url was added directly by the edit?

As far as I know: No, not without parsing the wikitext again. The original intention of using the externallinks table, if I understood the commit correctly, was to omit the second parsing of the wikitext.

If it is because of template/edit of any other wiki page, that edit should've caused captcha too, and as such, the URL is already verified. What do you think,

I agree, however, with the current way of parsing the wikitext, an extension can not (as far as I can see) know, if a link is added by a template or the actual wikitext link. However, from my point of view, it even shouldn't. In fact, the link is new on the page, it was not there before. It doesn't really matter if it was added by a template or not, even given this little problematic stuff :/

Given both things, I would rather look into the side why the external links were not added to the table in the first place.

What I could think of: The table, this is a wild guess, should, if the link was added by a template update, probably be added by a deferred template edit on the page, not sure, if this is actually the case. However, I do not really have deep knowledge of that.

The only thing ConfirmEdit could do here, is to read the content from the old revision of the database in any case, parse it and get the external links from it in that way. However, I'm not even sure, if that would load the templates, as well :(

This can happen when the refresh job is not finished after addition of a url to a template.
Go throught the list of templates of the reported pages and try to find an edit.

Or try to do a nulledit as anon to similar pages. I am not sure, if the url is includes in the text.
After the url is known it should be part of the whitelist as it is in a template

From the both url used on the example pages it seems that this message change added the new url: https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikimediaMessages/+/477507/6/i18n/cclicensetexts/en.json

Someone has to linkpurge all the pages using Template:Cc-by-sa-layout (over 30 million) or you whitelist the url in https://commons.wikimedia.org/wiki/MediaWiki:Captcha-addurl-whitelist