Page MenuHomePhabricator

Citoid is overwriting editor provided values without notification (was "Bloomberg - Are you a robot?")
Open, MediumPublic

Description

  1. A user entered URL in Citoid is not what gets saved to the source.

Steps to reproduce:
a) Use Visual editor to creation a citation, Click Cite, under automatic enter URL, then click generate
a1) Example URL used: https://www.bloomberg.com/news/articles/2018-03-10/south-africa-court-dismisses-gigaba-appeal-over-lying-under-oath
b)Click Insert
c)Click Publish

Expected result: the user provided value will be included in their edit and the title of the URL will be saved to the citation title
Observed result: Citoid is changing the user provided value (in example above to https://www.bloomberg.com/tosv2.html?vid=&uuid=d2ffa440-326b-11ea-95c2-518ae4217675&url=L25ld3MvYXJ0aWNsZXMvMjAxOC0wMy0xMC9zb3V0aC1hZnJpY2EtY291cnQtZGlzbWlzc2VzLWdpZ2FiYS1hcHBlYWwtb3Zlci1seWluZy11bmRlci1vYXRo ) The title returned is "Bloomberg - Are you a robot?"

Impact: This is resulting in a valid editor-provided reference being overwritten with something that has not been vetted by the editor, including in this example a useless link


Example:
Attempting to use Citoid to generate a reference to https://www.bloomberg.com/news/articles/2018-03-10/south-africa-court-dismisses-gigaba-appeal-over-lying-under-oath resulted in a title "Bloomberg - Are you a robot?"

I would guess this is suboptimal. Is there anything that can be done from a technical perspective to prevent that response?

Event Timeline

Well- we ARE a robot. So probably technically the things that can be done would involve pretending we are not a robot...

Mvolz triaged this task as Low priority.Dec 11 2018, 6:47 PM

Well- we ARE a robot. So probably technically the things that can be done would involve pretending we are not a robot...

:)

In the mean time, we should at least force the user to do it manually. Much like how Washingtonpost.com fails to generate.

Notably the behavior is different (and much worse) on VE then source editor.

Xaosflux renamed this task from Bloomberg - Are you a robot? to Citoid is changing the editors provided URL without notification.Jan 8 2020, 11:12 PM
Xaosflux raised the priority of this task from Low to Medium.
Xaosflux updated the task description. (Show Details)
Xaosflux renamed this task from Citoid is changing the editors provided URL without notification to Citoid is overwriting editor provided values without notification.Jan 8 2020, 11:14 PM
Izno renamed this task from Citoid is overwriting editor provided values without notification to Citoid is overwriting editor provided values without notification (was "Bloomberg - Are you a robot?").Jan 8 2020, 11:47 PM
Izno updated the task description. (Show Details)

Notably the behavior is different (and much worse) on VE then source editor.

I'm not seeing any difference in citoid's behavior based on whether the editing environement is in visual or wikitext modes: https://en.wikipedia.org/w/index.php?title=User:Whatamidoing_(WMF)/sandbox&diff=935026476&oldid=923476899&diffmode=source

Did you mean "Citoid gets different results than RefToolbar"?

@Whatamidoing-WMF yes I think, it is very confusing to know which is in play at any time, as there is no branding attached to any of the front ends of the tools!

So yes, on enwiki the enabled-by-default reftoolbar (v2) gadget does not have this problem so our average editor using source edit mode doesn't have the problem, but if they move to VE and get citoid their web references may get corrupted

Has there been any progress on this recently?

This bug continues to spread bogus links and titles in references throughout the corpus. Sure, that's a problem caused by the users who trust the script too much and don't review the changes they're making, but it could be mitigated very simply with a fix in this code -- why not speccificly block "are you a robot?" titles from being created until a better fix can be developed?

For reference, we're now categorizing a few different titles that clearly indicate an issue in CS1 errors: generic title, one of which is Bloomberg. The patterns checked are in CS1 configuration as:

['generic_titles'] = {
	-- patterns in this table to be lowercase only
	-- leave ['local'] nil except when there is a matching generic title in your language
	{['en'] = {'^wayback%s+machine$', false},		['local'] = nil},
	{['en'] = {'are you a robot', true},			['local'] = nil},
	{['en'] = {'hugedomains.com', true},			['local'] = nil},
	{['en'] = {'^[%(%[{<]?no +title[>}%]%)]?$', false},	['local'] = nil},
	{['en'] = {'page not found', true},			['local'] = nil},
	{['en'] = {'^[%(%[{<]?unknown[>}%]%)]?$', false},	['local'] = nil},
	{['en'] = {'website is for sale', true},		['local'] = nil},
	{['en'] = {'^404', true},				['local'] = nil},
}

That's in addition to the IABot, which is in CS1 maint: archived copy as title:

['archived_copy'] = { -- used with CS1 maint: Archive[d] copy as title
	['en'] = '^archived?%s+copy$', -- for English; translators: keep this because templates imported from en.wiki
	['local'] = nil, -- translators: replace ['local'] = nil with lowercase translation only when bots or tools create generic titles in your language
	},

Maybe you could incorporate this list into citoid and use it to ignore url replacement at least. Because Bloomber is also replacing url with fake url. Once you do automatic generation you have to find the correct url again.
https://www.bloomberg.com/tosv2.html?vid=&uuid=8c153f6a-955e-11ec-b858-6c5a6b626661&url=L25ld3MvZmVhdHVyZXMvMjAxOS0xMi0wMy9tZXJjay1jeWJlcmF0dGFjay1zLTEtMy1iaWxsaW9uLXF1ZXN0aW9uLXdhcy1pdC1hbi1hY3Qtb2Ytd2Fy

Would be useful to have the original url if someone already closed a tab with the article.