Page MenuHomePhabricator

Citoid inserts bad information from Indian news sites
Open, Needs TriagePublic

Description

See these search results:

https://en.wikipedia.org/w/index.php?sort=relevance&search=%28+%22Times+of+India%22+OR+%22India+Today%22+%29+insource%3A%2F%5C%7C+%2Alast3+%2A%3D+%2AIst%2F&title=Special:Search&profile=advanced&fulltext=1&advancedSearch-current=%7B%7D&ns0=1

and this discussion thread:

https://en.wikipedia.org/w/index.php?title=Wikipedia_talk:RefToolbar&oldid=940509175#Indian_sources_mis-read

This problem is currently affecting over 1,000 articles on the English Wikipedia. If Citoid can't find proper author information on these sites, it should not attempt to retrieve author data from them.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 13 2020, 12:54 AM
Mvolz added a subscriber: Mvolz.EditedFeb 13 2020, 11:51 AM

I've opened a new translator request to Zotero for this one: https://github.com/zotero/translators/issues/2118

What's going on is Zotero's embedded metadata translator is putting the entire "byline" in the author field, and splitting it. PTI stands for Press Trust of India, I think.

. If Citoid can't find proper author information on these sites, it should not attempt to retrieve author data from them.

Unfortunately, in the short term, it's easier to just write a translator than to exclude all malformatted metadata - or at least, that's kind of a can of worms. I've opened https://github.com/zotero/translators/issues/2119 but that's kind of a bandaid. See also https://github.com/zotero/translators/blob/master/Embedded%20Metadata.js#L590

We could fork our version of the Zotero translators to basically disallow using the byline entirely, or the entire "allow low quality metadata function" which is very aware function name! - but we did that before, and it was a bit of a maintenance headache, so we switched to using their repo directly and not using a fork, since they have a lot more people working on it than we do / it's better maintained, generally speaking. There's also https://github.com/zotero/translators/issues/1092 which might address that better but it's been stalled I think.

tl;dr fastest way to fix this is to fix times of india explicitly, other ways are all a bit stalled, because we no longer run the citoid native scraper but use Zotero's.

AlanM1 added a subscriber: AlanM1.Feb 15 2020, 3:32 PM

Two more examples of widely-cited, American, sites:

Change 574504 had a related patch set uploaded (by Mvolz; owner: Mvolz):
[mediawiki/services/zotero@master] Squashed commit of the following:

https://gerrit.wikimedia.org/r/574504

Change 574504 merged by jenkins-bot:
[mediawiki/services/zotero@master] Update Zotero to f0cff95

https://gerrit.wikimedia.org/r/574504

I just wanted to do a noop deploy of zotero as part of T235411 and figured that the change never made it to production. It is deployed to staging though.
Is there anything blocking this or is it okay to deploy?

(Adding @Pchelolo, just because you triggered the merge :-) )

Mvolz added a comment.May 14 2020, 9:26 AM

I just wanted to do a noop deploy of zotero as part of T235411 and figured that the change never made it to production. It is deployed to staging though.
Is there anything blocking this or is it okay to deploy?

(Adding @Pchelolo, just because you triggered the merge :-) )

This change is only part of the changes needed to fix the bug; it also requires

https://github.com/zotero/translators/pull/2122 which is not yet merged, and then the submodule which points to that repo to be updated in the translation-server (zotero) repository. So basically I never bothered deploying it, but deploying it is harmless.

Deployed that as said on IRC. As far as I can tell it looks good...

Izno added a subscriber: Izno.Jul 8 2020, 6:17 PM

"All instances" is a bit misleading. That's searching for the "last3=Ist|" string.

Other instances of garbage/other timezones could be around.

AlanM1 added a comment.EditedJul 8 2020, 10:58 PM

@Jonesey95: That's not surprising, as I don't see any change at all on enwiki in the way it handles the ToI or IT examples I gave in the original discussion there or the USA Today and LA Times examples given here.