Citoid inserts bad information from Indian news sites
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	Jonesey95
	Feb 13 2020, 12:54 AM

Description

See these search results:

https://en.wikipedia.org/w/index.php?sort=relevance&search=%28+%22Times+of+India%22+OR+%22India+Today%22+%29+insource%3A%2F%5C%7C+%2Alast3+%2A%3D+%2AIst%2F&title=Special:Search&profile=advanced&fulltext=1&advancedSearch-current=%7B%7D&ns0=1

and this discussion thread:

https://en.wikipedia.org/w/index.php?title=Wikipedia_talk:RefToolbar&oldid=940509175#Indian_sources_mis-read

This problem is currently affecting over 1,000 articles on the English Wikipedia. If Citoid can't find proper author information on these sites, it should not attempt to retrieve author data from them.

Details

	Subject	Repo	Branch	Lines +/-
	Update Zotero to f0cff95	mediawiki/services/zotero	master	+10 -4

Customize query in gerrit

Related Objects

Mentioned Here: T235411: Add TLS termination to services running on kubernetes
rMWf0cff95fed17: Update patch set 1

Event Timeline

Jonesey95 created this task.Feb 13 2020, 12:54 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 13 2020, 12:54 AM

I've opened a new translator request to Zotero for this one: https://github.com/zotero/translators/issues/2118

What's going on is Zotero's embedded metadata translator is putting the entire "byline" in the author field, and splitting it. PTI stands for Press Trust of India, I think.

. If Citoid can't find proper author information on these sites, it should not attempt to retrieve author data from them.

Unfortunately, in the short term, it's easier to just write a translator than to exclude all malformatted metadata - or at least, that's kind of a can of worms. I've opened https://github.com/zotero/translators/issues/2119 but that's kind of a bandaid. See also https://github.com/zotero/translators/blob/master/Embedded%20Metadata.js#L590

We could fork our version of the Zotero translators to basically disallow using the byline entirely, or the entire "allow low quality metadata function" which is very aware function name! - but we did that before, and it was a bit of a maintenance headache, so we switched to using their repo directly and not using a fork, since they have a lot more people working on it than we do / it's better maintained, generally speaking. There's also https://github.com/zotero/translators/issues/1092 which might address that better but it's been stalled I think.

tl;dr fastest way to fix this is to fix times of india explicitly, other ways are all a bit stalled, because we no longer run the citoid native scraper but use Zotero's.

Two more examples of widely-cited, American, sites:

The date is not retrieved from this "USA Today" article.
The author is not retrieved from this "Los Angeles Times" article.

Change 574504 had a related patch set uploaded (by Mvolz; owner: Mvolz):
[mediawiki/services/zotero@master] Squashed commit of the following:

https://gerrit.wikimedia.org/r/574504

gerritbot added a project: Patch-For-Review.Feb 24 2020, 4:59 PM

Change 574504 merged by jenkins-bot:
[mediawiki/services/zotero@master] Update Zotero to rMWf0cff95fed17

https://gerrit.wikimedia.org/r/574504

I just wanted to do a noop deploy of zotero as part of T235411 and figured that the change never made it to production. It is deployed to staging though.
Is there anything blocking this or is it okay to deploy?

(Adding @Pchelolo, just because you triggered the merge :-) )

In T245092#6136256, @JMeybohm wrote:

I just wanted to do a noop deploy of zotero as part of T235411 and figured that the change never made it to production. It is deployed to staging though.
Is there anything blocking this or is it okay to deploy?

(Adding @Pchelolo, just because you triggered the merge :-) )

This change is only part of the changes needed to fix the bug; it also requires

https://github.com/zotero/translators/pull/2122 which is not yet merged, and then the submodule which points to that repo to be updated in the translation-server (zotero) repository. So basically I never bothered deploying it, but deploying it is harmless.

Deployed that as said on IRC. As far as I can tell it looks good...

JMeybohm unsubscribed.May 31 2020, 12:45 PM

This problem is not (completely) fixed. Here's an edit from 7 May 2020 (source: Times of India) that shows the problem:

https://en.wikipedia.org/w/index.php?title=COVID-19_pandemic_in_Pakistan&diff=prev&oldid=955426641

And here's an edit (source: India Today) from 21 April 2020:

https://en.wikipedia.org/w/index.php?title=Avika_Gor&type=revision&diff=952297461&oldid=940079779

Here's an insource search on en.WP that finds all current instances of the problem:

https://en.wikipedia.org/w/index.php?title=Special:Search&limit=500&offset=0&ns0=1&ns118=1&search=insource%3A%2Flast3%3DIst%5C%7C%2F&advancedSearch-current=%7B%7D&searchToken=b7llu595zmfq7cmqj8k4wj8lc

Izno subscribed.Jul 8 2020, 6:17 PM

"All instances" is a bit misleading. That's searching for the "last3=Ist|" string.

Other instances of garbage/other timezones could be around.

@Jonesey95: That's not surprising, as I don't see any change at all on enwiki in the way it handles the ToI or IT examples I gave in the original discussion there or the USA Today and LA Times examples given here.

diegodlh subscribed.Aug 23 2021, 7:55 PM

Curb_Safe_Charmer subscribed.Feb 9 2022, 2:35 PM

Hi folks!

I hope it's appropriate to do this on a Phabricator thread -- but we've identified a couple of threads such as this one where there are some issues with Citoid and automatic extraction / generation of metadata, and @diegodlh has been working on a community based solution for this problem, called Web2Cit. Web2Cit aims to solve some of those problems without having users to fiddle with Zotero translators or having a lot of technical skills.

On May 11 at 4 PM UTC we will be running a workshop to show the tool and allow users to test the early adopters version. If you're interested, you can register here: https://us06web.zoom.us/meeting/register/tZIpfu2upj4sE9ZrqblmM3-QujaeqekAAINK

If you want to know more about Web2Cit or the workshop, check here: https://meta.wikimedia.org/wiki/Web2Cit/Workshops

We would also greatly appreciate it if you happen to know anyone that might be interested in attending such a workshop and can handle some technical complexity.

cheers,
scann

Is this still an issue?

Citoid inserts bad information from Indian news sitesOpen, Needs TriagePublicActions

Description

Details

Related Objects

Event Timeline

Citoid inserts bad information from Indian news sites
Open, Needs TriagePublic
Actions