In Citoid, magically follow URLs provided in non-link fields (e.g. link rel="publisher") to scrape more info from there
Closed, DeclinedPublic8 Estimated Story Points
Actions

Assigned To

None

Authored By

	rugk
	Nov 16 2015, 7:54 PM

Description

E.g. when using this URL http://www.3cx.com/blog/webrtc/webrtc-secure/ it added https://plus.google.com/+3cx-Global/ as a publisher.

Both Facebook (open graph) and google have been trending towards putting links in as metadata, with the idea that the link contains the metadata rather than the tag itself. ('knowledge graph') So for instance, in "publisher" field, there will be a link to the google plus page or facebook page of the publisher rather than the text name. We should follow these links to get the publisher name. For the time being, finding the title is probably sufficient; although in some case, particularly with facebook, we can expect these to have the open graph type "profile" in which case the page is pointing to a person.

Details

	Subject	Repo	Branch	Lines +/-
	Update Zotero	operations/deployment-charts	master	+1 -1
	Update translators submodule	mediawiki/services/zotero	master	+1 -1

Customize query in gerrit

Related Objects

Mentioned In: T99091: Citoid should not add website name (punknews.org) as a first name

Event Timeline

rugk created this task.Nov 16 2015, 7:54 PM

rugk raised the priority of this task from to Needs Triage.

rugk updated the task description. (Show Details)

rugk added a project: VisualEditor.

rugk subscribed.

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptNov 16 2015, 7:54 PM

Yes, this is per the website's claim:

<link rel="publisher" href="https://plus.google.com/+3cx-Global/">

Bu this does not make any sense.

It seems to be related to a stupid feature by Google to use this meta tag where this can be used for adding a authorship link to the Google+ page:

However I don't think that's how we define a "publisher" in Wikipedia and adding these Google+ links everywhere is not only advertising - it's just not adding any more information to the reference...
So maybe add an exclusion or something like this for it?

@Mvolz couldn't remember if a similar task had already been filed, so I'm changing the scope of this one for her.

Jdforrester-WMF renamed this task from Follow links in non-link field to scrape more info to In Citoid, magically follow URLs provided in non-link fields (e.g. link rel="publisher") to scrape more info from there.Nov 24 2015, 8:26 PM

Jdforrester-WMF triaged this task as Low priority.

Jdforrester-WMF edited a custom field.

Jdforrester-WMF moved this task from To Triage to Freezer on the VisualEditor board.

This:

{{Cite web|title = Feels Like The First Time: Cursive's Tim Kasher reconvenes the Good Life|url = http://substreammagazine.com/2015/11/feels-like-the-first-time-cursives-tim-kasher-reconvenes-the-good-life/|website = Substream Magazine|publisher = https://plus.google.com/107858022853661098790|accessdate = 2015-11-29|language = en-US}}

is undesirable. If we're going to have a link to the publisher's website, then could we at least make it a link with a label, e.g., `[https://plus.google.com/spam Name]` ?

Raising to Normal as the community seems to think this is more of a priority :)

Mvolz updated the task description. (Show Details)Nov 30 2015, 8:43 AM

See discussion here:

https://en.wikipedia.org/wiki/Wikipedia:VisualEditor/Feedback#VE_appears_to_put_external_URLs_in_citation_publisher.3D_parameters

Keith_D subscribed.Jan 22 2016, 3:02 AM

Symac subscribed.Jan 26 2016, 11:25 PM

Mvolz moved this task from Backlog to Service on the Citoid board.Jul 29 2016, 3:21 PM

If this is still happening once the Embedded Metadata translator is up and running on the server, ping me and I'll look into it.

It isn't ideal to manually reformat what the website itself is giving us as its name, but if there are any patterns (e.g., if all Google Plus sites follow this pattern) I can help either write a translator to cover these cases or update Embedded Metadata to accommodate.

Mvolz moved this task from Service to Service: Scraper & Validation on the Citoid board.Oct 14 2016, 1:26 PM

Mvolz mentioned this in T99091: Citoid should not add website name (punknews.org) as a first name.Oct 28 2016, 3:07 PM

Mvolz removed Mvolz as the assignee of this task.Jan 15 2018, 11:15 AM

Mvolz moved this task from Service: Scraper & Validation to Service on the Citoid board.Dec 11 2018, 11:30 AM

This is still happening daily with Author field containing facebook link, Ie https://fr.wikipedia.org/w/index.php?title=Lekima_Tagitagivalu&diff=prev&oldid=205847137 because <meta property="article:author" content="https://www.facebook.com/ActuRugby">.
I've made hundred of edits to remove these errors on frwiki: https://fr.wikipedia.org/w/index.php?title=Sp%C3%A9cial:Contributions/Framawiki&target=Framawiki&offset=20230706113753
At least Citoid should avoid using fields that are url.

• Elitre unsubscribed.Jul 10 2023, 1:10 PM

Framawiki renamed this task from In Citoid, magically follow URLs provided in non-link fields (e.g. link rel="publisher") to scrape more info from there to Citoid adds facebook url as publisher.Jul 18 2023, 12:42 PM

Framawiki moved this task from Freezer to To Triage on the VisualEditor board.

LD subscribed.Jul 19 2023, 12:09 AM

In T118773#9000769, @Framawiki wrote:

This is still happening daily with Author field containing facebook link, Ie https://fr.wikipedia.org/w/index.php?title=Lekima_Tagitagivalu&diff=prev&oldid=205847137 because <meta property="article:author" content="https://www.facebook.com/ActuRugby">.
I've made hundred of edits to remove these errors on frwiki: https://fr.wikipedia.org/w/index.php?title=Sp%C3%A9cial:Contributions/Framawiki&target=Framawiki&offset=20230706113753
At least Citoid should avoid using fields that are url.

Yeah, wow, that's really annoying.

This seems to have been introduced when we updated the Zotero translators; It's not supposed to happen, though, so that's a bug.

In T118773#9083122, @Mvolz wrote:

In T118773#9000769, @Framawiki wrote:

This is still happening daily with Author field containing facebook link, Ie https://fr.wikipedia.org/w/index.php?title=Lekima_Tagitagivalu&diff=prev&oldid=205847137 because <meta property="article:author" content="https://www.facebook.com/ActuRugby">.
I've made hundred of edits to remove these errors on frwiki: https://fr.wikipedia.org/w/index.php?title=Sp%C3%A9cial:Contributions/Framawiki&target=Framawiki&offset=20230706113753
At least Citoid should avoid using fields that are url.

Yeah, wow, that's really annoying.

This seems to have been introduced when we updated the Zotero translators; It's not supposed to happen, though, so that's a bug.

I've opened a pr upstream, resetting because this task is related to the original task, but not the same, strictly speaking: https://github.com/zotero/translators/pull/3103/files

Mvolz renamed this task from Citoid adds facebook url as publisher to In Citoid, magically follow URLs provided in non-link fields (e.g. link rel="publisher") to scrape more info from there .Aug 11 2023, 9:05 AM

Mvolz closed this task as Declined.

Change 949936 had a related patch set uploaded (by Mvolz; author: Mvolz):

[mediawiki/services/zotero@master] Update translators submodule

https://gerrit.wikimedia.org/r/949936

gerritbot added a project: Patch-For-Review.Aug 17 2023, 10:35 AM

Change 949936 merged by jenkins-bot:

[mediawiki/services/zotero@master] Update translators submodule

https://gerrit.wikimedia.org/r/949936

Change 952149 had a related patch set uploaded (by Mvolz; author: Mvolz):

[operations/deployment-charts@master] Update Zotero

https://gerrit.wikimedia.org/r/952149

Change 952149 merged by jenkins-bot: