Page MenuHomePhabricator

In Citoid, magically follow URLs provided in non-link fields (e.g. link rel="publisher") to scrape more info from there
Closed, DeclinedPublic8 Estimated Story Points

Description

E.g. when using this URL http://www.3cx.com/blog/webrtc/webrtc-secure/ it added https://plus.google.com/+3cx-Global/ as a publisher.

Both Facebook (open graph) and google have been trending towards putting links in as metadata, with the idea that the link contains the metadata rather than the tag itself. ('knowledge graph') So for instance, in "publisher" field, there will be a link to the google plus page or facebook page of the publisher rather than the text name. We should follow these links to get the publisher name. For the time being, finding the title is probably sufficient; although in some case, particularly with facebook, we can expect these to have the open graph type "profile" in which case the page is pointing to a person.

Event Timeline

rugk raised the priority of this task from to Needs Triage.
rugk updated the task description. (Show Details)
rugk added a project: VisualEditor.
rugk subscribed.
Jdforrester-WMF claimed this task.
Jdforrester-WMF subscribed.

Yes, this is per the website's claim:

<link rel="publisher" href="https://plus.google.com/+3cx-Global/">

Bu this does not make any sense.

It seems to be related to a stupid feature by Google to use this meta tag where this can be used for adding a authorship link to the Google+ page:

However I don't think that's how we define a "publisher" in Wikipedia and adding these Google+ links everywhere is not only advertising - it's just not adding any more information to the reference...
So maybe add an exclusion or something like this for it?

Elitre renamed this task from Visual Editor often adds Google Plus links as a publisher to Follow links in non-link field to scrape more info.Nov 18 2015, 2:57 PM
Elitre reopened this task as Open.
Elitre removed Jdforrester-WMF as the assignee of this task.
Elitre added a project: Citoid.
Elitre set Security to None.
Elitre added subscribers: Mvolz, Elitre.

@Mvolz couldn't remember if a similar task had already been filed, so I'm changing the scope of this one for her.

Jdforrester-WMF renamed this task from Follow links in non-link field to scrape more info to In Citoid, magically follow URLs provided in non-link fields (e.g. link rel="publisher") to scrape more info from there.Nov 24 2015, 8:26 PM
Jdforrester-WMF triaged this task as Low priority.
Jdforrester-WMF edited a custom field.
Jdforrester-WMF moved this task from To Triage to Freezer on the VisualEditor board.

This:

{{Cite web|title = Feels Like The First Time: Cursive's Tim Kasher reconvenes the Good Life|url = http://substreammagazine.com/2015/11/feels-like-the-first-time-cursives-tim-kasher-reconvenes-the-good-life/|website = Substream Magazine|publisher = https://plus.google.com/107858022853661098790|accessdate = 2015-11-29|language = en-US}}

is undesirable. If we're going to have a link to the publisher's website, then could we at least make it a link with a label, e.g., `[https://plus.google.com/spam Name]` ?

Mvolz raised the priority of this task from Low to High.
Mvolz lowered the priority of this task from High to Medium.

Raising to Normal as the community seems to think this is more of a priority :)

If this is still happening once the Embedded Metadata translator is up and running on the server, ping me and I'll look into it.

It isn't ideal to manually reformat what the website itself is giving us as its name, but if there are any patterns (e.g., if all Google Plus sites follow this pattern) I can help either write a translator to cover these cases or update Embedded Metadata to accommodate.

Mvolz removed Mvolz as the assignee of this task.Jan 15 2018, 11:15 AM

This is still happening daily with Author field containing facebook link, Ie https://fr.wikipedia.org/w/index.php?title=Lekima_Tagitagivalu&diff=prev&oldid=205847137 because <meta property="article:author" content="https://www.facebook.com/ActuRugby">.
I've made hundred of edits to remove these errors on frwiki: https://fr.wikipedia.org/w/index.php?title=Sp%C3%A9cial:Contributions/Framawiki&target=Framawiki&offset=20230706113753
At least Citoid should avoid using fields that are url.

Framawiki renamed this task from In Citoid, magically follow URLs provided in non-link fields (e.g. link rel="publisher") to scrape more info from there to Citoid adds facebook url as publisher.Jul 18 2023, 12:42 PM
Framawiki moved this task from Freezer to To Triage on the VisualEditor board.

This is still happening daily with Author field containing facebook link, Ie https://fr.wikipedia.org/w/index.php?title=Lekima_Tagitagivalu&diff=prev&oldid=205847137 because <meta property="article:author" content="https://www.facebook.com/ActuRugby">.
I've made hundred of edits to remove these errors on frwiki: https://fr.wikipedia.org/w/index.php?title=Sp%C3%A9cial:Contributions/Framawiki&target=Framawiki&offset=20230706113753
At least Citoid should avoid using fields that are url.

Yeah, wow, that's really annoying.

This seems to have been introduced when we updated the Zotero translators; It's not supposed to happen, though, so that's a bug.

This is still happening daily with Author field containing facebook link, Ie https://fr.wikipedia.org/w/index.php?title=Lekima_Tagitagivalu&diff=prev&oldid=205847137 because <meta property="article:author" content="https://www.facebook.com/ActuRugby">.
I've made hundred of edits to remove these errors on frwiki: https://fr.wikipedia.org/w/index.php?title=Sp%C3%A9cial:Contributions/Framawiki&target=Framawiki&offset=20230706113753
At least Citoid should avoid using fields that are url.

Yeah, wow, that's really annoying.

This seems to have been introduced when we updated the Zotero translators; It's not supposed to happen, though, so that's a bug.

I've opened a pr upstream, resetting because this task is related to the original task, but not the same, strictly speaking: https://github.com/zotero/translators/pull/3103/files

Mvolz renamed this task from Citoid adds facebook url as publisher to In Citoid, magically follow URLs provided in non-link fields (e.g. link rel="publisher") to scrape more info from there .Aug 11 2023, 9:05 AM
Mvolz closed this task as Declined.

Change 949936 had a related patch set uploaded (by Mvolz; author: Mvolz):

[mediawiki/services/zotero@master] Update translators submodule

https://gerrit.wikimedia.org/r/949936

Change 949936 merged by jenkins-bot:

[mediawiki/services/zotero@master] Update translators submodule

https://gerrit.wikimedia.org/r/949936

Change 952149 had a related patch set uploaded (by Mvolz; author: Mvolz):

[operations/deployment-charts@master] Update Zotero

https://gerrit.wikimedia.org/r/952149

Change 952149 merged by jenkins-bot:

[operations/deployment-charts@master] Update Zotero

https://gerrit.wikimedia.org/r/952149