Page MenuHomePhabricator

In Citoid, magically follow URLs provided in non-link fields (e.g. link rel="publisher") to scrape more info from there
Open, MediumPublic8 Estimated Story Points


E.g. when using this URL it added as a publisher.

Both Facebook (open graph) and google have been trending towards putting links in as metadata, with the idea that the link contains the metadata rather than the tag itself. ('knowledge graph') So for instance, in "publisher" field, there will be a link to the google plus page or facebook page of the publisher rather than the text name. We should follow these links to get the publisher name. For the time being, finding the title is probably sufficient; although in some case, particularly with facebook, we can expect these to have the open graph type "profile" in which case the page is pointing to a person.

Event Timeline

rugk raised the priority of this task from to Needs Triage.
rugk updated the task description. (Show Details)
rugk added a project: VisualEditor.
rugk added a subscriber: rugk.
Jdforrester-WMF claimed this task.
Jdforrester-WMF added a subscriber: Jdforrester-WMF.

Yes, this is per the website's claim:

<link rel="publisher" href="">

Bu this does not make any sense.

It seems to be related to a stupid feature by Google to use this meta tag where this can be used for adding a authorship link to the Google+ page:

However I don't think that's how we define a "publisher" in Wikipedia and adding these Google+ links everywhere is not only advertising - it's just not adding any more information to the reference...
So maybe add an exclusion or something like this for it?

Elitre renamed this task from Visual Editor often adds Google Plus links as a publisher to Follow links in non-link field to scrape more info.Nov 18 2015, 2:57 PM
Elitre reopened this task as Open.
Elitre removed Jdforrester-WMF as the assignee of this task.
Elitre added a project: Citoid.
Elitre set Security to None.
Elitre added subscribers: Mvolz, Elitre.

@Mvolz couldn't remember if a similar task had already been filed, so I'm changing the scope of this one for her.

Jdforrester-WMF renamed this task from Follow links in non-link field to scrape more info to In Citoid, magically follow URLs provided in non-link fields (e.g. link rel="publisher") to scrape more info from there.Nov 24 2015, 8:26 PM
Jdforrester-WMF triaged this task as Low priority.
Jdforrester-WMF edited a custom field.
Jdforrester-WMF moved this task from To Triage to Freezer on the VisualEditor board.


{{Cite web|title = Feels Like The First Time: Cursive's Tim Kasher reconvenes the Good Life|url =|website = Substream Magazine|publisher =|accessdate = 2015-11-29|language = en-US}}

is undesirable. If we're going to have a link to the publisher's website, then could we at least make it a link with a label, e.g., `[ Name]` ?

Mvolz raised the priority of this task from Low to High.
Mvolz lowered the priority of this task from High to Medium.

Raising to Normal as the community seems to think this is more of a priority :)

If this is still happening once the Embedded Metadata translator is up and running on the server, ping me and I'll look into it.

It isn't ideal to manually reformat what the website itself is giving us as its name, but if there are any patterns (e.g., if all Google Plus sites follow this pattern) I can help either write a translator to cover these cases or update Embedded Metadata to accommodate.

Mvolz removed Mvolz as the assignee of this task.Jan 15 2018, 11:15 AM