Merge together the results from Zotero and the fallback HTML scaper?
Open, MediumPublic8 Estimated Story Points
Actions

Assigned To

None

Authored By

	Jdforrester-WMF
	May 25 2017, 10:18 AM

Description

For example, https://en.wikipedia.org/api/rest_v1/data/citation/mediawiki/http%3A%2F%2Fwww.telegraph.co.uk%2Fnews%2F2017%2F05%2F24%2Ftesco-plans-could-spell-end-5p-carrier-bag%2F
returns:

[
  {
    "url": "http://www.telegraph.co.uk/news/2017/05/24/tesco-plans-could-spell-end-5p-carrier-bag/",
    "itemType": "newspaperArticle",
    "title": "Tesco plans could spell the end of the 5p carrier bag",
    "abstractNote": "The days of the 5p supermarket carrier bag could soon be over, as Tesco is piloting a plan to scrap them and force shoppers who forget their own bags to buy a &quot;bag for life&quot;.",
    "publicationTitle": "The Telegraph",
    "language": "en-GB",
    "accessDate": "2017-05-25",
    "source": [
      "citoid"
    ]
  }
]

This is lacking any author data, but the page itself has a couple of HTML hints that the HTML scraper would presumably pick up, including a schema.org itemType. Though eventually we should get each Zotero translator to be as good as possible, maybe we should run things through both (somehow without fetching the content twice to avoid trouble for sources?).

Related Objects
Search...

Status	Assigned	Task
Declined	Mvolz	T93579 Restructure so that citoid can be run without Zotero
Open	None	T114907 Parallelize scraper and Zotero requests
Open	None	T166297 Merge together the results from Zotero and the fallback HTML scaper?

Event Timeline

Jdforrester-WMF created this task.May 25 2017, 10:18 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 25 2017, 10:18 AM

Jdforrester-WMF moved this task from To Triage to TR1: Releases on the VisualEditor board.May 25 2017, 10:18 AM

This task is sort of T114907. Or at least, blocked by it.

In this example it's only coming from citoid, not zotero, so this particular case is more a result of lack of support for the types of metadata here. I've never seen DSCext before. Seems to be some sort of .NET thing? https://github.com/StefanOssendorf/DCSExt

And we actually don't support schema.org except for title, that's T87331.

Mvolz added a parent task: T114907: Parallelize scraper and Zotero requests.May 25 2017, 10:40 AM

Ha, OK.

Looks like Zotero does have a telegraph translator, it's just not enabled: https://github.com/zotero/translators/blob/e04633baeaac2c62eddb4435e517ae7d51c89a73/The%20Telegraph.js

T166305

Mvolz created subtask T166305: Test and potentially enable 'v' flag on Telegraph translator.May 25 2017, 12:05 PM

Mvolz removed a subtask: T166305: Test and potentially enable 'v' flag on Telegraph translator.

Apart from the concrete Telegraph case discussed in the comments, I think it would be valuable to consider the general question posed in this task's title:

Merge together the results from Zotero and the fallback HTML scaper?

Yes, I think that would be a good idea in general. As a first step we could consider interpolating only fields that are not provided by Zotero, and in a second step merge all fields removing duplicates

In T166297#3298954, @mobrovac wrote:

Apart from the concrete Telegraph case discussed in the comments, I think it would be valuable to consider the general question posed in this task's title:

Merge together the results from Zotero and the fallback HTML scaper?

Yes, I think that would be a good idea in general. As a first step we could consider interpolating only fields that are not provided by Zotero,

and in a second step merge all fields removing duplicates

I guess we'd treat Zotero as a "better" source where it responds and we also get a result from the HTML scraper?

In T166297#3301458, @Jdforrester-WMF wrote:

I guess we'd treat Zotero as a "better" source where it responds and we also get a result from the HTML scraper?

In general, yes, but I separated it as a second step because I believe there are a lot of irregularities and edge cases we need to keep in mind when doing this. A non-exhaustive list is:

same Zotero and HTML-scraped info, but different format/syntax (e.g. in one the author's name is shortened, in the other it is not)
info updated in one place, but not the other
etc...

Mvolz moved this task from Backlog to Service on the Citoid board.Jun 29 2017, 1:58 PM

• Deskana moved this task from TR1: Releases to External and Administrivia on the VisualEditor board.Jul 26 2017, 9:28 AM

• Deskana removed a project: VisualEditor.Aug 21 2018, 1:14 PM

Merge together the results from Zotero and the fallback HTML scaper?Open, MediumPublic8 Estimated Story PointsActions

Description

Related ObjectsSearch...

Event Timeline

Merge together the results from Zotero and the fallback HTML scaper?
Open, MediumPublic8 Estimated Story Points
Actions

Related Objects
Search...