Page MenuHomePhabricator

Merge together the results from Zotero and the fallback HTML scaper?
Open, MediumPublic8 Estimated Story Points

Description

For example, https://en.wikipedia.org/api/rest_v1/data/citation/mediawiki/http%3A%2F%2Fwww.telegraph.co.uk%2Fnews%2F2017%2F05%2F24%2Ftesco-plans-could-spell-end-5p-carrier-bag%2F
returns:

[
  {
    "url": "http://www.telegraph.co.uk/news/2017/05/24/tesco-plans-could-spell-end-5p-carrier-bag/",
    "itemType": "newspaperArticle",
    "title": "Tesco plans could spell the end of the 5p carrier bag",
    "abstractNote": "The days of the 5p supermarket carrier bag could soon be over, as Tesco is piloting a plan to scrap them and force shoppers who forget their own bags to buy a "bag for life".",
    "publicationTitle": "The Telegraph",
    "language": "en-GB",
    "accessDate": "2017-05-25",
    "source": [
      "citoid"
    ]
  }
]

This is lacking any author data, but the page itself has a couple of HTML hints that the HTML scraper would presumably pick up, including a schema.org itemType. Though eventually we should get each Zotero translator to be as good as possible, maybe we should run things through both (somehow without fetching the content twice to avoid trouble for sources?).

Event Timeline

This task is sort of T114907. Or at least, blocked by it.

In this example it's only coming from citoid, not zotero, so this particular case is more a result of lack of support for the types of metadata here. I've never seen DSCext before. Seems to be some sort of .NET thing? https://github.com/StefanOssendorf/DCSExt

And we actually don't support schema.org except for title, that's T87331.

Apart from the concrete Telegraph case discussed in the comments, I think it would be valuable to consider the general question posed in this task's title:

Merge together the results from Zotero and the fallback HTML scaper?

Yes, I think that would be a good idea in general. As a first step we could consider interpolating only fields that are not provided by Zotero, and in a second step merge all fields removing duplicates

Apart from the concrete Telegraph case discussed in the comments, I think it would be valuable to consider the general question posed in this task's title:

Merge together the results from Zotero and the fallback HTML scaper?

Yes, I think that would be a good idea in general. As a first step we could consider interpolating only fields that are not provided by Zotero,

+1

and in a second step merge all fields removing duplicates

I guess we'd treat Zotero as a "better" source where it responds and we also get a result from the HTML scraper?

I guess we'd treat Zotero as a "better" source where it responds and we also get a result from the HTML scraper?

In general, yes, but I separated it as a second step because I believe there are a lot of irregularities and edge cases we need to keep in mind when doing this. A non-exhaustive list is:

  • same Zotero and HTML-scraped info, but different format/syntax (e.g. in one the author's name is shortened, in the other it is not)
  • info updated in one place, but not the other
  • etc...