Page MenuHomePhabricator

Documentation for citoid in general, but dwindled down info type handling
Closed, ResolvedPublic0 Estimated Story Points

Description

The documentation for citoid is a rather bare-bone "we have a service citoid" and it is nothing near whats needed to track down problems with the live service. Trying to figure out what works and whats not, I have spent several days with only getting a marginally better idea about whats working.

There are no clear information on how and why some metadata parses and some don't. I assume that these translators are supposed to work, but they only do so sometimes. So to be sure that I had infact a valid page I wrote a server that does no caching at all and also tells everything downstream to not do an caching at all. Then I used the turtle_movie.html test file and checked the response. What I got back was this

[
	{
		"url":"https://my.own.web.server",
		"itemType":"webpage",
		"language":"en-US",
		"title":"Turtles of the Jungle",
		"abstractNote":"A 2008 film about jungle turtles.",
		"websiteTitle":"Awesome Turtle Movies Website",
		"date":"2012-02-04",
		"accessDate":"2016-04-25",
		"author":[
			["","http://www.example.com/turtlelvr"]
		]
	}
]

Clearly some fields are missing, and the itemType is "webpage". Author is also broken, but hey, its a test file.

Starting to be a bit frustrated I found a page at The Time, which I copied locally. I chose this file because citoid said it was something different from a "webpage". Using their own server I got this

[
	{
		"itemType":"magazineArticle",
		"notes":[],
		"tags":[
			{"tag":"al-qaeda","type":1},
			{"tag":"crime","type":1},
			{"tag":"hizballah","type":1},
			{"tag":"narcotrafficking","type":1},
			{"tag":"syria","type":1},
			{"tag":"breaking bad","type":1},
			{"tag":"captagon","type":1},
			{"tag":"meth","type":1},
			{"tag":"syria","type":1}
		],
		"title":"Syria’s Breaking Bad: Are Amphetamines Funding the War?",
		"publicationTitle":"Time",
		"url":"http://world.time.com/2013/10/28/syrias-breaking-bad-are-amphetamines-funding-the-war/",
		"abstractNote":"A spike in the trafficking of the illegal drug Captagon, popular in the Middle East, is connected to Syria's brutal civil war",
		"libraryCatalog":"world.time.com",
		"accessDate":"2016-04-24",
		"ISSN":["0040-718X"],
		"shortTitle":"Syria’s Breaking Bad",
		"author":[
			["Aryn","Baker"]
		]
	}
]

Note that itemType is now "magazineArticle".

My exact copy of the same page turned up this

[
	{
		"url":"https://my.own.server",
		"itemType":"webpage",
		"websiteTitle":"TIME.com",
		"title":"Syria’s Breaking Bad: Are Amphetamines Funding the War? | TIME.com",
		"abstractNote":"A spike in the trafficking of the illegal drug Captagon, popular in the Middle East, is connected to Syria's brutal civil war",
		"accessDate":"2016-04-25"
	}
]

Note that now the itemType is "webpage". Something is clearly wrong, unless citoid sees something different than what I see. And yes, I have checked that metadata that parses and gets reported back from citoid infact reflects my changes to the same page on my server.

We need a better description of the parser process, how it chose the translator to use and why it typically fails. As it is now it is very difficult to tell external parties how they should change their pages so they will parse correctly, now we more or less only have a black box that does "something" and often not even that.

Sorry for the tone, I'm a bit frustrated over this.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Just a guess; are you running a separate translator for Time.com? That could be the reason for some of the weirdness. It still does not answer why local newspapers gives the wrong answer from what I expect.

The reason for the discrepancy is that citoid uses Zotero, but if Zotero
isn't running it falls back on native citoid code. It looks like your local
version doesn't have Zotero working so you're only getting the native
parser results.

Ah yeah, so Zotero uses regex against the submitted URL to determine which
Zotero translator to use. If you're using a different URL with the same
metadata you'll get the metadata from the citoid translators instead.

I made a stripped down html-page like the following, but it still turns up as "webpage" <s>and with missing fields</s>. [Not missing fields in this case]

<html lang="en">
	<head>
		<meta charset="utf-8">
		<title>Turtles are AWESOME!!1 | Awesome Turtles Website</title>
		
		<meta name="author" content="Turtle Lvr">
		<meta name="robots" content="we welcome our robot overlords"/>
		<meta name="description" content="Exposition on the awesomeness of turtles"/>
		<meta name="keywords" content="turtle, movie" />
		
		<link rel="canonical" href="http://example.com/turtles" />
		<link rel="publisher" href="https://mediawiki.org"/>
		<link rel="author" href="http://examples.com/turtlelvr"/>
		<link rel="shortlink" href="http://example.com/c" />
		
		<!--Dublin Core-->
		<meta name="DC.Title" content="Turtles of the Jungle" >
		<meta name="DC.Creator" content="http://www.example.com/turtlelvr" >
		<meta name="DC.Description" content="A 2008 film about jungle turtles." >
		<meta name="DC.Date" content="2012-02-04 12:00:00" >
		<meta name="DC.Type" content="Image.Moving" >
	</head>
	<body>
	</body>
</html>

It will always turn out to be webpage, because you can't tell it's a
magazineArticle from the metadata alone. You need to know it's from time.com
for that. This is part of the reason the zotero translators give much
better results, because the itemType can be hard coded based on the URL.

Webpage is the default itemType because technically, everything is a
webpage :)

It is explicitly said which type of item it describes, but the information is neglected?

Hmm, I don't think we're using Dublin core types to determine type. Only
open graph. That could be filed as a separate bug.

Check translator for Dublin Core, this clearly shows that the type is in use.

Checked OpenGraph, page still fails to give correct type.

This page

<html lang="en">

<head prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb# video: http://ogp.me/ns/video#">

<meta charset="utf-8">

<title>Turtles are AWESOME!!1 | Awesome Turtles Website</title>

<meta name="author" content="Turtle Lvr">
<meta name="robots" content="we welcome our robot overlords"/>
<meta name="description" content="Exposition on the awesomeness of turtles"/>
<meta name="keywords" content="turtle, movie" />

<link rel="canonical" href="http://example.com/turtles" />
<link rel="publisher" href="https://mediawiki.org"/>
<link rel="author" href="http://examples.com/turtlelvr"/>
<link rel="shortlink" href="http://example.com/c" />

<!--Open Graph-->

<meta property="og:locale" content="en_US" />
<meta property="og:type" content="video.movie" />
<meta property="og:title" content="Turtles of the Jungle" />
<meta property="og:description" content="A 2008 film about jungle turtles." />
<meta property="og:url" content="http://example.com" />
<meta property="og:site_name" content="Awesome Turtle Movies Website" />
<meta property="og:image" content="http://example.com/turtle.jpg" />
<meta property="og:image" content="http://example.com/shell.jpg" />

<meta property="video:tag" content="turtle" />
<meta property="video:tag" content="movie" />
<meta property="video:tag" content="awesome" />
<meta property="video:director" content="http://www.example.com/PhilTheTurtle" />
<meta property="video:actor" content="http://www.example.com/PatTheTurtle" />
<meta property="video:actor:role" content="Turtle #3" /> <!-- Currently ignored -->
<meta property="video:actor" content="http://www.example.com/SaminaTheTurtle" />
<meta property="video:writer" content="http://www.example.com/TinaTheTurtle" />
<meta property="video:release_date" content="2015-01-14T19:14:27+00:00" />
<meta property="video:duration"  content="1000000" />

</head>

<body>
</body>

</html>

Gives this result

[{
"url":"https://this.is.my.shit",
"itemType":"webpage",
"language":"en-US",
"title":"Turtles of the Jungle",
"abstractNote":"A 2008 film about jungle turtles.",
"websiteTitle":"Awesome Turtle Movies Website",
"accessDate":"2016-04-25",
"author":[["Turtle","Lvr"]]
}]

It is run through citoid.wikimedia.org

I wonder if this happen for all of the citoid translators. Will check...

  • Base data is picked up from turtle_movie.html
  • Type from OpenGraph fails
  • Type from DublinCore fails
  • Nothing additional is picked up from Twitter
  • Nothing additional is picked up from AL
jeblad renamed this task from Documentation for citoid to Documentation for citoid in general, but dwindled down info type handling.May 12 2016, 1:50 AM

Some examples

Wild guess, type handling is still broken? That is, the code does not fall back to use Twitter and OGP.

Another thing that might help is that responses now include the metadata source in the source field, so you can tell if any given response is coming from a zotero translator or not:

See:

https://citoid.wikimedia.org/api?format=mediawiki&search=example.com

vs.

https://citoid.wikimedia.org/api?format=mediawiki&search=http%3A%2F%2Flink.springer.com%2Fchapter%2F10.1007%2F11926078_68

(We do not support Twitter and AL currently in citoid.)

I've added a section here with a little more information, including docs about the new source field: https://www.mediawiki.org/wiki/Citoid/API#Response

Too much to hope for that this resolves this ticket? :D

Some examples

  • NRK: Ny reality-serie (citoid) really strange author entry, but it is corerct given the source. This type of error is pretty common. If the publisher is used, then clean out the same entries from author.

Can't do much about that in citoid, since it's a problem with the underlying metadata. This kind of issue is best solved by creating a Zotero translator which can deal with issues specific to particular sites.

There are at least two bugs encompassed in that one, created: T147079

Wild guess, type handling is still broken? That is, the code does not fall back to use Twitter and OGP.

This is a problem with a Zotero translator, probably could be filed upstream.

I'm going to mark this as resolved as the service now gives the source, which should make it much easier to debug, and the updated documentation on wikitech. Happy to re-open if it's not satisfactory though.